The Pearson Correlation Coefficient (Statistics in Erlang part 6)
August 23, 2008 7:46 pm Erlang, Math, Programming, Statistics, Tools and Libraries[digg-reddit-me]I went over much of the concept of correlations in Part 5; if you don’t know what statistical correlations are, you should read part 5 first.
The Pearson Correlation Coefficient is one method of generating the correlation of sets. You can get a good math-based explanation at David Lane’s Hyperstat.
In english, basically, you take two numeric lists of the same length, X and Y, then calculate five sums and a length from them:
- The sum of the items in List X (SumX)
- The sum of the items in List Y (SumY)
- The sum of the squares of the items in List X (SumXX)
- The sum of the squares of the items in List Y (SumYY)
- The sum of the products of the matched items in Lists X and Y (SumXY)
- The length of the lists, which should be the same (N)
Using those, you can construct a polynomial which is honestly best expressed in code:
pearson_correlation(List1, List2) when is_list(List1), is_list(List2) ->
SumXY = lists:sum([A*B || {A,B} <- lists:zip(List1,List2) ]), ]
SumX = lists:sum(List1),
SumY = lists:sum(List2),
SumXX = lists:sum([L*L || L<-List1]),
SumYY = lists:sum([L*L || L<-List2]),
N = length(List1),
Numer = (N*SumXY) - (SumX * SumY),
Denom = math:sqrt(((N*SumXX)-(SumX*SumX)) * ((N*SumYY)-(SumY*SumY))),
{r, (Numer/Denom)}.
This code is part of the ScUtil Library. The ScUtil library is free and MIT licensed, because the GPL is evil.
1> X = [1,3,5,6,8,9,6,4,3,2].
[1,3,5,6,8,9,6,4,3,2]
2> Y = [2,5,6,6,7,7,5,3,1,1].
[2,5,6,6,7,7,5,3,1,1]
3> scutil:pearson_correlation(X,Y).
{r,0.854706}
Verification of test data is available at Changing Minds. This closes issue 140.

August 23rd, 2008 at 7:49 pm
[...] the Spearman Rank Correlation Coefficient. Each one is covered in one of the upcoming tutorials: Pearson Correlation in Erlang (part 6), Spearman Correlation in Erlang (part 7) and Kendall Correlation in Erlang (part [...]
August 24th, 2008 at 10:29 am
[...] Rho – is a method of determining the similarity between two numeric sets (we also go over the Pearson and the Kendall). If you don’t know what a correlation is, start with Statistics in Erlang [...]