The Pearson Correlation Coefficient (Statistics in Erlang part 6)

7:46 pm Erlang, Math, Programming, Statistics, Tools and Libraries

[digg-reddit-me]I went over much of the concept of correlations in Part 5; if you don’t know what statistical correlations are, you should read part 5 first.

The Pearson Correlation Coefficient is one method of generating the correlation of sets.  You can get a good math-based explanation at David Lane’s Hyperstat.

In english, basically, you take two numeric lists of the same length, X and Y, then calculate five sums and a length from them:

  1. The sum of the items in List X (SumX)
  2. The sum of the items in List Y (SumY)
  3. The sum of the squares of the items in List X (SumXX)
  4. The sum of the squares of the items in List Y (SumYY)
  5. The sum of the products of the matched items in Lists X and Y (SumXY)
  6. The length of the lists, which should be the same (N)

Using those, you can construct a polynomial which is honestly best expressed in code:

pearson_correlation(List1, List2) when is_list(List1), is_list(List2) ->

    SumXY = lists:sum([A*B || {A,B} <- lists:zip(List1,List2) ]),   ]

    SumX  = lists:sum(List1),
    SumY  = lists:sum(List2),

    SumXX = lists:sum([L*L || L<-List1]),
    SumYY = lists:sum([L*L || L<-List2]),

    N     = length(List1),

    Numer = (N*SumXY) - (SumX * SumY),
    Denom = math:sqrt(((N*SumXX)-(SumX*SumX)) * ((N*SumYY)-(SumY*SumY))),

    {r, (Numer/Denom)}.

This code is part of the ScUtil Library.  The ScUtil library is free and MIT licensed, because the GPL is evil.

1> X = [1,3,5,6,8,9,6,4,3,2].      
[1,3,5,6,8,9,6,4,3,2]
2> Y = [2,5,6,6,7,7,5,3,1,1].  
[2,5,6,6,7,7,5,3,1,1]
3> scutil:pearson_correlation(X,Y).
{r,0.854706}

Verification of test data is available at Changing Minds.  This closes issue 140.

2 Responses

  1. Statistical Correlations (Statistics in Erlang part 5) | Full of BS Says:

    [...] the Spearman Rank Correlation Coefficient.  Each one is covered in one of the upcoming tutorials: Pearson Correlation in Erlang (part 6), Spearman Correlation in Erlang (part 7) and Kendall Correlation in Erlang (part [...]

  2. Spearman’s Rank Correlation Coefficient (Statistics in Erlang part 7) | Full of BS Says:

    [...] Rho – is a method of determining the similarity between two numeric sets (we also go over the Pearson and the Kendall).  If you don’t know what a correlation is, start with Statistics in Erlang [...]

Leave a Comment

Your comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.