Spearman’s Rank Correlation Coefficient (Statistics in Erlang part 7)
August 24, 2008 10:29 am Erlang, Math, Programming, Statistics, Tools and Libraries[digg-reddit-me]Spearman’s Rank Correlation Coefficient – usually just called The Spearman Correlation, sometimes Spearman’s Rank or Spearman’s Rho – is a method of determining the similarity between two numeric sets (we also go over the Pearson and the Kendall). If you don’t know what a correlation is, start with Statistics in Erlang part 5, which covers the basic idea of correlations. A good math-based explanation is available at David M Lane’s Hyperstat.
In English, for two lists of length N, you take the square of the difference between each matched row in the two lists, sum them, multiply by six, and divide by N cubed minus N. This provides the rank correlation squared, which is over the interval [-1, 1]. However, with spearman you usually use the rank squared, so we leave it that way instead of providing the root; the tuple’s atom label makes it clear that’s happening.
Sorry again about the screwy formatting; I’m just getting things to fit on the blog. The stuff in the library is better formatted.
spearman_correlation(List1, List2) when
is_list(List1), is_list(List2),
length(List1) /= length(List2) -> {error, lists_must_be_same_length};
spearman_correlation(List1, List2) when is_list(List1), is_list(List2) ->
{TR1,_} = lists:unzip(ordered_ranks_of(List1)),
{TR2,_} = lists:unzip(ordered_ranks_of(List2)),
Numerator = 6 * lists:sum([ (D1-D2)*(D1-D2)
|| {D1,D2} <- lists:zip(TR1,TR2) ]),
Denominator = math:pow(length(List1),3)-length(List1),
{rsquared,1-(Numerator/Denominator)}.
Test data is available at Geography Fieldwork.
1> X = [50,175,270,375,425,580,710,790,890,980].
[50,175,270,375,425,580,710,790,890,980]
2> Y = [1.80,1.20,2.00,1.00,1.00,1.20,0.80,0.60,1.00,0.85].
[1.80000, 1.20000, 2.00000, 1.00000, 1.00000,
1.20000, 0.800000, 0.600000, 1.00000, 0.850000]
3> scutil:spearman_correlation(X,Y).
{rsquared,-0.730303}
This code is part of the ScUtil library. This code is free and MIT licensed, because the GPL is evil. This code uses the ordered ranks code from Part 4. This closes issue 139.
