Moments, Skewness and Kurtosis (Statistics in Erlang part 8)

Erlang, Math, Programming, Statistics, Tools and Libraries, Tutorials No Comments

[digg-reddit-me]So, it was pointed out to me that I had the central moments, but not the moments – ie, the ones not normalized against the input’s average.  Also, it was pointed out that most people don’t know that kurtosis and skewness are related to the central moments.  Furthermore, it turns out (and I didn’t know this) that there are in fact meaningful uses of floating-point exponents in moments.

So, I implemented moments, I replaced my central moments implementation, and I gave name wrappers for skewness and kurtosis to make them easier to identify.

This closes issues 169, 170, 171, 172 and 173.  As usual, this code is part of the ScUtil library, which is free and MIT licensed, because the GPL is evil.

moment(List, N) when is_list(List), is_number(N) ->
    scutil:arithmetic_mean( [ pow(Item, N) || Item <- List ] ).

moments(List) ->
    moments(List, [2,3,4]).

moments(List, Moments) when is_list(Moments) ->
    [ moment(List, M) || M <- Moments ].

central_moment(List, N) when is_list(List), is_number(N) ->
    ListAMean = scutil:arithmetic_mean(List),
    scutil:arithmetic_mean( [ pow(Item-ListAMean, N) || Item <- List ] ).

central_moments(List) ->
    central_moments(List, [2,3,4]).

central_moments(List, Moments) when is_list(Moments) ->
    [ central_moment(List, M) || M <- Moments ].

skewness(List) -> central_moment(List, 3).
kurtosis(List) -> central_moment(List, 4).

Spearman’s Rank Correlation Coefficient (Statistics in Erlang part 7)

Erlang, Math, Programming, Statistics, Tools and Libraries No Comments

[digg-reddit-me]Spearman’s Rank Correlation Coefficient – usually just called The Spearman Correlation, sometimes Spearman’s Rank or Spearman’s Rho – is a method of determining the similarity between two numeric sets (we also go over the Pearson and the Kendall).  If you don’t know what a correlation is, start with Statistics in Erlang part 5, which covers the basic idea of correlations.  A good math-based explanation is available at David M Lane’s Hyperstat.

In English, for two lists of length N, you take the square of the difference between each matched row in the two lists, sum them, multiply by six, and divide by N cubed minus N.  This provides the rank correlation squared, which is over the interval [-1, 1].  However, with spearman you usually use the rank squared, so we leave it that way instead of providing the root; the tuple’s atom label makes it clear that’s happening.

Sorry again about the screwy formatting; I’m just getting things to fit on the blog.  The stuff in the library is better formatted.

spearman_correlation(List1, List2) when
    is_list(List1), is_list(List2),
    length(List1) /= length(List2) -> {error, lists_must_be_same_length};

spearman_correlation(List1, List2) when is_list(List1), is_list(List2) ->

    {TR1,_} = lists:unzip(ordered_ranks_of(List1)),
    {TR2,_} = lists:unzip(ordered_ranks_of(List2)),

    Numerator   = 6 * lists:sum([ (D1-D2)*(D1-D2)
                               || {D1,D2} <- lists:zip(TR1,TR2) ]),
    Denominator = math:pow(length(List1),3)-length(List1),

    {rsquared,1-(Numerator/Denominator)}.

Test data is available at Geography Fieldwork.

1> X = [50,175,270,375,425,580,710,790,890,980].
[50,175,270,375,425,580,710,790,890,980]
2> Y = [1.80,1.20,2.00,1.00,1.00,1.20,0.80,0.60,1.00,0.85].
[1.80000, 1.20000, 2.00000, 1.00000, 1.00000,
 1.20000, 0.800000, 0.600000, 1.00000, 0.850000]
3> scutil:spearman_correlation(X,Y).
{rsquared,-0.730303}

This code is part of the ScUtil library.  This code is free and MIT licensed, because the GPL is evil.  This code uses the ordered ranks code from Part 4.  This closes issue 139.

The Pearson Correlation Coefficient (Statistics in Erlang part 6)

Erlang, Math, Programming, Statistics, Tools and Libraries 2 Comments

[digg-reddit-me]I went over much of the concept of correlations in Part 5; if you don’t know what statistical correlations are, you should read part 5 first.

The Pearson Correlation Coefficient is one method of generating the correlation of sets.  You can get a good math-based explanation at David Lane’s Hyperstat.

In english, basically, you take two numeric lists of the same length, X and Y, then calculate five sums and a length from them:

  1. The sum of the items in List X (SumX)
  2. The sum of the items in List Y (SumY)
  3. The sum of the squares of the items in List X (SumXX)
  4. The sum of the squares of the items in List Y (SumYY)
  5. The sum of the products of the matched items in Lists X and Y (SumXY)
  6. The length of the lists, which should be the same (N)

Using those, you can construct a polynomial which is honestly best expressed in code:

pearson_correlation(List1, List2) when is_list(List1), is_list(List2) ->

    SumXY = lists:sum([A*B || {A,B} <- lists:zip(List1,List2) ]),   ]

    SumX  = lists:sum(List1),
    SumY  = lists:sum(List2),

    SumXX = lists:sum([L*L || L<-List1]),
    SumYY = lists:sum([L*L || L<-List2]),

    N     = length(List1),

    Numer = (N*SumXY) - (SumX * SumY),
    Denom = math:sqrt(((N*SumXX)-(SumX*SumX)) * ((N*SumYY)-(SumY*SumY))),

    {r, (Numer/Denom)}.

This code is part of the ScUtil Library.  The ScUtil library is free and MIT licensed, because the GPL is evil.

1> X = [1,3,5,6,8,9,6,4,3,2].      
[1,3,5,6,8,9,6,4,3,2]
2> Y = [2,5,6,6,7,7,5,3,1,1].  
[2,5,6,6,7,7,5,3,1,1]
3> scutil:pearson_correlation(X,Y).
{r,0.854706}

Verification of test data is available at Changing Minds.  This closes issue 140.

Statistical Correlations (Statistics in Erlang part 5)

Erlang, Math, Programming, Statistics, Tools and Libraries 2 Comments

[digg-reddit-me]Since most of my readers are programmers, I’m going to explain this in programmer-speak.  Also, it’s damned hard to find a non-math explanation of this stuff.

The general idea with correlations is simple: we want to measure how much changes in one set affect the other set – that is, their correlation.  Correlations aren’t about the actual values involved in the two columns, so much as how they seem to affect one another.

A simple example is the edge size and volume of a cube.  As the edge size goes up, so will the volume.  To that end, if you make a two-column table where one column is edge size and the other volume (or, for that matter, face size works too), and then the rows are just a bunch of example data, you would want to see a “perfect correlation” – without fail, the change in column 1 should show a perfect match for changes in column 2.  For a perfect match like that, you get a correlation of 1.0.  Similarly, if you measure the average density of a fixed number of particles in that space, as the edge size goes up, the average density goes down; you would see a “perfect inverse” correlation, or a value of -1.0.  If you measure two values which aren’t correlated – where values in one column don’t seem to affect values in the other – you should get a value at or near zero.

The purpose of the correlation coefficient is to tell how how strongly two columns are correlated, as well as whether their correlation is positive (similar) or negative (inverse).  You can use measurements to determine whether sets of measurements are related.

Consider, for example, a table of height and weight among a distribution of people.  One expects a strong correlation, but not perfect; some people are over- and under-weight for their height.  The closer that measurement comes to 1, the less the outside factors matter.  The closer that measurement comes to zero, the less dominant the measured term is in the measured result.  In practical terms, if you see (for example) a stronger correlation between users of Medicine X and outbreaks of Symptom Y than in the general population, it is likely that Medicine X has Symptom Y as a long-term ramification.

The way this is achieved is through ranking, which was covered in Statistics in Erlang part 4.  The general idea is straightforward: just make a list of your values’ ranks from most significant (usually largest) to least, starting counting at 1.  Do that for both columns, then sort by the first column (keep the columns correlated of course).  At that point, what you actually do to measure the correlation varies from method to method, but the general landscape of things should now be apparent: we’re just measuring how much the difference in rank varies when sorted by one column.

There are several ways to get such correlations.  We’re going to go over the big three – the Pearson Correlation Coefficient, the Kendall Tau Rank Correlation, and the Spearman Rank Correlation Coefficient.  Each one is covered in one of the upcoming tutorials: Pearson Correlation in Erlang (part 6), Spearman Correlation in Erlang (part 7) and Kendall Correlation in Erlang (part 8).

Ranks Of, Ordered Ranks Of and Tied Ranks Of (Statistics in Erlang part 4)

Erlang, Math, Programming, Statistics, Tools and Libraries 8 Comments

[digg-reddit-me]The upcoming Statistics in Erlang post, on correlations, requires some groundwork tools.  At this time, I don’t actually know the proper statistical names for these functions, and in fact such names probably don’t exist, so I just made some up; these functions are treated by the texts describing correlations as an inline part of the process, but the process is much easier to understand if they’re seperate, so I seperated them.

Witness ranks_of(), ordered_ranks_of() and tied_ranks_of().  On their own, their importance isn’t tremendously apparent, but when we get to correlations, it’ll suddenly become obvious.

  • ranks_of()
    • Ranks of takes a list and reverse sorts it, then returns a series of 2-tuples whose first member is the “rank” or 1-offset ordinal of the item’s list position, and whose second member is the original list value.  Erlang sort is used to order the terms, so they don’t actually need to be strictly numeric, though the function is pretty meaningless otherwise.  Soon, a version will be provided that takes a predicate for custom sorting.
    • Tied (equal) values are listed in order as separate ranks.  If you want those ties to be expressed as average values, use tied_ranks_of() below.  Because there are no ties, ranks are expressed as integers.
    • Values are presented with highest rank (rank 1, highest literal value) sorted to the front.
    • 1> scutil:ranks_of([3,8,22,8,535]).
      [{1,535},{2,22},{3,8},{4,8},{5,3}]
  • tied_ranks_of()
    • Very similar to ranks_of(), except that ranks are presented as tied when values are equivalent.  Notice how with ranks_of(), the two 8s are ranked 3 and 4; here, they are both ranked 3.5.
    • Because ties can create half-values, ranks are presented as floats.
    • 86> scutil:tied_ranks_of([3,8,22,8,535]).
      [{1.00000,535},{2.00000,22},{3.50000,8},{3.50000,8},{5.00000,3}]
  • ordered_ranks_of()
    • Correlations frequently need to be column cross-sorted.  The easiest way to deal with this is to provide a ranking function which provides the rankings in the column’s original order.  ordered_ranks_of() is tied_ranks_of() which does not alter list ordering.
    • 3> scutil:ordered_ranks_of([3,8,22,8,535]).
      [{5.00000,3},{3.50000,8},{2.00000,22},{3.50000,8},{1.00000,535}]

These functions are important building blocks to the correlations, coming up in parts 5 and 6.

As usual, this code is part of the ScUtil library. ScUtil is free and MIT license, because the GPL is evil.

This code doesn’t need any of the prior parts 1, 2, or 3 of Statistics in Erlang, but I’m linking them to make them easy to find.

This closes issue 129.

Sorry about the screwy formatting; it’s hard fitting code into a blog.  The formatting in the module is much nicer.

ranks_of(List) when is_list(List) ->
    lists:zip(lists:seq(1,length(List)),lists:reverse(lists:sort(List))).

tied_ranks_of(List) ->
    tied_rank_worker(ranks_of(List), [], no_prev_value).

tied_add_prev(Work, {FoundAt, NewValue}) ->
    lists:duplicate(
        length(FoundAt),
        {lists:sum(FoundAt) / length(FoundAt), NewValue}
    ) ++ Work.

tied_rank_worker([], Work, PrevValue) ->
    lists:reverse(tied_add_prev(Work, PrevValue));

tied_rank_worker([Item|Remainder], Work, PrevValue) ->
    case PrevValue of
        no_prev_value ->
            {BaseRank,BaseVal} = Item,
            tied_rank_worker(Remainder, Work, {[BaseRank],BaseVal});
        {FoundAt,OldVal} ->
            case Item of
                {Id,OldVal} ->
                    tied_rank_worker(
                        Remainder,
                        Work,
                        {[Id]++FoundAt,OldVal});
                {Id,NewVal} ->
                    tied_rank_worker(Remainder,
                        tied_add_prev(Work, PrevValue),
                        {[Id],NewVal})
            end
    end.

ordered_ranks_of(List) when is_list(List) ->
    ordered_ranks_of(List, tied_ranks_of(List), []).

ordered_ranks_of([], [], Work) ->
    lists:reverse(Work);

ordered_ranks_of([Front|Rem], Ranks, Work) ->
    {value,Item} = lists:keysearch(Front,2,Ranks),
    {IRank,Front} = Item,
    ordered_ranks_of(Rem, Ranks--[Item], [{IRank,Front}]++Work).

Standard Deviation, Root Mean Square and the Central Moments (Statistics in Erlang part 3)

Erlang, Math, Programming, Statistics, Tools and Libraries 4 Comments

[digg-reddit-me]These are standard statistical tools for measuring differences, drift and error within sets, as well as Kth moments about the mean.  Central moments are also the building blocks of skewness and kurtosis, which are covered in a later post.  Root mean square is particularly useful as a measure of set error.  Standard deviation is great for figuring out how much spread/differentiation there is within a set.

This code requires the arithmetic mean stuff from Statistics in Erlang part 1.  There’s interesting, unrelated stuff in Part 2.

As usual, this code is part of the ScUtil library.  ScUtil is free and MIT license, because the GPL is evil.

This implements issue 128.  This implements issue 138.  This finishes implementing issue 119.

std_deviation(Values) when is_list(Values) ->
    Mean = arithmetic_mean(Values),
    math:sqrt(arithmetic_mean([ (Val-Mean)*(Val-Mean) || Val <- Values ])).

root_mean_square(List) when is_list(List) ->
    math:sqrt(arithmetic_mean([ Val*Val || Val <- List ])).

central_moments(Items) ->
    Cnt  = length(Items),
    Mean = lists:sum(Items) / Cnt,
    TFun = fun(X) ->
        Base = X-Mean, B2=Base*Base, B3=B2*Base, B4=B3*Base, {B2,B3,B4}
    end,
    collapse_central_moments(Cnt, [ TFun(I) || I <- Items ], {0,0,0}).

collapse_central_moments(N, [], {WM2, WM3, WM4}) ->
    { WM2/N, WM3/N, WM4/N };

collapse_central_moments(N, [{I2,I3,I4}|Rem], {WM2, WM3, WM4}) ->
    collapse_central_moments(N, Rem, {I2+WM2, I3+WM3, I4+WM4}).

Mean, Median, Mode and Histograph (Statistics in Erlang Part 2)

Erlang, Math, Programming, Statistics, Tools and Libraries 1 Comment

[digg-reddit-me]Mean is a complex topic, and is covered in Statistics in Erlang part 1.

Median and mode are less complex.  It’s worth noting that mode reports its results as a list, because it’s possible for there to be several modes for a list.  Also, mode is not strictly numeric – it works for mixed-type lists too (you can take the mode of a list of atoms, for example.)

Mode is really just a reduction of the results of histograph, which is similarly open to arbitrary type list contents.  Mode also uses a toy function even_or_odd which is provided here.

As usual, this code is part of the ScUtil library.  ScUtil is free and MIT license, because the GPL is evil.

This closes issue 100.  This closes issue 105.  This closes issue 134.  This closes issue 135.

histograph(List) when is_list(List) ->
    [Head|Tail] = lists:sort(List),
    histo_count(Tail, Head, 1, []).

histo_count([], Current, Count, Work) ->
    lists:reverse([{Current,Count}]++Work);

histo_count([Current|Tail], Current, Count, Work) ->
    histo_count(Tail, Current, Count+1, Work);

histo_count([New|Tail], Current, Count, Work) ->
    histo_count(Tail, New, 1, [{Current,Count}]++Work).

even_or_odd(Num) when is_integer(Num) ->
    if
        Num band 1 == 0 -> even;
        true            -> odd
    end.

median(List) when is_list(List) ->
    SList = lists:sort(List),
    Length = length(SList),
    case even_or_odd(Length) of
        even -> [A,B] = lists:sublist(SList, round(Length/2), 2), (A+B)/2;
        odd  -> lists:nth( round((Length+1)/2), SList )
    end.

mode([]) -> [];

mode(List) when is_list(List) ->
    mode_front(lists:reverse(lists:keysort(2, scutil:histograph(List)))).

mode_front([{Item,Freq}|Tail]) ->
    mode_front(Tail, Freq, [Item]).

mode_front([ {Item, Freq} | Tail], Freq, Results) ->
    mode_front(Tail, Freq, [Item]++Results);

mode_front([{_Item,_Freq} |_Tail],_Better, Results) ->
    Results;

mode_front([], _Freq, Results) -> Results.

Arithmetic Mean, Geometric Mean, Harmonic Mean and Weighted Arithmetic Mean (Statistics in Erlang Part 1)

Erlang, Math, Programming, Statistics, Tools and Libraries 4 Comments

[digg-reddit-me]I’ll be putting up some statistical functions I’ve had to write recently.  This is the first batch.  I confess, I find the erlang implementations far more readable than the pure math definitions one finds around; I’ve been thinking about writing tutorials, but with this code here, I’m not entirely sure it’s necessary.

At any rate, the code follows.  As with so much of my erlang code, this code is part of the ScUtil library.  ScUtil is free and MIT license, because the GPL is evil.

There’re more statistics coming, I just don’t want to make any one post too huge, and I don’t want keyword saturation.

This partially closes issue 119.  This closes issue 133.  This closes issue 136.  This closes issue 137.

list_product(List) when is_list(List) ->
    list_product(List, 1).

list_product([], Counter) ->
    Counter;

list_product([Head|Tail], Counter) ->
    list_product(Tail, Counter*Head).

arithmetic_mean(List) when is_list(List) ->
    lists:sum(List) / length(List).

geometric_mean(List) when is_list(List) ->
    math:pow(scutil:list_product(List), 1/length(List)).

harmonic_mean(List) when is_list(List) ->
    length(List) / lists:sum([ 1/X || X<-List ]).

weighted_arithmetic_mean(List) when is_list(List) ->
    weighted_arithmetic_mean(List, 0, 0).

weighted_arithmetic_mean([], Num, Denom) ->
    Num/Denom;

weighted_arithmetic_mean([{W,V}|Tail], Num, Denom) ->
    weighted_arithmetic_mean(Tail, Num+(W*V), Denom+W).

Implementing Geometric Mean in MySQL

Math, Programming, SQL, Statistics 2 Comments

This turns out to be pretty easy.  You might also want to read Implementing Harmonic Mean in MySQL.

Why bother blogging something this simple?  Because I couldn’t find one pre-made, and that means it’s time to bring my mighty google rank of like negative two to bear to fix the problem.

CREATE TABLE example(val integer);
INSERT INTO example VALUES(1),(2),(4),(8),(16);
SELECT exp(avg(ln(val))) as gmean from example;

That’s it.  You should see output like this:

mysql> CREATE TABLE example(val integer);
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO example VALUES(1),(2),(4),(8),(16);
Query OK, 5 rows affected (0.01 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> SELECT exp(avg(ln(val))) as gmean from example;
+- - - -+
| gmean |
+- - - -+
|     4 |
+- - - -+
1 row in set (0.00 sec)

Still haven’t figured out how to do central moments, skewness or kurtosis.

Implementing Harmonic Mean in MySQL

Math, Programming, SQL, Statistics 2 Comments

This turns out to be pretty easy.  You might also want to read Implementing Geometric Mean in MySQL.

Why bother blogging something this simple?  Because I couldn’t find one pre-made, and that means it’s time to bring my mighty google rank of like negative two to bear to fix the problem.

CREATE TABLE example(val integer);
INSERT INTO example VALUES(1),(2),(4),(8),(16);
SELECT count(val) / sum(1/val) as hmean from example;

That’s it.  You should see output like this:

mysql> CREATE TABLE example(val integer);
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO example VALUES(1),(2),(4),(8),(16);
Query OK, 5 rows affected (0.01 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> SELECT count(val)/sum(1/val) as hmean from example;
+- - - - +
| hmean  |
+- - - - +
| 2.5806 |
+- - - - +
1 row in set (0.00 sec)

Still haven’t figured out how to do central moments, skewness or kurtosis.