Statistical Correlations (Statistics in Erlang part 5)

4:45 pm Erlang, Math, Programming, Statistics, Tools and Libraries

[digg-reddit-me]Since most of my readers are programmers, I’m going to explain this in programmer-speak.  Also, it’s damned hard to find a non-math explanation of this stuff.

The general idea with correlations is simple: we want to measure how much changes in one set affect the other set – that is, their correlation.  Correlations aren’t about the actual values involved in the two columns, so much as how they seem to affect one another.

A simple example is the edge size and volume of a cube.  As the edge size goes up, so will the volume.  To that end, if you make a two-column table where one column is edge size and the other volume (or, for that matter, face size works too), and then the rows are just a bunch of example data, you would want to see a “perfect correlation” – without fail, the change in column 1 should show a perfect match for changes in column 2.  For a perfect match like that, you get a correlation of 1.0.  Similarly, if you measure the average density of a fixed number of particles in that space, as the edge size goes up, the average density goes down; you would see a “perfect inverse” correlation, or a value of -1.0.  If you measure two values which aren’t correlated – where values in one column don’t seem to affect values in the other – you should get a value at or near zero.

The purpose of the correlation coefficient is to tell how how strongly two columns are correlated, as well as whether their correlation is positive (similar) or negative (inverse).  You can use measurements to determine whether sets of measurements are related.

Consider, for example, a table of height and weight among a distribution of people.  One expects a strong correlation, but not perfect; some people are over- and under-weight for their height.  The closer that measurement comes to 1, the less the outside factors matter.  The closer that measurement comes to zero, the less dominant the measured term is in the measured result.  In practical terms, if you see (for example) a stronger correlation between users of Medicine X and outbreaks of Symptom Y than in the general population, it is likely that Medicine X has Symptom Y as a long-term ramification.

The way this is achieved is through ranking, which was covered in Statistics in Erlang part 4.  The general idea is straightforward: just make a list of your values’ ranks from most significant (usually largest) to least, starting counting at 1.  Do that for both columns, then sort by the first column (keep the columns correlated of course).  At that point, what you actually do to measure the correlation varies from method to method, but the general landscape of things should now be apparent: we’re just measuring how much the difference in rank varies when sorted by one column.

There are several ways to get such correlations.  We’re going to go over the big three – the Pearson Correlation Coefficient, the Kendall Tau Rank Correlation, and the Spearman Rank Correlation Coefficient.  Each one is covered in one of the upcoming tutorials: Pearson Correlation in Erlang (part 6), Spearman Correlation in Erlang (part 7) and Kendall Correlation in Erlang (part 8).

2 Responses

  1. The Pearson Correlation Coefficient (Statistics in Erlang part 6) | Full of BS Says:

    [...] Programming, Statistics, Tools and Libraries I went over much of the concept of correlations in Part 5; if you don’t know what statistical correlations are, you should read part 5 [...]

  2. Spearman’s Rank Correlation Coefficient (Statistics in Erlang part 7) | Full of BS Says:

    [...] go over the Pearson and the Kendall).  If you don’t know what a correlation is, start with Statistics in Erlang part 5, which covers the basic idea of correlations.  A good math-based explanation is available at David [...]

Leave a Comment

Your comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.