c++0x standard finalized; now open for last international comment phase
October 31, 2008 C/C++, General Interest, Programming No CommentsIt’s finally done!
And, thank god, “concepts” are officially in.
It’s finally done!
And, thank god, “concepts” are officially in.
[digg-reddit-me]One of the things I was most looking forward to about my new iPhone, knowing there were SSH clients, was the ability to use it as a genuinely remote terminal, no matter where I was, to do little shell tasks and write simple code and so on.
Ha! The problem is, the iPhone has an autocorrecting keyboard which corrects if you don’t tell it not to (the obnoxious kind like Outlook has), and it makes completely asinine replacements (its becomes it’s, as if the word its doesn’t exist). This is bad enough if you just speak above the level of an eighth grader, but it makes using unix shells and writing code genuinely impossible.
Classic apple zealot response from IRC: “don’t be stupid, just teach the iPhone every word you want to use when programming.” Like they don’t even think before they answer.
Apple: why can’t I turn this off? It’d be simple enough: there’s bound to be some function somwhere get_best_replacement(char* current), which signals no reasonable match (as you get for, say, ‘zzzzz’) by way of an empty string, or something similar. That’s the hack point. Add if (customer_isnt_retarded()) { return “”; } else { previous_logic(); } and it’s fixed.
Seriously, who locks people into an autocorrecting keyboard? Ugh. This ruins the iPhone for any kind of technical use. What a mess.
If you hate this too, vote this up on digg and reddit, so that an Apple employee will see it.
[digg-reddit-me]No, really.
It’s not really a supercomputer; it’s a cluster. Specifically, it’s an eight node cluster in one box. Each node can have up to two Quad Xeons and up to 64 gig of RAM, meaning the box, maxed out, would be running 64 cores and half a terabyte of ram. It runs Windows Server HPC 2008, which is why Microsoft is involved.
No word on how decked out the box is at $25,000, but it’s supposed to be selling on Amazon starting today, and Amazon’s good about giving out specs. As of this writing it isn’t yet available.
Presumably because of the success of the NetFlix Prize, ESPN decided to hold a hundred thousand dollar purse to see who could provide the best algorithm to predict the outcome of upcoming college football games. Fortunately, ESPN went looking for experienced help to design such a game. Unfortunately, they chose TopCoder.
It’s a shame that ESPN chose to do this through TopCoder, as TopCoder’s general practices are poison for a machine learning contest. TopCoder chose to impose a gig memory limit and a nine minute runtime on any approach to this problem, which murders most machine learning tactics right out the door. It’s a shame they didn’t do this themselves on the NetFlix model, where contestants just submit predictions.
This contest isn’t to get football predictions. It’s to get football predictions under arbitrary ram and cpu caps. ESPN’s staff wouldn’t face such restrictions when using the work – one gig for nine minutes? C’mon; there is literally no reason for this limitation to exist.
This contest’s design precludes most modern approaches to machine learning to no appreciable benefit, and is therefore fundamentally flawed. ESPN is going to get seriously quality-limited results.
Very disappointing. That money would go to much better effect if the contest had been designed with the kind of foresight of which the NetFlix Prize had had the benefit.
[digg-reddit-me]So, it was pointed out to me that I had the central moments, but not the moments – ie, the ones not normalized against the input’s average. Also, it was pointed out that most people don’t know that kurtosis and skewness are related to the central moments. Furthermore, it turns out (and I didn’t know this) that there are in fact meaningful uses of floating-point exponents in moments.
So, I implemented moments, I replaced my central moments implementation, and I gave name wrappers for skewness and kurtosis to make them easier to identify.
This closes issues 169, 170, 171, 172 and 173. As usual, this code is part of the ScUtil library, which is free and MIT licensed, because the GPL is evil.
moment(List, N) when is_list(List), is_number(N) ->
scutil:arithmetic_mean( [ pow(Item, N) || Item <- List ] ).
moments(List) ->
moments(List, [2,3,4]).
moments(List, Moments) when is_list(Moments) ->
[ moment(List, M) || M <- Moments ].
central_moment(List, N) when is_list(List), is_number(N) ->
ListAMean = scutil:arithmetic_mean(List),
scutil:arithmetic_mean( [ pow(Item-ListAMean, N) || Item <- List ] ).
central_moments(List) ->
central_moments(List, [2,3,4]).
central_moments(List, Moments) when is_list(Moments) ->
[ central_moment(List, M) || M <- Moments ].
skewness(List) -> central_moment(List, 3).
kurtosis(List) -> central_moment(List, 4).
[digg-reddit-me]Spearman’s Rank Correlation Coefficient – usually just called The Spearman Correlation, sometimes Spearman’s Rank or Spearman’s Rho – is a method of determining the similarity between two numeric sets (we also go over the Pearson and the Kendall). If you don’t know what a correlation is, start with Statistics in Erlang part 5, which covers the basic idea of correlations. A good math-based explanation is available at David M Lane’s Hyperstat.
In English, for two lists of length N, you take the square of the difference between each matched row in the two lists, sum them, multiply by six, and divide by N cubed minus N. This provides the rank correlation squared, which is over the interval [-1, 1]. However, with spearman you usually use the rank squared, so we leave it that way instead of providing the root; the tuple’s atom label makes it clear that’s happening.
Sorry again about the screwy formatting; I’m just getting things to fit on the blog. The stuff in the library is better formatted.
spearman_correlation(List1, List2) when
is_list(List1), is_list(List2),
length(List1) /= length(List2) -> {error, lists_must_be_same_length};
spearman_correlation(List1, List2) when is_list(List1), is_list(List2) ->
{TR1,_} = lists:unzip(ordered_ranks_of(List1)),
{TR2,_} = lists:unzip(ordered_ranks_of(List2)),
Numerator = 6 * lists:sum([ (D1-D2)*(D1-D2)
|| {D1,D2} <- lists:zip(TR1,TR2) ]),
Denominator = math:pow(length(List1),3)-length(List1),
{rsquared,1-(Numerator/Denominator)}.
Test data is available at Geography Fieldwork.
1> X = [50,175,270,375,425,580,710,790,890,980].
[50,175,270,375,425,580,710,790,890,980]
2> Y = [1.80,1.20,2.00,1.00,1.00,1.20,0.80,0.60,1.00,0.85].
[1.80000, 1.20000, 2.00000, 1.00000, 1.00000,
1.20000, 0.800000, 0.600000, 1.00000, 0.850000]
3> scutil:spearman_correlation(X,Y).
{rsquared,-0.730303}
This code is part of the ScUtil library. This code is free and MIT licensed, because the GPL is evil. This code uses the ordered ranks code from Part 4. This closes issue 139.
[digg-reddit-me]I went over much of the concept of correlations in Part 5; if you don’t know what statistical correlations are, you should read part 5 first.
The Pearson Correlation Coefficient is one method of generating the correlation of sets. You can get a good math-based explanation at David Lane’s Hyperstat.
In english, basically, you take two numeric lists of the same length, X and Y, then calculate five sums and a length from them:
Using those, you can construct a polynomial which is honestly best expressed in code:
pearson_correlation(List1, List2) when is_list(List1), is_list(List2) ->
SumXY = lists:sum([A*B || {A,B} <- lists:zip(List1,List2) ]), ]
SumX = lists:sum(List1),
SumY = lists:sum(List2),
SumXX = lists:sum([L*L || L<-List1]),
SumYY = lists:sum([L*L || L<-List2]),
N = length(List1),
Numer = (N*SumXY) - (SumX * SumY),
Denom = math:sqrt(((N*SumXX)-(SumX*SumX)) * ((N*SumYY)-(SumY*SumY))),
{r, (Numer/Denom)}.
This code is part of the ScUtil Library. The ScUtil library is free and MIT licensed, because the GPL is evil.
1> X = [1,3,5,6,8,9,6,4,3,2].
[1,3,5,6,8,9,6,4,3,2]
2> Y = [2,5,6,6,7,7,5,3,1,1].
[2,5,6,6,7,7,5,3,1,1]
3> scutil:pearson_correlation(X,Y).
{r,0.854706}
Verification of test data is available at Changing Minds. This closes issue 140.
[digg-reddit-me]Since most of my readers are programmers, I’m going to explain this in programmer-speak. Also, it’s damned hard to find a non-math explanation of this stuff.
The general idea with correlations is simple: we want to measure how much changes in one set affect the other set – that is, their correlation. Correlations aren’t about the actual values involved in the two columns, so much as how they seem to affect one another.
A simple example is the edge size and volume of a cube. As the edge size goes up, so will the volume. To that end, if you make a two-column table where one column is edge size and the other volume (or, for that matter, face size works too), and then the rows are just a bunch of example data, you would want to see a “perfect correlation” – without fail, the change in column 1 should show a perfect match for changes in column 2. For a perfect match like that, you get a correlation of 1.0. Similarly, if you measure the average density of a fixed number of particles in that space, as the edge size goes up, the average density goes down; you would see a “perfect inverse” correlation, or a value of -1.0. If you measure two values which aren’t correlated – where values in one column don’t seem to affect values in the other – you should get a value at or near zero.
The purpose of the correlation coefficient is to tell how how strongly two columns are correlated, as well as whether their correlation is positive (similar) or negative (inverse). You can use measurements to determine whether sets of measurements are related.
Consider, for example, a table of height and weight among a distribution of people. One expects a strong correlation, but not perfect; some people are over- and under-weight for their height. The closer that measurement comes to 1, the less the outside factors matter. The closer that measurement comes to zero, the less dominant the measured term is in the measured result. In practical terms, if you see (for example) a stronger correlation between users of Medicine X and outbreaks of Symptom Y than in the general population, it is likely that Medicine X has Symptom Y as a long-term ramification.
The way this is achieved is through ranking, which was covered in Statistics in Erlang part 4. The general idea is straightforward: just make a list of your values’ ranks from most significant (usually largest) to least, starting counting at 1. Do that for both columns, then sort by the first column (keep the columns correlated of course). At that point, what you actually do to measure the correlation varies from method to method, but the general landscape of things should now be apparent: we’re just measuring how much the difference in rank varies when sorted by one column.
There are several ways to get such correlations. We’re going to go over the big three – the Pearson Correlation Coefficient, the Kendall Tau Rank Correlation, and the Spearman Rank Correlation Coefficient. Each one is covered in one of the upcoming tutorials: Pearson Correlation in Erlang (part 6), Spearman Correlation in Erlang (part 7) and Kendall Correlation in Erlang (part 8).
[digg-reddit-me]The upcoming Statistics in Erlang post, on correlations, requires some groundwork tools. At this time, I don’t actually know the proper statistical names for these functions, and in fact such names probably don’t exist, so I just made some up; these functions are treated by the texts describing correlations as an inline part of the process, but the process is much easier to understand if they’re seperate, so I seperated them.
Witness ranks_of(), ordered_ranks_of() and tied_ranks_of(). On their own, their importance isn’t tremendously apparent, but when we get to correlations, it’ll suddenly become obvious.
These functions are important building blocks to the correlations, coming up in parts 5 and 6.
As usual, this code is part of the ScUtil library. ScUtil is free and MIT license, because the GPL is evil.
This code doesn’t need any of the prior parts 1, 2, or 3 of Statistics in Erlang, but I’m linking them to make them easy to find.
This closes issue 129.
Sorry about the screwy formatting; it’s hard fitting code into a blog. The formatting in the module is much nicer.
ranks_of(List) when is_list(List) ->
lists:zip(lists:seq(1,length(List)),lists:reverse(lists:sort(List))).
tied_ranks_of(List) ->
tied_rank_worker(ranks_of(List), [], no_prev_value).
tied_add_prev(Work, {FoundAt, NewValue}) ->
lists:duplicate(
length(FoundAt),
{lists:sum(FoundAt) / length(FoundAt), NewValue}
) ++ Work.
tied_rank_worker([], Work, PrevValue) ->
lists:reverse(tied_add_prev(Work, PrevValue));
tied_rank_worker([Item|Remainder], Work, PrevValue) ->
case PrevValue of
no_prev_value ->
{BaseRank,BaseVal} = Item,
tied_rank_worker(Remainder, Work, {[BaseRank],BaseVal});
{FoundAt,OldVal} ->
case Item of
{Id,OldVal} ->
tied_rank_worker(
Remainder,
Work,
{[Id]++FoundAt,OldVal});
{Id,NewVal} ->
tied_rank_worker(Remainder,
tied_add_prev(Work, PrevValue),
{[Id],NewVal})
end
end.
ordered_ranks_of(List) when is_list(List) ->
ordered_ranks_of(List, tied_ranks_of(List), []).
ordered_ranks_of([], [], Work) ->
lists:reverse(Work);
ordered_ranks_of([Front|Rem], Ranks, Work) ->
{value,Item} = lists:keysearch(Front,2,Ranks),
{IRank,Front} = Item,
ordered_ranks_of(Rem, Ranks--[Item], [{IRank,Front}]++Work).
[digg-reddit-me]These are standard statistical tools for measuring differences, drift and error within sets, as well as Kth moments about the mean. Central moments are also the building blocks of skewness and kurtosis, which are covered in a later post. Root mean square is particularly useful as a measure of set error. Standard deviation is great for figuring out how much spread/differentiation there is within a set.
This code requires the arithmetic mean stuff from Statistics in Erlang part 1. There’s interesting, unrelated stuff in Part 2.
As usual, this code is part of the ScUtil library. ScUtil is free and MIT license, because the GPL is evil.
This implements issue 128. This implements issue 138. This finishes implementing issue 119.
std_deviation(Values) when is_list(Values) ->
Mean = arithmetic_mean(Values),
math:sqrt(arithmetic_mean([ (Val-Mean)*(Val-Mean) || Val <- Values ])).
root_mean_square(List) when is_list(List) ->
math:sqrt(arithmetic_mean([ Val*Val || Val <- List ])).
central_moments(Items) ->
Cnt = length(Items),
Mean = lists:sum(Items) / Cnt,
TFun = fun(X) ->
Base = X-Mean, B2=Base*Base, B3=B2*Base, B4=B3*Base, {B2,B3,B4}
end,
collapse_central_moments(Cnt, [ TFun(I) || I <- Items ], {0,0,0}).
collapse_central_moments(N, [], {WM2, WM3, WM4}) ->
{ WM2/N, WM3/N, WM4/N };
collapse_central_moments(N, [{I2,I3,I4}|Rem], {WM2, WM3, WM4}) ->
collapse_central_moments(N, Rem, {I2+WM2, I3+WM3, I4+WM4}).