Sunday, July 26, 2009

The Netflix prize winner

I'll post some reflections on the Netflix prize at a later date, but as someone who has been with the competition since the beginning I thought it might be useful to explain why the second place team on the leaderboard Bellkor's Pragmatic Chaos are almost definitely the winners.

The reason is that there are two datasets against which every competitor is judged. The first is the Quiz dataset - the results of which are reported back to the competitors and appear on the leaderboard and a second dataset which is called the Test dataset which is actually used to determine the winner. The purpose of this is to stop what is called "overfitting", i.e. using the results you achieve on the Quiz dataset every time you make a submission to figure out the actual values. Now with 1.5 million datapoints its impossible to figure out each value, but I'm sure both teams used some of the information from the results on the quiz dataset to work out the optimal combination of numbers to contribute - Given that the teams are separated by less than a one point difference in the fourth decimal place only a very small amount of overfitting could cause the positions to switch and in this case it looks like Bellkor's Pragmatic Chaos overfitted slightly less than "The Ensemble" and hence according to the posts on the Netflix forum are the ones to be validated.

There is a final stage that needs to be gone through and that is validation where the top team have to demonstrate how they achieved the results and to publish how they did it - but given that Bellkor have been through this twice before on the progress prize it should be a formality.

3 comments:

Michael said...

I think this is a good explanation. Now the question is: why was Ensemble's submission more over-fitted?

My first guess is that Ensemble blended more submissions than BellKor et al. And these submissions, generated by diverse group of developers, were all tuned in some way to minimize the quiz score. Since you have more submissions to blend, each with a bias towards lowering the quiz score, you are more prone to be cursed by over-fitting.

It could also be that Ensemble used the quiz scores with their blending algorithm, which may have put them past the 10% threshold, but ultimately burned them.

Data Miner said...

If BellKor has a lower test score, you can claim that Ensemble over-fitted more, but... both teams broke the 10% threshold on the test set, so the difference is at most a few 0.01% (unless BellKor performed better on test than quiz). This is just noise. Reality is that the outcome was decided by lottery and BellKor held the lucky ticket!

Just a guy in a garage said...

Data Miner - I can only agree with you - this one must have been very close.