Wednesday, June 29, 2011

Recommendation system competitions

Having met (physically) in the last month 3 people who were kind enough to say that they read my blog - I reckon I ought to spend a little more time keeping it up to date.

This entry describes my efforts in the Kddcup 2011 competition in which Yahoo kindly supplied some 300m music ratings. No Netflix $1m at the end of it, but considerable kudos if you can win.

From the outset I decided to limit myself to a single method (stochastic gradient descent of "real world" models of human rating behaviour since you ask). As the only psychologist (at least who has come out so far) entering these types of competition I'm determined to try and drive through approaches based on new models of human behaviour rather than try and improve a computer science approach to the problem. In addition, I set myself the target to try and create a single model that would perform close to the ensemble methods that won the Netflix prize. (mainly I must admit because I didn't have time to try out a variety of approaches).

I guess my biggest discovery - which was enough to propel me to 5th place at one stage of the competition was that ratings are not made in isolation but are made relative to the other items being rated at the time and the ratings that they receive. So one simple effect is the anchoring effect well known to behavioural (behavioral I guess for the US) economics. i.e. if you give a high value to one rating, you are likely to give a high value to the next one regardless of how much you actually like it and vice versa. This effect can plainly be seen in the Yahoo dataset as there is a very high correlation between ratings - even when adjusting for all other effects. So unless Yahoo were feeding items to be recommended in a particular order, this would seem to provide significant support for the existence of such an effect, In fact, without knowing anything at all about preferences for different items you can achieve a score of around 24.5 rmse on the validation set simply by exploiting this fact.

In addition, there are other such anchoring type effects. As a thought experiment try rating anything on (say) 5 different criteria. The question to ponder - is did you at any stage consider the relative ratings that you gave to the different criteria rather than the absolute values. This relative effect led to the development of a series of moving average models.

Rather than trying to predict preferences in isolation of previous items, I developed a series of models that compared the moving average score and the predicted moving average of the preferences to an individual items preferences and score. So if, for example, the last 5 items had an average score of 75 and calculated preferences of 70 and the item I was trying to predict had a calculated preference of 90 then I simply multipled 75 by 90/70 (or depending on the model tried) added (90-70) to 75. The advantage of this approach is that it eliminated many of the issues to do with how preferences change over time (do they change over time? - or is it just the score you attribute to a preference that changes - it still seems an open question to me) and led to a single model score of 21.5 rmse on the validation set. - alas not translated to the same score on the test set. I think, however, that is probably better than most single models based on SVD which was the preferred approach during the Netflix competition - It was certainly better than the tests that I ran using some of the standard Netflix approaches.

Anyway with the time available I realised I couldn't catch the leader - hats off to them for a very good score almost from the outset - it will be interesting to hear what they did. So that's where I left it.


3 comments:

Rohan said...

Awesome! Is your final test score using just a single model?

Gavin said...

HiRohan

Yes - although its a combination of the same model using different parameters.

Zeno said...

I know this is an old post, but it came up on Twitter, so I may as well comment on it.

Temporal aspects were noticed by many of the participants in track 1, and there is a nice paper by (some of) the conference organizers that describes and exploits session bias quite well: http://www.eng.tau.ac.il/~noamk/papers/DKK11.pdf

If I remember correctly, all the top teams used time information for their predictions.

The papers can be seen here:
http://kddcup.yahoo.com/workshop.php

... and there is a JMLR special issue on the cup in preparation.