Tuesday, June 9, 2009

The psychological meaning of billions of parameters

The leaders in the Neflix competition have made great strides since my last post.

Essentially my understanding is that they have done this by modelling thousands of factors on a daily basis. i.e for each person they model (say 2000) factors on an individual and individual day basis. The set of ratings provided for the competition gives enough information so that you can work out that a particular person had a particular preference of a particular strength on a particular day to watch something funny (or given that there are 2000 factors or so) something rather more obscure (maybe watch something in sepia or something). The ratings also enable you to calculate how much a film meets those requirements (again on a particular day - what seemed funny at one time period may not seem funny at another).

By combining the two sets of factors you can then work out how a person will rate a particular movie and improve your score in the competition. This is an undoubtedly impressive feat from a statistical / machine learning viewpoint.

It strikes me that this is also interesting from a psychological viewpoint - do we really believe that people have such nuanced preferences across such a large number of dimensions. I have an open mind about this - apriori I would have thought people would use many fewer factors in arriving at a rating decision - certainly 2000 factors (or even 20) can't all be combined consciously - the subconscious must be heavily involved. Maybe, on the other hand, there are only a few factors that we take into account - but they are different per person and the only way in which they can be explained is by taking a mix of the 2000 or so factors that are modelled.

It strikes me that depending on your view on the above your choice of research direction on the Netflix competition, recommendation systems and indeed psychological processes in general will vary.

I'd welcome views.


edr said...

Hello Mr. Potter,

First, I would like to thank you for publishing your insights into the Netflix Prize problem. I discovered this competition about 2 months ago and have been playing with the data on-and-off since then. While I am not particularly interested in recommender systems, many of the ideas developed, and *tested* in the course of this competition are incredibly interesting and will surely have applicability to other problem domains.

The complexity (# of model parameters) required by the front-runners to get their predictions has astounded me, but this might be expected as one asymptotically approaches the 'noise floor' in which the signal is buried. What I find especially amazing is how *fast* the parameter count rises. A simple average movie rating (0 parameters/customer) gets you an RMSE of 1.05. Taking account of global effects as you described in one of your papers requires only a few parameters/customer and gets you to around 0.95. A 20-factor SVD I ran got 0.92 (20 parameters/customer). The 'blended' approaches described by many of the front-runners must use the equivalant of thousands or tens of thousands of parameters/customer - and the best RMSE to date is ~0.86. This 20% improvement in accuracy over the average rating has come at a cost of complexity measured in orders of magnitude!

While I find the 'analytics' aspects of the netflix prize more interesting than the psychological ones, the results versus the requisite model complexity has been disappointing - It shouldn't require hugely more sophisticated models to get as small an improvement as we have been seeing. One thought I had was that different people are responding to different movie factor sets. While an SVD with a large number of factors *should* handle this, it is too easy to overtrain large-dimension models. One experiment I ran was to train a low-dimension SVD (10 factors) with the whole customer population, and then take the subset of customers who weren't modeled well and train them up a separate SVD model. While the overall results were disappointing (assuming I did it right), it was interesting to see that about 45% of the customers were modeled quite well (train RMSE < 0.7) with the 10 factor model, but the remainder were a bad fit (average train RMSE ~ 1.15), and mostly stayed a bad fit even when given their own SVD parameter set. This suggests to me that different customers might be best represented by using structurally different models - and not by merely trying to 'fit' them with a single model with more parameters. I believe this model-to-customer matching is different than the current 'blending' schemes which sound like they are based on a linear weightings applied to entire prediction sets. Whether it works any better, however, is anybody's guess.

best regards,

Ed Ramsden

jbourne said...

I wonder if the approach also leads to overfitting to the qualifying set. The qualifying set has dates which occur in the training set. So, the prediction is not strictly for future ratings. As a result the ultimate best approach might find the best way to overfit to the qualifying set. Which is why the best approach so far is using more and more model parameters - its is trying to overfit to the qualifying set in my opinion. Do you think this might be a drawback?

PragmaticTheory said...

I have been experimenting with mega-models myself, but I found them rather unhelpful. Not only did they not improve our overall score significantly, but we found that more compact models (with much fewer parameters) can provide better accuracy. On our own blog, we report a model that achieved 0.8732 accuracy on the quiz set. This model uses approximately 300M parameters. This is still a lot, considering it is trained on about 100M entries. However, it is two order of magnitudes smaller than the model with over 34000M parameters described in the main post (there are over 17M user-date pairs times 2000 parameters). Yet it achieves greater accuracy. Our experiments show that modeling a user with more than 200 factors offers almost no improvement. A very decent model can be achieved with as little as 50 factors per user (0.8758 on the quiz set for example).

200 factors to describe the tastes of a user still looks like a large number. I would argue that these factors are not independent, and that it actually takes multiple factors to properly model one taste dimension. For the sake of argument, let's assume that the movie factors are pre-computed and constants. A linear regression of user ratings provides an estimate of the user factors. The keyword here is "linear". My intuition is that user tastes are not linear. It is plausible that a given user might like a little romantic touch in a movie, but will dislike too much. A linear model has difficulty modeling something like this. To compensate, the training procedure assigns multiple factors to model the romantic aspect of the movie: one that measures the amount of romantic content, another that measure the amount of "a little" romantic content i.e. the movie gets a low score if there is not enough or too much romance, etc. Each dimension of taste gets modeled as multiple factors to account for non-linearities. I believe this explain why models require longer factor vectors than intuition would credit.

Another valid question is why use more parameters than training samples. The answer is in the widely varying number of ratings provided by a user. Some users have provided only a handful of ratings, while others have provided thousands. For simplicity, models use a fix number of factors for all users. Larger number of factors (ex: 200) are certainly unhelpful for users with few ratings, but may capture small correlations in more prolific users. Proper regularization insures that prediction accuracy for users with few ratings are unaffected, so it is all gain and no loss. As we try to increase the number of factors even more, the number of users with sufficient ratings become smaller and smaller, and the improvement becomes negligible.

A word of warning: as the size of the model increases to mega-values, a model accuracy may appear to improve or degrade due to non-optimal regularization. Sometimes it is hard to tell whether a large model captures taste better, or whether regularization was simply better tuned.

Martin Piotte

Just a guy in a garage said...

Thanks for the comments. I had also played with edr's ideas of fitting different factors to different groups but with similar lack of success. Oh well...

I'm sure pragmatic theory's views are correct that it takes more than one factor to model a particular dimension of taste, which suggests to me, at least, that there might be a more parsimonious way of describing the data. I guess we will all have to keep on trying...

The other problem that these multiple factors cause is that it makes interpretation very difficult. I'm up for the challenge, if anyone else is, in trying to figure out what all this means in terms of intrepetation of these factors in psychological terms.

It seems like there might be another industry forming - not interpreting the data, but in interpreting the models interpreting the data... if you get my drift.

The Pageman said...

I think that's why you have PCA and PFA - when you do scree plot or elbow method - you will be able to isolate some factors that actually account for most of the tendencies - although 2,000 seems a lot ...