top | item 3719374

(no title)

jstepien | 14 years ago

During the previous semester I spent some time building a recommender using this data as a project for a data mining class. It turned out to be far more challenging than I had initially anticipated.

I've used methods known as collaborative filtration, whose goal was to estimate how a given user would rate a given item basing on knowledge of preferences of other users of similar interests. The initial scope included a naïve Bayesian classifier and a technique called Slope One [1]. The latter one is particularly interesting as according to claims of its authors allows to make a very good estimation in a very short time using solely a very simple linear model. The preprocessing is both time- and space-wise expensive though as it requires you to build a matrix of deviations between rated items.

After reducing the data set to a single subreddit and filtering it from users who weren't avid voters I ran the algorithms and after some tuning I was very content to see promising ROC curves and decent AUC values. Models built around NBC and S1 achieved comparable results when it came to such metrics as precision, recall and F-measure.

When I went to discuss the results with the professor teaching the class I've heard "That's indeed promising, but how about comparing those results with a really naïve model which would just take an average of existing votes by a given user?". Guess what: the model built solely using a single call to the avg function was nearly as good as the NBC and S1 models.

Now I understand why the guys from Reddit are looking for external help with the recommender. It's a way less obvious task than it might seem to be.

[1] http://lemire.me/fr/documents/publications/lemiremaclachlan_...

Edit: s/machine learning/data mining/

discuss

_dps|14 years ago

Out of curiosity, did you compare to any other baselines? I suspect you did a lot better than you think you did, because that particular baseline is actually very misleading for ranking/recommendation tasks (this is a common source of confusion for newcomers). Here's why, in two parts:

1) Say you estimate (as you propose) that a user will always give their average rating. This might get you good-ish error and ROC as a prediction task, but will give zero recommendation value because the prediction for a given user will be constant for all possible recommendations.

2) Say you estimate that a user will give the average score that the item has received across all users. Again, possibly good-ish in terms of prediction ROC and RMS error, but this offers no personalization (all users get the same predictions, i.e. you're basically just showing the default Reddit ranking).

Both of these baselines are vastly inferior to even really stupid models like "how many times have I upvoted stories from this submitter" in terms of recommendation value, but the latter is (if I recall from my own experiments) much worse when evaluated on the basis of overall ROC.

I would strongly suspect that a correctly implemented NB or S1 would vastly outperform either of the two baselines in terms of actual recommendation utility (even though when you look at the baseline's ability to predict actual numbers, they might be comparably good in an RMS sense).

The moral of the story: one must be very careful when trying to quantify the performance of learning systems; actual utility is often difficult to evaluate merely by looking at standard statistical measures of accuracy.

jstepien|14 years ago

No, I didn't make any comparisons to other baselines. Thanks a lot for sharing your thoughts; I'll have to reconsider the results I got in the light of your comment.

greendestiny|14 years ago

I implemented Slope One for the netflix prize and found its results pretty unimpressive. So I decided to extend it to build a SVD predictor of Slope One values figuring it might do better than SVD by itself. It didn't.

Turns out increasing the dimensionality of the input 17 thousand times just reduces the amount training data for each attribute. Duh :)