For Amazon, though, which is the example in Evan Miller's post, I don't really get why you'd first dichotomize the five-star rating into positive vs. negative and then use Wilson intervals. Just construct a run-of-the-mill 95% confidence interval for the mean of a continuous distribution and sort by the (still plausible) worst case scenario a.k.a. the lower bound of that: `mean - 1.96 * SE`, where the standard error is `SE = stddev(scores)/sqrt(n)`.
Because of the central limit theorem, you can do this even if scores are not normally distributed and it'll work out too.
For better accuracy with small samples you could use the multinomial distribution instead. The covariance matrix for the rating probabilities can be found here for example: http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.ht...
Then the variance for the expected rating can be calculated as a weighted sum of the values in the covariance matrix.
These companies really should be hiring statistics consultants instead of relying on the intuitions of their programmers.
I have fundamental problem with democratic voting systems. Whatever the general view likes will tend to come on top, hence cat pictures on reddit. The most philosophiycally elegant solution I've encountered so far is the "quadratic voting" (see https://news.ycombinator.com/item?id=9477747), where every user has a limited number of credits to spend per time-period. Every vote will have a quadratic cost.
Assume user has/obtains 1000 karma points a month. If the user merely likes a post or not he gives his 1 vote, which cost him 1 karma. If he strongly wants one post up he can spend a maximum of 31 vote on it. This way the minorities will have also extra influence on the voting process.
The requirements is that each user have 1 account, e.g. maybe by some form of payment for 1000 karma to avoid fake voting fraud. Maybe using bitcoins if it picks up to avoid privacy problems.
Do you think this will hinder the workings of sites like reddit?
What I always thought was that there really should be some user-based weighting system. Like, if a user upvotes 90% of the things he sees, his upvotes are probably worth less than upvotes by someone who upvotes only 1% of the posts he sees.
Same thing applies to things like Yelp reviews. Maybe a user with close to a 5-star lifetime rating average should have his reviews "renormalized" to 3's because his standards are probably just lower than the guy with a 1-star average.
The problem is that there are so many other factors here (maybe the 5-star average person only visits (? or just reviews) really good places). Maybe the crazy upvoter just spends more time reading each page on Reddit. These are complicating factors that are hard to predict and if the simple case is working, why try? If there were a simple, clearly better way of voting/rating, it would be done.
> Do you think this will hinder the workings of sites like reddit?
I do. The user will think "Well given that I only have X to spend, I'd better be careful what I vote for". That will lead to less votes being placed, and thus, reduced interaction / community activity.
It's the same reason you don't want to cap the amount of submissions or comments someone makes. A flood control (max X per Y time period) to stop bots and spammers, sure, but no hard cap - or an inhumanly high one at least, bearing in mind that some people are indeed inhuman when it comes to voting/commenting.
I think sites like reddit just want engagement, even over quality. It's been known forever that reddit's algorithm leaves much to be desired but it's too risky to change.
I find this to be a strange comment. Quadratic voting is much better if we have to reach a consensus and want to protect minorities. So it has it's application (and is woefully under-used). But surely it's not always the best method. I think often you want the most popular comment to be at the top of the list, because it's what appeals to the most people.
Secondly, one of the big problems that quadratic voting addresses is when an ambivalent majority outweighs a minority simply because everyone votes. But not everyone has to vote on every comment on a site like reddit, so the ambivalent majority is already ignored.
It might work, if there was a nominal cost to creating an account.
It would be better to take the recommendation engine of netflix. Just throw it on top of reddit and charge a fee for tailored recommendations. Now that is something I would pay money for.
Wrong solution #1 sounds like it could work quite well for UrbanDictionary, since it would tend to reward posts that have a lot of engagement. It's probably a good solution for a lot of sites.
The problem here is feedback. The higher rated ones get higher rated, so more people see it so it gets higher rated. That opens up a whole extra can of worms you don't want to deal with.
I agree. It's a good solution for all cases where the intent is to have a negative vote exactly cancel out a positive.
The method in the article combines a quality rating with a quantity rating, but it's a bit unwieldy and difficult to tune intuitively. It seems to me for a lot of purposes you might get a sufficiently similar effect by using method #1, and then multiplying the result with the sigmoid function applied to the ratio. The advantage of this would be you that the only magical numbers in the formula would be tuning factors you put in yourself.
This seems more appealing to me than "((positive + 1.9208) / (positive + negative) - 1.96 SQRT((positive * negative) / (positive + negative) + 0.9604) / (positive + negative)) / (1 + 3.8416 / (positive + negative))*"
I’ve also seen a variation of this that gives different weights to positive and negative votes, the theory being that the likelyhood of you actually voting when you like/don’t like isn’t the same.
Some of the comments from that posting give concrete examples where the formula fails. Such as: an item with 1000 upvotes and 2000 downvotes will get ranked above one with 1 upvote and 2 downvotes. This is because the formula uses the lower bound of the Wilson interval.
I'm ranking movies by critics ratings. Most of them have too low numbers, and thus naive bayesian ranking by avg does not work. IMDB gets away with it, but I cannot.
And you should be able to see a good preview of the expected ranking even with low numbers.
So you need to check the confidence interval with Wilson, but you also need to check the quality of the reviewer. There are some in the 90%, but there are also often outliers, i.e. extremities. Mostly french btw.
First two points are great, but why then we see this:
"Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what?"
What normal person thinks in terms of confidence intervals?
The obvious answer is people want the product with the highest "real" rating. That is the rating the product would get if it had arbitrary many ratings.
To get this you just find the mean of your posterior probability distribution. For just positive and negative reviews thats basically (positive+a)/(total+b) where a and b depend on your prior.
His proposal would mean that a product with zero reviews would be rated below a product with 1 positive review. This may deal with spam and vote manipulation since things with less information are penalized more but that is a separate issue.
I have always wondered what amazon was thinking with that way of sorting.
Perhaps it's a deliberate way to spread purchases out over a span of products instead of just the two top products?
I think it's about a product discovery. If we always sort this way, new products don't have a chance.
And I don't think Amazon would sort like this, it would make more sense for them to use hn/reddit way to sort items that give a chance for the new items to get to the top.
There is a much simpler and elegant method. Just rank posts by their probability of getting an upvote. This is just (upvotes+1)/(upvotes+downvotess+2).
This gives an advantage to new posts for which the probability is much more uncertain: it's easier to get 1 upvote and 0 downvotes (rank 2/3) than to get 1999 upvotes and 999 downvotes (also rank 2/3). Maybe that's what you want, but the post is exactly about those cases when this is not what you want.
Interestingly, I _think_ the Reddit algorithm basically makes this mistake too -- although embedded in a more complicated algorithm that combines with 'newest first' altered by positives minus negatives.
I don't think the HN algorithm is public, but wouldn't be surprised if it does the same.
Perhaps the generally much smaller number of 'votes' on a HN/reddit post makes it less significant.
For posts, I'm not sure what the algorithm is (I think it's deliberately more complicated, and has to take into account time of posting?), but after this article [the op] was written, reddit implemented the method for comments, as explained by Randall Munroe: http://www.redditblog.com/2009/10/reddits-new-comment-sortin...
You only get this ranking method if you sort the comments by 'best' though
I can pretty much guarantee there are elements of this you're not considering which are addressed there (though there are also elements which Farmer and Glass don't hit either). But it's an excellent foundation.
Second: If you're going to have a quality classification system, you need to determine what you are ranking for. As the Cheshire Cat said, if you don't know where you're going, it doesn't much matter how you get there. Rating for popularity, sales revenue maximization, quality or truth, optimal experience, ideological purity, etc., are all different.
Beyond that I've compiled some thoughts of my own from 20+ years of using (and occasionally building) reputation systems myself:
"Content rating, moderation, and ranking systems: some non-brief thoughts"
http://redd.it/28jfk4
⚫ Long version: Moderation, Quality Assessment, & Reporting are Hard
⚫ Simple vote counts or sums are largely meaningless.
⚫ Indicating levels of agreement / disagreement can be useful.
⚫ Likert scale moderation can be useful.
⚫ There's a single-metric rating that combines many of these fairly well -- yes, Evan Miller's lower-bound Wilcox score.
⚫ Rating for "popularity" vs. "truth" is very, very different.
⚫ Reporting independent statistics for popularity (n), rating (mean), and variance or controversiality (standard deviation) is more informative than a single statistic.
⚫ Indirect quality measures also matter. I should add: a LOT.
⚫ There almost certainly isn't a single "best" ranking. Fuzzing scores with randomness can help.
⚫ Not all rating actions are equally valuable. Not everyone's ratings carry the same weight.
⚫ There are things which don't work well.
⚫ Showing scores and score components can be counterproductive and leads to various perverse incentives.
I'm also increasing leaning toward a multi-part system, one which rates:
1. Overall favorability.
2. Any flaggable aspects. Ultimately, "ToS" is probably the best bucket, comprising spam, harassment, illegal activity, NSFW/NSFL content (or improperly labeled same), etc.
3. A truth or validity rating. Likeley rolled up in #2. But worth mentioning separately.
4. Long-term author reputation.
There's also the general problem associated with Gresham's Law, which I'm increasingly convinced is a general and quite serious challenge to market-based and popularity-based systems. Assessment of complex products, including especialy information products, is difficult, which is to say, expensive.
I'm increasingly in favour of presenting newer / unrated content to subsets of the total audience, and increasing its reach as positive approval rolls in. This seems like a behavior HN's "New" page could benefit from. Decrease the exposure for any one rater, but spread ratings over more submissions, for longer.
And there are other problems. Limiting individuals to a single vote (or negating the negative effects of vote gaming) is key. Watching the watchmen. Regression toward mean intelligence / content. The "evaporative cooling" effect (http://blog.bumblebeelabs.com/social-software-sundays-2-the-...).
[+] [-] stdbrouw|10 years ago|reply
For Amazon, though, which is the example in Evan Miller's post, I don't really get why you'd first dichotomize the five-star rating into positive vs. negative and then use Wilson intervals. Just construct a run-of-the-mill 95% confidence interval for the mean of a continuous distribution and sort by the (still plausible) worst case scenario a.k.a. the lower bound of that: `mean - 1.96 * SE`, where the standard error is `SE = stddev(scores)/sqrt(n)`.
Because of the central limit theorem, you can do this even if scores are not normally distributed and it'll work out too.
[+] [-] vcdimension|10 years ago|reply
These companies really should be hiring statistics consultants instead of relying on the intuitions of their programmers.
[+] [-] danialtz|10 years ago|reply
Assume user has/obtains 1000 karma points a month. If the user merely likes a post or not he gives his 1 vote, which cost him 1 karma. If he strongly wants one post up he can spend a maximum of 31 vote on it. This way the minorities will have also extra influence on the voting process.
The requirements is that each user have 1 account, e.g. maybe by some form of payment for 1000 karma to avoid fake voting fraud. Maybe using bitcoins if it picks up to avoid privacy problems.
Do you think this will hinder the workings of sites like reddit?
[+] [-] bwy|10 years ago|reply
Same thing applies to things like Yelp reviews. Maybe a user with close to a 5-star lifetime rating average should have his reviews "renormalized" to 3's because his standards are probably just lower than the guy with a 1-star average.
The problem is that there are so many other factors here (maybe the 5-star average person only visits (? or just reviews) really good places). Maybe the crazy upvoter just spends more time reading each page on Reddit. These are complicating factors that are hard to predict and if the simple case is working, why try? If there were a simple, clearly better way of voting/rating, it would be done.
[+] [-] Cthulhu_|10 years ago|reply
I do. The user will think "Well given that I only have X to spend, I'd better be careful what I vote for". That will lead to less votes being placed, and thus, reduced interaction / community activity.
It's the same reason you don't want to cap the amount of submissions or comments someone makes. A flood control (max X per Y time period) to stop bots and spammers, sure, but no hard cap - or an inhumanly high one at least, bearing in mind that some people are indeed inhuman when it comes to voting/commenting.
[+] [-] billmalarky|10 years ago|reply
IE the following (since patched) bug: http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-e...
[+] [-] jmilloy|10 years ago|reply
Secondly, one of the big problems that quadratic voting addresses is when an ambivalent majority outweighs a minority simply because everyone votes. But not everyone has to vote on every comment on a site like reddit, so the ambivalent majority is already ignored.
[+] [-] nzealand|10 years ago|reply
It would be better to take the recommendation engine of netflix. Just throw it on top of reddit and charge a fee for tailored recommendations. Now that is something I would pay money for.
[+] [-] dredmorbius|10 years ago|reply
Fundamentally: assessing quality of complex products, including information goods, is hard.
[+] [-] woah|10 years ago|reply
[+] [-] imh|10 years ago|reply
[+] [-] Udo|10 years ago|reply
The method in the article combines a quality rating with a quantity rating, but it's a bit unwieldy and difficult to tune intuitively. It seems to me for a lot of purposes you might get a sufficiently similar effect by using method #1, and then multiplying the result with the sigmoid function applied to the ratio. The advantage of this would be you that the only magical numbers in the formula would be tuning factors you put in yourself.
This seems more appealing to me than "((positive + 1.9208) / (positive + negative) - 1.96 SQRT((positive * negative) / (positive + negative) + 0.9604) / (positive + negative)) / (1 + 3.8416 / (positive + negative))*"
[+] [-] rjst|10 years ago|reply
[+] [-] learnstats2|10 years ago|reply
If this article wants to make its point, it should show cases where its ordering differs from UrbanDictionary.
[+] [-] bbrazil|10 years ago|reply
[+] [-] ggreer|10 years ago|reply
Some of the comments from that posting give concrete examples where the formula fails. Such as: an item with 1000 upvotes and 2000 downvotes will get ranked above one with 1 upvote and 2 downvotes. This is because the formula uses the lower bound of the Wilson interval.
[+] [-] rurban|10 years ago|reply
So you need to check the confidence interval with Wilson, but you also need to check the quality of the reviewer. There are some in the 90%, but there are also often outliers, i.e. extremities. Mostly french btw.
I updated the c and perl versions, compiled and pure perl here: https://github.com/rurban/confidence_interval
[+] [-] anon4327733|10 years ago|reply
"Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what?"
What normal person thinks in terms of confidence intervals?
The obvious answer is people want the product with the highest "real" rating. That is the rating the product would get if it had arbitrary many ratings.
To get this you just find the mean of your posterior probability distribution. For just positive and negative reviews thats basically (positive+a)/(total+b) where a and b depend on your prior.
His proposal would mean that a product with zero reviews would be rated below a product with 1 positive review. This may deal with spam and vote manipulation since things with less information are penalized more but that is a separate issue.
[+] [-] spacemoelte|10 years ago|reply
[+] [-] randomtree|10 years ago|reply
And I don't think Amazon would sort like this, it would make more sense for them to use hn/reddit way to sort items that give a chance for the new items to get to the top.
[+] [-] Houshalter|10 years ago|reply
[+] [-] stdbrouw|10 years ago|reply
[+] [-] anon4327733|10 years ago|reply
[+] [-] jrochkind1|10 years ago|reply
I don't think the HN algorithm is public, but wouldn't be surprised if it does the same.
Perhaps the generally much smaller number of 'votes' on a HN/reddit post makes it less significant.
[+] [-] gsteinb88|10 years ago|reply
You only get this ranking method if you sort the comments by 'best' though
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] sova|10 years ago|reply
[+] [-] Houshalter|10 years ago|reply
[+] [-] discardorama|10 years ago|reply
[+] [-] cgearhart|10 years ago|reply
[+] [-] jmilloy|10 years ago|reply
[+] [-] dredmorbius|10 years ago|reply
Book: http://shop.oreilly.com/product/9780596159801.do
Wiki: http://buildingreputation.com/doku.php
Blog: http://buildingreputation.com/
I can pretty much guarantee there are elements of this you're not considering which are addressed there (though there are also elements which Farmer and Glass don't hit either). But it's an excellent foundation.
Second: If you're going to have a quality classification system, you need to determine what you are ranking for. As the Cheshire Cat said, if you don't know where you're going, it doesn't much matter how you get there. Rating for popularity, sales revenue maximization, quality or truth, optimal experience, ideological purity, etc., are all different.
Beyond that I've compiled some thoughts of my own from 20+ years of using (and occasionally building) reputation systems myself:
"Content rating, moderation, and ranking systems: some non-brief thoughts" http://redd.it/28jfk4
⚫ Long version: Moderation, Quality Assessment, & Reporting are Hard
⚫ Simple vote counts or sums are largely meaningless.
⚫ Indicating levels of agreement / disagreement can be useful.
⚫ Likert scale moderation can be useful.
⚫ There's a single-metric rating that combines many of these fairly well -- yes, Evan Miller's lower-bound Wilcox score.
⚫ Rating for "popularity" vs. "truth" is very, very different.
⚫ Reporting independent statistics for popularity (n), rating (mean), and variance or controversiality (standard deviation) is more informative than a single statistic.
⚫ Indirect quality measures also matter. I should add: a LOT.
⚫ There almost certainly isn't a single "best" ranking. Fuzzing scores with randomness can help.
⚫ Not all rating actions are equally valuable. Not everyone's ratings carry the same weight.
⚫ There are things which don't work well.
⚫ Showing scores and score components can be counterproductive and leads to various perverse incentives.
I'm also increasing leaning toward a multi-part system, one which rates:
1. Overall favorability.
2. Any flaggable aspects. Ultimately, "ToS" is probably the best bucket, comprising spam, harassment, illegal activity, NSFW/NSFL content (or improperly labeled same), etc.
3. A truth or validity rating. Likeley rolled up in #2. But worth mentioning separately.
4. Long-term author reputation.
There's also the general problem associated with Gresham's Law, which I'm increasingly convinced is a general and quite serious challenge to market-based and popularity-based systems. Assessment of complex products, including especialy information products, is difficult, which is to say, expensive.
I'm increasingly in favour of presenting newer / unrated content to subsets of the total audience, and increasing its reach as positive approval rolls in. This seems like a behavior HN's "New" page could benefit from. Decrease the exposure for any one rater, but spread ratings over more submissions, for longer.
And there are other problems. Limiting individuals to a single vote (or negating the negative effects of vote gaming) is key. Watching the watchmen. Regression toward mean intelligence / content. The "evaporative cooling" effect (http://blog.bumblebeelabs.com/social-software-sundays-2-the-...).
[+] [-] fahadalie|10 years ago|reply
[+] [-] dredmorbius|10 years ago|reply