YouTube Comes To A 5-Star Realization: Its Ratings Are Useless

[+] Timothee|16 years ago|reply

The official post from YouTube: http://youtube-global.blogspot.com/2009/09/five-stars-domina...

[+] jmah|16 years ago|reply

That chart is veering dangerously toward misleading; it should be a bar, not a line.

[+] acangiano|16 years ago|reply

Having "xx% liked this video" under each video would be more useful and meaningful.

BTW, they really should have used a histogram here.

[+] icey|16 years ago|reply

I'm shocked that such a big-data company has missed the mark on ratings by this much. Especially given that they have Flash running in the browser.

If I were YouTube, I would start by automatically ranking the videos by the percentage of times it was completely played but discounting times it was played completely and the page was just left open with no activity (to counteract the people who opened the page and ignored it afterwards, or left their machines idle). Discard the 5% outliers on either end of the spectrum and you should have a pretty decent idea of which videos are popular and which ones are boring.

After all, don't they care mostly about whether the videos are engaging?

It seems like there should be a ton of techniques that they should have already been using to supplement the star ratings in order to fine tune any algorithms they wanted to play with.

[+] stcredzero|16 years ago|reply

I'm shocked that such a big-data company has missed the mark on ratings by this much

Well, given how badly they messed up with comments...

http://xkcd.com/202/

[+] InclinedPlane|16 years ago|reply

At least they realize it. Amazon book ratings have a similar problem, to a lesser degree. You really have to dig into the individual reviews to get any reasonable ratings info. I find, at least for technical books, that comparing the price the book is being sold for used to the new price is one of the more accurate ways of determining if a book has legitimate value.

[+] netsp|16 years ago|reply

Actually, Amazon's review & rating system seems to be one of the few that works well. Seeing the distribution and reading the most useful positive and negative reviews works really well in my experience.

[+] tel|16 years ago|reply

Though I don't have aggregate data, I don't think Amazon has this issue. People seem able to differentiate within the 2-4 range when there's an actual purchase involved.

Then again, Amazon doesn't account for the statistical uncertainty of votes, so it's a little odd as well.

[+] netsp|16 years ago|reply

Maybe it's because Amazon asks (and receives) written reviews as well. They make people think a little more. I assume there is still a bias towards the 1 & 5 stars though.

[+] mynameishere|16 years ago|reply

http://74.125.113.132/search?q=cache:QCWrYt03Y8UJ:sloanwordp...

The econometric results reveal that the reviews for the majority of the products have an asymmetric bimodal distribution.

etc. I thought everyone knew this, and it makes sense. You don't review something if you think it's "okay". You review it if you love it or hate it. Rottentomatoes probably has a bell curve because the professional reviewers review everything, whether they want to or not.

Also, the reviews on youtube are quite useful, and a three-or-less star average indicates that something is labeled incorrectly (for copyrighted material, tv shows, movies, videos, and the like).

[+] DanielStraight|16 years ago|reply

I've always thought smaller scales were better. I think asking people to rate something on a 1-5 scale is optimistic. A 1-10 scale is simply insanity. No one, without substantial effort, can reasonably rank something from 1-10... even if they think they can. I once had someone try to convince me that they could rank the attractiveness of girls accurately on a scale of 1-100, differentiating between every point on the scale. I don't think anyone would argue that that's insane. I feel the same way about 1-10 scales. It simply isn't possible. When a particular rating is really important, I say use 1-4. I think preventing people from picking a middle answer will help get more honest opinions. If it's an issue about which you can truly be apathetic, 1-3. Hacker News seems to do quite well with a scale of 1-1 on submissions, or perhaps we should call it a scale of apathy-opinion.

Maybe for YouTube they could count opinion votes when someone watches the same video twice or uses the embed/share feature. If someone watches a video but never shares it or watches it again, then I think it's fair to say they were apathetic about it.

[+] derefr|16 years ago|reply

Just to supply evidence to the contrary, I make full use of the five-star scale in iTunes, with each additional star translating to some privilege the song gets as a strict subset of the songs with lesser ratings (e.g. 4+ = always on iPod.) I frequently wish for half-ratings across the board, which would make the full, discrete scale 1-10 :) I imagine I may indeed be in the extreme minority, though.

[+] NathanKP|16 years ago|reply

I commented on the article:

"Five start systems are supposed to work on the basic idea that the one and five votes will average out to a value somewhere in the middle. In this way the two, three, and four star ratings are averages based on the ratio of one star votes to five star votes."

However, that doesn't always work. People seem to be lazy and they don't want to judge the comparative value of different items.

[+] IsaacL|16 years ago|reply

I found this interesting, as I've just started working again on a web app for rating learning resources - I made a previous post on HN asking about the merits of a 5-point system versus a 2-point system (thumbs up / thumbs down). It seemed the consensus was to the five-point system, as it's more informative.

However, TC seems to believe that the 5-star system is more poorly defined. Opinions? I've been wondering about whether to include guidelines for the different rankings, (3 stars means this, 4 stars means this, etc) or just to let people use their own definitions (since, in all likelihood, the guidelines would be ignored). I've always thought it strange when reviewers provide a little box saying "5 stars - Excellent, 4 stars - Good..." and so on, since most people should have got the idea by now.

I also want to segregate links by level - Getting Started/Beginner/Intermediate/Advanced - and was trying to think of precise definitions for each. Again, I've also thought that since people likely would ignore any such definitions, it might be better for them to use their own definitions. Theory being that if 100 people thought this link should be in the 'beginner' category, then others won't be surprised to find it there.

So, give definitions for each ranking, or let users work by their own interpretations for each ranking?

2 ideas to improve things:

Idea 1: Change the weighting of a vote based on the voter's voting habits. Eg, if a person only gives 1 and 5 star votes, decrease the weighting for their votes. I doubt I'm the first person to come up with this idea, does anyone know of any sites that implement such a scheme?

Idea 2: Users with 'editor' priveleges have the ability to move things around into their 'correct' place. This could make it more useful to have predefined definitons for each ranking and category.

[+] JCThoughtscream|16 years ago|reply

I'd say that a rating based on the number of favorites makes more sense. It's less arbitrary than the five-star system, in that it tracks the nominal level of interest in the video. People don't generally favorite a video at random, so any favoriting at all is generally a good sign as to the video's content.

[+] brazzy|16 years ago|reply

I really don't see why, as the article claims, the fact that opinions are subjective and everyone has a different one makes the 1-5 star rating any more useless than a thumbs up/thumbs down/do nothing or a favorite/not favorite rating. Those are subjective and different for everyone as well, and the idea is that it averages out over lots of people.

No, the problem is that people can't be bothered to spend time to think about just how much they like something, compared to everything else, and that it can become uncomfortable to try and do that because you realize that your relative liking may not be neither constant nor consistent, and trying to make it so could be a lot of work (I like A a lot, but B even more... hey, C is really cool! But some things about A are better than C... now what?).

[+] potatolicious|16 years ago|reply

"I really don't see why, as the article claims, the fact that opinions are subjective and everyone has a different one makes the 1-5 star rating any more useless than a thumbs up/thumbs down"

Because you're implementing a system that on paper has a lot more resolution than what you're really getting. Imagine buying a 1080p HDTV and then showing solid colors on it only - not only a waste of engineering effort but also building subsequent systems around the validity of your 5-star ratings will also be fundamentally broken.

Also, based on their data there's a very concrete reason why the 1-5 star rating is worse than the thumbs up/thumbs down. With the thumbs up/down system you have a single dimension of data ("likedness"), whereas with the 1-5 star rating system they're only getting data from people who like the video (and almost none from people who disliked it, look at the distribution). This makes the data practically useless for determining the quality and user preference for a video. Consequently ranking algorithms just won't work on the star system - the difference between video #1 and #100,000 can be an average rating of 4.8 and 4.92.

"Those are subjective and different for everyone as well"

So are movie ratings - but it's still a very useful metric to a lot of people. With a large enough sample size you get the lowest common denominator preference measure - which may be what YouTube wants.

"the problem is that people can't be bothered"

I object to the labeling of basic user behaviour as laziness or some type of stupidity. Users will behave how they behave - assigning value judgments to this behaviour just makes you a prick, and disconnects you from your users (who also happen to be your customers, yay!). If your users aren't using your system in the way you intended, you need to fix it. Trying to pawn off your responsibility in the equation as "lazy users can't be bothered" simply is a cop-out.

[+] Timothee|16 years ago|reply

Actually, I remember reading a post around the Netflix prize that was basically saying that it wasn't important to know what people exactly thought 2, 3, 4... stars meant for them, because overall, it was still enough to obtain the data and rating relevant to your point-of-view.

[+] JDigital|16 years ago|reply

The more motivated you are to vote, the more likely you are to vote it a five. A mediocre video is characterized by a lack of votes, not a preponderance of three-star votes.

Back when Youtube rounded the average vote result down to the nearest star, I joked that Youtube videos really only had three ratings:

Five star: New video (eventually it will accrue a vote of other than 'five' and drop to four stars) Four star: Top notch (mainly fives, a few troll one-star votes) Three star: Disgusting trash (at least many one-votes as fives)

You don't see anything less than three-star since it will have been deleted by the time you get there.

[+] johnfn|16 years ago|reply

The way to make ratings more valuable can be seen on a site like www.rateyourmusic.com . You only get one vote per item, and voting again just changes your previous vote. Furthermore, you can see your voting distribution (this usually cows people into not just voting 5 on everything, because that makes you look like an idiot). On the other hand, youtubers probably aren't too concerned with looking like an idiot...

[+] bdmac97|16 years ago|reply

Kinda always had a feeling that's what happens but nice to see graphical proof! I feel good now about choosing +/- rating only for launchly.

[+] chrischen|16 years ago|reply

You still have to be careful with a +/- rating system, because for it to work you need a constant number of viewers for every item. The more the actual number of viewers deviates from the ideal constant then the more unfair the voting is. An example would be a popular item getting more finite votes because a larger sample of people were exposed to it. So a +/- system probably must be expressed as a percentage to correct this problem. However I can see a 5 star system being improved by showing some sort of relative data too.

So I guess a +/- system may not necessarily be more advantageous. It all depends on how carefully and correctly you interpret the data.

[+] rsheridan6|16 years ago|reply

It really doesn't matter very much. The main thing the rating system does is allow you to avoid utterly crappy or misrepresented videos. I don't see what difference it makes whether you're avoiding one and two star videos or videos with too many thumbs down.

I don't think changing to a thumbs up or down system would affect my experience of youtube on way or another.

[+] chrischen|16 years ago|reply

YouTube is slow to realize it... You don't know statistics to show that 5 star ratings are innaccurate.

31 comments