When you don't start your y-axis at 0, you skew the interpretation of your data. At best, this is a significant mistake, and at worst, this is intentionally misleading.
It looks like women are rated as more than twice as smart as men. Huge difference.
Except until you run the numbers. Women are rated about 4.3% "smarter" than men. Not twice as smart, like the graph implies. Not 20% smarter. Not even 5%.
Please, pay attention to your graphs. They're great tools, but they can mislead as much as they can help elucidate.
Gah, this data is so interesting, but is not presented very well.
First change I would make is put all the red bars next to each other, and the blue bars next to each other; there is more value in comparing across enthicities/genders than there is in comparing across variables. Second change I would make it get rid of the eye-popping primary colors and choose two more neutral ones (two different shades of beige maybe?)
Finally, Y-axis scale changes too frequently. Also error bars would be nice as has been mentioned.
edit: Now that I think more about it, at a glance it would be nice to show a scatter plot with a dot for each of Male/Female/Ethnicity/etc in the way that you have a scatter plot on your home page. http://judg.me
And worse than that, the y-axis range shifts. For example, the range on the Fat vs. Normal chart is much smaller than the surrounding charts, giving the illusion that the spread is much larger than it is. (site is down or I'd link).
True, and I was having a tough time deciding which option to take (whether to start the y-axis at 0, or 5). As far as averages are concerned, the differences are in the tenths and hundredths - so starting the y-axis at 0 would have it more difficult to see any differences across the graphs.
Inter-rater reliability is super important for tests like this. http://en.wikipedia.org/wiki/Inter-rater_reliability The gist is: you can't simply mark one person as "asian" and assume that categorization is correct. In that respect, the data would reveal more about the person sorting the photos than it would reveal about the perceptions of those that are rating the photos.
Second, there is a huge problem with causality here. So for instance, the author writes: "Be Asian if you want to appear smart; Latino if you want to appear extroverted." The problem is that there is a methodological flaw. On the first photo I saw on judge.me, I was presented with this image: http://images.judg.me/82e7fcbd988dbdcac0d00bd53fb93e96.jpg This would appear to me to be a latino or hispanic male at a party. I'm highly inclined to rate them highly on the extrovert scale: they're at a party. But that doesn't indicate stereotypically latino or hispanic features indicate extroversion. It could be that people with stereotypically latino or hispanic features were more likely to upload photos in which the image portrayed a more stereotypically extroverted activity.
Third, it appears that users can upload a photo to the site and see their feedback from votes. It seems highly possible that users self-select a photo that will best affirm the image of themselves they wish to cultivate. In that respect, there's both a huge confirmation bias and huge self-selection bias. If I want to think of myself as an academic, I'll upload a picture of me at my desk studying and watch the "intellectual" ratings pour in. Then I can feel assured that other people perceive me the way I want to be perceived. Additionally, if one wants to conform to social expectations (and things like Asch's line test http://en.wikipedia.org/wiki/Asch_conformity_experiments indicate conformity is common), this data might really be nothing more than showing the degree to which people post photos affirming their conformity to their social expectations (i.e. 'smart' ethnicities posting 'smart-looking' photos) and be saying nothing at all about how people actually perceive ethic cues.
There are huge methodological concerns for this 'study'. Instead, the revelation of this data might actually be the insight that "pictures of yourself at social events makes you look more social." Taking much of anything at all away from this data set would be rather unwise.
"In that respect, the data would reveal more about the person sorting the photos than it would reveal about the perceptions of those that are rating the photos."
The problem is none of this matters since we don't know anything about the people rating the photos. Not their sex, not their age, not their location nothing.
In addition to the inter-rater reliability issue, there are also a lot of unanswered questions about the statistical distributions involved. The results are reported as population means, but without information about the underlying distribution of the results it's unclear whether the mean is a meaningful measure of central tendency for the data or how much overlap there was in the distributions. How did the mean compare with the median and mode? What were the standard deviations? Interquartile range? They're using a visual analog scale for the ranking which is reasonable, but it seems that it's just been assumed that the data can be treated as interval data for the analysis and the validity of that assumption hasn't been established. If I were doing the analysis I'd have been inclined to bin the data and report the results as odds ratios with 95% confidence intervals (e.g. people wearing glasses are N + or - 95% CI times more likely to be regarded as "smart", where "smart" is defined as a score >= some reasonable threshold on the "smartness" axis than those without glasses).
I would think that a multi-level/hierarchical/mixed GLM would be an interesting approach to their data. Multilevel modeling assumes that there is correlation between observations that are inside the same "level". This is in stark comparison to regular GLM (even one with dummy variables to represent categories), which assumes that all observations are 100% independent.
E.g. in a model that predicts students' GPA, you could divide your data into a hierarchy consisting of, at the highest level, geographic area, followed by high school, maybe followed by teacher. In that model, the correlation between students who are in the same state, the same school or in the same classroom would be accounted for. You could even go as deep as at an individual level if you have >1 observation per student.
In addition to regular predictive variables, judg.me could probably use their weblogs to group people's judgement scores by country of origin and by individuals, among other possibilities.
While inter-rater reliability is a concern for identifying ethnicities, other traits should be more reliable (gender, hair color). Still, stuff like that needs to be pre-tested. You basically let the coders code a limited set of photos and check the correlation between their codes. The higher the correlation, the better, >.7 is the convention in social science (but still pretty bad, higher would be better).
You should also check intra-coder reliability, i.e. give the same person the same set of photos with two weeks or so between. You can then again calculate the correlation. This tells you wether your categories are too fuzzy (e.g. what exactly is medium length hair?).
All in all this has serious methodical flaws, from a social science perspective it’s not salvageable, and I haven’t even talked about the complete lack of statistical tests (which, to be honest, would just be like polishing a turd).
The problem with this blog is so obvious that I'm suprised that I haven't seen it in the comments yet (I probably missed it), but you can't use a random selection of photographs for this if you want to expect gross ratings to mean something. You would have to normalise each trait that you were comparing against every other trait. Otherwise, when you were trying to isolate how smart people judge black people to be, and black people were wearing caps a quarter more ofter than the average person, you would think you were getting interesting data for blacks when really you were getting interesting data for caps.
If you didn't plan to use gross ratings like this blog did (I think), then I'm pretty sure that you could do a post-normalization by analyzing the frequencies in the sample and determining how much you'd expect each of the traits to affect the rating for every other trait, then trying to determine the if the deviations from that were statistically significant in a universe that contained only those traits.
Honestly - just take the original data and assign every trait a 5 rating, then pick a random trait and pull that value up or down, then check and see what the gross ratings now say about the other traits.
I apoligize if the methodology was more complicated than it looks, and I hope there's a link to the spreadsheet of the original distribution somewhere in the blog that I missed, so someone could make sense of this data.
I have no idea what the standard deviation on this is. Lots of the numbers look close enough to be noise. Others in this thread have pointed out other missing information that makes this a fairly poor survey.
What a tragic waste of data and time. Not one mention of confidence intervals (are _any_ of these differences statistically significant??), selection bias (who was more likely to submit photos, and why did they choose a specific photo??), or sampling errors (who rated the attributes, and how consistent were they?). The OK Cupid blog posts are a great source for similar (but statistically sound) studies.
"thousands of photos have been uploaded and judged by users since."
Who are the users that are judging? What is the breakdown of those users (age,sex,location,education etc.)? What can possibly be inferred from this without knowing that info?
It's entirely anonymous - I know nothing about the users who are uploading the photos (apart from the email addresses they use when uploading a photo), and nothing about users who judge the photos.
The entire premise of the site is for the user to be judged by strangers. Why would age/sex/location/education of the person doing the judging matter?
I read just enough to decide it isn't really worth reading. I love the articles OK Cupid does with hard statistical data backing up their inferences about similar social stuff. This does not strike me as of that ilk.
I am disappointed. I was recently thinking about how people are judged based on looks (and blogged about it) so was hoping for/looking forward to something meatier.
Your graphs appear to be very misleading. There is little to be learned from the data. Learn some Data analysis and learn how to not provide bias via graphs.
That's usually when you have some idea of the error in what you're measuring. They are just reporting on a social poll they did, isn't that usually a bit different. I mean of course this isn't a rigorous scientific study, but that doesn't mean it's useless either.
Why people assume that parameters are independent?
If most black woman that sent their photos are fat and people don't rate fat woman high then the black women will be rated low not because of the race but because being black woman and being fat woman is correlated in the sample data.
Owner of such sites have large sample of some data and they assume that large equals representative and they go on slicing their data by different parameters not controlling for anything and making statements that are only technically true with respect to their data but strongly misleading in many ways.
The authors conflate "extroversion" and "social skills". For example, based on his pic I'd rate this guy high on extroversion but low on social skills:
Even when people discuss it here, they talk about wanting to be left alone, not going to parties because they aren't interested in socializing, and feeling "weak" after socializing for a short amount of time.
With all of this, I don't know how your social skills could ever be considered high.
the y axes vary a great deal. there's no information on the distribution of ratings for each class. really difficult to tell if there's anything meaningful or even interesting here at all.
Awesome analysis, though I agree they results look like error bars.
I know you mention a random sample of 1000 images, but what were your overall metrics? Did you have a good data set across the board (ie as many Hispanic females as Caucasian males)? What kind of advertising did you do as well?
Reason I ask is I've been working on trying build a face-morpher based on different criteria (make you look 80, fat, African) and these are some of the questions I've got bouncing in my head about how to collect the data.
Law of large numbers says that your error should scale like 1/sqrt(N) where N is the sample size. In this case N = 1000, so 1/sqrt(N) ~ 3%
This measures 1 STD (68% of values lie in an interval of 3% of reported value). To be on the safe side you should take 2 or 3 STDs for the error bars. This already nullifies most of the results!
I'm sorry, but there's no 'unsexy data crunching' here - just a series of ratios compared against one another. There is a whole body of statistical literature about how to do anything of this kind, and they haven't done any of it. I'd quite happily believe that none of these differences have any kind of significance in the statistical sense (i.e., it's due to background variation). But then again, I wasn't given any information to know whether they've even looked. So I can't say...
Prof. Dan Ariely mentioned Hot or Not website in one of his books. He used the website to get his attractiveness score and other interesting data. The book is a great read, and analysis of how people percieve you by the looks. As for the website, I think that judg.me looks very promising as a source of social data, which is otherwise very difficult to obtain.
These people have a wrong definition of extroversion.
The actual site rates extroversion vs introversion, but the analysis here mistakenly uses the term social scale, implying that extroversion and sociability are interchangeable. They are correlated, but by absolutely no means are they interchangeable. This analysis should have stuck with the original vocabulary more consistently.
For someone in his early 30s whose hair is starting to thin out the results are interesting, though expected. I won't lose many social points but will pick up a good amount of perceived smartness when the baldness battle is finally lost. And to throw the "I'm a fun-loving extrovert" vibe out there for special occasions, I just throw on some shades.
[+] [-] pflats|14 years ago|reply
Take this graph:
http://i.imgur.com/bBzCK.png
It looks like women are rated as more than twice as smart as men. Huge difference.
Except until you run the numbers. Women are rated about 4.3% "smarter" than men. Not twice as smart, like the graph implies. Not 20% smarter. Not even 5%.
Please, pay attention to your graphs. They're great tools, but they can mislead as much as they can help elucidate.
[+] [-] hammock|14 years ago|reply
First change I would make is put all the red bars next to each other, and the blue bars next to each other; there is more value in comparing across enthicities/genders than there is in comparing across variables. Second change I would make it get rid of the eye-popping primary colors and choose two more neutral ones (two different shades of beige maybe?)
Finally, Y-axis scale changes too frequently. Also error bars would be nice as has been mentioned.
edit: Now that I think more about it, at a glance it would be nice to show a scatter plot with a dot for each of Male/Female/Ethnicity/etc in the way that you have a scatter plot on your home page. http://judg.me
[+] [-] aero142|14 years ago|reply
[+] [-] bvi|14 years ago|reply
[+] [-] boredguy8|14 years ago|reply
Second, there is a huge problem with causality here. So for instance, the author writes: "Be Asian if you want to appear smart; Latino if you want to appear extroverted." The problem is that there is a methodological flaw. On the first photo I saw on judge.me, I was presented with this image: http://images.judg.me/82e7fcbd988dbdcac0d00bd53fb93e96.jpg This would appear to me to be a latino or hispanic male at a party. I'm highly inclined to rate them highly on the extrovert scale: they're at a party. But that doesn't indicate stereotypically latino or hispanic features indicate extroversion. It could be that people with stereotypically latino or hispanic features were more likely to upload photos in which the image portrayed a more stereotypically extroverted activity.
Third, it appears that users can upload a photo to the site and see their feedback from votes. It seems highly possible that users self-select a photo that will best affirm the image of themselves they wish to cultivate. In that respect, there's both a huge confirmation bias and huge self-selection bias. If I want to think of myself as an academic, I'll upload a picture of me at my desk studying and watch the "intellectual" ratings pour in. Then I can feel assured that other people perceive me the way I want to be perceived. Additionally, if one wants to conform to social expectations (and things like Asch's line test http://en.wikipedia.org/wiki/Asch_conformity_experiments indicate conformity is common), this data might really be nothing more than showing the degree to which people post photos affirming their conformity to their social expectations (i.e. 'smart' ethnicities posting 'smart-looking' photos) and be saying nothing at all about how people actually perceive ethic cues.
There are huge methodological concerns for this 'study'. Instead, the revelation of this data might actually be the insight that "pictures of yourself at social events makes you look more social." Taking much of anything at all away from this data set would be rather unwise.
[+] [-] larrys|14 years ago|reply
The problem is none of this matters since we don't know anything about the people rating the photos. Not their sex, not their age, not their location nothing.
To wit:
"and nothing about users who judge the photos."
http://news.ycombinator.com/item?id=3921271
[+] [-] blahblahblah|14 years ago|reply
[+] [-] weissguy|14 years ago|reply
E.g. in a model that predicts students' GPA, you could divide your data into a hierarchy consisting of, at the highest level, geographic area, followed by high school, maybe followed by teacher. In that model, the correlation between students who are in the same state, the same school or in the same classroom would be accounted for. You could even go as deep as at an individual level if you have >1 observation per student.
In addition to regular predictive variables, judg.me could probably use their weblogs to group people's judgement scores by country of origin and by individuals, among other possibilities.
[+] [-] hej|14 years ago|reply
You should also check intra-coder reliability, i.e. give the same person the same set of photos with two weeks or so between. You can then again calculate the correlation. This tells you wether your categories are too fuzzy (e.g. what exactly is medium length hair?).
All in all this has serious methodical flaws, from a social science perspective it’s not salvageable, and I haven’t even talked about the complete lack of statistical tests (which, to be honest, would just be like polishing a turd).
[+] [-] pessimizer|14 years ago|reply
If you didn't plan to use gross ratings like this blog did (I think), then I'm pretty sure that you could do a post-normalization by analyzing the frequencies in the sample and determining how much you'd expect each of the traits to affect the rating for every other trait, then trying to determine the if the deviations from that were statistically significant in a universe that contained only those traits.
Honestly - just take the original data and assign every trait a 5 rating, then pick a random trait and pull that value up or down, then check and see what the gross ratings now say about the other traits.
I apoligize if the methodology was more complicated than it looks, and I hope there's a link to the spreadsheet of the original distribution somewhere in the blog that I missed, so someone could make sense of this data.
[+] [-] ori_b|14 years ago|reply
[+] [-] andrewfelix|14 years ago|reply
[+] [-] kvh|14 years ago|reply
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] larrys|14 years ago|reply
Who are the users that are judging? What is the breakdown of those users (age,sex,location,education etc.)? What can possibly be inferred from this without knowing that info?
[+] [-] bvi|14 years ago|reply
The entire premise of the site is for the user to be judged by strangers. Why would age/sex/location/education of the person doing the judging matter?
[+] [-] Mz|14 years ago|reply
I am disappointed. I was recently thinking about how people are judged based on looks (and blogged about it) so was hoping for/looking forward to something meatier.
[+] [-] nates|14 years ago|reply
[+] [-] Danieru|14 years ago|reply
It is an interesting study so I hope they update the post once they have been in business longer.
[+] [-] simonster|14 years ago|reply
[+] [-] slashcom|14 years ago|reply
[+] [-] mindstab|14 years ago|reply
[+] [-] scotty79|14 years ago|reply
If most black woman that sent their photos are fat and people don't rate fat woman high then the black women will be rated low not because of the race but because being black woman and being fat woman is correlated in the sample data.
Owner of such sites have large sample of some data and they assume that large equals representative and they go on slicing their data by different parameters not controlling for anything and making statements that are only technically true with respect to their data but strongly misleading in many ways.
[+] [-] lunchbox|14 years ago|reply
http://madconfessionsofaman.files.wordpress.com/2011/05/douc...
Similarly, being introverted doesn't mean you have low social skills.
[+] [-] paulhauggis|14 years ago|reply
I don't know how you can say that.
The definition here: http://dictionary.reference.com/browse/introvert?s=t
implies low social skills.
Even when people discuss it here, they talk about wanting to be left alone, not going to parties because they aren't interested in socializing, and feeling "weak" after socializing for a short amount of time.
With all of this, I don't know how your social skills could ever be considered high.
[+] [-] victorhn|14 years ago|reply
[+] [-] drewwwwww|14 years ago|reply
[+] [-] tsumnia|14 years ago|reply
I know you mention a random sample of 1000 images, but what were your overall metrics? Did you have a good data set across the board (ie as many Hispanic females as Caucasian males)? What kind of advertising did you do as well?
Reason I ask is I've been working on trying build a face-morpher based on different criteria (make you look 80, fat, African) and these are some of the questions I've got bouncing in my head about how to collect the data.
[+] [-] kami8845|14 years ago|reply
http://judg.me.nyud.net/blog/judgment-day/
[+] [-] dmvaldman|14 years ago|reply
Law of large numbers says that your error should scale like 1/sqrt(N) where N is the sample size. In this case N = 1000, so 1/sqrt(N) ~ 3%
This measures 1 STD (68% of values lie in an interval of 3% of reported value). To be on the safe side you should take 2 or 3 STDs for the error bars. This already nullifies most of the results!
[+] [-] willpearse|14 years ago|reply
[+] [-] MichailP|14 years ago|reply
[+] [-] bvi|14 years ago|reply
[+] [-] B-Con|14 years ago|reply
The actual site rates extroversion vs introversion, but the analysis here mistakenly uses the term social scale, implying that extroversion and sociability are interchangeable. They are correlated, but by absolutely no means are they interchangeable. This analysis should have stuck with the original vocabulary more consistently.
[+] [-] jcc80|14 years ago|reply