Transforming Wikipedia into a cultural knowledge quiz

[+] testplzignore|7 years ago|reply

> Digging in, it turned out “Apple” belonged to the category Steve Jobs which eventually belonged to… “People,” of course. It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.

This feels like a problem worth solving on Wikipedia itself. It would be nice if categories could be marked as non-hierarchical, so that for a given category, you could know whether its articles could be classified under all of the ancestor categories.

https://en.wikipedia.org/wiki/Category:Eponymous_categories would be a good place to start. Probably most of those are not hierarchical.

[+] tyingq|7 years ago|reply

Wikidata is structured. Here's the entry for Apple: https://m.wikidata.org/wiki/Q312

[+] thaumasiotes|7 years ago|reply

>> It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.

> This feels like a problem worth solving on Wikipedia itself.

Why would that be a problem at all? "Apple" (I assume this refers to the company) makes perfect sense in the "Steve Jobs" category. If you looked up the listing for Category:Steve_Jobs you'd expect to find it there. "Fixing" this problem would just make Wikipedia worse.

The better fix would be to realize that categories aren't strictly hierarchical, and you shouldn't operate as if they are.

[+] unknown|7 years ago|reply

[deleted]

[+] thanatropism|7 years ago|reply

The next step is to make crossword puzzles from this.

(I'm next to illiterate about constraint satisfaction programming. How hard is to make reasonable crossword puzzles?)

[+] theoh|7 years ago|reply

The constraint programming part isn't the bit that makes the difference between a good and bad crossword. The difference is in the quality of the clues. Generating plausible 'cryptic' clues is probably well beyond the ability of current AI.

If you wanted 'Jeopardy'-style clues, that's easier.

[+] WAthrowaway|7 years ago|reply

Creating a grid to fit the chosen words in wouldn't be so hard. The real difficulty would be coming up with clever clues that weren't just the first paragraph of the wiki article with the names removed

[+] Udik|7 years ago|reply

Like others, I did the test and was rather disappointed at my score :). And yes, there seem to be a lot of rappers and Bollywood stars and movies in the quiz, that don't really appeal to my European sense of "culture". I wonder if instead of (or better, in addition to) page popularity it wouldn't be wise to use the number of translations of an entry in other languages. That should at least ensure that an item is considered important across local cultures- which is usually a good indicator of cultural importance. Did you try that?

[+] JauntyHatAngle|7 years ago|reply

Isn't it a bit much to say "Scientifically Accurate" when more or less people are just checking boxes? My feeling would be that people are massively over representing their own knowledge.

[+] crazygringo|7 years ago|reply

Author/creator here, happy to answer any questions.

[+] kbenson|7 years ago|reply

That's an amazing project, and I loved reading all the ways you worked through the problems as you encountered them.

There was just one final problem: occasionally items would pop up that were definitely NSFW, or just made you feel icky when reading the description. To make the quiz more family-friendly, I filtered out anything related to adult entertainment (quite a few porn stars in the top 10,000), as well as contemporary people notable principally for violent crime (whether as perpetrators or victims). There are just… some things you’d rather not read about while eating lunch.

I can't help but think it would be fun to take the version with NSFW content still in, or even limited to only those items.

It would be really interesting to use that to see how certain subgroups are aware of this content. E.g. Certain subreddits and 4chan...

[+] psychometry|7 years ago|reply

Wouldn't it make more sense to ask country first and then generate examples? I'm from the U.S. and 10% of my examples were Bollywood actors.

[+] ppereira|7 years ago|reply

You may wish to check out the MIT Pantheon project which ranks people not just by page views, but also by the log of birth year and number of languages that their biography has been translated into. With that metric, knowing Aristotle would be much more valuable than knowing Justin Bieber, whose name is likely to decline in importance well within one lifetime, and is perhaps hardly known at all outside certain countries.

[+] tobr|7 years ago|reply

I wonder if you're mostly measuring someones interpretation of "uniquely identify" and "already knew existed"?

Did you consider making it an actual quiz with options to verify if someone actually knows what they claim? (Only skimmed the article, sorry if you mentioned it)

[+] 8bitsrule|7 years ago|reply

I realize you had to boil down a giant pile into a representation, but how? Some cultural categories seem to be under-represented. E.g. I didn't see a single composer in the whole lot ... or cathedral ... very few artworks ... too many films ... And it's pop-culture heavy.

Reminds me of Kenneth Clark's definition of 'Civilization' including only Europe ...

You've got the start of something here, but is it culture or people magazine?

[+] throwawaw666|7 years ago|reply

Have you ever seen this sketch? https://www.youtube.com/watch?v=vZ9myHhpS9s

[+] webwanderings|7 years ago|reply

How are you throwing Indian cultural entities? It doesn't seem to make sense, though I have not read your literature.

[+] rcMgD2BwE72F|7 years ago|reply

Why use Wikipedia infoboxes instead of Wikidata items?

[+] crazygringo|7 years ago|reply

I would have loved to use Wikidata -- I actually first attempted a prototype of this several years ago using the similar Freebase as a datasource, until it was bought and shut down by Google.

Wikidata looked very promising, but I was worried if it would contain all the data I would wind up needing, or if it would be in the same format 2 or 5 years from now. Wikipedia is a household name and the information in it has a lot of eyeballs on it constantly, while Wikidata as a project I couldn't tell if I could be equally confident in -- so really just taking a conservative approach is the only reason.

[+] sandov|7 years ago|reply

It's worth pointing out that "Your culture" here means "first world, English speaking countries' culture".

[+] telesilla|7 years ago|reply

More than that - pop culture. I hardly know any rappers or RnB singers in the 21st century but I can name, for a start, a number of contemporary philosophers, computer languages and their creators, artists and composers. I don't think I'm in the minority for not knowing pop culture? Still, impressive project from the author and I understand that the result is ultimately pop-culture skewed, given the restrictions.

[+] RLN|7 years ago|reply

I'd go as far to generally say it just means United States culture.

[+] dang|7 years ago|reply

Recent discussion: https://news.ycombinator.com/item?id=18175910

[+] kevinwang|7 years ago|reply

Shouldn't the logistic regression picture have a picture that looks more like this instead of a line? https://en.m.wikipedia.org/wiki/Logistic_regression#/media/F.... Or is the curve just really flat?

[+] crazygringo|7 years ago|reply

You're right, I mixed up the terms while writing this. It doesn't use the logistic function, it's a more general case of binomial regression [1]. (The example is a line, but the site actually uses a logarithmic function as its link function.) Just corrected the post, thanks.

[1] https://en.wikipedia.org/wiki/Binomial_regression

[+] rahulcap|7 years ago|reply

Very interesting article. In the end, I was a bit confused on how you converted the binomial regression to a single number. I understood that the output was a probability that I know each of the 10,000 items, so then did you need to use some cutoff to decide that I "knew it"?

Anyways, I am interested to see what analysis you do after you get more data.

[+] crazygringo|7 years ago|reply

Thanks for the interest -- it's actually just a sum of the probabilities for the items from 1 to 10,000. For example, if there's a 0.1 chance you know each of 10 items, it adds up to a total value of 1 -- no cutoff needed.

Mathematically, there's a trick where you don't even need to compute the sum item-by-item... I calculate the binomial regression which gives me the two relevant parameters, from which I can calculate the probability density function (PDF) [1] for an item of given rank. Then I just calculate the associated cumulative distribution function (CDF) with the same two parameters [2] for rank 10,000 -- and that's the final result.

[1] https://en.wikipedia.org/wiki/Probability_density_function

[2] https://en.wikipedia.org/wiki/Cumulative_distribution_functi...

[+] arielbaz|7 years ago|reply

Is combosaurus one of the 10,000? https://www.quora.com/Whatever-happened-to-Combosaurus

[+] dpatrick86|7 years ago|reply

Following the instructions pedantically (e.g. emphasizing "uniquely identify," especially for the individuals) probably leads to an enormous biasing of the score.

[+] personjerry|7 years ago|reply

How did you drive traffic to your quiz site? I feel like this will affect significantly your results.

[+] jordiburgos|7 years ago|reply

How the items are selected ?

[+] asianthrowaway|7 years ago|reply

I guess I'm not cultured for not knowing about all the rappers and actors who seem to make up 90% of the list.

I did have a good laugh at Jared Kushner being listed as an "investor".

[+] jonwachob91|7 years ago|reply

His money is in real estate investments. Might not be tech VC, but an investor is still an investor.

[+] mwfunk|7 years ago|reply

It is funny (sad and maybe a red flag but also funny). His job description is based on where his money goes, not where it comes from.

46 comments