top | item 9067800

What interests Reddit? A network analysis of 84M comments by 200K users

175 points| alexcasalboni | 11 years ago |markallenthornton.com

58 comments

order
[+] faizshah|11 years ago|reply
I'm working on a project relevant to this. Does anyone know if the author has shared this data set anywhere? Or does anyone know of any data sets that could be used for developing mixture models to classify users into interest groups (like photographers, programmers etc)?
[+] aroch|11 years ago|reply
If I'm remembering correctly, the raw data was given under an NDA/DND as a one time only deal. There was a subreddit associated with the data and collection but its since been banned.
[+] placeybordeaux|11 years ago|reply
Yeah the thing I most wanted out of the post is a torrent to the data.
[+] stared|11 years ago|reply
The author released data, links are at the bottom of the article (great thanks to him!).
[+] Houshalter|11 years ago|reply
The admins released a bunch of anonymize voting data once. And several people have distributed datasets of scraped comments. Sorry I don't have links handy. Check through /r/redditdev
[+] stared|11 years ago|reply
I like a lot such analyses based on the network structure (not long ago I made something similar for Stack Exchange - http://stared.github.io/tagoverflow/; continuation of my older one https://github.com/stared/tag-graph-map-of-stackexchange/wik...).

Though, technology-wise, it is one use-case where SVG beats pixel graphics, both in terms of usability and interface (whether it is custom D3.js or something graph-oriented as http://sigmajs.org/).

[+] wamatt|11 years ago|reply
if you set tag coloring to "% answered", an interesting pattern emerges

http://i.imgur.com/ZLmWHrq.png

responsiveness of the community in order of most to least

- oldschool hacker (c/c++,bash,perl,regex)

- web dev (jquery, javascript, html, css)

- app dev (ios, objectivec, android, java)

[+] jedberg|11 years ago|reply
Fun fact: We did this exact analysis at reddit many years ago, and used it to figure out which subreddits were related to each other. We never got around to productizing it, unfortunately, but the idea was to use it to suggest new reddits to you.
[+] sinemetu11|11 years ago|reply
I guess this might get into some special sauce territory, but was there a specific reason why this type of recommendation system was deprecated?
[+] swalsh|11 years ago|reply
One potential downside of using an algorithm like that is the possibility of a feedback loop.
[+] hooo|11 years ago|reply
I find these network visualizations nice to look at, but not all that insightful. They're generally hard to read and track relations outside of the main clusters. Am I missing something?
[+] th0ma5|11 years ago|reply
No I don't think so. A lot of people call these things "hairballs," and probably a more useful interface would be some kind of faceted browser that allows you to do pivots and look at aggregate stats of the various lenses you can put on top of a graph. Additionally, measurements such as node separation, "betweenness," or perhaps even looking at common chain patterns are probably more useful ways of trying to dissect graph structures.
[+] SwellJoe|11 years ago|reply
What interests reddit? Casual racism and misogyny. Also cats.

Seriously though, it's interesting how interconnected some things can be in this view. I'm not sure what sense I can make of those interconnections, though. Mousing around, while being a very frequent redditor (so my own neural network is making connections based on experience), I can kinda infer order out of things like the "government->state" topics connected to "force" and "property" among others (hints at the libertarian-leaning general population), and the "women" topic connecting to a whole host of stuff...the cyan colored section off to the top right might even kinda hint at the casual misogyny thing (which was a "ha ha only serious" kind of joke), with words like "bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being connected to "women" but not to "men".

But, without having spent years on reddit, and without my particular flavor of reddit (the subs I'm subscribed to), maybe I'd interpret the data very differently. I never quite know how to interpret network graphs like this, honestly, short of for things that are networks. i.e. a computer network topology on a graph shows useful data...the hops from one machine to the next. When connecting up one word to the next, it seems difficult to draw meaningful conclusions. Like my interpretation of the meaning of "government->state->property" as being a hint at the libertarian leanings of many subreddits, or the connection of "women->reasoning->evidence" as being a hint of many redditors belief that women are illogical liars (which is the impression many of my female friends have of reddit, in general, particularly when topics like date rape or the "friend zone" come up). Is that actually the context in which these connections are made? I wouldn't really know how to check. It'd be cool to be able to drill down to conversations in which the connections where made, but presenting that in a coherent UI seems challenging.

[+] thejaredhooper|11 years ago|reply
I agree. It would be nice to drill down into the data in order to further analyze everything. I also feel there was a particular sort of censorship in the dataset, an indicator of which was the explicit racist and misogynistic words that were absent. There was a large lack of swears and bad terms in this analysis (bitch being a particularly obvious cut) and I, for one, see examples of these slurs prevalently used by young men far too often on the site.

Perhaps the data was tailored when it was provided to the analyst, or it was censored after reception, but this felt too "PG-13" for an analysis of reddit's "interests".

[+] unknown|11 years ago|reply

[deleted]

[+] 6stringmerc|11 years ago|reply
I get the feeling that Conde Nast may not like this type of approach when they're not directly profiting from it. A study of the language between the SFW and NSFW type tags might be pretty interesting, or, well, not very pleasant. I did participate in a couple music communities for a while, but there's something in the stew over there that I'm glad I closed my account and never looked back. YMMV.
[+] shillster|11 years ago|reply
Or even better, if we could systematically measure the brigades, moderator manipulation and psyops.
[+] erroneousfunk|11 years ago|reply
Small point: Is it really considered "scraping" (" I scraped approximately 84 million comments") if you used a Python library that uses the Reddit API, not the actual site directly?
[+] fspacef|11 years ago|reply
Salute the effort put into this, quite thought provoking
[+] okasaki|11 years ago|reply
Maybe I'm just stupid, but I don't see anything thought provoking.

In fact I feel that a better way to see what redditors are interested in would be to just find (there may even be stats on reddit on this) the ~50 most active subreddits.

[+] grabcocque|11 years ago|reply
Misogyny seems to be big in Reddit comments.
[+] seany|11 years ago|reply
You misspelled misandry.