I'm working on a project relevant to this. Does anyone know if the author has shared this data set anywhere? Or does anyone know of any data sets that could be used for developing mixture models to classify users into interest groups (like photographers, programmers etc)?
If I'm remembering correctly, the raw data was given under an NDA/DND as a one time only deal. There was a subreddit associated with the data and collection but its since been banned.
The admins released a bunch of anonymize voting data once. And several people have distributed datasets of scraped comments. Sorry I don't have links handy. Check through /r/redditdev
Though, technology-wise, it is one use-case where SVG beats pixel graphics, both in terms of usability and interface (whether it is custom D3.js or something graph-oriented as http://sigmajs.org/).
Fun fact: We did this exact analysis at reddit many years ago, and used it to figure out which subreddits were related to each other. We never got around to productizing it, unfortunately, but the idea was to use it to suggest new reddits to you.
I find these network visualizations nice to look at, but not all that insightful. They're generally hard to read and track relations outside of the main clusters. Am I missing something?
No I don't think so. A lot of people call these things "hairballs," and probably a more useful interface would be some kind of faceted browser that allows you to do pivots and look at aggregate stats of the various lenses you can put on top of a graph. Additionally, measurements such as node separation, "betweenness," or perhaps even looking at common chain patterns are probably more useful ways of trying to dissect graph structures.
What interests reddit? Casual racism and misogyny. Also cats.
Seriously though, it's interesting how interconnected some things can be in this view. I'm not sure what sense I can make of those interconnections, though. Mousing around, while being a very frequent redditor (so my own neural network is making connections based on experience), I can kinda infer order out of things like the "government->state" topics connected to "force" and "property" among others (hints at the libertarian-leaning general population), and the "women" topic connecting to a whole host of stuff...the cyan colored section off to the top right might even kinda hint at the casual misogyny thing (which was a "ha ha only serious" kind of joke), with words like "bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being connected to "women" but not to "men".
But, without having spent years on reddit, and without my particular flavor of reddit (the subs I'm subscribed to), maybe I'd interpret the data very differently. I never quite know how to interpret network graphs like this, honestly, short of for things that are networks. i.e. a computer network topology on a graph shows useful data...the hops from one machine to the next. When connecting up one word to the next, it seems difficult to draw meaningful conclusions. Like my interpretation of the meaning of "government->state->property" as being a hint at the libertarian leanings of many subreddits, or the connection of "women->reasoning->evidence" as being a hint of many redditors belief that women are illogical liars (which is the impression many of my female friends have of reddit, in general, particularly when topics like date rape or the "friend zone" come up). Is that actually the context in which these connections are made? I wouldn't really know how to check. It'd be cool to be able to drill down to conversations in which the connections where made, but presenting that in a coherent UI seems challenging.
I agree. It would be nice to drill down into the data in order to further analyze everything. I also feel there was a particular sort of censorship in the dataset, an indicator of which was the explicit racist and misogynistic words that were absent. There was a large lack of swears and bad terms in this analysis (bitch being a particularly obvious cut) and I, for one, see examples of these slurs prevalently used by young men far too often on the site.
Perhaps the data was tailored when it was provided to the analyst, or it was censored after reception, but this felt too "PG-13" for an analysis of reddit's "interests".
I get the feeling that Conde Nast may not like this type of approach when they're not directly profiting from it. A study of the language between the SFW and NSFW type tags might be pretty interesting, or, well, not very pleasant. I did participate in a couple music communities for a while, but there's something in the stew over there that I'm glad I closed my account and never looked back. YMMV.
Small point: Is it really considered "scraping" (" I scraped approximately 84 million comments") if you used a Python library that uses the Reddit API, not the actual site directly?
Maybe I'm just stupid, but I don't see anything thought provoking.
In fact I feel that a better way to see what redditors are interested in would be to just find (there may even be stats on reddit on this) the ~50 most active subreddits.
[+] [-] faizshah|11 years ago|reply
[+] [-] aroch|11 years ago|reply
[+] [-] placeybordeaux|11 years ago|reply
[+] [-] stared|11 years ago|reply
[+] [-] Houshalter|11 years ago|reply
[+] [-] stared|11 years ago|reply
Though, technology-wise, it is one use-case where SVG beats pixel graphics, both in terms of usability and interface (whether it is custom D3.js or something graph-oriented as http://sigmajs.org/).
[+] [-] wamatt|11 years ago|reply
http://i.imgur.com/ZLmWHrq.png
responsiveness of the community in order of most to least
- oldschool hacker (c/c++,bash,perl,regex)
- web dev (jquery, javascript, html, css)
- app dev (ios, objectivec, android, java)
[+] [-] stared|11 years ago|reply
Link "ate" the semicolon.
[+] [-] jedberg|11 years ago|reply
[+] [-] sinemetu11|11 years ago|reply
[+] [-] swalsh|11 years ago|reply
[+] [-] hooo|11 years ago|reply
[+] [-] th0ma5|11 years ago|reply
[+] [-] SwellJoe|11 years ago|reply
Seriously though, it's interesting how interconnected some things can be in this view. I'm not sure what sense I can make of those interconnections, though. Mousing around, while being a very frequent redditor (so my own neural network is making connections based on experience), I can kinda infer order out of things like the "government->state" topics connected to "force" and "property" among others (hints at the libertarian-leaning general population), and the "women" topic connecting to a whole host of stuff...the cyan colored section off to the top right might even kinda hint at the casual misogyny thing (which was a "ha ha only serious" kind of joke), with words like "bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being connected to "women" but not to "men".
But, without having spent years on reddit, and without my particular flavor of reddit (the subs I'm subscribed to), maybe I'd interpret the data very differently. I never quite know how to interpret network graphs like this, honestly, short of for things that are networks. i.e. a computer network topology on a graph shows useful data...the hops from one machine to the next. When connecting up one word to the next, it seems difficult to draw meaningful conclusions. Like my interpretation of the meaning of "government->state->property" as being a hint at the libertarian leanings of many subreddits, or the connection of "women->reasoning->evidence" as being a hint of many redditors belief that women are illogical liars (which is the impression many of my female friends have of reddit, in general, particularly when topics like date rape or the "friend zone" come up). Is that actually the context in which these connections are made? I wouldn't really know how to check. It'd be cool to be able to drill down to conversations in which the connections where made, but presenting that in a coherent UI seems challenging.
[+] [-] thejaredhooper|11 years ago|reply
Perhaps the data was tailored when it was provided to the analyst, or it was censored after reception, but this felt too "PG-13" for an analysis of reddit's "interests".
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] 6stringmerc|11 years ago|reply
[+] [-] brandonwamboldt|11 years ago|reply
Since 2012, Reddit operates as an independent company (Advanced Publications, the parent company of Condé Nast is a majority share holder though).
See: http://www.redditblog.com/2013/08/reddit-myth-busters_6.html...
[+] [-] shillster|11 years ago|reply
[+] [-] erroneousfunk|11 years ago|reply
[+] [-] fspacef|11 years ago|reply
[+] [-] okasaki|11 years ago|reply
In fact I feel that a better way to see what redditors are interested in would be to just find (there may even be stats on reddit on this) the ~50 most active subreddits.
[+] [-] _sword|11 years ago|reply
[deleted]
[+] [-] grabcocque|11 years ago|reply
[+] [-] seany|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]