top | item 11106386

Show HN: Sense2vec model trained on all 2015 Reddit comments

225 points| syllogism | 10 years ago |sense2vec.spacy.io | reply

79 comments

order
[+] minimaxir|10 years ago|reply
As someone who's tried to quantify the Reddit hivemind by analyzing word usage (http://minimaxir.com/2015/10/reddit-topwords/), I think the spaCy semantic approach is a much more robust approach to the Reddit data corpus.

However, I'm not sure if I agree with the use of phrase similarity as a indicator of Reddit hivemind behavior, which was discussed in the complementary blog post. The tool is more of an indicator of the writing styles of Reddit's primary demographics (Male, 18-30) and phrases which are coincident, instead of weighting the importance of given phrases to Reddit discussion.

[+] rockmeamedee|10 years ago|reply
Word vectors are really intriguing to me, and I have a few questions:

If you also trained a model on a different news aggregator's comments, would it be possible to "match up" the meaning vector spaces and see differences in what meanings each community ascribes to words?

Additionally, could you determine sentiments from the positions of the words in the meaning space? One example, looking for underlying assumptions, like associations of the words 'black person' and 'criminal'. Or idk, man + sex = player, but woman + sex = slut.

Would it be possible to go higher level and see how much a corpus agrees with "free markets are good" based on its word positions?

It seems like word2vec has the potential to bring Sapir-Whorf to a whole new level.

[+] syllogism|10 years ago|reply
Let's say we want to know how usage differs in one subreddit vs another. If you just train two entirely separate models, you end up with two vector spaces. The meanings within each space are entirely relative --- the absolute positions obviously aren't significant. You can try to learn a mapping, but the transform is not necessarily linear. (Interesting empirical question there...)

What you need is a 'seam' that connects the two vector spaces. I would do it like this.

Train a single model, with the words decorated by their subreddit. So you have combat:/r/gaming and combat:/r/history as different tokens. Then you have shared tokens which aren't decorated in this way.

[+] MasterScrat|10 years ago|reply
Now can't you experiment with algebra operations on those vectors, like "queen minus female plus male = king"?
[+] syllogism|10 years ago|reply
This is supported in the underlying data, but we don't expose it in the UI at the moment. We'll probably do something a bit different to help people make those queries.
[+] andrewtbham|10 years ago|reply
This is very cool. I have played with the word2vec download a lot and I have been sorta amazed. If you had a lot of money... like google... it would be cool to train word vectors on the whole internet.

something like this definitely makes word net obsolete. :-( http://wordnetweb.princeton.edu/perl/webwn

[+] ecesena|10 years ago|reply
I don't think wordnet (and similar) will ever be obsolete. From my point of view, they are a "curated" subset of the semantic that you can obtain with word2vec. The latter can fail, also on very trivial cases (I discussed geography in another comment), and curated+structured knowledge bases such as wordnet allow to overcome the limitations of any machine learning algorithm.
[+] RBerenguel|10 years ago|reply
I wouldn't go as far as that, wordnet has other interesting uses (and its prolog version makes for a cool "query" language)
[+] 1ris|10 years ago|reply
It's impressive how good it is.

But there is still room for improvement. Searching "Haskell" leads to Clojure and C++. Yes, these are both Programming Languages, but out of all I personally wouldn't have said C++. :)

"Scheme" leads to Haskell, witch is very fitting, and Prolog, witch seems to fit as a "university language" aswell.

For "Agda" it outputs "typeclasses" at 74%. This is a much discussed topic. But for a "truly" semantic understand It should know that both "Agda" and "Haskell" are in the same category and that "typeclasses" is a property that elments in this category have or don't have.

Still, very impressive. But not the singularity jet.

[+] verroq|10 years ago|reply
I would have said C++ because the template system is a functional programming language.
[+] gradi3nt|10 years ago|reply
The first name to show up when I search 'evil man' is 'Mr Rogers'...what the hell, reddit.

(first name, not first word, had to scroll down a bit, past words like 'Nazi soldier' and 'evildoer')

[+] realusername|10 years ago|reply
I was half-expecting this but Comcast is associated with Swastika and Nazi.
[+] andrewtbham|10 years ago|reply
The band names are particularly cool. It is sorta like a recommendation engine. If you like Pink floyd, you might like Led Zepplin.
[+] fla|10 years ago|reply
Submitting 'NSFW' (without quotes) will display mangled urls.
[+] Houshalter|10 years ago|reply
Not necessarily a bug on their end, those urls are actually from a bot, AutoWikibot. Which uses the word "NSFW" in every post, so all of it's formatting is highly correlated with it: https://www.reddit.com/user/AutoWikibot

The last line (Parent commenter can toggle NSFW or delete...) has this formatting:

    ^Parent ^commenter ^can [^toggle ^NSFW](/message/compose?to=autowikibot&subject=AutoWikibot NSFW toggle&message=%2Btoggle-nsfw+ct3omf8) ^or[](#or) [^delete](/message/compose?to=autowikibot&subject=AutoWikibot Deletion&message=%2Bdelete+ct3omf8)^. ^Will ^also ^delete ^on ^comment ^score ^of ^-1 ^or ^less. ^| [^(FAQs)](/r/autowikibot/wiki/index) ^| [^Mods](/r/autowikibot/comments/1x013o/for_moderators_switches_commands_and_css/) ^| [^Call ^Me](/r/autowikibot/comments/1ux484/ask_wikibot/)
You can see how that formatting might fuck up their nice natural language processor and tokenizer.
[+] ecesena|10 years ago|reply
Very cool!

I've been working for a while on extracting "semantic" from naked text (mostly news). One of the big limits of word2vec is that the semantic should be related with words proximity in sentences, which works in some cases, but not always.

I'll give an example for all: travel/geography. If you query something like Italy [1], the results are other European states. But if you're looking for news in Italy, or to plan a vacation to Italy, or for some Italian food... or anything related to Italy itself, you probably don't expect "Spain" to be the first result.

It would be nice to have some sort of easy way in word2vec to define domains and their relationship with words proximity in sentences, to overcome situations like this one.

[1] https://sense2vec.spacy.io/?Italy%7CGPE

[+] nl|10 years ago|reply
Don't think of Word2Vec (etc) as a search - it's a different thing.

It's more like a recommender: If I like Pizza in Napoli where should I go in Spain and what should I eat there?

Italy:Pizza -> Spain:?

Italy:Napoli -> Spain:?

[+] syllogism|10 years ago|reply
Well, you can't do this in the web API obviously, but there are a few ways you could do this. This is a good idea really -- thanks for bringing this up.

One way would be to predefine the entity types or tags that you want to get in your results. So you could ask for things like Italy that are nouns.

The other way is to use the vector space. The classic demonstration of this is the arithmetic, doing like "Italy|GPE - *|GPE + food". My results for this have been very mixed. I wouldn't expect the query above to work.

I would think you'd have more luck specifying the query as a combination of constraints: first query for foods in some way, and then sort them by distance from Italy.

[+] dbunkerx|10 years ago|reply
Tried something similar a while back, only using Sparks Word2Vec implementation, though just looking at individual organizations as used in different subreddits. It is surprising how far Word2Vec can take you in deriving word similarity.

Used POS tagging in a previous post, though not with Word2Vec since I wasn't sure if differences like duck verb vs. duck noun would improve the result because the placement of verb vs. noun would already be different. Though certainly an interesting approach, I'm wondering if going backwards might yield better results for the POS tagger as well, since verb vs noun would span disparate word clusters.

1) http://dbunker.github.io/2016/01/05/spark-word2vec-on-reddit...

[+] Grue3|10 years ago|reply
Strange, I enter "Harry Kane" and it shows me various footballers (none of which are on the same team as him). But then I enter "Jamie Vardy" and there are no results whatsoever, even though he's been HUGE on /r/soccer in 2015.
[+] syllogism|10 years ago|reply
Hmm. Interesting. I wonder whether we're missing some data. I didn't do so much to verify that. Thanks.
[+] empath75|10 years ago|reply
Oh this is fun if you look up politically charged words like 'thug' or even something that should be pretty harmless like 'girls'.
[+] Houshalter|10 years ago|reply
I don't see anything surprising about those searches. E.g. "thug" gets "gangster" which seems pretty fitting. "girls" gets "female friends", which is semantically very close.
[+] Palomides|10 years ago|reply
is it just me or does it not search when you hit enter?
[+] syllogism|10 years ago|reply
The server logs say everything's working, so I hope not! Are you still having trouble?

Sometimes I get tricked when I enter a query that has the same top result as the one that's currently displayed. Then it looks like the results haven't changed, but further down the list, they have.

[+] justinsaccount|10 years ago|reply
It's not just you, I need to click the search icon.
[+] cowsandmilk|10 years ago|reply
tried "netflix and chill", didn't find anything? That to me is the phrase of the year that is more than the sum of its parts.
[+] syllogism|10 years ago|reply
The underlying linguistics of a "phrase" here are sort of narrow. What we did is retokenize the text so that entities and basic noun phrases are merged into a single token. "Netflix and chill" is analysed as multiple tokens, so it doesn't come up as a query result.
[+] sgarman|10 years ago|reply
I tried "rick and morty" and got the same issue. My guess is it has something to do with the term "and."
[+] IshKebab|10 years ago|reply
This is really cool. The next logical thing is to extend it to allow multiple meanings for the same word with the same type. For example "lead is a heavy element" vs "I own a dog lead". Not sure how you'd do that without explicitly giving each unique word an ID and manually annotating the training data.