The cognoscenti already know this, but word vectors are a game changer in the multi-class text classification game. Finding the correct representation of text makes the classification task so much easier.
For my problem (1000+ total classes, 1 class per input), I experimented with Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99% accuracy, 10 minutes training).
FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams, and there is a default implementation in C++ with a built-in classifier that runs on the command-line (no need to setup Tensorflow or anything like that).
Despite it only running on plain CPUs and only supporting a linear classifier, it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in my use cases. I later discovered this paper from the authors comparing CNNs (and other algorithms) to FastText, and their results track my experiences [1].
This goes to show that while GPU-accelerated models are cool, sometimes using a simpler, more suitable model can have a significantly better pay-off.
Thanks for this, I'll have to take a look at FastText. I've been using word2vec before turning it into a matrix and running it through a CNN. I based it off of Yoon Kim's[0] work. I haven't had much luck though on my 92-class problem. Maybe Fast Text will work better, although I think there are a lot of improvements my model can have still.
I'm working on a problem kinda similar to that with a binary classification on a class imbalanced dataset, but fastText and Bidirectional LSTMs appear to work pretty terribly even with oversampling. Is there a better alternative?
In the "Doing bad digital humanities with color vectors", if you consider colors as 3D vectors, which they do, you'll see that summing enough uniformly sampled vectors always gives medium browns, because that's the color in the middle of the colorspace. Instead, you should model colors in a polar space and sum vectors in that space. This will prevent going inside the sphere and losing color saturation.
The original model comes from Kim 2014, https://arxiv.org/abs/1408.5882 It's a very neat use of CNNs for language processing, instead of the more popular RNNs/LSTMs. CNNs have the advantage of training much faster.
Word vectors are at the same time amazing because they contain a huge amount of latent information, and not good enough because they collapse a space of very high dimensionality in ~300 dimensions, so they have a limit to how much they can discriminate between close topics. I have done a lot of experiments on classifying text in thousands of topics, and sometimes they work amazingly well, other times they are really hard to use, depending on how close together are the topics I want to discriminate between.
Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
A third problem of word vectors is out of vocabulary words and words with low frequency. For OOV, the usual solution is to create character or character-ngram embeddings that can be used to compute embeddings for new words. For low frequency words we usually ignore them (apply a cutoff).
Then there is the problem of phrases and collocations - some words go together, such as "New York" and "give up". The meaning of the phrase is different from the sum of the meanings of the component words. In these cases we need to have lists of phrases and replace them in the original text before training the vectors, so we have proper vectors for phrases.
By the way, one amazing tool that goes with word vectors is the library 'annoy' which can do similarity search in log time. So you can do approx 1000 lookups per second per CPU even if the database contains millions of vectors, pretty good. Annoy can be used to find similar articles, or music recommendations. Another remark - my preferred word vectors are computed with Doc2VecC (a variant of doc2vec with corruption). Doc2VecC seems more apt to discriminate between topics, but the secret is to feed it gigabytes of text.
Playing with word vectors has taught me intuitively how it is to navigate a space of high dimensionality. It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
>Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
There has been work on representing words not as vectors but as multimodal Gaussian distributions in order to try to deal with polysemy, such as [0], of which an implementation (which I have not tried to use) is available on GitHub at [1].
>It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
I appreciate that the author of the parent comment has found some sort of intuition, but I would caution others from trying to use the above quote in order to develop their own intuition as it is meaningless in any rigorous sense.
> Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
This is interesting, and there seems to be a bit of debate about it (at least with compositional distributional semantic models). [0][1] seem to show that sense disambiguation helps in some contexts, and [2] show that they don't in others. It doesn't seem immediately clear who is right here. I agree with you though that it seems pretty likely that disambiguating would be helpful.
> we needed first to find the right sense for each word in a phrase and only then assign the vector
Interesting! I wonder if you could you e.g. arbitrarily split a word into some number of symbols, e.g. two, and each time you're going to apply a training update, only apply the update to one of the symbols -- perhaps initially choosing the symbol facing the greatest loss (forcing the symbols apart in vector space), and then eventually switching over to picking the symbol with the smallest loss (letting each settle onto its own precise meaning)?
My mind was blown when I found out how easy it was to get started using pre-trained Glove embedding in Keras. Took my Kaggle game up a few notches over-night.
This is a highly localized problem, but I've wanted to read the past couple articles by Allison and cannot, because my org has to block gists.
Allison we know this is a lot of work for a very small group but, if you see this, a couple of us here would be super stoked if you could mirror your articles somewhere else as well!
You might be interested to know that Annoy is also integrated into Gensim, which allows you to train, use and query word embeddings on your own data. Gensim implements fast Doc2Vec and FastText too, which are a bit newer embedding techniques [0] :-)
Dimension reduction technique. For word vectors, the usual use case is to take the 400 dimension vector and turn it into a 2 dimension vector that you can use to plot like a scatterplot. Similar things will be close together. Dissimilar things will be far apart. It's kind of like principal components analysis but quirkier.
For those who like Tsne. Check out the relatively new UMAP, which seems to be faster and better.
For anyone who had to google it like me, T-SNE is T-Distributed Stochastic Neighbor Embedding, and the way I understand it is that it's essentially a statistical way to reduce data dimensionality while still preserving its structure.
[+] [-] wenc|8 years ago|reply
For my problem (1000+ total classes, 1 class per input), I experimented with Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99% accuracy, 10 minutes training).
FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams, and there is a default implementation in C++ with a built-in classifier that runs on the command-line (no need to setup Tensorflow or anything like that).
Despite it only running on plain CPUs and only supporting a linear classifier, it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in my use cases. I later discovered this paper from the authors comparing CNNs (and other algorithms) to FastText, and their results track my experiences [1].
This goes to show that while GPU-accelerated models are cool, sometimes using a simpler, more suitable model can have a significantly better pay-off.
[0] https://fasttext.cc/docs/en/supervised-tutorial.html
[1] https://arxiv.org/abs/1607.01759
[+] [-] bomb199|8 years ago|reply
[0] https://arxiv.org/abs/1408.5882
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] akhilcacharya|8 years ago|reply
[+] [-] aisofteng|8 years ago|reply
To clarify: fasttext does not support n-grams of words, but instead considers n-grams of characters within words.
[+] [-] ppod|8 years ago|reply
[+] [-] ovi256|8 years ago|reply
In the "Doing bad digital humanities with color vectors", if you consider colors as 3D vectors, which they do, you'll see that summing enough uniformly sampled vectors always gives medium browns, because that's the color in the middle of the colorspace. Instead, you should model colors in a polar space and sum vectors in that space. This will prevent going inside the sphere and losing color saturation.
It's explained quite well here in the Interpolation section: http://www.inference.vc/high-dimensional-gaussian-distributi...
If you want to understand contemporary use of words embeddings in ML, a nice simple model is explained, with full code, here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a...
The original model comes from Kim 2014, https://arxiv.org/abs/1408.5882 It's a very neat use of CNNs for language processing, instead of the more popular RNNs/LSTMs. CNNs have the advantage of training much faster.
[+] [-] Ethcad|8 years ago|reply
[+] [-] wodenokoto|8 years ago|reply
I mostly see word2vec and fasttext, neither of which are CNN nor rnn
[+] [-] visarga|8 years ago|reply
Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
A third problem of word vectors is out of vocabulary words and words with low frequency. For OOV, the usual solution is to create character or character-ngram embeddings that can be used to compute embeddings for new words. For low frequency words we usually ignore them (apply a cutoff).
Then there is the problem of phrases and collocations - some words go together, such as "New York" and "give up". The meaning of the phrase is different from the sum of the meanings of the component words. In these cases we need to have lists of phrases and replace them in the original text before training the vectors, so we have proper vectors for phrases.
By the way, one amazing tool that goes with word vectors is the library 'annoy' which can do similarity search in log time. So you can do approx 1000 lookups per second per CPU even if the database contains millions of vectors, pretty good. Annoy can be used to find similar articles, or music recommendations. Another remark - my preferred word vectors are computed with Doc2VecC (a variant of doc2vec with corruption). Doc2VecC seems more apt to discriminate between topics, but the secret is to feed it gigabytes of text.
Playing with word vectors has taught me intuitively how it is to navigate a space of high dimensionality. It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
doc2vecC: https://github.com/mchen24/iclr2017
annoy: https://github.com/spotify/annoy
[+] [-] aisofteng|8 years ago|reply
There has been work on representing words not as vectors but as multimodal Gaussian distributions in order to try to deal with polysemy, such as [0], of which an implementation (which I have not tried to use) is available on GitHub at [1].
>It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
I appreciate that the author of the parent comment has found some sort of intuition, but I would caution others from trying to use the above quote in order to develop their own intuition as it is meaningless in any rigorous sense.
[0] https://arxiv.org/abs/1704.08424
[1] https://github.com/benathi/word2gm
[+] [-] aglionby|8 years ago|reply
This is interesting, and there seems to be a bit of debate about it (at least with compositional distributional semantic models). [0][1] seem to show that sense disambiguation helps in some contexts, and [2] show that they don't in others. It doesn't seem immediately clear who is right here. I agree with you though that it seems pretty likely that disambiguating would be helpful.
[0] https://www.aclweb.org/anthology/W13-3513
[1] https://www.aclweb.org/anthology/P16-1018
[2] https://www.aclweb.org/anthology/D10-1115
[+] [-] RoboTeddy|8 years ago|reply
Interesting! I wonder if you could you e.g. arbitrarily split a word into some number of symbols, e.g. two, and each time you're going to apply a training update, only apply the update to one of the symbols -- perhaps initially choosing the symbol facing the greatest loss (forcing the symbols apart in vector space), and then eventually switching over to picking the symbol with the smallest loss (letting each settle onto its own precise meaning)?
[+] [-] hiker512|8 years ago|reply
[+] [-] stared|8 years ago|reply
And just for getting idea why does it work, and play with examples in your browser: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html
[+] [-] sumitgt|8 years ago|reply
https://blog.keras.io/using-pre-trained-word-embeddings-in-a...
[+] [-] debt|8 years ago|reply
ah shit this makes me want to stand on a desk.
[+] [-] ZeroCool2u|8 years ago|reply
Allison we know this is a lot of work for a very small group but, if you see this, a couple of us here would be super stoked if you could mirror your articles somewhere else as well!
[+] [-] wyldfire|8 years ago|reply
[+] [-] titanomachy|8 years ago|reply
What's it like being a software dev in North Korea?
[+] [-] radarsat1|8 years ago|reply
[+] [-] Radim|8 years ago|reply
[0] https://twitter.com/gensim_py/status/969222857246101504
[+] [-] patelajay285|8 years ago|reply
https://github.com/plasticityai/magnitude
[+] [-] melzarei|8 years ago|reply
[+] [-] hoerzu|8 years ago|reply
[+] [-] b_tterc_p|8 years ago|reply
For those who like Tsne. Check out the relatively new UMAP, which seems to be faster and better.
[+] [-] jxub|8 years ago|reply
[+] [-] h66p6qb84p4a1|8 years ago|reply
[deleted]
[+] [-] juanmirocks|8 years ago|reply
[+] [-] cup-of-tea|8 years ago|reply