Show HN: Word2Bits – Quantized Word Vectors

[+] dougb5|8 years ago|reply

This is great and echoes a recent fascination for me. One application of compact word embeddings is that if they're small enough you can ship a whole model to the user in a web app so that semantic computations can be done entirely client-side, which is useful for privacy. I did a naive 1-bit quantization a few months ago in order to fit a large-vocabulary word embedding into a smallish (< 3 MB) JSON file so that synonyms could be generated by the user without hitting a server. (See https://docs.google.com/presentation/d/1SfbfnuNSlDRtUOpN1KNc... for a description and https://prototypes.cortico.ai/search-history/ for the application -- happy to send you more details/code if you want) I never got as far as formally evaluating the quality of these vectors, though.

[+] visarga|8 years ago|reply

You can also replace rare words with a linear combination of 1..3 of their closest neighbours, especially for topic classification tasks. This would allow higher precision for more frequent words and still handle the rare words, reducing the number of vectors you need to send to the thin client.

[+] narrator|8 years ago|reply

Finally a use for 5g bandwidth: sending big ML models to the client.

[+] rpedela|8 years ago|reply

Can you expand on the privacy use case? I don't understand how applying a word vector model on the client improves privacy.

[+] tmalsburg2|8 years ago|reply

Interesting to see that "science" and "fiction" are so similar according to this metric. Not surprising, though, given how often they co-occur in text, but this clearly shows the limitations of the method. The unit of interest is not really character strings but lexical entries and there can be multiple lexical entries associated with one character string. For instance, the word "bank" can mean "financial institution" or "edge of a river" but both meanings would contribute to a common word vector. You could say that this only affects ambiguous words, but first ambiguous words are very common and second these words probably distort the whole vector space (by introducing "short circuits") and therefore also affect the vectors for non-ambiguous words. A long way left to go for these methods I suppose.

Edit: Have people tried to detect ambiguous words by measuring local conflict in word2vec space? E.g. "laboratory" is similar to "science" and "science" is similar to "fiction" but there is no evidence to suggest that "laboratory" should be similar to "fiction".

[+] morelandjs|8 years ago|reply

One method to partially rectify the problem you mention is to add additional preprocessing to the text to identify ngrams and add part of speech tags to words. For example [The, river, bank] might be restructed as [The, river bank (bigram)] where "river bank" is embedded as its own word vector.

This can also be used to disambiguate words like "hit" which can be used as a verb and a noun. You just replace "hit" with "hit|noun" and "hit|verb".

[+] popinman322|8 years ago|reply

This is a known problem and work is being done on it. I'm not sure whether there's work in the specific direction that you specified in your edit, but I did manage to find this paper from 2012 via a quick search: http://www.aclweb.org/anthology/P12-1092

This paper in particular has ~700 citations right now; I didn't go through them to see the latest work but this likely doesn't represent the cutting edge.

I imagine that marginal computational cost would be the deciding factor in choosing a more advanced model here. Granted, these embeddings are often generated infrequently so I don't see the harm in extra one-off training time. It's possible that choosing between definitions adds overhead elsewhere in the system.

[+] Sukotto|8 years ago|reply

I thought I knew what a vector was, but I have no idea what any of this project means.

Would someone please explain in simple language what this is, and why it's cool?

[+] maxlam|8 years ago|reply

The idea is that you can kind of capture the "meaning" of a word with a sequence of numbers (a vector) -- and then you use these vectors for machine learning tasks to do cool stuff like answer questions!

Word2Vec is one of the algorithms to do this. Given a bunch of text (like Wikipedia) it turns words into vectors. These vectors have interesting properties like:

vector("man") - vector("woman") + vector("queen") = vector("king")

and

distance(vector("man"), vector("woman)) < distance(vector("man"), vector("cat"))

What Word2Bits does is make sure that the numbers that represent a word is limited to just 2 values (-.333 and +.333). This reduces the amount of storage the vectors take and surprisingly improves accuracy in some scenarios.

If you're interested in learning more, check out http://colah.github.io/posts/2014-07-NLP-RNNs-Representation... which has a lot more details about representations in deep learning!

[+] opportune|8 years ago|reply

Quick run down on word2vec and what happened here:

A big problem with NLP is understanding the semantic associations between words (or even lemmas. Lemmas in this context refer to different meanings of the same word, like a baseball bat vs. a vampire bat). For example "run" and "sprint" are similar in meaning but convey different connotations; kings and queens are both high-level monarchs but we need to encode the difference in gender between them for a true semantic understanding. The problem is that the information in words themselves don't accurately convey all of this information through their string (or spoken) representations. Even dictionary definitions lack explanations of connotations or subtle differences in context, and furthermore aren't easily explained to a computer

Word2vec is an algorithm that maps words to vectors, which can be compared to one another to analyze semantic meaning. These are typically high-dimensional (e.g. several hundred dimensions). When words are used often together, or in similar contexts, they are embedded within the vector space closer to each other. The idea is that words like "computer" and "digital" are placed closer together than "inference" and "paddling".

Usually these vector mappings are represented as something like a python dictionary with each key in the dictionary corresponding to some token or word appearing at least one (maybe more) times in some set of training data. As you can imagine these can be quite large if the vocabulary of the training data is diverse, and due to being in such a high-dimensional space, the precision of a vector entry may not need to be as high as doubles or floats encode. Floats are 32 bits. The authors of this paper/repo figured out a way to quantize vector entries into representations with smaller numbers of bits, which can be used in storage to make saved word2vec models even smaller. This is really useful because a big problem with running word2vec instances is that they can take up space on the order of gigabytes. I haven't read it all yet but it seems the big innovation might have been figuring out a way to work with these quantized word vectors in-memory without losing much performance

Edit: seems it may not work in-memory

[+] knolan|8 years ago|reply

From the Arxiv paper, multidimensional word vectors are used in natural language processing. However they can be many gigabytes in size making then cumbersome. This approach aims to solve that issue.

The wiki page on Word2vec seems helpful.

https://en.m.wikipedia.org/wiki/Word2vec

[+] creichenbach|8 years ago|reply

Only tangentially related, but we've recently tried to find an encoding of text that's "stable" with regards to it's characters (as opposed to stable wrt semantic meaning as here). That is, similar words (or fragments) such as "carg" and "cargo" should yield a simliar encoding.

To our surprise, we couldn't find any example or description of someone doing this before. Is this such an uncommon problem or did we just not search in the right places?

FWIW, our use case is a search tool with live result listing, so we're dealing with word fragments and would like the outcome to be somewhat stable as the user types along. We ended up rolling our own, but it has certain shortcomings, such as a hard character limit.

[+] wodenokoto|8 years ago|reply

> To our surprise, we couldn't find any example or description of someone doing this before. Is this such an uncommon problem or did we just not search in the right places?

This is one of the defining differences between Word2Vec and Fasttext. But fasttext incorporates these character vectors as part of calculating the semantic vector, so you can't expect carg and cargo to end up being similar, but people have thought of it.

I don't think partial search is that uncommon, but I don't think it is usually solved by using vector representations similarly to word vectors. It seems like what you are looking for is usually accomplished by edit-distance / Levenshtein distance [1]

[1] https://en.wikipedia.org/wiki/Levenshtein_distance

[+] opportune|8 years ago|reply

You might be interested in phonetic algorithms for similarly sounding words:

https://en.wikipedia.org/wiki/New_York_State_Identification_...

https://en.wikipedia.org/wiki/Soundex

Your specific example would be relevant to word-stemming and lemmatization. Stemming is the process of removing suffixes from words for standardization (e.g. swim, swims, swimming, swimmer could all be stemmed to just "swim") across inflections/conjugations. Lemmatization is similar but uses contexts. Actually, some stemmers wouldn't stem cargo to carg by default, but they definitely could be modified to exhibit that kind of behavior, or used as one step in a multistep standardization process

https://en.wikipedia.org/wiki/Stemming

https://en.wikipedia.org/wiki/Lemmatisation

Levenshtein distance is a good metric for individual comparisons but if you're doing a lot of pairwise comparisons/want to index it's not a great option sometimes.

https://en.wikipedia.org/wiki/Levenshtein_distance

You definitely want to also look into tries/prefix trees. These take each character in the word and use it for an O(1) index for the next level of the tree. For example, "brea" queries the top node "b", pointing the next node "r", then "e", then "a". If you next read "d", the trie would indicate that this represents a completed word-fragment at the b->r->e->a->d node of the trie. If you combine this data structure with a statistical model, you can use it for things like spell-checking and autocompletion

https://en.wikipedia.org/wiki/Trie

(I've edited this comment twice now to make it more clear, hopefully this is sufficient). Let me know if you'd like me to point you to any other resources. I've worked with NLP a decent amount and could even work with you guys, if interested my email is in my profile and we can arrange further conversations

[+] patelajay285|8 years ago|reply

We recently open sourced a library that does exactly this:

https://github.com/plasticityai/magnitude

For word2vec, GloVE, and fastText. It is able to generate vectors for out-of-vocabulary words through "fragments" of words or subword character n-grams rather.

[+] isoprophlex|8 years ago|reply

You can try encoding your input in shingles maybe, and feeding vectorized shingles (instead of a one hot encoded dictionary) into the usual CBOW or skipgram thing to train the embedding.

You'd need to invest plenty of effort into shingling and vectorizing properly to get useful results, though.

[+] nl|8 years ago|reply

FastText can generate word vectors for partial and/or unknown words based on similar chargram patterns.

[+] Radim|8 years ago|reply

Optimizations are always great :-)

This sounds similar to Maciej Kula's experiments in "Binary Latent Representations for Efficient Ranking: Empirical Assessment", https://arxiv.org/abs/1706.07479.

Maciej shared this [1]: "FWIW I ran similar experiments on recommendation tasks. My initial results were very encouraging (and similar to those reported here), but the effect disappeared entirely after adding regularization to the baseline. I would have more confidence in the results reported here if the authors ran a more principled hyperparameter search, and added regularization to their baseline."

[1] https://twitter.com/Maciej_Kula/status/976028573487239168

[+] a_bonobo|8 years ago|reply

This could be useful with DNA/protein sequences - there's dna2vec and similar approaches to mine sequences for regulatory elements and more, the problem when compared with word2vec is that you have an order of magnitude more words. You don't really have words so you split the continuous DNA sequence into overlapping mers and treat that as words (afaik). That means that dna2vec takes forever (>24h with one chromosome) and builds a huge vector. I wonder whether this will speed up things...

[+] rococode|8 years ago|reply

Very cool that this beats Word2Vec on SQuAD! I wonder if the current state of the art models are using the standard GloVe word vecs and might see improvement from this. Tbh I don't know too much about how those have been implemented though, haha.

I'm curious, how many values did you try for the quantization functions? Without thinking too much about it, that seems like one of the hyperparams that could have a pretty big impact on performance.

[+] maxlam|8 years ago|reply

You're definitely right, the quantization function and its values definitely have an impact on performance.

For 1 bit I think I tried something like -1/+1, -.5/+.5, -.25/+.25, -.333/+.333. and something like -10/+10 -- (and I think a few more). It seemed -.333/+.333 worked the best while +10/-10 did the worst on the google analogy task (getting like 0% right). All this was tuned on 100MB of Wikipedia data.

[+] loxias|8 years ago|reply

Interesting.

So, this approach computes a "traditional" neural embedding, in say, R^50, and then "brutally" replaces each of the reals with an integer in Z_2,4,8...

I can't quite put my finger on it, but my hunch is that this naive method, while already delivering interesting results, can be drastically improved upon.

* don't use a fixed bit depth for all vector components

I guess it depends on what you're trying to optimize for -- what algebraic properties you wish to preserve for the end application:

If the end goal is: "linearity be damned, I want a stupidly fast but inaccurate way of doing approximate nearest neighbor search", then turning words into bitvectors, and using hamming distance, not(xor(a,b)), &c" works.

Either way, thanks for the ideas, OP. (was going to go on with some mathematical stuff which I suspect would improve upon it, but decided to either shutup, or put-up-and-credit-you.)

[+] _mhr_|8 years ago|reply

I would appreciate if you elaborated on your supposed improvements.

[+] mark_l_watson|8 years ago|reply

Very nice, thanks for creating this!

In addition to (possibly new) hardware that supports much larger memory, compression techniques like this might allow us to start operations with “phrase embedding” or eventually even whole “sentence embedding.”

[+] rspeer|8 years ago|reply

It's interesting and slightly uncomfortable that the illustration for similar words uses the word "man" as an example, given the gender biases that result from learning word vectors solely from word distributions.

To explain, although I’m sure the author himself is familiar with the issue: For any word that is disproportionately associated with one gender in the corpus, the model will learn that gender difference as part of the representation of the word, pretty much baking it into all applications. [1] It's all fun and games when this helps you find the difference between "king" and "queen", but it becomes a problem when the same difference appears between "genius" and "beautiful".

I haven't evaluated these vectors for built-in biases, but I assume they would have similar problems to the pre-computed word2vec and GloVe embeddings. (If they don't -- if quantization is a natural way to counteract bias -- then that's an awesome result! But an unlikely one.)

To the author: I don’t mean this to sound like an accusation that you haven’t done this yet; I know that short papers can’t tell two stories at the same time. But the next step is pretty clear, right? How do these vectors measure on the Word Embedding Association Test for implicit bias? Does Bolukbasi’s de-biasing method, or an analogue of it, work on quantized embeddings?

[1] Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." https://arxiv.org/abs/1607.06520

[+] BjoernKW|8 years ago|reply

In modern linguistics grammar is used descriptively rather than prescriptively, i.e. it's about describing how speakers actually use language rather than being about telling them how to use language 'properly'.

If there's gender bias in certain words it might be interesting to point that out but it's not the linguists' job to 'de-bias' the grammar and the underlying model.

There's an undeniable gender bias in certain words (men probably are less frequently referred to as 'beautiful' than women while on the other hand 'genius' likely is more often used when referring to men). Glossing over that by smoothing models not only misrepresents how speakers use language but probably doesn't really help the cause either.

If the corpus used displays a disproportionate association of certain words with one gender this could just mean that the corpus is insufficient for representing a language in general (as opposed to just a particular register or sociolect), which is a common problem in computational linguistics, not just when it comes to gender biases.

[+] fvdessen|8 years ago|reply

Personally I am rather uncomfortable with the idea of 'correcting' the learning process to have the machine output what we think it should instead of what it objectively learns from the dataset. It is trading a bias against an ideal world with a bias against reality.

[+] maxlam|8 years ago|reply

Interesting, definitely need to think about debiasing. Seems like it won't really work straight out of the box since it'd destroy the 1 bit-ness of the vectors.

Though if only a few of the vectors are de-biased then you can still save a lot of space since all the other vectors are still represented using 2 numbers (while the de-biased vectors are represented using the full range of 32 bit numbers).

[+] hnuser1234|8 years ago|reply

One of the closest related words is lady, and then effeminate, so it's quite unclear exactly what there is to be upset about? In fact it appears it's saying men are related to being effeminate, which should get you excited.

[+] andreyk|8 years ago|reply

Very cool! I like the visualizations a lot. Did you try to get an interpretation for what each quantized vector dimension means (have just skimmed, not read)?

Also, I am curious why you chose to go straight to publishing on Arxiv? I am actually also in CS224N right now and have a project me and my collaborator feel is publication worthy, but our plan is to go the normal route of submitting to a conference and only putting it on Arxiv after the review process (though our code is open source, so now that I think about it maybe that's not that useful a plan...).

[+] maxlam|8 years ago|reply

Definitely tried to figure out if the dimensions mean anything -- as far as I can tell they don't really mean much :(

[+] malcook|8 years ago|reply

Is there an explanation for why almost all the words most distant from "man" are either uppercase or otherwise lexically anomalous? Is it largely driven by their being low frequency in the corpus?

[+] Jack000|8 years ago|reply

can this be done with fasttext as well?

Word2bits is definitely great for memory-constrained applications but for server use memory isn't as much a constraint (there's a direct word -> vector relationship so you can just put it in a database)

it would be amazing to combine this with fasttext's ability to generate vectors for out-of-vocab words.

[+] manojlds|8 years ago|reply

Yeah you can - https://github.com/facebookresearch/fastText/blob/master/qua...

[+] Dawny33|8 years ago|reply

The data engineer in me thanks you. Great work.

The paper would be a great weekend read.

[+] wodenokoto|8 years ago|reply

Can someone explain how to read the graph?

[+] iampims|8 years ago|reply

From the paper

> Figure 2 shows a visualisation of 800 dimensional 1 bit word vectors trained on English Wikipedia (2017). The top 100 closest and furthest word vectors to the target word vector are plotted. Distance is measured by dot product; every 5 word vectors are labelled. A turquoise line separates the 100 closest vectors to the target word from the 100 furthest vectors (labelled “...”). We see that there are qualitative similarities between word vectors whose words are related to each other.

[+] maxlam|8 years ago|reply

Yeah, I should definitely put more detail in the writeup -- thanks for the feedback!

What's happening with figure 1a (epochs vs google accuracy) is that as you train for more epochs the full precision loss continues to decrease (dotted red line) but accuracy also starts decreasing (solid red line). This indicates overfitting (since you'd expect accuracy to increase if loss decreases). The blue lines (quantized training with 1 bit) do not show this which suggests that quantized training seems to act as a form of regularization.

Figure 1b is pretty similar, except on the x axis we have vector dimension. As you increase vector dimension, full precision loss decreases, yet after a certain point full precision accuracy decreases as well. I took this to mean that word2vec training was overfitting with respect to vector dimension.

[+] newman8r|8 years ago|reply

interesting that 'artists' is a furthest neighbor of 'man' in the example

[+] maxlam|8 years ago|reply

This might be because "Artist" has an uppercase "A" -- I trained all the word vectors to be case sensitive so "Artist" is not the same as "artist" (which should be closer to "man" than "Artist")

110 comments