> The 2D map analogy was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It’s impossible for us lowly 3-dimensional creatures to visualize what “distance” looks like in 1000 dimensions. Also, we don’t know what each dimension represents, hence the section heading “Very weird multi-dimensional space”.5 One dimension might represent something close to color. The king - man + woman ≈ queen anecdote suggests that these models contain a dimension with some notion of gender. And so on. Well Dude, we just don’t know.nit. This suggests that the model contains a direction with some notion of gender, not a dimension. Direction and dimension appear to be inextricably linked by definition, but with some handwavy maths, you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n. This helps explain why spaces on the order of 1k dimensions can "fit" billions of concepts.
PaulHoule|9 months ago
In
https://nlp.stanford.edu/projects/glove/
there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall. If you try experiments with N>100 words you go endlessly in circles and produce the kind of inconclusively negative results that people don't publish.
The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word. For instance you can't really build a "part of speech" classifier that can tell you "red" is an adjective because it is also a noun, but give it the context and you can.
In the context of full text search, bringing in synonyms is a mixed bag because a word might have 2 or 3 meanings and the the irrelevant synonyms are... irrelevant and will bring in irrelevant documents. Modern embeddings that recognize context not only bring in synonyms but the will suppress usages of the word with different meanings, something the IR community has tried to figure out for about 50 years.
yorwba|9 months ago
While it would certainly have been possible to choose a projection where the two groups of words are linearly separable, that isn't even the case for https://nlp.stanford.edu/projects/glove/images/man_woman.jpg : "woman" is inside the "nephew"-"man"-"earl" triangle, so there is no way to draw a line neatly dividing the masculine from the feminine words. But I think the graph wasn't intended to show individual words classified by gender, but rather to demonstrate that in pairs of related words, the difference between the feminine and masculine word vectors points in a consistent direction.
Of course that is hardly useful for anything (if you could compare unrelated words, at least you would've been able to use it to sort lists...) but I don't think the GloVe authors can be accused of having created unrealistic graphs when their graph actually very realistically shows a situation where the kind of simple linear classifier that people would've wanted doesn't exist.
minimaxir|9 months ago
In addition to being able to utilize attention mechanisms, modern embedding models use a form of tokenization such as BPE which a) includes punctuation which is incredibly important for extracting semantic meaning and b) includes case, without as much memory requirements as a cased model.
The original BERT used an uncased, SentencePiece tokenizer which is out of date nowadays.
manmal|9 months ago
philipwhiuk|9 months ago
Ramsey theory (or 'the Woolworths store alignment hypothesis')
kaycebasques|9 months ago
I think your comment is also clicking for me now because I previously did not really understand how cosine similarity worked, but then watched videos like this and understand it better now: https://youtu.be/e9U0QAFbfLI
I will eventually update the post to correct this inaccuracy, thank you for improving my own wetware's conceptual model of embeddings
manmal|9 months ago
It’s the first in a series of three that I can very highly recommend.
> there's no way that there's a 1-to-1 correspondence between concepts and dimensions.
I don’t know about that! Once you go very high dimensional, there is a lot of direction vectors that are almost perfectly perpendicular to each other (meaning they can cleanly encode a trait). Maybe they don’t even need to be perfectly perpendicular, the dot product just needs to be very close to zero.
OJFord|9 months ago
So the distinction between a direction and a dimension expressing 'gender' is that maybe gender isn't 'important' (or I guess high-information-density) enough to be an entire dimension, but rather is expressed by a linear combination of two (or more) yet more abstract dimensions.
benatkin|9 months ago
This is maybe showing some age as well, or maybe not. It seems that text generation will soon be writing top tier technical docs - the research done on the problem with sycophancy will likely result something significantly better than what LLMs had before the regression to sycophancy. Either way, I take "having the biggest impact on technical writing" to mean in the near term. If having great search and organization tools (ambient findability and such) is going to steal the thunder from LLMs writing really good technical docs, it's going to need to happen fast.
aaronblohowiak|9 months ago
nit within a nit: I believe you intended to write "nearly orthogonal directions within n dimensional space" which is important as you are distinguishing direction from dimension in your post.
tyho|9 months ago
ohxh|9 months ago
I always wondered: if we want to preserve distances between a billion points within 10%, that would mean we need ~18k dimensions. 1% would be 1.8m. Is there a stronger version of the lemma for points that are well spread out? Or are embeddings really just fine with low precision for the distance?
[1] https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...
gweinberg|9 months ago
otabdeveloper4|9 months ago
You can test this hypothesis with some clever LLM prompting. When I did this I got "male monarch" for "king" but "British ruler" for "queen".
Oops!
pletnes|9 months ago
rdtsc|9 months ago
nit for the nit (micro nit!): Is it meant to be "a number of nearly orthogonal directions within n dimensional space"? Otherwise n dimensional space will have just n dimensions.
kaycebasques|9 months ago
rahimnathwani|9 months ago
https://transformer-circuits.pub/2022/toy_model/index.html
drc500free|9 months ago
emaro|9 months ago
daxfohl|9 months ago
daxfohl|9 months ago
osigurdson|9 months ago
minimaxir|9 months ago
aswanson|9 months ago
pyinstallwoes|9 months ago
alok-g|9 months ago
>> nit. This suggests that the model contains a direction with some notion of gender ...
In fact, it is likely even more restrictive ...
Even if the said vector arithmetic were to be (approximately) honored by the gender-specific words, it only means there's a specific vector (with a specific direction and magnitude) for such gender translation. 'Woman' + 'king - man' goes to 'queen, however, p * ('king - man') with p being significantly different from one may be a different relation altogether.
The meaning of the vector 'King' - 'man' may be further restricted in that the vector added to a 'Queen' need not land onto some still more royal version of a queen! The networks can learn non-linear behaviors, so the meaning of the vector could be dependent on something about the starting position too.
... unless shown otherwise via experimental data or some reasoning.