top | item 43964392

(no title)

tyho | 9 months ago

> The 2D map analogy was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It’s impossible for us lowly 3-dimensional creatures to visualize what “distance” looks like in 1000 dimensions. Also, we don’t know what each dimension represents, hence the section heading “Very weird multi-dimensional space”.5 One dimension might represent something close to color. The king - man + woman ≈ queen anecdote suggests that these models contain a dimension with some notion of gender. And so on. Well Dude, we just don’t know.

nit. This suggests that the model contains a direction with some notion of gender, not a dimension. Direction and dimension appear to be inextricably linked by definition, but with some handwavy maths, you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n. This helps explain why spaces on the order of 1k dimensions can "fit" billions of concepts.

discuss

PaulHoule|9 months ago

Note you don't see arXiv papers where somebody feeds in 1000 male gendered words into a word embedding and gets 950 correct female gendered words. Statistically it does better than chance, but word embeddings don't do very well.

https://nlp.stanford.edu/projects/glove/

there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall. If you try experiments with N>100 words you go endlessly in circles and produce the kind of inconclusively negative results that people don't publish.

The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word. For instance you can't really build a "part of speech" classifier that can tell you "red" is an adjective because it is also a noun, but give it the context and you can.

In the context of full text search, bringing in synonyms is a mixed bag because a word might have 2 or 3 meanings and the the irrelevant synonyms are... irrelevant and will bring in irrelevant documents. Modern embeddings that recognize context not only bring in synonyms but the will suppress usages of the word with different meanings, something the IR community has tried to figure out for about 50 years.

yorwba|9 months ago

> there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall.

While it would certainly have been possible to choose a projection where the two groups of words are linearly separable, that isn't even the case for https://nlp.stanford.edu/projects/glove/images/man_woman.jpg : "woman" is inside the "nephew"-"man"-"earl" triangle, so there is no way to draw a line neatly dividing the masculine from the feminine words. But I think the graph wasn't intended to show individual words classified by gender, but rather to demonstrate that in pairs of related words, the difference between the feminine and masculine word vectors points in a consistent direction.

Of course that is hardly useful for anything (if you could compare unrelated words, at least you would've been able to use it to sort lists...) but I don't think the GloVe authors can be accused of having created unrealistic graphs when their graph actually very realistically shows a situation where the kind of simple linear classifier that people would've wanted doesn't exist.

minimaxir|9 months ago

> The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word.

In addition to being able to utilize attention mechanisms, modern embedding models use a form of tokenization such as BPE which a) includes punctuation which is incredibly important for extracting semantic meaning and b) includes case, without as much memory requirements as a cased model.

The original BERT used an uncased, SentencePiece tokenizer which is out of date nowadays.

manmal|9 months ago

Don’t the high end embedding services use a transformer with attention to compute embeddings? If so, I thought that would indeed capture the semantic meaning quite well, including the trait-is-described-by-direction-vector.

philipwhiuk|9 months ago

> In https://nlp.stanford.edu/projects/glove/ there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall.

Ramsey theory (or 'the Woolworths store alignment hypothesis')

kaycebasques|9 months ago

Oh yes, this makes a lot of sense, thank you for the "nit" (which doesn't feel like a nit to me, it feels like an important conceptual correction). When I was writing the post I definitely paused at that part, knowing that something was off about describing the model as having a dimension that maps to gender. As you said, since the models are general-purpose and work so well in so many domains, there's no way that there's a 1-to-1 correspondence between concepts and dimensions.

I think your comment is also clicking for me now because I previously did not really understand how cosine similarity worked, but then watched videos like this and understand it better now: https://youtu.be/e9U0QAFbfLI

I will eventually update the post to correct this inaccuracy, thank you for improving my own wetware's conceptual model of embeddings

manmal|9 months ago

This video explains the direction-encodes-trait topic very well IMO: https://youtu.be/wjZofJX0v4M

It’s the first in a series of three that I can very highly recommend.

> there's no way that there's a 1-to-1 correspondence between concepts and dimensions.

I don’t know about that! Once you go very high dimensional, there is a lot of direction vectors that are almost perfectly perpendicular to each other (meaning they can cleanly encode a trait). Maybe they don’t even need to be perfectly perpendicular, the dot product just needs to be very close to zero.

OJFord|9 months ago

I would think of it as the whole embedding concept again on a finer grained scale: you wouldn't say the model 'has a dimension of whether the input is king', instead the embedding expresses the idea of 'king' with fewer dimensions than would be needed to cover all ideas/words/tokens like that.

So the distinction between a direction and a dimension expressing 'gender' is that maybe gender isn't 'important' (or I guess high-information-density) enough to be an entire dimension, but rather is expressed by a linear combination of two (or more) yet more abstract dimensions.

benatkin|9 months ago

> Machine learning (ML) has the potential to advance the state of the art in technical writing. No, I’m not talking about text generation models like Claude, Gemini, LLaMa, GPT, etc. The ML technology that might end up having the biggest impact on technical writing is embeddings.

This is maybe showing some age as well, or maybe not. It seems that text generation will soon be writing top tier technical docs - the research done on the problem with sycophancy will likely result something significantly better than what LLMs had before the regression to sycophancy. Either way, I take "having the biggest impact on technical writing" to mean in the near term. If having great search and organization tools (ambient findability and such) is going to steal the thunder from LLMs writing really good technical docs, it's going to need to happen fast.

aaronblohowiak|9 months ago

>nearly orthogonal dimensions within n dimensional space

nit within a nit: I believe you intended to write "nearly orthogonal directions within n dimensional space" which is important as you are distinguishing direction from dimension in your post.

tyho|9 months ago

FFS, it's too late for me to edit. You are of course correct.

ohxh|9 months ago

Johnson-lindenstrauss lemma [1] for anyone curious. But you can only map to k>8(\ln N)/\varepsilon ^{2}} if you want to preserve distances within a factor of \varepsilon with a JL-transform. This is tight up to a constant factor too.

I always wondered: if we want to preserve distances between a billion points within 10%, that would mean we need ~18k dimensions. 1% would be 1.8m. Is there a stronger version of the lemma for points that are well spread out? Or are embeddings really just fine with low precision for the distance?

[1] https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

gweinberg|9 months ago

It's not at all a nit. If one of the dimensions did indeed correspond to gender, you might find "king" and "queen" pretty much only differed in one dimension. More generally, if these dimensions individually refer to human-meaningful concepts, you can find out what these concepts are just by looking at words that pretty much differ only along one dimension.

otabdeveloper4|9 months ago

That's the layman intuition, but actual models can give surprising results.

You can test this hypothesis with some clever LLM prompting. When I did this I got "male monarch" for "king" but "British ruler" for "queen".

Oops!

pletnes|9 months ago

There’s absolutely no reason to believe that the coordinate system of the embeddings would be aligned along the directions of individual concepts, even if they were linear and one dimensional in the embedding space.

rdtsc|9 months ago

> you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n.

nit for the nit (micro nit!): Is it meant to be "a number of nearly orthogonal directions within n dimensional space"? Otherwise n dimensional space will have just n dimensions.

kaycebasques|9 months ago

Yes, confirmed here: https://news.ycombinator.com/item?id=43966937

rahimnathwani|9 months ago

Nice article related to the last point (nearly orthogonal vectors):

https://transformer-circuits.pub/2022/toy_model/index.html

drc500free|9 months ago

Is this because we can essentially treat each dimension like a binary digit, so we get 2^n directions we can encode? Or am I barking up totally the wrong tree?

emaro|9 months ago

Basically, but it gets even better. If you allow directions of 'meaning' do wiggle a little bit (say, between 89 and 91 degrees to all other directions), you get a lot more degrees of freedom. In 3 dimensions, you still only get 3 meaningful directions with that wiggle-freedom. However in high-dimensional spaces, this small additional freedom allows you to fit a lot more almost orthogonal directions than the number of strictly orthogonal ones. That means in a 1000-dimensional space you can fit a huge number >> 1000 of binary concepts.

daxfohl|9 months ago

Wait, but if gender was composed of say two dimensions, then there'd be no way to distinguish between "the gender is different" and "the components represented by each of those dimensions are individually different", right?

daxfohl|9 months ago

Oh, so I think what it does is take a nearly infinite-dimensional nonlinear space, and transform it into "the N dimensional linear space that best preserves approximations of linear combinations of elements". That way, any two (or more) terms can combine to make others, so there isn't such a thing as "prime" terms (similar to real dictionaries, every word is defined in terms of other words). Though some, like gender, may have strong enough correlations so as to be approximately prime in a large enough space. Is that about right?

osigurdson|9 months ago

You can't visualize it but you can certainly compute the euclidean distance. Tools like UMAP can be used to drop the dimensionality as well.

minimaxir|9 months ago

Speaking of UMAP, a new update to the cuML library (https://github.com/rapidsai/cuml) released last month allows UMAP to feasibly be used on big data without shenanigans/spending a lot of money. This opens up quite a few new oppertunities and I'm getting very good results with.

aswanson|9 months ago

Any good umap links?

pyinstallwoes|9 months ago

I posit the fundamental foundation for logic is the recognition of the penis and vagina. From there follows spatial recognition and difference.

alok-g|9 months ago

>> The king - man + woman ≈ queen anecdote ...

>> nit. This suggests that the model contains a direction with some notion of gender ...

In fact, it is likely even more restrictive ...

Even if the said vector arithmetic were to be (approximately) honored by the gender-specific words, it only means there's a specific vector (with a specific direction and magnitude) for such gender translation. 'Woman' + 'king - man' goes to 'queen, however, p * ('king - man') with p being significantly different from one may be a different relation altogether.

The meaning of the vector 'King' - 'man' may be further restricted in that the vector added to a 'Queen' need not land onto some still more royal version of a queen! The networks can learn non-linear behaviors, so the meaning of the vector could be dependent on something about the starting position too.

... unless shown otherwise via experimental data or some reasoning.