(no title)
gojomo | 3 months ago
So of those 3, despite the superficially "large" distances, 2 of the 3 are just as good at this particular analogy as Google's 2013 word2vec vectors, in that 'queen' is the closest word to the target, when query-words ('king', 'woman', 'man') are disqualified by rule.
But also: to really mimic the original vector-math and comparison using L2 distances, I believe you might need to leave the word-vectors unnormalized before the 'king'-'man'+'woman' calculation – to reflect that the word-vectors' varied unnormalized magnitudes may have relevant translational impact – but then ensure the comparison of the target-vector to all candidates is between unit-vectors (so that L2 distances match the rank ordering of cosine-distances). Or, just copy the original `word2vec.c` code's cosine-similarity-based calculations exactly.
Another wrinkle worth considering, for those who really care about this particular analogical-arithmetic exercise, is that some papers proposed simple changes that could make word2vec-era (shallow neural network) vectors better for that task, and the same tricks might give a lift to larger-model single-word vectors as well.
For example:
- Levy & Goldberg's "Linguistic Regularities in Sparse and Explicit Word Representations" (2014), suggesting a different vector-combination ("3CosMul")
- Mu, Bhat & Viswanath's "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" (2017), suggesting recentering the space & removing some dominant components
jdthedisciple|3 months ago
> you might need to leave the word-vectors unnormalized before the 'king'-'man'+'woman' calculation – to reflect that the word-vectors' varied unnormalized magnitudes may have relevant translational impact
I believe translation should be scale-invariant, and scale should not affect rank ordering
gojomo|3 months ago
I don't believe this is true with regard to ending angles after addition steps between vectors of varying magnitudes.
Imagine just in 2D: vector A at 90° & magnitude 1.0, vector B at 0° & magnitude 0.5, and vector B' at 0° but normalized to magnitude 1.0.
The vectors (A+B) and (A+B') will be at both different magnitudes and different directions.
Thus, cossim(A,(A+B')) will be notably less than cossim(A,(A+B)), and more generally, if imagining the whole unit circles as filled with candidate nearest-neighbors, (A+B) and (A+B') may have notably different ranked lists of cosine-similarity nearest-neighbors.