(no title)
thisiszilff | 1 year ago
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
thisiszilff | 1 year ago
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
light_hue_1|1 year ago
What you're describing is an idea from the 90s that was a dead end. Bag of words representations.
It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).
There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.
microtonal|1 year ago
There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.
In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.
Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.
One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).
So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.
Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).
Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.
Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.
[1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f...
thisiszilff|1 year ago
Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.
BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.