Admittedly i more or less skimmed and plan on going back over this tomorrow, but I dont see how these vectors are actually created. I get that I could use your llm tool or whatever, but that seems unsatisfactory. How is the sausage made? (or if thats explained can someone point me at the right place to look?)
n2d4|2 years ago
Now on how to get a compression vector from an LLM, simplified: Most ML models are built from different layers, executed one after another. Some of the layers are bigger, some are smaller, but each has a defined in- and output. If a layer's input size is smaller than model's input size, that must mean (lossy) compression must have happened to get there. So, you just evaluate the LLM on whatever you want to embed, and take the activation at the smallest layer input, and that's your embedding vector.
Not every compression vector makes for good semantic embeddings (which requires that two similar phrases are next to each other in the embedding space), but because of how ML models work, this tends to be the case empirically.
optimalsolver|2 years ago
Can this be used to compress non-text sequences such as byte strings?
Olreich|2 years ago
Once all the words have vectors you can assume that there’s meaning in there and move on to trying to math these vectors against each other to find interesting correlations. It looks like the scoring for the initial training is based on making the vectors computable in various ways, so you can likely come up with a comparability criteria different than the papers use and get a more useful vectorization for your own purposes. Seems like cosine similarity is good enough for most things though.
PeterisP|2 years ago
In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.
If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.
lelanthran|2 years ago