(no title)
x1000 | 2 years ago
1) An English dictionary as input.
2) List of words that start with "app" wiki page as input.
3) Other alphabetically sorted pieces of text.
4) Elementary school homeworks for spelling.
5) Papers on glyphs, diphthongs, and other phonetic concepts.
You begin to recognize that the tokens in these lists appear near each other in this strange context. You hardly ever see token 11346 ("apple") and token 99015 ("appli") this close to each other before. But you see it frequently enough that you decide to nudge these two tokens' embeddings closer to one another.
Your ability to predict the next token in a sequence has improved. You have no idea why these two tokens are close every ten millionth training example. Your word embeddings start to encode spelling information. Your word embeddings start to encode handwriting information. Your word embeddings start to encode phonic information. You've never seen or heard the actual word, "apple". But, after enough training, your embeddings contain enough information so that if you're asked, ["How do", "you", "spell", "apple"], you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.
pertymcpert|2 years ago
sriram_malhar|2 years ago
Of course, those coordinates are not the only way in which the object can be represented, but for a certain problem context, these location coordinates are useful.
Given objects A,B,C, or rather, given their coordinates, one can tell which two are closest to each other, or you can find the point D that is the other point of the parallelogram ... this. In fact, it allows you to do similarity tests like "A:B :: C:D". This is through standard vector algebra.
Now, imagine each word associated with a 100-dimensional vector. You can do the same thing. Amazingly, one can do things like "man:woman ::king: ...." and get the answer "queen", just by treating each word as a vector, and looking up the inverse mapping for vector to word. It almost feels ... intelligent!
This embedding -- each word associated with an n-D vector -- is obtained while training neural nets. In fact, now you have readymade, pre-trained embedding approaches like Word2Vec.
https://www.tensorflow.org/tutorials/text/word2vec
pallas_athena|2 years ago
During training, each token (or word) gets an Embedding assigned.
Critically, _similar words will get similar embeddings_. And "similar words" could mean both semantically or (as was the example) syntactically ("apple" and "appli").
And being vectors, you can do operations on them. To give the classic example, you could do: Embedding(`king`) + Embedding(`female`) = Embedding(`queen`).
stormfather|2 years ago
basketball ==> (1.0, 0.7) # heavier, redder baseball ==> (0.2, 0.2) # less heavy, less red
When an LLM (large language model) is fed a word, it transforms that word into a vector in n-dimensional space. For example:
basketball -> [0.5, 0.3, 0.6, ... , 0.9] # Here the embedding is many, many numbers
It does this because computers process numbers not words. These numbers all represent some property of the word/concept basketball in a way that makes sense to the model. It learns to do this during it's training, and the humans that train these models can only guess what the embedding mappings it's learning actually represent. This is the first step of what a LLM does when it processes text.
wyldfire|2 years ago
When I read the GP description referring to "embedding" above I thought of the perceptron.
Definitely not supernatural at all. The act of making an automaton that "can perceive" feels to me like it's closer to the opposite. Taking that which might seem mystical and breaking it down into something predictable and reproducible.
[1] https://en.wikipedia.org/wiki/Perceptron
rrrrrrrrrrrryan|2 years ago
Is it possible for the current generation of LLMs to assign confidence intervals to their responses?
That's my main qualm with ChatGPT so far: sometimes it will give you an answer, but it will be confidently incorrect.
terramex|2 years ago
> GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, after the post-training process, the calibration is reduced (Figure 8).
pages 10-11: https://cdn.openai.com/papers/gpt-4.pdf
askiiart|2 years ago
Take this with a grain of salt though, I'm far from an expert, and it's been a while since I've played around with that feature.
harpiaharpyja|2 years ago
photochemsyn|2 years ago
meghan_rain|2 years ago