(no title)
kherud
|
1 year ago
Now that context length seems abundant for most tasks, I'm wondering why sub-word tokens are still used. I'm really curious how character-based LLMs would compare. With 2 M context, the compute bottleneck fades away. I'm not sure though what role the vocabulary size has. Maybe a large size is critical, since the embedding already contains a big chunk of the knowledge. On the other hand, using a character-based vocabulary would solve multiple problems, I think, like glitch tokens and possibly things like arithmetic and rhyming capabilities. Implementing sub-word tokenizers correctly and training them seems also quite complex. On a character level this should be trivial.
AaronFriel|1 year ago
I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.
thomasahle|1 year ago
Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.
yk|1 year ago
Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)
darby_eight|1 year ago
Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.
inbetween|1 year ago
joaogui1|1 year ago
1. latency, which would get worse if you have to sequentially generate more output
2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet
novaRom|1 year ago