top | item 42263763

(no title)

I don't know if this is being done already, but couldn't we add some training data to teach the LLM how to spell? We also teach kids what each letter means and how they combine into words. Maybe we can do this with tokens as well? E.g.:

Token 145 (ar) = Token 236 (a) + Token 976 (r)

Repeat many times with different combinations and different words?

discuss

acchow|1 year ago

> but couldn't we add some training data to teach the LLM how to spell?

Sure, but then we would lose a benchmark to measure progress of emergent behavior.

The goal is not to add one capability at a time by hand - because this doesn’t scale and we would never finish. The goal is that it picks up new capabilities automatically, all on its own.

lupire|1 year ago

Training data is already provided by humans and certainly already does include spelling instruction, which the model is bind to because of forced tokenization. Tokenizing on words is already an arbitrary capability added one at a time. It's just the wrong one. LLMs should be tokenizing by letter, but they don't, because they aren't good enough yet, so they get a massive deus ex machina (human ex machina?) of wordish tokenization.