top | item 43776051

(no title)

2snakes | 10 months ago

I read one characterization which is that LLMs don't give new information (except to the user learning) but they reorganize old information.

discuss

barrenko|10 months ago

Custodians of human knowledge.

docmechanic|10 months ago

That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.

selfhoster11|10 months ago

All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.

asdff|10 months ago

Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.

emaro|10 months ago

Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.