top | item 47195212

Ask HN: Why aren't we using "semantic tokenizers"/codebooks for text?

1 points| gavinray | 1 day ago

BPE tokenizes subwords efficiently, but it has zero awareness of semantic structure -- it's purely optimizing vocabulary/sequence length tradeoffs.

I read LanDiff [0], where they train a "semantic tokenizer" with codebooks that compresses 3D visual features into a 1D discrete token stream, Then train an LM over those semantic tokens (~x14,000 compression vs raw visual features). The results beat Sora and models x3 its size.

So why can't we do the analogous thing for text? Learn a discrete semantic codebook over spans/phrases, reason over that compressed sequence, decode back to natural language.

Is it that:

- text is already a high-density symbolic representation so gains are marginal - "semantic fidelity" is too hard to define for a lossy text codec - scaling raw tokens keeps working so nobody's motivated - some combination of the above

I think that the recent "neural codec" research (Meta BLT, DeepMind 2024) is somewhat similar to this, just applied to raw codec/signal data?

[0] https://arxiv.org/pdf/2503.04606

discuss

No comments yet.