top | item 35730748

(no title)

I'm going to add a contrarian take here: this preprint is not a research paper. While it's nice to see that there is an improvement here on their one task, this is not "semantically" driven tokenization. It's morphologically driven. To be semantically driven, it would be reasonable to expect that synonyms would have similar representations. I got really excited from the title, and the content is a let-down.

The line of research here has been going on for 30+ years, from Michael Brent's work, to Linguistica, to Morfessor, and now several approaches to incorporate morphology into tokenizers. The stand-out example is [0]. This paper doesn't seem to acknowledge any of that intellectual legacy. It's not a _research_ paper.

I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv. I don't know why they're surfacing so high on HN either.

[0]: https://aclanthology.org/2021.acl-long.279/

discuss

PaulHoule|2 years ago

If a transformer has a good "place" to assign meanings to I think it does a pretty good job of (1) discovering similar meanings in synonyms, (2) representing words differently based on context. That later one is a huge advance over word embeddings which I thought were holding progress back instead of advancing it.

You're right that what they are doing is morphological, not semantic, but it helps a lot. I would say that

   日本語

"Japanese Language" is a good token to apply embedding, attention, etc. to because it has a definite meaning to which the transformer can attach whatever syntax and semantics it learns in terms of activations. If BPE gives up and processes it as UTF-8 bytes

  e6 97 a5 e6 9c ac e8 aa 9e

there is no clear meaning for any one of those tokens, and the model is going to have to work a lot harder.

liliumregale|2 years ago

By your first paragraph's argument, the semantics are in the Transformer, not the tokenizer.

And yes, what they do helps on their two test tasks. I'm not disputing that. It's the fact that there's no scholarship here.

There are so many thousands of knobs to twiddle with in a model these days, and they went after one that's commonly regarded in the NLP community as the 'defect'—the only part of the model that's not end-to-end trained along with the rest. Which would be great, if they acknowledged it! But there's no citation to any tokenization literature beyond BPE or SentencePiece. The literature review is as superficial as what you could find in a blog.

There are certainly byte-level or character-level tokenizers (think about CANINE or ByT5), and we can argue back and forth about their data-hungriness or slow inference. It would be nice to give more helpful units to a Transformer, so it doesn't have to learn syllables (or even characters) all on its own. Rebracketing/incorrect segmentation is a problem! And these authors have clued into that, but so have several hundred (or thousand?) researchers they don't cite.

What I'm having trouble with is the notion that this paper uncovered some exciting, revelatory fact about tokenization. Yes, "Japanese Language" would be a reasonable semantic unit! But these authors didn't discover that fact. Nobody's questioning whether 'good tokenization is better than bad tokenization'. Tokenization has seen ongoing attention in NLP forever.

These authors tried one variant, compared it against a library default option (and nothing else), evaluated on one task, put a bit of marketing around it, and called it a day. In the NLP course I used to TA, this wouldn't even qualify as a complete final project for the course.

probably_wrong|2 years ago

> I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv.

Whenever something becomes a status symbol there will be people willing to exploit it. Perhaps ArXiv should hire some volunteers to check for a minimum of quality before acceptance? (/s, in case it's not clear).

Anecdotally, the second worst paper I've ever read was hosted on ArXiv and presented in an NLP group as a possible breakthrough. Tearing it apart in front of the person presenting it was no fun.

gliptic|2 years ago

> To be semantically driven, it would be reasonable to expect that synonyms would have similar representations.

How could a tokenizer do anything about that unless the synonyms actually share substrings? The vector embedding is learned, not part of the tokenizer.

thomastjeffery|2 years ago

It couldn't, which is why it's a good idea to avoid the word, "semantic".

The same problem also exists in the name, "Large Language Model". Sure, the content being modeled contains language, but the model itself is not specific or limited to language patterns. We ought to call them "Large Text Models"; or better yet, "Text Inference Models".

The words we use to describe software are very important: they inform goals and expectations. They define the context that software exists in.

I see our biggest mistake as calling these tools, "Artificial Intelligence". That title began as a goal and a category of work: it doesn't belong in the title or description of software unless that software has actually met the goal.

blatant303|2 years ago

A morpheme is the smallest *meaningful* unit in a language though.

liliumregale|2 years ago

I was being generous - stemming is poor man's morphology. Empirically useful (ask the IR folks) but incredibly heuristic.

thatsadude|2 years ago

> I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv.

I got downvoted when I expressed similar opinion with regards to MiniGPT4. I guess HN crowd value usefulness more than real contribution.