(no title)
liliumregale | 2 years ago
The line of research here has been going on for 30+ years, from Michael Brent's work, to Linguistica, to Morfessor, and now several approaches to incorporate morphology into tokenizers. The stand-out example is [0]. This paper doesn't seem to acknowledge any of that intellectual legacy. It's not a _research_ paper.
I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv. I don't know why they're surfacing so high on HN either.
PaulHoule|2 years ago
You're right that what they are doing is morphological, not semantic, but it helps a lot. I would say that
"Japanese Language" is a good token to apply embedding, attention, etc. to because it has a definite meaning to which the transformer can attach whatever syntax and semantics it learns in terms of activations. If BPE gives up and processes it as UTF-8 bytes there is no clear meaning for any one of those tokens, and the model is going to have to work a lot harder.liliumregale|2 years ago
And yes, what they do helps on their two test tasks. I'm not disputing that. It's the fact that there's no scholarship here.
There are so many thousands of knobs to twiddle with in a model these days, and they went after one that's commonly regarded in the NLP community as the 'defect'—the only part of the model that's not end-to-end trained along with the rest. Which would be great, if they acknowledged it! But there's no citation to any tokenization literature beyond BPE or SentencePiece. The literature review is as superficial as what you could find in a blog.
There are certainly byte-level or character-level tokenizers (think about CANINE or ByT5), and we can argue back and forth about their data-hungriness or slow inference. It would be nice to give more helpful units to a Transformer, so it doesn't have to learn syllables (or even characters) all on its own. Rebracketing/incorrect segmentation is a problem! And these authors have clued into that, but so have several hundred (or thousand?) researchers they don't cite.
What I'm having trouble with is the notion that this paper uncovered some exciting, revelatory fact about tokenization. Yes, "Japanese Language" would be a reasonable semantic unit! But these authors didn't discover that fact. Nobody's questioning whether 'good tokenization is better than bad tokenization'. Tokenization has seen ongoing attention in NLP forever.
These authors tried one variant, compared it against a library default option (and nothing else), evaluated on one task, put a bit of marketing around it, and called it a day. In the NLP course I used to TA, this wouldn't even qualify as a complete final project for the course.
probably_wrong|2 years ago
Whenever something becomes a status symbol there will be people willing to exploit it. Perhaps ArXiv should hire some volunteers to check for a minimum of quality before acceptance? (/s, in case it's not clear).
Anecdotally, the second worst paper I've ever read was hosted on ArXiv and presented in an NLP group as a possible breakthrough. Tearing it apart in front of the person presenting it was no fun.
gliptic|2 years ago
How could a tokenizer do anything about that unless the synonyms actually share substrings? The vector embedding is learned, not part of the tokenizer.
thomastjeffery|2 years ago
The same problem also exists in the name, "Large Language Model". Sure, the content being modeled contains language, but the model itself is not specific or limited to language patterns. We ought to call them "Large Text Models"; or better yet, "Text Inference Models".
The words we use to describe software are very important: they inform goals and expectations. They define the context that software exists in.
I see our biggest mistake as calling these tools, "Artificial Intelligence". That title began as a goal and a category of work: it doesn't belong in the title or description of software unless that software has actually met the goal.
blatant303|2 years ago
liliumregale|2 years ago
thatsadude|2 years ago
I got downvoted when I expressed similar opinion with regards to MiniGPT4. I guess HN crowd value usefulness more than real contribution.