top | item 41009803

What happened to BERT and T5?

251 points| fzliu | 1 year ago |yitay.net | reply

68 comments

order
[+] hdhshdhshdjd|1 year ago|reply
Maybe in SOTA ml/nlp research, but in the world of building useful tools and products, BERT models are dead simple to tune, work great if you have decent training data, and most importantly are very very fast and very very cheap to run.

I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.

[+] deepsquirrelnet|1 year ago|reply
Latency, throughput and cost are still very important for many applications.

Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.

Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.

That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.

[+] Tostino|1 year ago|reply
Want to share your collection with the class so we can all learn? Seems useful.
[+] ipsum2|1 year ago|reply
What does your swiss army collection do?
[+] Seattle3503|1 year ago|reply
What technique do you use to get BERT to work on longer documents?
[+] llm_trw|1 year ago|reply
Yeah, pretty much. When you have 2b files you need to troll through good luck using anything but a vector database. Once you do a level or two of pruning of the results then you can feed it into an LLM for final classification.
[+] janalsncm|1 year ago|reply
BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.

T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.

[+] hdhshdhshdjd|1 year ago|reply
I tried some large scale translation tasks with T5 and results were iffy at best. I’m going to try the same task with the newest Mistral small models and compare. My guess is Mistral will be better.
[+] vintermann|1 year ago|reply
For people like me who gave up trying to follow Arxiv ML papers 3+ years ago, articles like these are gold. I would love a Youtube channel or blog which does retrospectives on "big" papers of the last decade (those that everyone paid attention to at the time) and look at where the ideas are today.
[+] ildon|1 year ago|reply
All you need is uninterrupted attention
[+] matusp|1 year ago|reply
BERT is still the most downloaded LM at huggingface with 46M downloads last month. XLM Roberta has 24M and Distilbert is at 15M. I feel like BERTs are doing okay.
[+] andy_xor_andrew|1 year ago|reply
I'm a bit embarrassed to admit, but I still don't understand decoder vs encoder vs decoder/encoder models.

Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?

[+] ambrozk|1 year ago|reply
Encoder: Text tokens -> Fixed representation vector

Decoder: Fixed representation vector + N decoded text tokens -> N+1th text token

Encoder/Decoder architecture: You take some tokenized text, run an encoder on it to get a fixed representation vector, and then recursively apply the decoder to your fixed representation vector and the 0...N tokens you've already produced to produce the N+1th token.

Decoder-only architecture: You take some tokenized text, and recursively apply a decoder to the 0...N tokens you've already produced to produce the N+1th token (without ever using an encoded representation vector).

Basically, an encoder produces this intermediate output which a decoder knows how to combine with some existing output to create more output (imagine, e.g., encoding a sentence in French, and then feeding a decoder the vector representation of that sentence plus the three words you've translated so far, so that it can figure out the next word in the translation). A decoder can be made to require an intermediate context vector, or (this is how it's done in decoder-only architectures) it can be made to require only the text produced so far.

[+] kelseyfrog|1 year ago|reply
You can think of encoder/decoder models as specifically addressing the translation problem. They are also known as sequence-to-sequence models.

Take the task of translation. A translator needs to keep in mind the original text and the translation so far in order to predict the next translated token. The original text is encoded, and the translation so far is passed into the decoder to generate the next translated token. The next token is appended to the translation and the process repeats autoregressively.

Decoder-only models use just the decoder architecture of encoder/decoders. They are prompted and generate completions autoregressively.

Encoder-only models use just the encoder architecture which you can think of similarly to embedding. A task here is, producing vectors where vector distance is related to the semantic similarity of the input documents. This can be useful for retrieval tasks among other things.

You can of course translate using just the decoder, by constructing a "please translate this from A to B, <original text>" prompt and generating tokens just using the decoder. I'll leave it to people with more expertise than I do describe the pros and cons of these.

[+] mbowcut2|1 year ago|reply
The biggest difference is when you feed a sequence into a decoder only model, it will only attend to previous tokens when computing hidden states for the current token. So the hidden states for the nth token is only based on tokens <n. This is where you hear the talk about "causal masking", as the attention matrix is masked to achieve this restriction. Encoder architectures on the other hand allow for each position in the sequence to attend to every other position in the sequence.

Encoder architectures have been used for semantic analysis, and feature extraction of sequences, and encoder only for generation (i.e. next token prediction).

[+] chant4747|1 year ago|reply
Don't be embarrassed. This article makes the mistake of _saying_ they're going catch the under-informed up to speed but then immediately dives all the way in to the deep end.
[+] lalaland1125|1 year ago|reply
The key to understanding the difference is that transformers are attention models where tokens can "attend" to different tokens.

Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.

Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.

[+] lalaland1125|1 year ago|reply
I think the big reason why BERT and T5 have fallen out of favor is the lack of zero shot (or few shot) ability.

When you have hundreds or thousands of examples, BERT works great. But that is very restricting.

[+] jerrygenser|1 year ago|reply
Yes but you can use an llm to label data and then train a bert model which then costs a small fraction of time and money to run than the original llm.
[+] deepsquirrelnet|1 year ago|reply
They are there, you just have to look. Tasksource, NuNER, Flan, T0. There’s not a lot, but still at least a few good zero shot models in both architectures.
[+] visarga|1 year ago|reply
It's because you need to mess with embeddings or even train new heads on top of a network to use it. LLMs just use tokens-in tokens-out, they don't classify with softmax over classes, they softmax over vocabulary tokens. LLMs are more convenient
[+] minimaxir|1 year ago|reply
What happened is that "transformers go whrrrrrr." (yes, that's the academic term)

In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.

[+] k8si|1 year ago|reply
I believe many high-quality embedding models are still based on BERT, even recent ones, so I don't think it's entirely fair to characterize it as "deprecated".
[+] htrp|1 year ago|reply
feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.
[+] nshm|1 year ago|reply
No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.
[+] riku_iki|1 year ago|reply
T5 is LLM, I think first one of them.
[+] jszymborski|1 year ago|reply
> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio.

Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.

[+] phillypham|1 year ago|reply
You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.
[+] bugglebeetle|1 year ago|reply
Wasn’t there a recent paper that demonstrated BERT models are still competitive or beat LLMs in many tasks?
[+] IAmBurger|1 year ago|reply
IMO GenAI gets all the hype, but in the industry, the robustness (ig. does not hallucinate) of Extractive models is much appreciated.
[+] GaggiX|1 year ago|reply
>If BERT worked so well, why not scale it?

I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.

[+] PaulHoule|1 year ago|reply
For text classification/clustering/retrieval I am pretty happy with BERT-family models. It's only the last few month that I've seen better models come out that are practical (e.g. not sell all your children to Open AI to afford them)
[+] iandanforth|1 year ago|reply
nit: I find the writing in this post very distracting. (Grammar and style pet peeves)

Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"

So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.