Maybe in SOTA ml/nlp research, but in the world of building useful tools and products, BERT models are dead simple to tune, work great if you have decent training data, and most importantly are very very fast and very very cheap to run.
I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.
Latency, throughput and cost are still very important for many applications.
Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.
Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.
That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.
Yeah, pretty much. When you have 2b files you need to troll through good luck using anything but a vector database. Once you do a level or two of pruning of the results then you can feed it into an LLM for final classification.
BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.
I tried some large scale translation tasks with T5 and results were iffy at best. I’m going to try the same task with the newest Mistral small models and compare. My guess is Mistral will be better.
For people like me who gave up trying to follow Arxiv ML papers 3+ years ago, articles like these are gold. I would love a Youtube channel or blog which does retrospectives on "big" papers of the last decade (those that everyone paid attention to at the time) and look at where the ideas are today.
BERT is still the most downloaded LM at huggingface with 46M downloads last month. XLM Roberta has 24M and Distilbert is at 15M. I feel like BERTs are doing okay.
I'm a bit embarrassed to admit, but I still don't understand decoder vs encoder vs decoder/encoder models.
Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?
Encoder: Text tokens -> Fixed representation vector
Decoder: Fixed representation vector + N decoded text tokens -> N+1th text token
Encoder/Decoder architecture: You take some tokenized text, run an encoder on it to get a fixed representation vector, and then recursively apply the decoder to your fixed representation vector and the 0...N tokens you've already produced to produce the N+1th token.
Decoder-only architecture: You take some tokenized text, and recursively apply a decoder to the 0...N tokens you've already produced to produce the N+1th token (without ever using an encoded representation vector).
Basically, an encoder produces this intermediate output which a decoder knows how to combine with some existing output to create more output (imagine, e.g., encoding a sentence in French, and then feeding a decoder the vector representation of that sentence plus the three words you've translated so far, so that it can figure out the next word in the translation). A decoder can be made to require an intermediate context vector, or (this is how it's done in decoder-only architectures) it can be made to require only the text produced so far.
You can think of encoder/decoder models as specifically addressing the translation problem. They are also known as sequence-to-sequence models.
Take the task of translation. A translator needs to keep in mind the original text and the translation so far in order to predict the next translated token. The original text is encoded, and the translation so far is passed into the decoder to generate the next translated token. The next token is appended to the translation and the process repeats autoregressively.
Decoder-only models use just the decoder architecture of encoder/decoders. They are prompted and generate completions autoregressively.
Encoder-only models use just the encoder architecture which you can think of similarly to embedding. A task here is, producing vectors where vector distance is related to the semantic similarity of the input documents. This can be useful for retrieval tasks among other things.
You can of course translate using just the decoder, by constructing a "please translate this from A to B, <original text>" prompt and generating tokens just using the decoder. I'll leave it to people with more expertise than I do describe the pros and cons of these.
The biggest difference is when you feed a sequence into a decoder only model, it will only attend to previous tokens when computing hidden states for the current token. So the hidden states for the nth token is only based on tokens <n. This is where you hear the talk about "causal masking", as the attention matrix is masked to achieve this restriction. Encoder architectures on the other hand allow for each position in the sequence to attend to every other position in the sequence.
Encoder architectures have been used for semantic analysis, and feature extraction of sequences, and encoder only for generation (i.e. next token prediction).
Don't be embarrassed. This article makes the mistake of _saying_ they're going catch the under-informed up to speed but then immediately dives all the way in to the deep end.
The key to understanding the difference is that transformers are attention models where tokens can "attend" to different tokens.
Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.
Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.
They are there, you just have to look. Tasksource, NuNER, Flan, T0. There’s not a lot, but still at least a few good zero shot models in both architectures.
It's because you need to mess with embeddings or even train new heads on top of a network to use it. LLMs just use tokens-in tokens-out, they don't classify with softmax over classes, they softmax over vocabulary tokens. LLMs are more convenient
What happened is that "transformers go whrrrrrr." (yes, that's the academic term)
In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.
I believe many high-quality embedding models are still based on BERT, even recent ones, so I don't think it's entirely fair to characterize it as "deprecated".
feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.
No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.
> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio.
Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.
You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.
I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.
For text classification/clustering/retrieval I am pretty happy with BERT-family models. It's only the last few month that I've seen better models come out that are practical (e.g. not sell all your children to Open AI to afford them)
nit: I find the writing in this post very distracting. (Grammar and style pet peeves)
Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"
So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.
[+] [-] hdhshdhshdjd|1 year ago|reply
I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.
[+] [-] deepsquirrelnet|1 year ago|reply
Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.
Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.
That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.
[+] [-] Tostino|1 year ago|reply
[+] [-] ipsum2|1 year ago|reply
[+] [-] Seattle3503|1 year ago|reply
[+] [-] llm_trw|1 year ago|reply
[+] [-] janalsncm|1 year ago|reply
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.
[+] [-] isaacfung|1 year ago|reply
https://stability.ai/news/stable-diffusion-3-research-paper
https://t5tts.github.io/
Related discussion
https://www.reddit.com/r/StableDiffusion/comments/1c0by2y/wh...
[+] [-] hdhshdhshdjd|1 year ago|reply
[+] [-] vintermann|1 year ago|reply
[+] [-] swyx|1 year ago|reply
[+] [-] ildon|1 year ago|reply
[+] [-] matusp|1 year ago|reply
[+] [-] andy_xor_andrew|1 year ago|reply
Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?
[+] [-] ambrozk|1 year ago|reply
Decoder: Fixed representation vector + N decoded text tokens -> N+1th text token
Encoder/Decoder architecture: You take some tokenized text, run an encoder on it to get a fixed representation vector, and then recursively apply the decoder to your fixed representation vector and the 0...N tokens you've already produced to produce the N+1th token.
Decoder-only architecture: You take some tokenized text, and recursively apply a decoder to the 0...N tokens you've already produced to produce the N+1th token (without ever using an encoded representation vector).
Basically, an encoder produces this intermediate output which a decoder knows how to combine with some existing output to create more output (imagine, e.g., encoding a sentence in French, and then feeding a decoder the vector representation of that sentence plus the three words you've translated so far, so that it can figure out the next word in the translation). A decoder can be made to require an intermediate context vector, or (this is how it's done in decoder-only architectures) it can be made to require only the text produced so far.
[+] [-] kelseyfrog|1 year ago|reply
Take the task of translation. A translator needs to keep in mind the original text and the translation so far in order to predict the next translated token. The original text is encoded, and the translation so far is passed into the decoder to generate the next translated token. The next token is appended to the translation and the process repeats autoregressively.
Decoder-only models use just the decoder architecture of encoder/decoders. They are prompted and generate completions autoregressively.
Encoder-only models use just the encoder architecture which you can think of similarly to embedding. A task here is, producing vectors where vector distance is related to the semantic similarity of the input documents. This can be useful for retrieval tasks among other things.
You can of course translate using just the decoder, by constructing a "please translate this from A to B, <original text>" prompt and generating tokens just using the decoder. I'll leave it to people with more expertise than I do describe the pros and cons of these.
[+] [-] mbowcut2|1 year ago|reply
Encoder architectures have been used for semantic analysis, and feature extraction of sequences, and encoder only for generation (i.e. next token prediction).
[+] [-] chant4747|1 year ago|reply
[+] [-] lalaland1125|1 year ago|reply
Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.
Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.
[+] [-] thomasahle|1 year ago|reply
- Bert is encoder only.
- GPT is decoder only.
- T5 uses both the encoder and the decoder.
[+] [-] lalaland1125|1 year ago|reply
When you have hundreds or thousands of examples, BERT works great. But that is very restricting.
[+] [-] jerrygenser|1 year ago|reply
[+] [-] byefruit|1 year ago|reply
[+] [-] deepsquirrelnet|1 year ago|reply
[+] [-] visarga|1 year ago|reply
[+] [-] minimaxir|1 year ago|reply
In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.
[+] [-] k8si|1 year ago|reply
[+] [-] a_bonobo|1 year ago|reply
[+] [-] htrp|1 year ago|reply
[+] [-] nshm|1 year ago|reply
[+] [-] riku_iki|1 year ago|reply
[+] [-] jszymborski|1 year ago|reply
Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.
[+] [-] phillypham|1 year ago|reply
[+] [-] bugglebeetle|1 year ago|reply
[+] [-] caprock|1 year ago|reply
[+] [-] swyx|1 year ago|reply
[+] [-] IAmBurger|1 year ago|reply
[+] [-] GaggiX|1 year ago|reply
I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.
[+] [-] PaulHoule|1 year ago|reply
[+] [-] iandanforth|1 year ago|reply
Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"
So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.