top | item 42135699

(no title)

We found the same result a few years ago in our ICLR paper: https://arxiv.org/pdf/2209.14500

We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.

Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.

discuss

cscurmudgeon|1 year ago

Are there any intrinsic dis/advantages of bidirectional models over causal models for in-context learning? It seems that unidirectional model just have been explored and worked on more.

patelajay285|1 year ago

When you train bidirectionally only, you don't get a generative model, that would be the downside. However, you can train on a mixture of causal and bidirectional objectives as some LLM pre-training has done. As far as I am aware, there are no downsides of that, but it is not more common simply because the standard practice has been to train causal only and there just isn't enough funding/attention to go into experimenting on every axis of pre-training (which can be very expensive).

toxik|1 year ago

From that paper it seems the sampling method (SAP) is also slower, so that it beats larger models seems expected.

patelajay285|1 year ago

It's not at all expected. T5 models are not generative models by default and they were not thought to be able to perform generation, let alone in-context learning. Remember these models were released before any of the existing LLMs and in-context learning/prompting as a technique became popularized with GPT-3.

While the technique requires multiple samples to coax generations from this particular model, other LLM training schemes have incorporated both unidirectional and bidirectional objectives in their training now. However, this exploration hasn't been fully resolved as most models are still trained only on the causal objective by standard practice. There's still a a lot of exploration that can be done on pre-training objectives.