I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.
arugulum|6 months ago
Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.
A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)
gnerd00|6 months ago
no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.
hadlock|6 months ago
I'm really looking forward to "the social network" treatment movie about OpenAI whenever that happens
whimsicalism|6 months ago
muzani|6 months ago
I have the feeling they kept on this until GPT-4o (which was a different kind of data).
robrenaud|6 months ago
kevindamm|6 months ago
unknown|6 months ago
[deleted]
stavros|6 months ago
reasonableklout|6 months ago
[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...
therein|6 months ago
Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.
I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.
How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?