That's true. You would think LLM will condition its surprise completion to be more probable if it's in a joke context. I guess this only gets good when model really is good. It's similar that GPT 4.5 has better humor.
Good completely new jokes are like novel ideas: really hard even for humans. I mean fuck, we have an entire profession dedicated just to making up and telling them, and even theirs don't land half the time.
Exactly. It feels like with LLMs as soon as we achieved the at-the-time astounding breakthrough "LLMs can generate coherent stories" with GPT-2, people have constantly been like "yeah? Well it can't do <this thing that is really hard even for competent humans>.".
> We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text...
That was big news. I guess this is because it's quite hard for the most people to distinguish the enormous difficulty gulf between "generate a coherent paragraph" and "create a novel funny joke".
Which is notable, because GPT-4.5 is one of the largest models ever trained. It's larger than today's production models powering GPT-5.
Goes to show that "bad at jokes" is not a fundamental issue of LLMs, and that there are still performance gains from increasing model scale, as expected. But not exactly the same performance gains you get from reasoning or RLVR.
moffkalast|6 months ago
IshKebab|6 months ago
That breakthrough was only 6 years ago!
https://openai.com/index/better-language-models/
> We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text...
That was big news. I guess this is because it's quite hard for the most people to distinguish the enormous difficulty gulf between "generate a coherent paragraph" and "create a novel funny joke".
ACCount37|6 months ago
Goes to show that "bad at jokes" is not a fundamental issue of LLMs, and that there are still performance gains from increasing model scale, as expected. But not exactly the same performance gains you get from reasoning or RLVR.