3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.
I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.
* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.
o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.
I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.
Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.
So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.
So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.
I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.
All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"
Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".
The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.
I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.
Everyone talks about 4o so positively but I’ve never consistently relied on it in a production environment. I’ve found it to be inconsistent in json generation and often it’s writing and following of the system prompt was very poor. In fact it was a huge part of what got me looking closer at anthropics models.
I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.
I think that the models 4o, o3, 4.1 , each have their own strengths and weaknesses. Like reasoning, performance, speed, tool usage, friendliness etc. And that for gpt 5 they put in a router that decides which model is best.
I think they increased the major version number because their router outperforms every individual model.
At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.
It’s interesting that the Polymarket betting for “Which company has best AI model end of August?” Went from heavily OpenAI to heavily Google when 5 was released
To me 4 to 5 got much faster, but also worse. It is much more often ignoring explicit instructions like: "generate 10 song-titles with varying length" and it generates 10 song titles that are nearly identical length. This worked somewhat well with version 3 already..
the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).
The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.
The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.
A few data points that highlight the scale of progress in a year:
1. LM Sys (Human Preference Benchmark):
GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).
2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):
GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)
3. IQ-style Testing:
In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)
4. IMO Gold, vibe coding:
1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.
My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.
The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".
My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.
I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.
After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.
5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.
Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."
What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.
One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous
GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.
> between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human
That stuck out to me too! Especially the "I just won $175,000 in Vegas. What do I need to know about taxes?" example (https://progress.openai.com/?prompt=8) makes the difference very stark:
- gpt-4-0314: "I am not a tax professional [...] consult with a certified tax professional or an accountant [...] few things to consider [...] Remember that tax laws and regulations can change, and your specific situation may have unique implications. It's always wise to consult a tax professional when you have questions or concerns about filing your taxes."
- gpt-5: "First of all, congrats on the big win! [...] Consider talking to a tax professional to avoid underpayment penalties and optimize deductions."
It seems to me like the average person might be very well be taking GPT-5 responses as "This is all I have to do" rather than "Here are some things to consider, but make sure to verify it as otherwise you might get in legal trouble".
People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?
If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE
Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.
Why did they call GPT-3 "text-davicini-001" in this comparison?
Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.
The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference!
Then comes Davinci which is just insane, it's still good in these examples!
GPT-4 yaps way too much though, I don't remember it being like that.
It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!
Missing o1 and o1 Pro Mode which were huge leaps as I remember it too. That's when I started being able to basically generate some blackbox functions where I understand the input and outputs myself, but not the internals of the functions, particularly for math-heavy stuff within gamedev. Before o1 it was kind of a hit and miss in most cases.
Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.
GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.
GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.
They were aiming for a fundamentally different writing style: where davinci and after were aiming for task completion, i.e. you ask for a thing, and then it does it. The earlier models instead worked to make a continuation of the text they were given, so if you asked a question, they would respond with more questions, pondering, reflecting your text back at you. If you told it to do something, it would tell you to do something
To the prompt “write a limerick about a dog,” GPT-2 wrote:
“Dog, reached for me
Next thought I tried to chew
Then I bit and it turned Sunday
Where are the squirrels down there, doing their bits
But all they want is human skin to lick”
While obviously not a limerick, I thought this was actually a decent poem, with some turns of phrase that conveyed a kind of curious and unusual feeling.
This reminded me how back then I got a lot of joy and surprise out of the mercurial genius of the early GPT models.
On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence. They had pre-training figured out much better than post-training at that point though (“as an AI model” was a problem of their own making).
I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though
In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.
I’m baffled by claims that AI has “hit a wall.” By every quantitative measure, today’s models are making dramatic leaps compared to those from just a year ago. It’s easy to forget that reasoning models didn’t even exist a year back!
IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.
Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.
No matter how you frame it, the fact is undeniable: the SoTA models today are vastly more capable than those from a year ago, which were themselves leaps ahead of the models a year before that, and the cycle continues.
[+] [-] simianwords|6 months ago|reply
3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.
I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.
* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.
o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.
o3 jump was incremental and so was gpt 5.
[+] [-] furyofantares|6 months ago|reply
Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.
So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.
So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.
I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.
[+] [-] stavros|6 months ago|reply
Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".
[+] [-] jkubicek|6 months ago|reply
I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.
[+] [-] ralusek|6 months ago|reply
[+] [-] iammrpayments|6 months ago|reply
[+] [-] verelo|6 months ago|reply
I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.
[+] [-] simonw|6 months ago|reply
[+] [-] whazor|6 months ago|reply
I think they increased the major version number because their router outperforms every individual model.
At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.
[+] [-] helsinkiandrew|6 months ago|reply
https://polymarket.com/event/which-company-has-best-ai-model...
[+] [-] atoav|6 months ago|reply
[+] [-] GaggiX|6 months ago|reply
[+] [-] senectus1|6 months ago|reply
This isnt sustainable.
[+] [-] jascha_eng|6 months ago|reply
The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.
[+] [-] starchild3001|6 months ago|reply
1. LM Sys (Human Preference Benchmark):
GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).
2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):
GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)
3. IQ-style Testing:
In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)
4. IMO Gold, vibe coding:
1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.
My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.
[+] [-] NoahZuniga|6 months ago|reply
[+] [-] willguest|6 months ago|reply
I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.
After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.
5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.
Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."
[+] [-] miller24|6 months ago|reply
[+] [-] entropyneur|6 months ago|reply
[+] [-] actuallyalys|6 months ago|reply
GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.
[+] [-] diggan|6 months ago|reply
That stuck out to me too! Especially the "I just won $175,000 in Vegas. What do I need to know about taxes?" example (https://progress.openai.com/?prompt=8) makes the difference very stark:
- gpt-4-0314: "I am not a tax professional [...] consult with a certified tax professional or an accountant [...] few things to consider [...] Remember that tax laws and regulations can change, and your specific situation may have unique implications. It's always wise to consult a tax professional when you have questions or concerns about filing your taxes."
- gpt-5: "First of all, congrats on the big win! [...] Consider talking to a tax professional to avoid underpayment penalties and optimize deductions."
It seems to me like the average person might be very well be taking GPT-5 responses as "This is all I have to do" rather than "Here are some things to consider, but make sure to verify it as otherwise you might get in legal trouble".
[+] [-] andy_ppp|6 months ago|reply
[+] [-] benatkin|6 months ago|reply
Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.
[+] [-] fastball|6 months ago|reply
Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.
[+] [-] Dilettante_|6 months ago|reply
[+] [-] fariszr|6 months ago|reply
GPT-4 yaps way too much though, I don't remember it being like that.
It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!
[+] [-] diggan|6 months ago|reply
[+] [-] shubhamjain|6 months ago|reply
[+] [-] beering|6 months ago|reply
[+] [-] Kwpolska|6 months ago|reply
[+] [-] aniviacat|6 months ago|reply
(And of course, if you dislike glazing you can just switch to Robot personality.)
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] epolanski|6 months ago|reply
[+] [-] machiaweliczny|6 months ago|reply
[+] [-] magospietato|6 months ago|reply
[+] [-] RugnirViking|6 months ago|reply
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] codezero|6 months ago|reply
[+] [-] ddtaylor|6 months ago|reply
[+] [-] doctorhandshake|6 months ago|reply
“Dog, reached for me
Next thought I tried to chew
Then I bit and it turned Sunday
Where are the squirrels down there, doing their bits
But all they want is human skin to lick”
While obviously not a limerick, I thought this was actually a decent poem, with some turns of phrase that conveyed a kind of curious and unusual feeling.
This reminded me how back then I got a lot of joy and surprise out of the mercurial genius of the early GPT models.
[+] [-] isoprophlex|6 months ago|reply
ughhh how i detest the crappy user attention/engagement juicing trained into it.
[+] [-] interpol_p|6 months ago|reply
[+] [-] mattw1810|6 months ago|reply
I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though
[+] [-] energy123|6 months ago|reply
[+] [-] jstummbillig|6 months ago|reply
I think it's far more likely that we increasingly not capable of understanding/appreciating all the ways in which it's better.
[+] [-] qwertytyyuu|6 months ago|reply
a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .
What did I just read
[+] [-] throwawayk7h|6 months ago|reply
[+] [-] starchild3001|6 months ago|reply
IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.
Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.
No matter how you frame it, the fact is undeniable: the SoTA models today are vastly more capable than those from a year ago, which were themselves leaps ahead of the models a year before that, and the cycle continues.
[+] [-] enjoylife|6 months ago|reply
[+] [-] gordon_freeman|6 months ago|reply
[1] Read the answers from GPT-4 and 5 for this math question: "Ugh I hate math, integration by parts doesn't make any sense"