IMO a critical feature of the Turing test/imitation game, which many modern implementations including this site's linked paper ignore, is that the interrogator talks to both a human and a bot and must decide that one xor the other is a human. So fooling an interrogator means having them choose the bot as human over an actual human, not just judging the bot to be human (while probably judging humans to be human even more frequently).
When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.
Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.
That's not the original Turing test either. The original imitation game as proposed by Turing involves reading a text transcript of a human and a computer and having the evaluator determine which is which. The evaluator does not interact directly with the conversing parties.
I'm skeptical on the claim. I think most folks, given the test you describe, would be able to pick out which is human. I think it can get there, but I'm not sure anyone has made one yet. ChatGPT responses are heavily downvoted and mocked because they're easy to spot.
Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?
You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.
> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
> Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).
Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.
Honestly, after that, I'm tuned out completely on him and ARC-AGI. Nice minor sidestory at one point in time.
He's right that this isn't solving all human-intelligence domain level problems.
But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.
The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.
It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.
I assumed this was about chatbot users committing suicide in order to "join" the bot they are chatting with. It's already happened a couple of times, apparently:
Yea, I too was not expecting a list of past benchmarks. If not the aforementioned actual human deaths, I had expected either a list of companies whose pivot to AI/LLMs led to their downfall (but I guess we're going to need to wait a year or two for that) or a list of industries (such as audio transcription) that are being killed by AI as we speak.
We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!
Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.
I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.
I thought it was a credible source of actual jobs replaced by LLMs. When I see headlines like this, I ad homimem the source as unprofitable company CEO, big consulting firm, bootcamp seller etc.
The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help. But you're not helping.
ARC-AGI is not yet killed by LLM. O3 achieved a breakthrough only on ARC-AGI-PUB, which is semi-private. Nothing guarantees that the test data wasn't leaked to OpenAI in previous testing rounds, because the model is not running offline.
I think this should be discussed more. Models that can only be accessed via API cannot be tested without giving their owners access to the test data. You just have to trust that they’ll do the right thing.
Interesting choice having a little (i) icon in the Turing Test card but having mouseover not bring up any text. Or having the link icons in that card that you can click on to do nothing.
Looks like a bug - that card has an overlay at a higher z-index that obscures its mouseover and clicks. In the source the (i) links to Turing's original "Imitation Game" paper, and the (?) has this hover text:
> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.
Wozniak's coffee test would be a really fun one to attempt. As long as you could get a capable enough robot, I imagine it's possible. Something like the Spot Arm[1] would be sufficient.
Something like:
- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))
- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.
And it would likely succeed some percentage of the time today!
I find that MATH challenge “solved” by AI hard to believe. The reason given was “saturation”. Could anybody help explain it a bit?
Also in my daily encounter, I stop find a lot of simple math problems all the frontier models could not solve: long logic puzzle, many cases reasoning, and particularly geometry problems. I don’t know where the 97% number for o1 does come from, but in my experience they are much lower than that and math, even elementary maths certainly can not be considered to be “solved”.
As far as I can see, OpenAI has been trained their models on all these public problems, so testing on them to record a benchmark is tainted as best when not outright cheating.
I've found o1 to be entirely useful at math problems that are beyond my own (admittedly modest) skills. I've had it write full proofs of correctness for me (one shot, verified), I've had it optimize equations to reduce necessary precision, I've had it optimize equations to remove specific expensive operations (making them computationally more efficient), and finally I've had it prove a handful of my conjectures, which was helpful for taking algorithmic shortcuts in a security sensitive environment.
Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.
It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician
An very reliable, very unethical test would be to deploy LLMs on the internet as humans and gauge how other humans react (ignore, call out as LLM, engage, etc). There isn't much in the way of stopping a company from doing that (there should be!).
I'm working on operationalizing AI, and our Turing test is if—by watching a screenshare of the AI worker—you can tell an AI worker (vs. a human) did the task.
The page doesn’t seem to define what „killed“ or „defeated“ means. The LLM being better than a human? The LLM having been trained against the benchmark, making it useless?
[+] [-] Ukv|1 year ago|reply
When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.
Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.
[+] [-] fastball|1 year ago|reply
[+] [-] silisili|1 year ago|reply
Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?
You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.
[+] [-] lamename|1 year ago|reply
> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...
[+] [-] energy123|1 year ago|reply
Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).
Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.
[+] [-] refulgentis|1 year ago|reply
He's right that this isn't solving all human-intelligence domain level problems.
But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.
The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.
It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.
[+] [-] 0xDEAFBEAD|1 year ago|reply
https://futurism.com/teen-suicide-obsessed-ai-chatbot
https://garymarcus.substack.com/p/the-first-known-chatbot-as...
[+] [-] cdev_gl|1 year ago|reply
We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!
Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.
I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.
[+] [-] nayuki|1 year ago|reply
[+] [-] sam0x17|1 year ago|reply
[+] [-] rasz|1 year ago|reply
[+] [-] aitchnyu|1 year ago|reply
[+] [-] ultrablack|1 year ago|reply
[+] [-] mindcrime|1 year ago|reply
[+] [-] matt3210|1 year ago|reply
If this doesn’t show over fitting in don’t know what would.
[+] [-] friend_Fernando|1 year ago|reply
[+] [-] lxgr|1 year ago|reply
[+] [-] yamrzou|1 year ago|reply
See: https://news.ycombinator.com/item?id=42478098
[+] [-] anon373839|1 year ago|reply
[+] [-] Tepix|1 year ago|reply
ARC-AGI-1 will be replaced by ARC-AGI-2
So yes, ARC-AGI-1 was killed.
[+] [-] anonymoushn|1 year ago|reply
[+] [-] fenomas|1 year ago|reply
> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.
[+] [-] levocardia|1 year ago|reply
It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.
[+] [-] themanmaran|1 year ago|reply
Something like:
- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))
- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.
And it would likely succeed some percentage of the time today!
[1] https://bostondynamics.com/products/spot/arm/
[+] [-] dleavitt|1 year ago|reply
[+] [-] sinuhe69|1 year ago|reply
[+] [-] Taek|1 year ago|reply
Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.
It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician
[+] [-] blinding-streak|1 year ago|reply
[+] [-] knowaveragejoe|1 year ago|reply
[+] [-] chriscappuccio|1 year ago|reply
[+] [-] alganet|1 year ago|reply
[+] [-] erichocean|1 year ago|reply
If you can't, the AI worker passes the test.
[+] [-] j45|1 year ago|reply
Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.
[+] [-] solarkraft|1 year ago|reply
[+] [-] anonymoushn|1 year ago|reply
[+] [-] krackers|1 year ago|reply
[+] [-] bufferoverflow|1 year ago|reply
https://epoch.ai/frontiermath/the-benchmark
[+] [-] varelse|1 year ago|reply
[deleted]
[+] [-] mrayycombi|1 year ago|reply
Too bad the real world isn't like that.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]