Killed by LLM | WingNews

[+] Ukv|1 year ago|reply

IMO a critical feature of the Turing test/imitation game, which many modern implementations including this site's linked paper ignore, is that the interrogator talks to both a human and a bot and must decide that one xor the other is a human. So fooling an interrogator means having them choose the bot as human over an actual human, not just judging the bot to be human (while probably judging humans to be human even more frequently).

When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.

Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.

[+] fastball|1 year ago|reply

That's not the original Turing test either. The original imitation game as proposed by Turing involves reading a text transcript of a human and a computer and having the evaluator determine which is which. The evaluator does not interact directly with the conversing parties.

[+] silisili|1 year ago|reply

I'm skeptical on the claim. I think most folks, given the test you describe, would be able to pick out which is human. I think it can get there, but I'm not sure anyone has made one yet. ChatGPT responses are heavily downvoted and mocked because they're easy to spot.

Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?

You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.

[+] lamename|1 year ago|reply

Posted by Chollet himself:

> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

[+] energy123|1 year ago|reply

> Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).

Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.

[+] refulgentis|1 year ago|reply

Honestly, after that, I'm tuned out completely on him and ARC-AGI. Nice minor sidestory at one point in time.

He's right that this isn't solving all human-intelligence domain level problems.

But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.

The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.

It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.

[+] 0xDEAFBEAD|1 year ago|reply

I assumed this was about chatbot users committing suicide in order to "join" the bot they are chatting with. It's already happened a couple of times, apparently:

https://futurism.com/teen-suicide-obsessed-ai-chatbot

https://garymarcus.substack.com/p/the-first-known-chatbot-as...

[+] cdev_gl|1 year ago|reply

Yea, I too was not expecting a list of past benchmarks. If not the aforementioned actual human deaths, I had expected either a list of companies whose pivot to AI/LLMs led to their downfall (but I guess we're going to need to wait a year or two for that) or a list of industries (such as audio transcription) that are being killed by AI as we speak.

We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!

Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.

I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.

[+] nayuki|1 year ago|reply

I thought the title meant that a chatbot gave bad medical, engineering, and/or safety-critical advice that a human ended up following.

[+] sam0x17|1 year ago|reply

Similarly I thought it would be about ML and data projects that have become defunct due to the advent of LLMs.

[+] rasz|1 year ago|reply

Using people with severe mental health problems might be a poor benchmark of performance.

[+] aitchnyu|1 year ago|reply

I thought it was a credible source of actual jobs replaced by LLMs. When I see headlines like this, I ad homimem the source as unprofitable company CEO, big consulting firm, bootcamp seller etc.

[+] ultrablack|1 year ago|reply

The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help. But you're not helping.

[+] mindcrime|1 year ago|reply

Describe in single words, only the good things that come into your mind about your mother.

[+] matt3210|1 year ago|reply

I read recently that small variations in the tests cause failures by large margins.

If this doesn’t show over fitting in don’t know what would.

[+] friend_Fernando|1 year ago|reply

Eventually, all the better AGI tests should have large private evaluation datasets with no possible cheating or feedback loops. We're getting there.

[+] lxgr|1 year ago|reply

Wasn’t that for human tests, i.e. not specifically AI benchmarks? Benchmarks should generally not be game-able by overfitting.

[+] yamrzou|1 year ago|reply

ARC-AGI is not yet killed by LLM. O3 achieved a breakthrough only on ARC-AGI-PUB, which is semi-private. Nothing guarantees that the test data wasn't leaked to OpenAI in previous testing rounds, because the model is not running offline.

See: https://news.ycombinator.com/item?id=42478098

[+] anon373839|1 year ago|reply

I think this should be discussed more. Models that can only be accessed via API cannot be tested without giving their owners access to the test data. You just have to trust that they’ll do the right thing.

[+] Tepix|1 year ago|reply

See https://arcprize.org/blog/oai-o3-pub-breakthrough

ARC-AGI-1 will be replaced by ARC-AGI-2

So yes, ARC-AGI-1 was killed.

[+] anonymoushn|1 year ago|reply

Interesting choice having a little (i) icon in the Turing Test card but having mouseover not bring up any text. Or having the link icons in that card that you can click on to do nothing.

[+] fenomas|1 year ago|reply

Looks like a bug - that card has an overlay at a higher z-index that obscures its mouseover and clicks. In the source the (i) links to Turing's original "Imitation Game" paper, and the (?) has this hover text:

> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.

[+] levocardia|1 year ago|reply

I don't really understand why "Killed by: Saturation" is needed - what other options are there?

It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.

[+] themanmaran|1 year ago|reply

Wozniak's coffee test would be a really fun one to attempt. As long as you could get a capable enough robot, I imagine it's possible. Something like the Spot Arm[1] would be sufficient.

Something like:

- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))

- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.

And it would likely succeed some percentage of the time today!

[1] https://bostondynamics.com/products/spot/arm/

[+] dleavitt|1 year ago|reply

The layer with the radial gradient you're putting in front of the Turing Test card blocks interaction with it - can't click or hover on its links.

[+] sinuhe69|1 year ago|reply

I find that MATH challenge “solved” by AI hard to believe. The reason given was “saturation”. Could anybody help explain it a bit? Also in my daily encounter, I stop find a lot of simple math problems all the frontier models could not solve: long logic puzzle, many cases reasoning, and particularly geometry problems. I don’t know where the 97% number for o1 does come from, but in my experience they are much lower than that and math, even elementary maths certainly can not be considered to be “solved”. As far as I can see, OpenAI has been trained their models on all these public problems, so testing on them to record a benchmark is tainted as best when not outright cheating.

[+] Taek|1 year ago|reply

I've found o1 to be entirely useful at math problems that are beyond my own (admittedly modest) skills. I've had it write full proofs of correctness for me (one shot, verified), I've had it optimize equations to reduce necessary precision, I've had it optimize equations to remove specific expensive operations (making them computationally more efficient), and finally I've had it prove a handful of my conjectures, which was helpful for taking algorithmic shortcuts in a security sensitive environment.

Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.

It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician

[+] blinding-streak|1 year ago|reply

Scroll down on the page. It explains saturation.

[+] knowaveragejoe|1 year ago|reply

I didn't know ARC-AGI had been "beaten" by o3. What are the next challenges that frontier models like o1/o3 are faced with?

[+] chriscappuccio|1 year ago|reply

o1 did terrible. o3 did well on arc-agi-pub (public training data) but hasn't passed the private test yet.

[+] alganet|1 year ago|reply

An very reliable, very unethical test would be to deploy LLMs on the internet as humans and gauge how other humans react (ignore, call out as LLM, engage, etc). There isn't much in the way of stopping a company from doing that (there should be!).

[+] erichocean|1 year ago|reply

I'm working on operationalizing AI, and our Turing test is if—by watching a screenshare of the AI worker—you can tell an AI worker (vs. a human) did the task.

If you can't, the AI worker passes the test.

[+] j45|1 year ago|reply

I'm not sure if LLMs have beaten the standards, as much have the information to reply to them as needed.

Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.

[+] solarkraft|1 year ago|reply

The page doesn’t seem to define what „killed“ or „defeated“ means. The LLM being better than a human? The LLM having been trained against the benchmark, making it useless?

[+] anonymoushn|1 year ago|reply

It does if you scroll down.

[+] krackers|1 year ago|reply

Everything says "killed by saturation". Is there another way to be killed?

[+] bufferoverflow|1 year ago|reply

There are benchmarks that humans score close to zero on average and the top LLM scores 25%.

https://epoch.ai/frontiermath/the-benchmark

[+] varelse|1 year ago|reply

[deleted]

[+] mrayycombi|1 year ago|reply

Bragging about how LLMs defeated maginot line defenses that can be trained around, makes us feel warm and fuzzy.

95 comments