top | item 46903904

(no title)

7734128 | 24 days ago

I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.

This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.

discuss

NitpickLawyer|24 days ago

It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.

A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):

> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.

wmf|24 days ago

In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.

simonw|24 days ago

This is a good rebuttal to the "it was in the training data" argument - if that's how this stuff works, why couldn't Opus 4.5 or any of the other previous models achieve the same thing?

lossolo|24 days ago

They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.

How many agents did they use with previous Opus? 3?

You've chosen an argument that works against you, because they actually could do that if they were trained to.

Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?

f311a|23 days ago

That's because they still struggle hard with out-of-distribution tasks even though some of them can be solved using existing training data pretty well. Focusing on out-of-distribution will probably lower scores for benchmarks. They focus too much on common tasks.

And keep in mind, the original creators of the first compiler had to come up with everything: lexical analysis -> parsing -> IR -> codegen -> optimization. LLMs are not yet capable of producing a lot of novelty. There are many areas in compilers that can be optimized right now, but LLMs can't help with that.

fatherwavelet|24 days ago

At some point it becomes like someone playing a nice song on piano and then someone countering with "that is great but play a song you don't know!".

Then they start improvising and the same person counters with "what a bunch of slop, just making things up!"

falloutx|24 days ago

They can literally print out entire books line by line.

calebhwin|24 days ago

[deleted]

zephen|24 days ago

> It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago.

They only have to keep reiterating this because people are still pretending the training data doesn't contain all the information that it does.

> It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.

Maybe not any old LLM, but Claude gets really close.

https://arxiv.org/pdf/2601.02671v1

skydhash|24 days ago

Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.

lunar_mycroft|24 days ago

LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.

(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)

[0] https://www.theregister.com/2026/01/09/boffins_probe_commerc...