(no title)
gjulianm | 21 days ago
I mean, $20k in tokens, plus the supervision by the author to keep things running, plus the number of people that got involved according to the article... doesn't look like "a weekend project".
> Building a C compiler which can correctly compile (maybe not link) the modern linux kernel is damn hard.
Is it correctly compiling it? Several people have pointed out that the compiler will not emit errors for clearly invalid code. What code is it actually generating?
> Building a C compiler which can correctly compile sqlite and pass the test suite at any speed is damn hard.
It's even harder to have a C compiler that can correctly compile SQLite and pass the test suite but then the SQLite binary itself fails to execute certain queries (see https://github.com/anthropics/claudes-c-compiler/issues/74).
> which, in comparison with a correct modern C compiler, is far less performance critical, complex, broad, etc.
That code might be less complex for us, but more complex for an LLM if it has to deal with lots of domain-specific context and without a test suite that has been developed for 40 years.
Also, if the end result of the LLM has the same problem that Anthropic concedes here, which is that the project is so fragile that bug fixes or improvements are really hard/almost impossible, that still matters.
> it really seems that the complaints here aren't about the LLMs themselves, or the agents, but about what people/organizations do with them, which is then a complaint about people, but not the technology
It's a discussion about what the LLMs can actually do and how people represent those achievements. We're point out that LLMs, without human supervision, generate bad code, code that's hard to change, with modifications specifically made to address failing tests without challenging the underlying assumptions, code that's inconsistent and hard to understand even for the LLMs.
But some people are taking whatever the LLM outputs at face value, and then claiming some capabilities of the models that are not really there. They're still not viable for using without human supervision, and because the AI labs are focusing on synthetic benchmarks, they're creating models that are better at pushing through crappy code to achieve a goal.
No comments yet.