top | item 46908409

(no title)

andrewshawcare | 24 days ago

It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.

Hard to find fully specified problems like this in the wild.

I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.

I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

> Write extremely high-quality tests

> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

discuss

order

tantalor|24 days ago

Why didn't Claude realize on its own that it needed a continuous integration pipeline?

Far to much human intervention here.

sublimefire|24 days ago

> Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.

krzat|24 days ago

You know what else is well specified? LLM improving on itself.

widdershins|24 days ago

I wouldn't describe intelligence as well specified. We can't even agree on what it is.

GalaxyNova|24 days ago

> Hard to find fully specified problems like this in the wild.

This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.

anematode|24 days ago

Impressive, my sarcasm/bait detector almost failed me.