(no title)
dboreham | 4 days ago
Anyway, I began this project while on vacation (again) then completed it while attending a conference, so the work wasn't 100% duty cycle. That said it took about a month from beginning to the current state. You can see in the linked article almost all the LLM sessions that built the project.
LLMs do seem to be a bit narcissistic as you've alluded to -- confidently declaring that it has implemented "PRI PAR" for example, but conveniently not mentioning that it only parsed the keywords and didn't in fact implement priority semantics. This reminds me of less experienced developers I've managed in the past. Loth to deliver bad news.
This project was all done with Claude. When I began I was given the Opus 4.5 model but fairly early in the timeline Anthropic enabled the new Opus 4.6 model. This was before its official release so I'm not sure if they have a rollout policy that targeted me or my project. Anyway, most of the work was Opus 4.6.
Overall I learned a tremendous amount about what today's frontier models can do: I could probably give 4-5 talks on various things I noticed, or talk for a few hours over beers. General take away was that the experience was uncannily similar to developing software as a human, or running a team of somewhat less experienced humans. A fun time to be alive for sure.
Rochus|4 days ago
In contrast to my experiences with e.g. Gemini 3 Pro, where it regularly happened that the LLM claimed to have reached full features scope in each iteration, but the result turned out to be full of stubs, Devin at least doesn't pull my leg and delivers what was agreed, but unfortunately debugging and fixing takes much more time than generating the initial version (about factor five). But so far I never tried to run an LLM project over such a long time as you did; must have cost a fortune.
dboreham|4 days ago
I find that it's uncannily like running a team of eager but not too experienced engineers: those humans would also show up claiming to have "finished". I'd say "well does it run so and so test ok?". They'd go away, come back a few days later... The LLM acts much the same. You have to keep it on a short leash but when it gets cracking on a problem it's amazing to watch. E.g. I saw it write countless test programs on the fly to diagnose a parser hang bug. It would try this and that, binary chopping on the problematic source file. If I was doing that myself I'd need a few strong coffees before diving in.