top | item 46515862

(no title)

The problem with this is none of this is production quality. You haven’t done edge case testing for user mistakes, a security audit, or even just maintainability.

Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.

discuss

structural|1 month ago

Yes, but my junior coworkers also don't reliably do edge case testing for user errors either unless specifically tasked to do so, likely with a checklist of specific kinds of user errors they need to check for.

And it turns out the quality of output you get from both the humans and the models is highly correlated with the quality of the specification you write before you start coding.

Letting a model run amok within the constraints of your spec is actually great for specification development! You get instant feedback of what you wrongly specified or underspecified. On top of this, you learn how to write specifications where critical information that needs to be used together isn't spread across thousands of pages - thinking about context windows when writing documentation is useful for both human and AI consumers.

sksishbs|1 month ago

The best specification is code. English is a very poor approximation.

I can’t get past that by the time I write up an adequate spec and review the agents code, I probably could have done it myself by hand. It’s not like typing was even remotely close to the slow part.

AI, agents, etc are insanely useful for enhancing my knowledge and getting me there faster.

ncruces|1 month ago

How will those juniors ever grow up to be seniors now?

pseudosavant|1 month ago

Isn't it though? I've worked with plenty of devs who shipped much lower quality code into production than I see Claude 4.5 or GPT 5.2 write. I find that SOTA models are more likely to: write tests, leave helpful comments, name variables in meaningful ways, check if the build succeeds, etc.

Stuff that seems basic, but that I haven't always been able to count on in my teams' "production" code.

jonas21|1 month ago

I can generally get maintainable results simply by telling Claude "Please keep the code as simple as possible. I plan on extending this later so readability is critical."

tannedNerd|1 month ago

Yeah some of it is probably related to me primarily using it for swift ui which doesn’t have years of stuff to scrape. But even with those and even telling that ios26 exists it will still at least once a session claim it doesn’t, so it’s not 100%

maherbeg|1 month ago

That may be true now, but think about how far we've come in a year alone! This is really impressive, and even if the models don't improve, someone will build skills to attack these specific scenarios.

Over time, I imagine even cloud providers, app stores etc can start doing automated security scanning for these types of failure modes, or give a more restricted version of the experience to ensure safety too.

afavour|1 month ago

There's a fallacy in here that is often repeated. We've made it from 0 to 5, so we'll be at 10 any day now! But in reality there are any number of roadblocks that might mean progress halts at 7 for years, if not forever.

usefulposter|1 month ago

This comment addresses none of the concerns raised. It writes off entire fields of research (accessibility, UX, application security) as Just train the models more bro. Accelerate.

bgirard|1 month ago

It's not from a few prompts, you're right. But if you layer on some follow-up prompts to add proper test suits, run some QA, etc... then the quality gets better.

I predict in 2026 we're going to see agents get better at running their own QA, and also get better at not just disabling failing tests. We'll continue to see advancements that will improve quality.

zamalek|1 month ago

I think someone around here said: LLMs are good at increasing entropy, experienced developers become good at reducing it. Those follow up prompts sounded additive, which is exactly where the problem lies. Yes, you might have tests but, no, that doesn't mean that your code base is approachable.

cyberpunk|1 month ago

You should try it with BEAM languages and the 'let it crash' style of programming. With pattern matching and process isolated per request you basically only need to code the happy path, and if garbage comes in you just let the process crash. Combined with the TDD plugin (bit of a hidden gem), you can absolutely write production level services this way.

layer8|1 month ago

Crashing is the good case. What people worry about is tacit data corruption, or other silently incorrect logic, in cases you didn’t explicitly test for.

vbezhenar|1 month ago

You don't need BEAM languages. I'm using Java and I always write my code in "let it crash" style, to spend time on happy paths and avoid spending time on error handling. I think that's the only sane way to write code and it hurts me to see all the useless error handling code people write.

LatencyKills|1 month ago

Agree... but that is exactly what MVPs are. Humans have been shipping MVPs while calling them production-ready for decades.

adriand|1 month ago

> Its answer will be 10x harder to maintain and debug

Maintain and debug by who? It's just going to be Opus 4.5 (and 4.6...and 5...etc.) that are maintaining and debugging it. And I don't think it minds, and I also think it will be quite good at it.

aschobel|1 month ago

there is are skills / subagents for that

something like code-simplifier is surprisingly useful (as is /review)

https://x.com/bcherny/status/2007179850139000872

joelthelion|1 month ago

Depends on the application. In many cases it's good enough.

mikert89|1 month ago

Its so much easier to create production quality software