top | item 45577341

(no title)

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)

It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.

While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.

discuss

WA|4 months ago

Another theory: you have some spec in your mind, write down most of it and expect the LLM to implement it according to the spec. The result will be objectively a deviation from the spec.

Some developers will either retrospectively change the spec in their head or are basically fine with the slight deviation. Other developers will be disappointed, because the LLM didn't deliver on the spec they clearly hold in their head.

It's a bit like a psychological false memory effect where you misremember and/or some people are more flexibel in their expectations and accept "close enough" while others won't accept this.

At least, I noticed both behaviors in myself.

benashford|4 months ago

This is true. But, it's also true of assigning tasks to junior developers. You'll get back something which is a bit like what you asked for, but not done exactly how you would have done it.

Both situations need an iterative process to fix and polish before the task is done.

The notable thing for me was, we crossed a line about six months ago where I'd need to spend less time polishing the LLM output than I used to have to spend working with junior developers. (Disclaimer: at my current place-of-work we don't have any junior developers, so I'm not comparing like-with-like on the same task, so may have some false memories there too.)

But I think this is why some developers have good experiences with LLM-based tools. They're not asking "can this replace me?" they're asking "can this replace those other people?"

scuff3d|4 months ago

This implies that it executes the spec correctly, just not in a way that's expected. But if you actually look at how these things operate, that's flat out not true.

Mitchell Hashimito just did a write up about his process for shipping a new feature for Ghostty using AI. He clearly knows what he's doing and follows all the AI "best practices" as far as I could tell. And while he very clearly enjoyed the process and thinks it made him more productive, the post is also a laundry list of this thing just shitting the bed. It gets confused, can't complete tasks, and architects the code in ways that don't make sense. He clearly had to watch it closely, step in regularly, and in some cases throw the code out entirely and write it himself.

The amount of work I've seen people describe to get "decent" results is absurd, and a lot of people just aren't going to do that. For my money it's far better as a research assistant and something to bounce ideas off of. Or if it is going to write something it needs to be highly structured input with highly structured output and a very narrow scope.

sussmannbaka|4 months ago

What I want to see at this point are more screencasts, write-ups, anything really, that depict the entire process of how someone expertly wrangles these products to produce non-trivial features. There's AI influencers who make very impressive (and entertaining!) content about building uhhh more AI tooling, hello worlds and CRUD. There's experienced devs presenting code bases supposedly almost entirely generated by AI, who when pressed will admit they basically throw away all code the AI generates and are merely inspired by it. Single-shot prompt to full app (what's being advertised) rapidly turns to "well, it's useful to get motivated when starting from a blank slate" (ok, so is my oblique strategies deck but that one doesn't cost 200 quid a month).

This is just what I observe on HN, I don't doubt there's actual devs (rather than the larping evangelist AI maxis) out there who actually get use out of these things but they are pretty much invisible. If you are enthusiastic about your AI use, please share how the sausage gets made!

fragmede|4 months ago

https://mitchellh.com/writing/non-trivial-vibing (not me)

structural|4 months ago

These things are amazing for maintenance programming on very large codebases (think, 50-100million lines of code or more, the people who wrote the code no longer work there, it's not open source so "just google it or check stack overflow" isn't even an option at all.)

A huge amount of effort goes into just searching for what relevant APIs are meant to be used without reinventing things that already exist in other parts of the codebase. I can send ten different instantiations of an agent off to go find me patterns already in use in code that should be applied to this spot but aren't yet. It can also search through a bug database quite well and look for the exact kinds of mistakes that the last ten years of people just like me made solving problems just like the one I'm currently working on. And it finds a lot.

Is this better than having the engineer who wrote the code and knows it very well? Hell no. But you don't always have that. And at the largest scale you really can't, because it's too large to fit in any one person's memory. So it certainly does devolve to searching and reading and summarizing for a lot of the time.

Zababa|4 months ago

https://simonwillison.net/2025/Oct/8/claude-datasette-plugin...

ehnto|4 months ago

Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.

But the other thing is that, your expectations normalise, and you will hit its limits more often if you are relying on it more. You will inevitably be unimpressed by it, the longer you use it.

If I use it here and there, I am usually impressed. If I try to use it for my whole day, I am thoroughly unimpressed by the end, having had to re-do countless things it "should" have been capable of based on my own past experience with it.

mexicocitinluez|4 months ago

> Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.

Absolutely nuts I had to scroll down this far to find the answer.Totally agree.

Maybe it's the fact that every software development job has different priorities, stakeholders, features, time constraints, programming models, languages, etc. Just a guess lol

lm28469|4 months ago

> The simplest explanation would be...

The simplest explanation is that most of us are code monkeys reinventing the same CRUD wheel over and over again, gluing things together until they kind of work and calling it a day.

"developers" is such a broad term that it basically is meaningless in this discussion

mexicocitinluez|4 months ago

or, and get this, software development is an enormous field with 100s of different kinds of variations and priorities and use cases.

lol.

another option is trying to convince yourself that you have any idea what the other 2,000,000 software devs are doing and think you can make grand, sweeping statements about it.

there is no stronger mark of a junior than the sentiment you're expressing

yoyohello13|4 months ago

I'm convinced it's not the results that are different, it's the expectations.

The SVP of IT for my company is 100% in on AI. He talks about how great it is all the time. I just recently worked on a legacy project in PHP he build years ago, and now I know his bar for what quality code looks like is extremely low...

I use LLMs daily to help with my work, but I tweak the output all the time because it doesn't quite get it right.

Bottom line, if your code is below average AI code will look great.

nasmorn|4 months ago

That seems to be the pitch too. I now get google ads where they advertise you can ask your phone things about what it sees. All the examples are so trivial, I can’t believe it. How to make a double espresso. What are these clouds called?

That’s being not a complete idiot as a service.

If it was at least how do I start the decalcification process on this machine so it actually realizes it and turns the service light off.

tovej|4 months ago

I would say they can't reliably deliver simple work. They often can, but reliability, to me, means I can expect it to work every time. Or at least as much as any other software tool, with failure rates somewhere in the vicinity of 1 in 10^5, 1 in 10^6. LLMs fail on the order of 1 in 10 times for simple work. And rarely succeed for complex work.

That is not reliable, that's the opposite of reliable.

krisoft|4 months ago

One has to look at the alternatives. What would i do if not use the LLM to generate the code? The two answers are “coding myself”, “asking an other dev to code it”. And neither of those approach anywhere a 10^5 failure rate. Not even close.

KronisLV|4 months ago

> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

Some possible reasons:

  * different models used by different folks, free vs paid ones, various reasoning effort, quantizations under the hood and other parameters (e.g. samplers and temperature)
  * different tools used, like in my case I've found Continue.dev to be surprisingly bad, Cline to be pretty decent but also RooCode to be really good; also had good experiences with JetBrains Junie, GitHub Copilot is *okay*, but yeah, lots of different options and settings out there
  * different system prompts, various tool use cases (e.g. let the model run the code tests and fix them itself), as well as everything ranging from simple and straightforward codebases that are dime a dozen out there (and in the training data), vs something genuinely new that would trip up both your average junior dev, as well as the LLMs
  * open ended vs well specified tasks, feeding in the proper context, starting new conversations/tasks when things go badly, offering examples so the model has more to go off of (it can predict something closer to what you actually want), most of my prompts at this point are usually multiple sentences, up to a dozen, alongside code/data examples, alongside prompting the model to ask me questions about what I want before doing the actual implementation
  * also sometimes individual models produce output for specific use cases badly, I generally rotate between Sonnet 4.5, Gemini Pro 2.5, GPT-5 and also use Qwen 3 Coder 480B running on Cerebras for the tasks I need done quickly and that are more simple

With all of that, my success rate is pretty great and the statement about the tech not being able to "...barely follow a simple instruction" holds untrue. Then again, most of my projects are webdev adjacent in mostly mainstream stacks, YMMV.

surgical_fire|4 months ago

> Then again, most of my projects are webdev adjacent in mostly mainstream stacks

This is probably the most significant part of your answer. You are asking it to do things for which there are a ton of examples of in the training data. You described narrowing the scope of your requests too, which tends to be better.

impossiblefork|4 months ago

It's true though, they can't. It really depends on what they have to work with.

In the fixed world of mathematics, everything could in principle be great. In software, it can in principle be okay even though contexts might be longer. When dealing with new contexts in something like real life, but different-- such as a story where nobody can communicate with the main characters because they speak a different language, then the models simply can't deal with it, always returning to the context they're familiar with.

When you give them contexts that are different enough from the kind of texts they've seen, they do indeed fail to follow basic instructions, even though they can follow seemingly much more difficult instructions in other contexts.

lnrd|4 months ago

> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

My hypothesis is that developers work on different things and while these models might work very well for some domains (react components?) they will fail quickly in others (embedded?). So one one side we have developers working on X (LLM good at it) claiming that it will revolutionize development forever and the other side we have developers working on Y (LLM bad at it) claiming that it's just a fad.

h33t-l4x0r|4 months ago

I think this is right on, and the things that LLM excels at (react components was your example) are really the things that there's just such a ridiculous amount of training data for. This is why LLMs are not likely to get much better at code. They're still useful, don't get me wrong, but they 5x expectations needs to get reined in.

wiether|4 months ago

Also the variation is the focus of each person.

Based on my own personal experience:

- on some topics, I get the x100 productivity that is pushed by some devs; for instance this Saturday I was able to make two features that I was reschudeling for years because, for lack of knowledge, it would have taken me many days to make them, but a few back and forth with an LLM and everything was working as expected; amazing!

- on other topics, no matter how I expose the issue to an LLM, at best it tells me that it's not solvable, at worst they try to push an answer that doesn't make any sense and push an even worst one when I point it out...

And when people ask me what I think about LLM, I say : "that's nice and quite impressive, but still it can't be blindly trusted and needs a lot of overhead, so I suggest caution".

I guess it's the classic half empty or half full glass.

logicchains|4 months ago

>I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

Two of the key skills needed for effective use of LLMs are writing clear specifications (written communication), and management, skills that vary widely among developers.

skydhash|4 months ago

There’s no clearer specifications than code, and I can manage my toolset just fines (lines of config, alias, and what not to make my job easier). That allowed me to deliver good results fine and fast without worrying if it’s right this time

smt88|4 months ago

I sometimes meet devs who are "using it wrong" with under-baked prompts.

But mostly my experience is that people who regularly get good output from AI coding tools fall into these buckets:

A) Very limited scope (e.g. single, simple method with defined input/output in context)

B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")

C) Use AI to force multiple iterations of the same prompt to "shake out the bugs" automatically instead of using the dev's time

I don't see many cases outside of this.

bsder|4 months ago

> B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")

Oh, boy, this. For example, I often use whatever AI I have to adjust my Nix files because the documentation for Nix is so horrible. Sure, it's slop, but it gets me working again and back to what I'm supposed to be doing instead of farting with Nix.

I would also argue:

D) The fact that an AI can do the task indicates that something with the task is broken.

If an AI can do the task well, there is something fundamentally wrong. Either the abstractions are broken, the documentation is horrible, the task is pure boilerplate, etc.

JamesSwift|4 months ago

D) Understand and use context creatively. They know when to start new conversations, and how to use the filesystem as context storage.

Ianjit|4 months ago

"expect an LLM to correctly produce the complex thing they have in mind"

My guess is that for some types of work people don't know what the complex thing they have in mind is ex ante. The idea forms and is clarified through the process of doing the work. For those types of task there is no efficiency gain in using AI to do the work.

danielbln|4 months ago

Why not? Just start iterating in chunks alongside the LLM and change gears/plan/spec as you learn more. You don't have to one-shot everything.

netdevphoenix|4 months ago

> Why do LLM experiences vary so much among developers?

The question assumes that all developers do the same work. The kind of work done by an embedded dev is very different from the work of a front-end dev which is very different from the kind of work a dev at Jane Street does. And even then, devs work on different types of projects: greenfield, brownfield and legacy. Different kind of setups: monorepo, multiple repos. Language diversity: single language, multiple languages, etc.

Devs are not some kind of monolith army working like robots in a factory.

We need to look at these factors before we even consider any sort of ML.

izacus|4 months ago

My experience is that people clamming they use AI exclusively are usually trying to sell an AI product.

gf000|4 months ago

Probably a good chunk of the differences in experience is this: https://news.ycombinator.com/item?id=45573521

> [..] possibly the repo is too far off the data distribution.

(Karpathy's quote)

Zababa|4 months ago

I've known lots of people that don't know how to properly use Google, and Google has been around for decades. "You're using it wrong" is partially true, I'd say more something like "it is a new tool that changes very quickly, you have to invest a lot of time to learn how to properly use it, most people using it well have been using it a lot over the last two years, you won't catch up in an afternoon. Even after all that time, it may not be the best tool for every job" (proof on the last point being Karpathy saying he wrote nanochat mostly by hand).

It is getting easier and easier to get good results out of them, partially by the models themselves improving, partially by the scaffolding.

> non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor

This is a false dichotomy that assumes we know way more about intelligence than we actually do, and also assumes than what you need to ship lots of high quality software is "intelligence".

>While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.

"reliably" is doing a lot of work here. If it means "without human guidance" it is true (for now), if it means "without scaffolding" it is true (also for now), if it means "at all" it is not true, if it means it can't increase dev productivity so that they ship more at the same level of quality, assuming a learning period, it is not true.

I think those conversations would benefit a lot from being more precise and more focused, but I also realize that it's hard to do so because people have vastly different needs, levels of experience, expectations ; there are lots of tools, some similar, some completely different, etc.

To answer your question directly, ie “Why do LLM experiences vary so much among developers?”: because "developer" is a very very very wide category already (MISRA C on a car, web frontend, infra automation, medical software, industry automation are all "developers"), with lots of different domains (both "business domains" as in finance, marketing, education and technical domains like networking, web, mobile, databases, etc), filled with people with very different life paths, very different ways of working, very different knowledge of AIs, very different requirements (some employers forbid everything except a few tools), very different tools that have to be used differently.

nunez|4 months ago

This is very similar to Tesla's FSD adoption in my mind.

For some (me), it's amazing because I use the technology often despite its inaccuracies. Put another way, it's valuable enough to mitigate its flaws.

For many others, it's on a spectrum between "use it sometimes but disengage any time it does something I wouldn't do" and "never use it" depending on how much control they want over their car.

In my case, I'm totally fine handing driving off to AI (more like ML + computer vision) most times but am not okay handing off my brain to AI (LLMs) because it makes too many mistakes and the work I'd need to do to spot-check them is about the same as I'd need to put in to do the thing myself.

theshrike79|4 months ago

It's a personality thing.

I know Car People who refuse to use even lane keeping assist, because it doesn't fit their driving style EXACTLY and it grates them immensely.

I on the other hand DGAF, I love how I don't need to mess with micro adjustments of the steering wheel on long stretches of the road, the car does that for me. I can spend my brainpower checking if that Gray VW is going to merge without an indicator or not.

Same with LLM, some people have a very specific style of code they want to produce and anything that's not exactly their style is "wrong" and "useless". Even if it does exactly what it should.

XenophileJKO|4 months ago

At some point people won't care to convince you and you will be left to adapt or fade away.

danielbln|4 months ago

That's where I stand now. I use LLMs in some agentic coding way 10h/day to great avail. If someone doesn't see or realize the value, then that's their loss.

ninetyninenine|4 months ago

It’s because people are using different tiers of AI and different models. And many people don’t stick with it long enough to get a more nuanced outlook of AI.

Take Joe. Joe sticks with AI and uses it to build an entire project. Hundreds of prompts. Versus your average HNer who thinks he’s the greatest programmer in the company and thinks he doesn’t need AI but tries it anyway. Then AI fails and fulfills his confirmation bias and he never tries it again.