top | item 44467949

Everything around LLMs is still magical and wishful thinking

298 points| troupo | 8 months ago |dmitriid.com | reply

356 comments

order
[+] tasty_freeze|8 months ago|reply
One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.

But that sets expectation way too high. Partly it is due to Amdahl's law: I spend only a portion of my time coding, and far more time thinking and communicating with others that are customers of my code. Even if does make the coding 10x faster (and it doesn't most of the time) overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.

[+] TeMPOraL|8 months ago|reply
Maybe it's due to a more R&D-ish nature of my current work, but for me, LLMs are delivering just as much gains in the "thinking" part as in "coding" part (I handle the "communicating" thing myself just fine for now). Using LLMs for "thinking" tasks feels similar to how mastering web search 2+ decades ago felt. Search engines enabled access to information provided you know what you're looking for; now LLMs boost that by helping you figure out what you're looking for in the first place (and then conveniently searching it for you, too). This makes trivial some tasks I previously classified as hard due to effort and uncertainty involved.

At this point I'd say about 1/3 of my web searches are done through ChatGPT o3, and I can't imagine giving it up now.

(There's also a whole psychological angle in how having LLM help sort and rubber-duck your half-baked thought makes many task seem much less daunting, and that alone makes a big difference.)

[+] wubrr|8 months ago|reply
> One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.

Similar situation at my work, but all of the productivity claims from internal early adopters I've seen so far are based on very narrow ways of measuring productivity, and very sketchy math, to put it mildly.

[+] thunky|8 months ago|reply
> One thing I find frustrating is that management where I work has heard of 10x productivity gains.

That may also be in part because llms are not as big of an accelerant for junior devs as they are for seniors (juniors don't know what is good and bad as well).

So if you give 1 senior dev a souped up llm workflow I wouldn't be too surprised if they are as productive as 10 pre-llm juniors. Maybe even more, because a bad dev can actually produce negative productivity (stealing from the senior), in which case it's infinityx.

Even a decent junior is mostly limited to doing the low level grunt work, which llms can already do better.

Point is, I can see how jobs could be lost, legitimately.

[+] louthy|8 months ago|reply
> overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.

It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools. The total cost of production should always be considered, not just throughput.

[+] datpuz|8 months ago|reply
It's just another tech hype wave. Reality will be somewhere between total doom and boundless utopia. But probably neither of those.

The AI thing kind of reminds me of the big push to outsource software engineers in the early 2000's. There was a ton of hype among executives about it, and it all seemed plausible on paper. But most of those initiatives ended up being huge failures, and nearly all of those jobs came back to the US.

People tend to ignore a lot of the little things that glue it all together that software engineers do. AI lacks a lot of this. Foreigners don't necessarily lack it, but language barriers, time zone differences, cultural differences, and all sorts of other things led to similar issues. Code quality and maintainability took a nosedive and a lot of the stuff produced by those outsourced shops had to be thrown in the trash.

I can already see the AI slop accumulating in the codebases I work in. It's super hard to spot a lot of these things that manage to slip through code review, because they tend to look reasonable when you're looking at a diff. The problem is all the redundant code that you're not seeing, and the weird abstractions that make no sense at all when you look at it from a higher level.

[+] coolKid721|8 months ago|reply
On my personal projects it's easily 10x faster if not more in some circumstances. At work where things are planned out months in advanced and I'm working with 5 different teams to figure out the right way to do things for requirements that change 8 times during development? Even just stuff with PR review and making sure other people understand it and can access it. idk sometimes it's probably break even or that 10-15%. It just doesn't work well in some environments and what really makes it flourish (having super high quality architectural planning/designs/standardized patterns etc.) is basically just not viable at anything but the smallest startups and solo projects.

Frankly even just getting engineers to agree upon those super specificized standardized patterns is asking a ton, especially since lots of the things that help AI out are not what they are used to. As soon as you have stuff that starts deviating it can confuse the AI and makes that 10x no longer accessible. Also no one would want to review the PRs I'd make for the changes I do on my "10x" local project... Especially maintaining those standards is already hard enough on my side projects AI will naturally deviate and create noise and the challenge is constructing systems to guide that to make sure nothing deviates (since noise would lead to more noise).

I think it's mostly a rebalancing thing, if you have 1 or a couple like minded engineers who intend to do it they can get that 10x. I do not see that EVER existing in any actual corporate environment or even once you get more then like 4 people tbh.

Ai for middle management and project planning on the other hand...

[+] mlinsey|8 months ago|reply
I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
[+] ericmcer|8 months ago|reply
Its great when they use AI to write a small app “without coding at all” over the weekend and then come in on Monday to brag about it and act baffled that tasks take engineers any time at all.
[+] jppope|8 months ago|reply
The reports from analysis of open source projects are that its something in the range of 10%-15% productivity gains... so it sounds like you're spot on
[+] doug_durham|8 months ago|reply
How much of the communication and meetings are because traditionally code was very expensive and slow to create? How many of those meetings might be streamlined or entirely disappear in the future? In my experience there is are a lot of process around making sure that software on schedule track and that it's doing what it is supposed to do. I think that the software lifecycle is about to be reinvented.
[+] deadbabe|8 months ago|reply
Wait till they hear about the productivity gains from using vim/neovim.

Your developers still push a mouse around to get work done? Fire them.

[+] tom_m|8 months ago|reply
Expectations are absolutely way too high. It's going to lead to a lot of toxicity and people being fired. It's really going to suck.
[+] ghuntley|8 months ago|reply
Canva has seen a 30% productivity uplift - https://fortune.com/2025/06/25/canva-cto-encourages-all-5000...

AI is the new uplift. Embrace and adapt, as a rift is forming (see my talk at https://ghuntley.com/six-month-recap/), in what employers seek in terms of skills from employees.

I'm happy to answer any questions folks may have. Currently AFK [2] vibecoding a brand new programming language [1].

[1] https://x.com/GeoffreyHuntley/status/1940964118565212606 [2] https://youtu.be/e7i4JEi_8sk?t=29722

[+] abletonlive|8 months ago|reply
I’m a tech lead and I have maybe 5X output now compared to everybody else under me. Quantified by scoring tickets at a team level. I also have more responsibilities outside of IC work compared to the people under me. At this point I’m asking my manager to fire people that still think llms are just toys because I’m tired of working with people with this poor mindset. A pragmatic engineer continually reevaluates what they think they know. We are at a tipping point now. I’m done arguing with people that have a poor model of reality. The rest of us are trying to compete and get shit done. This isn’t an opinion or a game. It’s business with real life consequences if you fall behind. I’ve offered to share my workflows, prompts, setup. Guess how many of these engineers have taken me up on my offer. 1-2 and the juniors or ones that are very far behind have not.
[+] hotpotat|8 months ago|reply
I have to say I’m in the exact camp the author is complaining about. I’ve shipped non trivial greenfield products which I started back when it was only ChatGPT and it was shitty. I started using Claude with copying and pasting back and forth between the web chat and XCode. Then I discovered Cursor. It left me with a lot of annoying build errors, but my productivity was still at least 3x. Now that agents are better and claude 4 is out, I barely ever write code, and I don’t mind. I’ve leaned into the Architect/Manager role and direct the agent with my specialized knowledge if I need to.

I started a job at a demanding startup and it’s been several months and I have still not written a single line of code by hand. I audit everything myself before making PRs and test rigorously, but Cursor + Sonnet is just insane with their codebase. I’m convinced I’m their most productive employee and that’s not by measuring lines of code, which don’t matter; people who are experts in the codebase ask me for help with niche bugs I can narrow in on in 5-30 minutes as someone whose fresh to their domain. I had to lay off taking work away from the front end dev (which I’ve avoided my whole career) because I was stepping on his toes, fixing little problems as I saw them thanks to Claude. It’s not vibe coding - there’s a process of research and planning and perusing in careful steps, and I set the agent up for success. Domain knowledge is necessary. But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.

[+] martinald|8 months ago|reply
I personally don't really get this.

_So much_ work in the 'services' industries globally comes down to really a human transposing data from one Excel sheet to another (or from a CRM/emails to Excel), manually. Every (or nearly every) enterprise scale company will have hundreds if not thousands of FTEs doing this kind of work day in day out - often with a lot of it outsourced. I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.

So really for giant value to be created out of LLMs you do not need them to be incredible at OCaml. They just need to ~outperform humans on Excel. Where I do think MCP really helps is that you can connect all these systems together easily, and a lot of the errors in this kind of work came from trying to pass the entire 'task' in context. If you can take an email via MCP, extract some data out and put it into a CRM (again via MCP) a row at a time the hallucination rate is very low IME. I would say at least a junior overworked human level.

Perhaps this was the point of the article, but non-determinism is not an issue for these kind of use cases, given all the humans involved are not deterministic either. We can build systems and processes to help enforce quality on non deterministic (eg: human) systems.

Finally, I've followed crypto closely and also LLMs closely. They do not seem to be similar in terms of utility and adoption. The closest thing I can recall is smartphone adoption. A lot of my non technical friends didn't think/want a smartphone when the iPhone first came out. Within a few years, all of them have them. Similar with LLMs. Virtually all of my non technical friends use it now for incredibly varied use cases.

[+] deepsquirrelnet|8 months ago|reply
Making a comparison to crypto is lazy criticism. It’s not even worth validating. It’s people who want to take the negative vibe from crypto and repurpose it. The two technologies have nothing to do with each other, and therefore there’s clearly no reason to make comparative technical assessments between them.

That said, the social response is a trend of tech worship that I suspect many engineers who have been around the block are weary of. It’s easy to find unrealistic claims, the worst coming from the CEOs of AI companies.

At the same time, a LOT of people are practically computer illiterate. I can only imagine how exciting it must seem to people who have very limited exposure to even basic automation. And the whole “talking computer” we’ve all become accustomed to seeing in science fiction is pretty much becoming reality.

There’s a world of takes in there. It’s wild.

I worked in ML and NLP several years before AI. What’s most striking to me is that this is way more mainstream than anything that has ever happened in the field. And with that comes a lot of inexperience in designing with statistical inference. It’s going to be the Wild West for a while — in opinions, in successful implementation, in learning how to form realistic project ideas.

Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.

[+] saulpw|8 months ago|reply
Each FTE doing that manual data pipelining work is also validating that work, and they have a quasi-legal responsibility to do their job correctly and on time. They may have substantial emotional investment in the company, whether survival instinct to not be fired, or ambition to overperform, or ethics and sense to report a rogue manager through alternate channels.

An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0. *It can be absolutely be worth an FTE salary to make sure these numbers are accurate.* And for there to be a person to blame/fire/imprison if they aren't accurate.

[+] marinmania|8 months ago|reply
>I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.

For what time of company is this true? I really would like someone to just do a census of 500 white collar jobs and categorize them all. Anything that is truly automatic has already been automated away.

I do think AI will cause a lot of disruption, but very skeptical of the view that most people with white collar jobs are just "email jobs" or data entry. That doesn't fit my experience at all, and I've worked at some large bureaucratic companies that people here would claim are stuck in the past.

[+] lottin|8 months ago|reply
You're vastly underestimating the complexity of these types of jobs.
[+] labrador|8 months ago|reply
I'm a retired programmer. I can't imagine trusting code generated by probablities for anything mission critical. If it were close and just needed minor tweaks I could understand that. But I don't have experience with it.

My comment is mainly to say LLMs are amazing in areas that are not coding, like brainstorming, blue sky thinking, filling in research details, asking questions that make me reflect. I treat the LLM like a thinking partner. It does make mistakes, but those can be caught easily by checking other sources, or even having another LLM review the conclusions.

[+] standardUser|8 months ago|reply
> Like most skeptics and critics, I use these tools daily. And 50% of the time they work 50% of the time.

I use LLMs nearly every day for my job as of about a year ago and they solve my issues about 90% of the time. I have a very hard time deciphering if these types of complaints about AI/LLMs should be taken seriously, or written off as irrational use patterns by some users. For example, I have never fed an LLM a codebase and expected it to work magic. I ask direct, specific questions at the edge of my understanding (not beyond it) and apply the solutions in a deliberate and testable manner.

if you're taking a different approach and complaining about LLMs, I'm inclined to think you're doing it wrong. And missing out on the actual magic, which is small, useful and fairly consistent.

[+] geuis|8 months ago|reply
Hmm. Ok so you're basically quoting the line from The Weatherman "60% of the time, it works all of the time."

I also use gpt and Claude daily via cursor.

Gpt o3 is kinda good for general knowledge searches. Claude falls down all the time, but I've noticed that while it's spending tokens to jerk itself off, quite often it happens on the actual issue going on with out recognizing it.

Models are dumb and more idiot than idiot savant, but sometimes they hit on relevant items. As long as you personally have an idea of what you need to happen and treat LLMs like rat terriers in a farm field, you can utilize them properly

[+] leptons|8 months ago|reply
Your comment is no better than the comment in the article that the author is calling out.

"90%" also seems a bit suspect.

[+] AbrahamParangi|8 months ago|reply
This reads like the author is mad about imprecision in the discourse which is real but to be quite frank more rampant amongst detractors than promoters, who often have to deal with the flaws and limitations on a day to day basis.

The conclusion that everything around LLMs is magical thinking seems to be fairly hubristic to me given that in the last 5 years a set of previously borderline intractable problems have become completely or near completely solved, translation, transcription, and code generation (up to some scale), for instance.

[+] troupo|8 months ago|reply
> but to be quite frank more rampant amongst detractors than promoters, who often have to deal with the flaws and limitations on a day to day basis.

"detractors" usually point to actual flaws. "promoters" usually uncritically hail LLMs as miracles capable of solving any problem in one go, without giving any specific details.

[+] DavidPiper|8 months ago|reply
Translation, transcription, and code generation (up to some scale) were borderline intractable problems?

Google Translate, Whisper and Code Generators (up to some scale) have existed for quite some time without using LLMs.

[+] atemerev|8 months ago|reply
"It's crypto all over again"

Crypto is a lifeline for me, as I cannot open a bank account in the country I live in, for reasons I can neither control nor fix. So I am happy if crypto is useless for you. For me and for millions like me, it is a matter of life and death.

As for LLMs — once again, magic for some, reliable deterministic instrument for others (and also magic). Just classified and sorted a few hundreds of invoices. Yes, magic.

[+] tehjoker|8 months ago|reply
This is basically the only use case for crypto, and one for which it was explicitly designed: censorship resistance. This is why people have so much trouble finding useful things for it to do in the legal economy, it was explicitly designed to facilitate transactions the government doesn't want or can't facilitate. In some cases, there are humanitarian applications, there are also a lot of illicit applications.
[+] harel|8 months ago|reply
Can you elaborate on your situation? Which country are you in? How is crypto used there?
[+] troupo|8 months ago|reply
It's a valid use case in the sea of nonsensical hype where "you are a moron if you don't believe in some true meaning of crypto".

"You had to be there to believe it" https://x.com/0xbags/status/1940774543553146956

AI craze is currently going through a similar period: any criticism is brushed away as being presented by morons who know nothing

[+] foobarchu|8 months ago|reply
I don't think you actually disagree with the authors quip. You seem to want to use crypto as a currency, while OP was most likely referring to the grifting around crypto as an investment. If you're using it as a currency, then the people trying to pump and dump coins and use it for a money making vehicle are your adversaries. You are best served if it's stable instead of a rollercoaster of booms and busts.
[+] mumbisChungo|8 months ago|reply
Said this in another thread and I'll repeat it here:

It's the same problem that crypto experiences. Almost everyone is propagating lies about the technology, even if a majority of those doing so don't understand enough to realize they're lies (naivety vs malice).

I'd argue there's more intentional lying in crypto and less value to be gained, but in both cases people who might derive real benefit from the hard truth of the matter are turning away before they enter the door due to dishonesty/misrepresentation- and in both cases there are examples of people deriving real value today.

[+] sureglymop|8 months ago|reply
Loosely related, but I find the use of AGI (and sometimes even AI) as terms annoying lately. Especially in scientific papers, where I would imagine everything to be well defined. If at least in how it is used in that paper.

So, why can't we just come up with some definition for what AGI is? We could then, say, logically prove that some AI fits that definition. Even if this doesn't seem practically useful, it's theoretically much more useful than just using that term with no meaning.

Instead it kind of feels like it's an escape hatch. On wikipedia we have "a type of ai that would match or surpass human capabilities across virtually all cognitive tasks". How could we measure that? What good is this if we can't prove that a system has this property?

Bit of a rant but I hope it's somewhat legible still.

[+] kgeist|8 months ago|reply
We recently started using LLMs at our company, and the first job I had was to transcribe 20k customer calls and extract the following info:

1) what products we're usually compared to

2) what problems users have with our software

3) what use cases users mention most often

What used to take weeks of research took just a couple of hours. It helped us form a new strategy and brought real business value.

I see LLMs as just a natural language processing engine, and they're great at that. Some people overhype it, sure, but that doesn't change the fact that it's been genuinely useful for our cases. Not sure what's up with all those "LLM bad" articles. If it doesn't work for you, just move on. Why should anyone have to prove anything to anyone? It's just a tool.

[+] hx8|8 months ago|reply
I think you are underestimating the negative impacts that overhype cause. It's distorting the market, causing over investment, preemptively slashing departments, and creating an expectation that will never be meet. These articles are important for cooling expectations. When people sell LLMs, they usually aren't talking about summarizing customer support calls, they are trying to sell the idea of firing customer support staff.
[+] herbst|8 months ago|reply
This very much. People who claim there is no real use of LLM never faced the problem of processing a lot of data in kinda reliably way.

For years most of the translations on the web didn't have context. Now they can have.

[+] djoldman|8 months ago|reply
Many well-trusted and reasonable tech folks who are known for sober takes on subjects have reported substantial improvements in their programming work by using various forms of generative AI.

What does substantial mean? Somewhere between 5% and 100%. Something NOT insignificant.

At a minimum, it is safe to say that GenAI is or could be a significantly beneficial tool for a significant number of people.

It's not required that folks disclose how many CPUs, lines of code, numbers of bytes processed, or other details for the above to be a reasonable take.

[+] alganet|8 months ago|reply
LLM tech probably will find some legitimate use, but by then, everything will be filled with people misusing it.

Millions of beginner developers running with scissors in their hands, millions of investment in the garbage.

I don't think this can be reversed anymore, companies are all-in and pot commited.

[+] dcre|8 months ago|reply
Similar argument to https://www.baldurbjarnason.com/2025/trusting-your-own-judge..., but I like this one better because at least it doesn’t try to pull the rhetorical trick of slipping from “we can’t know whether LLMs are helping because we haven’t studied the question systematically” to “actually we do know, and they’re shit”.
[+] yahoozoo|8 months ago|reply
The thing is, the questions such as “are they an expert in the domain” … “are they good at coding to being with” … and so on only really apply to the folks claiming positive results from LLMs. On the flip side, someone not getting much value - or dare I say, a skeptic - pushes back because they _can see_ what the LLM gave them is wrong. I’m not providing any revelatory comment here, but the simple truth is: people who are shit to begin with think this is all amazing/magic/the future.
[+] geetee|8 months ago|reply
I'm working with product managers that are almost certainly using LLMs to generate product requirement docs, complete with code samples, data type definitions, and diagrams. Everything looks good to the untrained eye, but it's complete and utter bullshit. LLM abuse is going to be the end of so many tech companies.
[+] ibaikov|8 months ago|reply
Crypto and NFT situation happened because of our society, media and vc/startup landscape who hype things up a lot for their own reasons. We treat massive technologies as new brands of bottled water. Or, actually, as a new hype toy as fidget spinners or pop it toys. This tech is massively more complex and you have to invest time to learn about its abilities, limitations and potential developments. Almost nobody actually does this, it's easier to follow hype train and put money into something that grows up and looks cool without obvious cons. Crypto is cool for some stuff. On the other hand, where's your Stepn (and move to earn in general), decentraland cities, Apes that will make a multimedia universe? Where's "you'll be paying using crypto for everything"?

Same for LLMs and AI: it is awesome for some things and absolutely sucks for other things. Curiously tho, it feels like UX was solved by making chats, but it actually still sucks enormously, as with crypto. It is mostly sufficient for doing basic stuff. It is difficult to predict where we'll land on the curve of difficult (or expensive) vs abilities. I'd bet AI will get way more capable, but even now you can't really deny its usefulness.

[+] orbital-decay|8 months ago|reply
The point about non-determinism is moot if you understand how it works. An accurate LLM always gives the same result where the same result is needed, no matter how many times you ask it. Try asking any LLM what is 2x2 on a temperature it's designed for, what are the chances to get 5 in a reply?

In reality, modern LLMs trained with RL have terrible variance and mainly learn 1:1 mapping of ideas to ideas, which is a big issue for creative writing and parallel inference/majority voting techniques, so there's even less meaningful "non-determinism" available than you might think. It's usually either able or not able to give the correct answer, rerolling it doesn't work well. I think even a human has more non-determinism than a modern LLM (it's impossible to measure though).

[+] ls-a|8 months ago|reply
AI helped me find a bug that was going undetected and also helped me fix it. I was debugging a completely different bug, and decided to get a bit creative with what context i should share. AI pointed out the other bug. What I'm trying to say is people should stop talking about AI making them faster, that is not a good goal, unless your manager sucks. People should talk about concrete cases of how AI helped them instead. Another example I can give is AI helped me understand a part of a complicated protocol without reading the spec. It explained just the part I needed at the moment. AI made me dread my work less. I hope founders start getting creative with AI tools instead of copying each other.
[+] cedws|8 months ago|reply
I still see little conversation about the two fundamental limitations of LLMs right now: context size, and prompt injection.

* Computation does not scale linearly with context size, meaning the ‘memory’ of LLMs is limited and gets more expensive as it gets bigger.

* Prompt injection limits the usability of LLMs in the real world. How can you put an LLM in the driving seat if malicious actors can talk it into doing something it’s not supposed to.

Whenever I see a blog post by Anthropic or OpenAI I do a Ctrl+F for “prompt injection.” Never mentioned. They want people to forget this is a problem — because it’s a massive one.

[+] hamilyon2|8 months ago|reply
I am impressed by speed of the sound goalpost movement.

Few days ago Google released very competent summary generator, interpreter between 10-s of languages, gpt-3 class general purpose assistant. Working locally on modest hardware. On 5 years old laptop, no discrete GPU.

It alone potentially saves so much toil, so much stupid work.

We also finally “solved computer vision”. Read from PDF, read diagrams and tables.

Local vision models are much less impressive and need some care to use. Give it 2 years.

I don't know if we can overhype it when it archives holy grail level on some important tasks.

[+] assuagering|8 months ago|reply
I have had SOTA models stray from factual content in documents I provided them with within 2-3 prompts.

They haven't solved anything. They are just fast and look good doing what we ask them to do. But they corrupt data with a passion and to that hype just responds: "just give us 10x as much money and compute".