What makes 5% of AI agents work in production?

sbierwagen|4 months ago

>This Monday, I moderated a panel in San Francisco with engineers and ML leads from Uber, WisdomAI, EvenUp, and Datastrato. The event, Beyond the Prompt, drew 600+ registrants, mostly founders, engineers, and early AI product builders.

>We weren’t there to rehash prompt engineering tips.

>We talked about context engineering, inference stack design, and what it takes to scale agentic systems inside enterprise environments. If “prompting” is the tip of the iceberg, this panel dove into the cold, complex mass underneath: context selection, semantic layers, memory orchestration, governance, and multi-model routing.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles|4 months ago

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.

Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?

It comes off as so cheap.

tkgally|4 months ago

I started to suspect a few paragraphs in that this post was written with a lot of AI assistance, but I continued to read to the end because the content was interesting to me. Here's one point that resonated in particular:

"There’s a missing primitive here: a secure, portable memory layer that works across apps, usable by the user, not locked inside the provider. No one’s nailed it yet. One panelist said if he weren’t building his current startup, this would be his next one."

esperent|4 months ago

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!

I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".

It works for the first couple of responses but then it's back to loads of links again.

donnaoana|4 months ago

thanks for the hate, they did love it indeed, the questions I've asked them, the draft I wrote for them to read, and published only after they read and added comments. I am curious, do you not use AI? isn't the point to polish things and make it more efficient? I am curious if there was anything useful to you in the article or if you have constructive criticism? I was sad to read some of the hate, but overall, I am very happy with the many notes form founders and builders who found it useful.

carimura|4 months ago

the future is now where debates about human vs machine will influence our trust and enjoyment! I read the article wondering how much of it was AI generated (new worry!), but also how biased it was based on the authors startup business interest (old worry!), and concluded that if I learned something about the panel it was worth the 5 minutes. Or maybe 2 minutes if an AI summarized it.

geoffbp|4 months ago

And the Oxford comma

scotty79|4 months ago

It did good enough job for me to skim it.

thisisit|4 months ago

It seems to me that people think AI is somehow magic. Recently I led a product demo. The conversation went something like this:

End users (at my company) - Can your AI system look at numbers and find differences and generate a text description?

Pre-sales - (trying to clarify) For our systems to generate text it will be better if you give it some live examples so that it understands what text to generate.

End users - But there is supporting data (metadata) around the numbers. Can't your AI system just generate text?

Pre-Sales - It can but you need to provide context and examples. Otherwise it is going to generic text like "there is x difference".

End user - You mean I need to write comments manually first? That is too much work.

Now these users have a call with another product - MS Copilot.

beezlebroxxxxxx|4 months ago

Well, you hear a lot about how AI will "empower" employees and generate new "insights" based off of data for analysts and execs. In reality, most executives aren't really interested in that. They'd like it for sure, but really what they want is automation. They want "efficiencies"; they want cost cutting.

Anyone that's been involved in data science roles in corporate environments knows that "the data" is usually forced into an execs pre-existing understanding of a phenomenon. With AI, execs are really excited at "cutting out the middlemen" when the middlemen in the equation are very often their own paid employees. That's all fine and dandy in an abstract economic view, but it's sure something they won't say publicly (at least most won't).

In terms of potential cost cutting, it probably is the most recent "new magic". You used to have to pay a consultant, now you can "ask AI".

nowittyusername|4 months ago

This is a very common sentiment I see everywhere and it really highlights how uneducated most people are about technology in general. Most folks seem to expect things to work magically and perform physics breaking feats and it honestly baffles me. I would expect this attitude from maybe the younger generations who grew up only being users of technology like tablets and smartphones, but I honestly never expected millennials to be in the same camp, but nope they are just as ignorant. And I am thinking to myself, did I grow up different? Were my friends also not using the same Nintendo cartridges, and VCR's and camcorders and all the other tech that you had no choice but to learn at least basic fundamentals to use? Apparently most people never delved deeper then surface level on how to use these things and everything else went right over their head...

alganet|4 months ago

> It seems to me that people think AI is somehow magic.

That's because it is marketed as magic. It's marketed as magic so people will adopt the thing before knowing its shortcomings.

https://pbfcomics.com/comics/the-masculator/

hadlock|4 months ago

The MS Copilot pre-sales person responded "oh, there is metadata? then yes, it will discover that and generate a text description, no problem"

alansaber|4 months ago

TBF synthetic data generation exists for this reason. I do understand why a lot of companies go with the "safe" choice (copilot) even though it's crap.

amenhotep|4 months ago

Pray, Mr Babbage, etc

iagooar|4 months ago

Wow, half of this article deeply resonates with what I am working on.

Text-to-SQL is the funniest example. It seems to be the "hello world" of agentic use in enterprise environments. It looks so easy, so clear, so straight-forward. But just because the concept is easy to grasp (LLMs are great at generating markup or code, so let's have them translate natural language to SQL) doesn't mean it is easy to get right.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

data-ottawa|4 months ago

SQL is never just the tables and joins, it’s knowing the table grains, the caveats, all the modelling definitions and errors (and your data warehouse almost certainly has modelling errors as business logic in your app drifts), plus the business context to correctly answer questions.

60% of the time I spend writing sql is probably validation. A single hallucinated assumption can blow the whole query. And there are questions that don’t have clear modelling approaches that you have to deal with.

Plus, a lot of the sql training data in LLMs is pretty bad, so I’ve not been impressed yet. Certainly not to let business users run an AI query agent unchecked.

I’m sure AI will get good at this, so I’m building up my warehouse knowledge base and putting together documentation as best I can. It’s just pretty awful today.

jamesblonde|4 months ago

Text2SQL was 75% on bird-bench 6 months ago. Now it's 80%. Humans are still at 90+%. We're not quite there yet. I suspect text-to-sql needs a lot of intermediate state and composition of abstractions, which vanilla attention is not great at.

https://bird-bench.github.io/

juleiie|4 months ago

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries

Wait but this just sounds unhinged, why oh why

donnaoana|4 months ago

glad it resonates, that was the intention

another_twist|4 months ago

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

For example, the number comes from perceived successes and failures and not actual measurements. The customer conclusions are also - it doesnt improve or it doesnt remember. Literally buying into the hype of recursive self improvement and completely oblivious to the fact that API dont control model weights and such cant do much self improvement besides writing more CRUD layers. The other complaints are about integrations which are totally valid. But in industries which still run windows XYZ without any API platforms so thats not going away in those cases.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

monero-xmr|4 months ago

A non-open ended path collapses into a decision tree. Very hard to think of customer support use-cases that do not collapse into decision trees. Most prompt engineering on the SaaS side results in very long prompts to re-invent decision trees and protect against edge cases. Ultimately the AI makes a “decision function call” which hits a decision tree. LLM is very poor replacement for a decision tree.

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood|4 months ago

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

jongjong|4 months ago

It's interesting because my management philosophy when delegating work has been to always start by telling people what my intent is, so that they don't get too caught up in a specific approach. Many problems require out-of-the-box thinking. This is really about providing context. Context engineering is basically a management skill.

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

HardCodedBias|4 months ago

"I suspect that people who are good engineering managers will also be good at 'vibe coding'."

I have observed that those who have both technical and management experience seem to be more adept (or perhaps willing?) to use LLMs in the daily life to good effect.

Of course what really helps, like in all things, is conscientiousness and an obsession for working through problems (if people don't like obsession then tenacity and diligence).

AdieuToLogic|4 months ago

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever1|4 months ago

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.

If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

mnky9800n|4 months ago

You see, in order to get the AI agent to do it's job, we needed to write a lot of software to provide it with guard rails so that it doesn't lose its mind when doing so.

might as well just write the ai agent part of the software yourself as well.

codyb|4 months ago

At work we're deploying a chat bot to help users with our internal tools and it's just a forcing function to write and mark as deprecated the documentation we never maintained in the first place.

So...

The bot, to its credit, returns some decent results. But my guess is that it will be quite a while before we see it in prod since a lot of these projects go from 0 - 80% in a week and 80% - deployable in several years.

danieltanfh95|4 months ago

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

ares623|4 months ago

At some point, say 5 years from now, someone will revisit their AI-powered production workloads and ask the question "how can we optimize this by falling back to non-AI workload?". Where does that leave AI companies when the obvious choice is to do away with their services once their customers reach a threshold?

anonzzzies|4 months ago

A lot of what we encounter is; there is this 'chat' interface which is the 'wow factor': you type something in english and something (like text to sql) falls out, maybe 60-80% of what was needed. But then the frustration (for the user) starts: the finetuning of the result. After a few uses, they always ask for the 'old way' back to do that: just editing the query or give them knobs to turn to finetune the result. Where most want knobs which are, outside the most generic cases (pick a timespan for a datetime column), custom work. So AI is used for the first 10% of the work time (which gives you 60%+ of the solution) until the frustration lands: the last 40% or less are going to take 90% of your time. Still great as overall it will probably take far less time than before.

EdwardDiego|4 months ago

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."

marcosdumay|4 months ago

Those 5% that generate revenue on the MIT article do that because the only thing they are used for is creating marketing spam to send to people.

And now we have an entire panel of bullshitters with an article-long theory about how to make LLMs program actually for real this time.

(Oh, and it would be great if journalists actually cited their public sources, instead of pretending they link to the article but actually linking to their review of related content.)

intended|4 months ago

I just refuse to read long AI generated text. Sadly this feels exactly like that.

donnaoana|4 months ago

I am curious, how would you use AI then, if not to make one more productive? The text is not AI generated, I came up with the questions, moderated the discussion, wrote a draft that the speakers red, added comments and the AI polished it, the AI was a custom GPT that I trained on my previous text from that substack. I am curious what would you have done differently or if you would refuse to use AI at all? I wrote the article, so I am genuinly curious. I didn't know someone posted on Hacker News, I knew people like to be negative here because there is no accountablity, I want to learn from all this hate. I am personally happy with the outcome, I gover over 30 notes from people who are building that this was useful to them, and the speakers were happy. So I am curious what could have I done differently from your perspective or what should be my learning from all these people who take time from their day to hate on this piece of writing instead of deciding not to read and moving on.

codyb|4 months ago

I get really frustrated when I see it on PRs cause it's such a time sink, super obvious, and so fluffy.

So you scaffold this up in 30 seconds but want me to read through it carefully? Cool, thanks.

EdwardDiego|4 months ago

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

meheleventyone|4 months ago

They're saying that someone should implement the CalculateQuarterRevenue(year, quarter) function somewhere in a manner that has been verified (e.g. run it against previous quarters to make sure it works correctly) then rather than using the LLM to generate SQL you use it to decide what domain function should be called. Which to me seems to mean that someone on the panel was gently taking the piss out of the idea. Since if you've done all the hardwork anyway presenting this in a deterministic way with a nice UX is straightforward bit of front end work.

esafak|4 months ago

In other words, there should be a list of predefined queries, or possibly subqueries, that the user can request. This is basically how products used to work before AI. The difference is now you can request which query you want verbally.

edit: I'm serious. I'm just answering the question, not making a value judgement.

dchftcs|4 months ago

On one side, you have an agent calculating the revenue.

On the other side, you have an SQL that calculates the revenue

Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.

tirumaraiselvan|4 months ago

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.

moomoo11|4 months ago

psychedelics

zoeey|4 months ago

I've always felt the real challenge isn't the LLM itself, but managing the context around it. Many people assume that writing a good prompt is enough, but the real work is turning something unpredictable into a tool you can actually rely on.

LogicFailsMe|4 months ago

95% of the talent is being paid top dollar to build ~5% of the applications?

alansaber|4 months ago

Absolutely, when we're talking about infrastructure versus model development (RL/fine tuning, let alone pre-training).

hn_throwaway_99|4 months ago

> Here’s the reality check: One panelist mentioned that 95% of AI agent deployments fail in production. Not because the models aren’t smart enough, but because the scaffolding around them, context engineering, security, memory design, isn’t there yet.

It's a big pet peeve of mine when an author states an opinion, with no evidence, as some kind of axiom. I think there is plenty of evidence that "the models aren't smart enough". Or to put it more accurately, it's an incredibly difficult problem to get a big productivity gain when an automated system is blatantly wrong ~1% of the time but when those wrong answers are inherently designed to look like right answers as much as possible.

janalsncm|4 months ago

> The panel’s consensus: conversation works when it removes a learning curve.

Conversational UIs are controversial but I think there are a good number of websites where a better search could be more centric. Not generating text, but surfacing the most relevant text.

I’m thinking of a lot of library documentation, government info websites, etc. Basically an improvement over deep hierarchical navigation, where their way of organizing info is a leaky abstraction.

Maybe that will be one of the side effects of this AI boom. Who knows.

tirumaraiselvan|4 months ago

This article is getting a lot of hate but honestly it does have good amount of useful content learned through practical experience, although at an abstract level. For example, this section:

``` The teams that succeed don’t just throw SQL schemas at the model. They build:

Business glossaries and term mappings

Query templates with constraints

Validation layers that catch semantic errors before execution ```

Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.

another_twist|4 months ago

Its weird that this makes the front page and Metas code world model never did.

metadat|4 months ago

First I've heard of it:

https://ai.meta.com/research/publications/cwm-an-open-weight...

CuriouslyC|4 months ago

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

ath3nd|4 months ago

[deleted]

hshdhdhehd|4 months ago

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight|4 months ago

Agents are the oxen pulling the plow through the seasons... turning over ground, following furrows, adapting to terrain. RAG is the irrigation system. Prompts are the farmer's instructions. And the harvest? That depends on how well you understood what you were trying to grow.

ath3nd|4 months ago

[deleted]

121 comments