(no title)
twalkz | 7 months ago
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
Aurornis|7 months ago
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
jfarmer|7 months ago
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
slg|7 months ago
ivape|7 months ago
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
lobochrome|7 months ago
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
dustingetz|7 months ago
xeonmc|7 months ago
kjkjadksj|7 months ago
Why would you need ai for that though? Pull your sources. Run a diff. Straight to the known truth without the chatgpt subscription. In fact by that point you don’t even need the diff if you pulled from the sources. Just drop into the spreadsheet at that point.
casey2|7 months ago
ricardobayes|7 months ago
satvikpendem|7 months ago
— Tom Cargill, Bell Labs
https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule
simantel|7 months ago
Probably because it's just here now? More people take Waymo than Lyft each day in SF.
danny_codes|7 months ago
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
dingnuts|7 months ago
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
camillomiller|7 months ago
maxlin|7 months ago
A few comparisons:
>Pressing the button: $1 >Knowing which button to press: $9,999 Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
Also: SCE to AUX.
hx8|7 months ago
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
samtp|7 months ago
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
torginus|7 months ago
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
MattSayar|7 months ago
nemomarx|7 months ago
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
stpedgwdgfhgdd|7 months ago
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
casey2|7 months ago
sensanaty|7 months ago
And why 98%? Why not 99% right? Or 99.9% right? I know they can't outright say 100% because everyone knows that's a blatant lie, but we're okay with them bullshitting about the 98% number here?
Also there's no universe in which this guy gets to walk his dog while his little pet AI does his work for him, instead his boss is going to hound him into doing quadruple the work because he's now so "efficient" that he's finishing his spreadsheet in an hour instead of 8 or whatever. That, or he just gets fired and the underpaid (or maybe not even paid) intern shoots off the same prompt to the magic little AI and does the same shoddy work instead of him. The latter is definitely what the C-suite is aiming for with this tech anyway.
j_timberlake|7 months ago
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
camdenreslink|7 months ago
lossolo|7 months ago
taf2|7 months ago
hiq|7 months ago
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
sebasvisser|7 months ago
maccard|7 months ago
ants_everywhere|7 months ago
The usual estimate you see is that about 2-5% of spreadsheets used for running a business contain errors.
davedx|7 months ago
pyman|7 months ago
eboynyc32|7 months ago
rvz|7 months ago
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
And no, it is not "AGI" either.
unknown|7 months ago
[deleted]
positron26|7 months ago
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
travelalberta|7 months ago
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
tibbar|7 months ago
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
nlawalker|7 months ago
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
groby_b|7 months ago
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
joshstrange|7 months ago
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
j_timberlake|7 months ago
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
gh0stcat|7 months ago
unknown|7 months ago
[deleted]
FridgeSeal|7 months ago
At a certain point, relentlessly checking for whether the model has got everything is more effort in turn than…doing it.
Moreover, is it actually a 4-8 hour job? Or is the person not using the right tool, is the better tool a sql query?
Half these “wow ai” examples feel like “oh my plates are dirty, better just buy more”.
apwell23|7 months ago
thorum|7 months ago
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
Forgeties79|7 months ago
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
ytpete|7 months ago
It's a high cognitive burden if you don't know which 10% of the work the AI failed to do / did incorrectly, though.
I think you're picturing a percentage indicating what scope of the work the AI covered, but the parent was thinking about the accuracy of the work it did cover. But maybe what you're saying is if you pick the right 90% subset, you'll get vastly better than 98% accuracy on that scope of work? Maybe we just need to improve our intuition for where LLMs are reliable and where they're not so reliable.
Though as others have pointed out, these are just made-up numbers we're tossing around. Getting 99% accuracy on 90% of the work is very different from getting 75% accuracy on 50% of the work. The real values vary so much by problem domain and user's level of prompting skill, but it will be really interesting as studies start to emerge that might give us a better idea of the typical values in at least some domains.
vidarh|7 months ago
What error rate this same person would find if reviewing spreadsheets made by other people seems like an inherently critical benchmark before we can even discuss whether this is a problem or an achievement.
mclau157|7 months ago
dimitri-vs|7 months ago
kingnothing|7 months ago
mentalpiracy|7 months ago
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
vidarh|7 months ago
98% might well be disastrous, but I've seen enough awful quality human-produced data that without some benchmarks I'm not confident we know whether this would be better or worse.
LandoCalrissian|7 months ago
casperb|7 months ago
https://www.computerworld.com/article/1561181/excel-error-le...
fsndz|7 months ago
unknown|7 months ago
[deleted]
stingraycharles|7 months ago
I once was managing a team of data scientists and my boss kept getting frustrated about some incorrectnesses she discovered, and it was really difficult to explain that this is just human error and it would take lots of resources to ensure 100% correctness.
The same with code.
It’s a cost / benefits balance that needs to be found.
AI just adds another opportunity into this equation.
ncr100|7 months ago
vonneumannstan|7 months ago
People act like this is some new thing but this exactly what supervising a more junior coworker is like. These models won't stay performing at Jr. levels for long. That is clear
LgLasagnaModel|7 months ago
NoboruWataya|7 months ago
chrisgd|7 months ago
anentropic|7 months ago
Fomite|7 months ago
chairmansteve|7 months ago
dkga|7 months ago
mdale|7 months ago
colinnordin|7 months ago
Also, do you really understand what the numbers in that spreadsheet mean if you have not been participating in pulling them together?
d--b|7 months ago
It just make people quite faster at what they’re already doing.
unknown|7 months ago
[deleted]
guluarte|7 months ago
eitally|7 months ago
jstummbillig|7 months ago
exitb|7 months ago
iwontberude|7 months ago