If I take a step back and think back to say a few (or 5) years ago, what LLMs can do is amazing. One has to acknowledge that (or at least, I do). But as a scientist it's been rather interesting to probe the jagged edge and unreliability, including using deep research tools, on any topic I know well.
If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.
One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.
I think those in the computer science field see passable results of LLM use with respect to software and papers and start assuming other engineering fields should be easy.
They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.
The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.
This is the model conflating correlation with causation. Perhaps with more data spurious correlations would disappear, but the 'right' way is to make the models learn causal, world models.
This is a good articulation of what is a real concern around the AI bull thesis.
If a calculator works great 99% of the time you could not use that calculator to build a bridge.
Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.
> If a calculator works great 99% of the time you could not use that calculator to build a bridge.
We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.
A pedantic but maybe-not-entirely-pedantic point: It depends on what you mean by 99%.
If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)
If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.
Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...
I believe it absolutely will. I think eventually we'll get to a point where people will be measured on now well they can get the AI to behave and how good they are at keeping cost down.
My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.
This doesn't seem like the first time engineers try to work with something useful that is only partially reliable.
The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.
LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.
What we are seeing with our customers is that LLM errors are a very manageable problem. End users adapt pretty quickly to the idea that AI systems aren't perfect. In many cases AI products are doing tasks that used to be done by humans and these humans were making mistakes too, so the end user is used to the idea that the task will get accomplished with some non-zero error rate.
You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.
Good article. Agree that general unreliability will continue to be an issue since it's fundamental to how LLMs work. However, it would surprise me if there was still a significant gap between single-turn and multi-turn performance in 18 months. Judging by improvements in the last few frontier model releases, I think the top AI labs have finally figured out how to train for multi-turn and agentic capabilities (likely RL) and just need to scale this up.
Reasoning is just the worst kind of stop gap measure. The state that should emerge internally is forced through automating prompts. And you can clearly see this because the models rarely follow their own "reasoning". Its just auto self prompting
MongoDB was basically "vibe coding" for RBDMs. After the hype cycle, there will be a wasteland of unmaintainable vibe-coded products that companies will have to pump unlimited amounts of money into to maintain.
I think we mythologize the relational model a bit too much to call nosql dbs vibe coding. DynamoDB is quite good and you can point to some very large customers using it successfully.
A few months ago I asked CGPT to create a max operating depth table for scuba diving based on various PPO2 limits and EAN gas profiles, just to test it on something I know (its a trivially easy calculation; and the formula is readily available online). It got it wrong…multiple times…even after correction and supplying the correct formula, the table was still repeatedly wrong (it did finally output a correct table). I just tried it again, with the same result. Obviously not something I would stake my life on anyway, but if it’s getting something so trivial wrong, I’m not inclined to trust it on more complex topics.
There are jobs out there that have always been unreliable.
A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.
The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.
But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.
Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.
Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?
etc.
When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.
But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.
Once they start making deals with the relevant organizations, book rooms, handle insurance, replacement hotels, etc, then they'll replace travel agents. These guys don't just Google a bunch of tickets you know.
Yes, which is why it's slightly confusing why programming is being pushed so hard to use with LLMs. For things that don't need completely accurate information, sure. But for programming, data, and factual information, it's surprising to see so many people using LLMs.
I have used it on three big family vacations already and it's definitely a place where "AI" shines in usefulness. It did recommend some out-of-business hotels and things but the broad strokes were good enough to save hours of work.
LLMs can't evaluate their own output. LLMs suggest possibilities, but can't evaluate them. Imagine an insane man who is rumbling something smart, but doesn't self-reflect. The evaluation is done against some framework of values that are considered true: the rules of a board game, the language syntax or something else. LLMs also can't fabricate evaluation because the latter is a rather rigid and precise model, a unlike natural language. Otherwise you could set up two LLMs questioning each other.
Isn't this kind of the hope/dream of multi-agent systems where one LLM "coordinates" among others or checks the responses? In my experience it works about as well as you're describing.
It's hard to say "never" in technology. History isn't really on your side. However, LLMs have largely proven to be good at things computers were are already good at: repetitive tasks, parallel processing, and data analysis. There's nothing magical about an LLM that seems to be defeating the traditional paradigm. Increasingly I lean toward an implosion of the hype cycle for AI.
LLMs are a legitimate technology with legitimate applications. However in a desperate bid for a new iPhone moment to assure Wall Street that the fantasy of infinite growth in a finite world is possible, they have utterly lost the plot regarding what statistical analysis of words at scale is capable of doing. Useless? Far from it. The basis for a 300 billion company with no meaningful products after almost a decade working on it? I have doubts.
I can't fathom a future where OpenAI for sure doesn't eat dirt, with Anthropic likely not far behind it. nVidia will likely come out fine, since it still has gamers to disappoint, and the infrastructure build out that did occur will crater the cost of GPUs at scale for smaller, smarter companies to take advantage of. So it will likely still kick around, but as another technology, not the second coming of Cyber Christ as it's been hyped to be.
Unreliability doesn't matter for some people because their bar was already that low. Unfortunately this is the way of the world and quality has and will continue to suffer. LLMs mostly accelerate this problem... hopefully they get good enough to help solve it.
Has anyone experimented with an ensemble + synthesizer approach for reliability? I'm thinking: make n identical requests to get diverse outputs, then use a separate LLM call to synthesize/reconcile the distinct results into a final answer. Seems like it could help with the consistency issues discussed here by leveraging the natural variance in LLM outputs rather than fighting it. Any experience with this pattern?
The field where LLMs are most successful, software development, is also a place where many software developers are paid to use LLMs. I have colleagues who are reluctant to express their skepticism publicly for just this reason.
Well. In an ideal world, LLMs would be used this way, as a tool to help automate the bullshit and let the person driving worry about other stuff.
But I never see them actually used this way. At the big institution end, companies and universities will continue to force AI tools on their employees in heavy handed and poorly thought out ways, and use it as an excuse to fire people whenever budgets get tight (or investors demand higher profits). At the opposite scale, with individual users, it’s really alarming how rapidly people seem to stop thinking with their own brain and offload all critical thinking to an LLM. That’s not “extending your capabilities,” that’s letting all your skills atrophy while you train a machine to be your shitty replacement.
I'm no AI fan, but articles talking about the shortcomings of LLM's seem to have to be complaining that forks aren't good for drinking soup.
Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.
For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.
Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:
in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?
This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for.
function extractValues<T extends readonly IProperty<any>[]>(
props: [...T]
): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
return props.map(p => p.get()) as any;
}
This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.
The thing is, I have spent the last year being told that I will VERY SOON be able to use a fork to drink soup, and better than any spoon has ever been able to, and in fact pretty soon spoons will be completely outclassed anyway, and I'M the idiot for doubting this.
Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.
The problem is exactly how the public will learn "not to ask 2+2". When you have a well trained professional using an LLM it's all great. They know how to separate hallucination from actually good results as you do. The problem lies with the general public and new workers who will, no questions about it, use the AI generated results as some sort of truth.
So many times I've asked questions just like this and gotten complete nonsense incorrect answers. In fact, you have no guarantees whatsoever that even the typescript question you asked will always return a sensible answer.
I'm by no means saying that LLMs aren't useful. They're just not reliably useful.
I think I'm settling on a "Gell-mann Amnesia" explanation of why people are so rabidly committed to the "acceptable veracity" of LLM output. When you don't know the facts, you're easily mislead by plausible-sounding analysis, and having been mislead -- a certain default prejudice to existing beliefs takes over. There's a significant asymmetry of effort in belief change vs. acquisition. I think there's also an ego-protection effect here too: if I have to change my belief then I was wrong.
There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.
In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.
This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.
You need to be pushing much more data in than you're getting out. 40k tokens of input can result in 400 actual quality tokens of output. Not giving enough input to work off of will result in regressed output.
It's basically like a funnel, which can also be used the other way around if the user is okay with quirky side effects. It feels like a lot of people are using the funnel the wrong way around and complaining that it's not working.
Bullshit works on lots of people. Seeming to be true, or even just plausible, is enough for most people. This is why powerful bullshit machines are dangerous tools.
> Internally, it uses a sophisticated, multi-path strategy, approximating the sum with one heuristic while precisely determining the final digit with another. Yet, if asked to explain its calculation, the LLM describes the standard 'carry the one' algorithm taught to humans.
The LLM has no relevant capacities, either to tell the truth or to lie. In generates "appropriate" text, given a history of cases of appropriate textual structures.
It is the person who reads this text as-if written by a person who imparts these capacities to the machine, who treats the text as meaningful. But almost no text the LLM generates could be said to be meaningful, if any.
In the sense that if a two year old were taught to say, "the magnitude of the charge on the electron is the same as the charge on the proton", one would not suppose the two year old meant what was said.
Since the LLM has no interior representational model of the world, only a surface of text tokens laid out as-if it did, its generation of text never comes into direct contact with a system of understanding that text. Therefore the LLM has no capacities ever implied by its use of language, it only appears to.
This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.
A LLM can't self-reflect. It doesn't know what happens in its own circuits. If you ask it, it will either tell you what it knows (from the articles about LLMs it has ingested), and if it doesn't, it will hallucinate something, as it is often the case.
Since the LLM has no knowledge on how LLMs do addition, it will pick something that seems to makes sense, and it picked the "carry the one" algorithm. New generations of LLMs will probably do better now that they have access to a better answer for that specific question, but it doesn't mean that they have become more insightful.
No, because the LLM is a tool without any feeling and consciousness, like the article rightfully point out. It doesn't have the possibility to scrutinize it's own internals, nor the possibility to wonder if that would be something relevant to do.
Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".
I have been using LLM coding tools to make stuff which I had no chance of making otherwise. They are MVPs, and if anything ever got traction I am very aware that I would need to hire a real dev. For now, I am basically a PM and QA person.
What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.
I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?
hallucinations are essentially the only thing keeping all knowledge workers from being made permanently redundant. if that doesnt make you a little concerned then you are a fool. and the predictions of all the experts in 2010 is that what is currently happening right in front of us could never happen within a hundred years. why are the predictions of experts more reliable now? anyone who dismisses the risks is just a sorry fool
I'm a knowledge worker (electrical engineer) but not one bit worried about being replaced by AI in yhe foreseeable future. It does not only neet to be reliable, but also should be able to create, as in create physically working complex systems for me to be worried. I have not seen anything remotely close this yet.
I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!
Large language models reliably produce misinformation that appears plausible only because it mimics human language. They are dangerous toys that cannot be made into tools that are safe to use.
I think this misses some of the core problems and it suggests there are some more straight forward solutions. We have no solutions to this and the way we're treating this means we aren't going to come up with solutions.
Problem 1: Training
Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.
This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.
This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.
Problem 2: Metrics and Alignment
All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.
This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.
In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.
This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.
We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.
My experience with LLm-based chat is so different from what the article (and some friends) describe.
I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.
It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.
Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.
Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.
My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.
Honestly this is why your experience is different: your expectations are different (and likely lower). I never find they are "mostly right enough", I find they are "mostly wrong in ways that range from subtle mistakes to extremely incorrect". The more subtly they are wrong, the worse I rate their output actually, because that is what costs me more time when I try to use them
I want tools that save me time. When I use LLMs I have to carefully write the prompts, read and understand, evaluate, and iterate on the output to get "close enough" then fix it up to be actually correct.
By the time I've done all of that, I probably could have just written it from scratch.
The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo
From what I can tell, rather than a simple difference in expectation (which could explain your positive experience vs others), it seems to be a "comfort within uncertainty" difference that, from what I can tell, is a personality trait!
You're comfortable with the uncertainty, and accommodate it in your use and expectations. You're left feeling good about the experience, within that uncertainty. Others are repelled by uncertainty, so will have a negative experience, regardless of how well it may work for a subset of tasks they try, because that repulsive uncertainty is always present.
I think it would be interesting (and possibly very useful/profitable for the marketing/UI departments of companies that use AI) to find the relation between perceived AI usefulness and the results of some of the "standard" personality tests.
It's fine if LLMs are used casually, for things that don't affect anyone but the user. But when someone plugs an LLM into Social Security or other governmental bodies to take action on real human beings, then disaster awaits. Nobody is going to care if the LLM got it wrong if you're just chatting with it or writing some wonky code that doesn't matter in the real world, but when your government check is reduced or deleted by an LLM that is hallucinating, then the real problems start. These things should not be trusted with anything but the least consequential actions an individual would use it for.
Charitably, your low expectations are probably the source of your finding them acceptable.
It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.
However the cases you list - brainstorming - don’t really care about wrong answers.
Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.
I really don't understand people who are down on LLM.
In terms of code output. I have gone from the productivity of being a Sr. Engineer to a team with .8 of a Sr. Engineer, 5 Jr. Engineers and One dude solely dedicated to reading/creating documentation.
Unlike a lot of my fellow engineers who are also from traditional CS backgrounds and haven't worked in revenue restricted startup environments, I also have been VERY into interpreted languages like ruby in the past.
Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.
It's both exciting and scary, I can't believe how people are still sleep walking in this environment and don't realize we are in a different world. Once again the human inability to "gut reason" about exponentials is going to screw us all over.
Within the population that writes code there are a small number of successful people who approach the topic in a ~purely mathematical approach, and a small number of successful people that approach writing code in a ~purely linguistic approach. Most people fall somewhere in the middle.
Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.
My guess is that HN population will tend to show stronger reactions against LLM's because it was heavily seeded with functional programmers which I think has a concentration of the successful extremely math focused. I worked for several years in a purely functional shop and that was my observation: Elixir, Haskell, Ramda.
In my space, "mostly right enough" isn't useful. Particularly when that means that the errors are subtle and I might miss them. I can't write whitepapers that tell people to do things that would result in major losses.
IMHO it's a great summarizing search engine. I now don't have to click on a link to go to that original source - Gemini just hands me a useful summary. Ask AI to do something specific that requires GI (General Intelligence) your milage may vary. So as OpenAI and Google suck in all your content (creators) you are going to find yourself derive less and less revenue generated by visits to your site. Just sayin.
ar813|9 months ago
If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.
One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.
elictronic|9 months ago
They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.
The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.
crooked-v|9 months ago
BlueTemplar|9 months ago
How do you find they compare?
esafak|9 months ago
brentm|9 months ago
If a calculator works great 99% of the time you could not use that calculator to build a bridge.
Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.
tveita|9 months ago
We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.
oconnor663|9 months ago
If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)
If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.
Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...
tom_m|9 months ago
My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.
worldsayshi|9 months ago
The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.
LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.
the_snooze|9 months ago
That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug
andrewmutz|9 months ago
You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.
vinni2|9 months ago
But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.
Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.
thorum|9 months ago
karn97|9 months ago
koakuma-chan|9 months ago
ok123456|9 months ago
Spivak|9 months ago
boardwaalk|9 months ago
Ostrogoth|9 months ago
tom_m|9 months ago
eterm|9 months ago
A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.
The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.
But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.
Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.
Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?
etc.
When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.
But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.
asadotzler|9 months ago
whyowhy3484939|9 months ago
65|9 months ago
liveoneggs|9 months ago
akomtu|9 months ago
candiddevmike|9 months ago
izabera|9 months ago
josefritzishere|9 months ago
ToucanLoucan|9 months ago
I can't fathom a future where OpenAI for sure doesn't eat dirt, with Anthropic likely not far behind it. nVidia will likely come out fine, since it still has gamers to disappoint, and the infrastructure build out that did occur will crater the cost of GPUs at scale for smaller, smarter companies to take advantage of. So it will likely still kick around, but as another technology, not the second coming of Cyber Christ as it's been hyped to be.
dist-epoch|9 months ago
Or being able to explain the static physical forces in a picture that are keeping a structure from collapsing.
Or recommend me a python library which does X, Y and Z with constraints A, B and C.
But I guess you can file all the above under "data analysis".
wintermutestwin|9 months ago
karn97|9 months ago
[deleted]
tom_m|9 months ago
willk357|9 months ago
worik|9 months ago
Not very hard to understand, except it seems to be
cmiles74|9 months ago
baxtr|9 months ago
I think and say this all the time. But people keep continue to say that AI will take all our jobs and I’m so utterly confused by this.
Sometimes I wonder if I have gone mad or everyone else.
turtletontine|9 months ago
But I never see them actually used this way. At the big institution end, companies and universities will continue to force AI tools on their employees in heavy handed and poorly thought out ways, and use it as an excuse to fire people whenever budgets get tight (or investors demand higher profits). At the opposite scale, with individual users, it’s really alarming how rapidly people seem to stop thinking with their own brain and offload all critical thinking to an LLM. That’s not “extending your capabilities,” that’s letting all your skills atrophy while you train a machine to be your shitty replacement.
AlienRobot|9 months ago
Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.
For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.
Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:
This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for. This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.DanHulton|9 months ago
Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.
bluefirebrand|9 months ago
But use them to do more important things that require more precision and accuracy?
No thanks
coliveira|9 months ago
Marazan|9 months ago
uludag|9 months ago
I'm by no means saying that LLMs aren't useful. They're just not reliably useful.
mjburgess|9 months ago
There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.
In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.
This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.
helloplanets|9 months ago
It's basically like a funnel, which can also be used the other way around if the user is okay with quirky side effects. It feels like a lot of people are using the funnel the wrong way around and complaining that it's not working.
asadotzler|9 months ago
jeisc|9 months ago
johnea|9 months ago
So, the LLM isn't just wrong, it also lies...
mjburgess|9 months ago
It is the person who reads this text as-if written by a person who imparts these capacities to the machine, who treats the text as meaningful. But almost no text the LLM generates could be said to be meaningful, if any.
In the sense that if a two year old were taught to say, "the magnitude of the charge on the electron is the same as the charge on the proton", one would not suppose the two year old meant what was said.
Since the LLM has no interior representational model of the world, only a surface of text tokens laid out as-if it did, its generation of text never comes into direct contact with a system of understanding that text. Therefore the LLM has no capacities ever implied by its use of language, it only appears to.
This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.
GuB-42|9 months ago
Since the LLM has no knowledge on how LLMs do addition, it will pick something that seems to makes sense, and it picked the "carry the one" algorithm. New generations of LLMs will probably do better now that they have access to a better answer for that specific question, but it doesn't mean that they have become more insightful.
psychoslave|9 months ago
Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".
glial|9 months ago
unknown|9 months ago
[deleted]
consumer451|9 months ago
What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.
I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?
Am I way off base here?
bionhoward|9 months ago
smeeger|9 months ago
bgnn|9 months ago
I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!
Oh, I guess I'm a fool.
lapsis_beeftech|9 months ago
godelski|9 months ago
Problem 1: Training
Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.
This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.
This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.
Problem 2: Metrics and Alignment
All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.
This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.
In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.
This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.
We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.
[0] https://www.youtube.com/watch?v=EstiCb1gA3U
lcfcjs6|9 months ago
[deleted]
jmathai|9 months ago
I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.
It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.
Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.
Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.
My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.
bluefirebrand|9 months ago
Honestly this is why your experience is different: your expectations are different (and likely lower). I never find they are "mostly right enough", I find they are "mostly wrong in ways that range from subtle mistakes to extremely incorrect". The more subtly they are wrong, the worse I rate their output actually, because that is what costs me more time when I try to use them
I want tools that save me time. When I use LLMs I have to carefully write the prompts, read and understand, evaluate, and iterate on the output to get "close enough" then fix it up to be actually correct.
By the time I've done all of that, I probably could have just written it from scratch.
The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo
nomel|9 months ago
You're comfortable with the uncertainty, and accommodate it in your use and expectations. You're left feeling good about the experience, within that uncertainty. Others are repelled by uncertainty, so will have a negative experience, regardless of how well it may work for a subset of tasks they try, because that repulsive uncertainty is always present.
I think it would be interesting (and possibly very useful/profitable for the marketing/UI departments of companies that use AI) to find the relation between perceived AI usefulness and the results of some of the "standard" personality tests.
leptons|9 months ago
foobiekr|9 months ago
It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.
However the cases you list - brainstorming - don’t really care about wrong answers.
Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.
fellowniusmonk|9 months ago
In terms of code output. I have gone from the productivity of being a Sr. Engineer to a team with .8 of a Sr. Engineer, 5 Jr. Engineers and One dude solely dedicated to reading/creating documentation.
Unlike a lot of my fellow engineers who are also from traditional CS backgrounds and haven't worked in revenue restricted startup environments, I also have been VERY into interpreted languages like ruby in the past.
Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.
It's both exciting and scary, I can't believe how people are still sleep walking in this environment and don't realize we are in a different world. Once again the human inability to "gut reason" about exponentials is going to screw us all over.
One terribly overlooked thing I've noticed that I think explains the differing takes. Foundation of my position here: https://www.nature.com/articles/s41598-020-60661-8
Within the population that writes code there are a small number of successful people who approach the topic in a ~purely mathematical approach, and a small number of successful people that approach writing code in a ~purely linguistic approach. Most people fall somewhere in the middle.
Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.
My guess is that HN population will tend to show stronger reactions against LLM's because it was heavily seeded with functional programmers which I think has a concentration of the successful extremely math focused. I worked for several years in a purely functional shop and that was my observation: Elixir, Haskell, Ramda.
Just my speculation.
light_hue_1|9 months ago
What do you use it for?
In my space, "mostly right enough" isn't useful. Particularly when that means that the errors are subtle and I might miss them. I can't write whitepapers that tell people to do things that would result in major losses.
strangattractor|9 months ago