top | item 47036557

(no title)

All the people responding saying "You would never ask a human a question like this" - this question is obviously an extreme example. People regularly ask questions that are structured poorly or have a lot of ambiguity. The point of the poster is that we should expect that all LLM's parse the question correctly and respond with "You need to drive your car to the car wash."

People are putting trust in LLM's to provide answers to questions that they haven't properly formed and acting on solutions that the LLM's haven't properly understood.

And please don't tell me that people need to provide better prompts. That's just Steve Jobs saying "You're holding it wrong" during AntennaGate.

discuss

jmward01|13 days ago

This reminds me of the old brain-teaser/joke that goes something like 'An airplane crashes on the boarder of x/y, where do they bury the survivors?' The point being that this exact style of question has real examples where actual people fail to correctly answer it. We mostly learn as kids through things like brain teasers to avoid these linguistic traps, but that doesn't mean we don't still fall for them every once in a while too.

Retric|13 days ago

That’s less a brain teaser than running into the error correction people use with language. This is useful when you simply can’t hear someone very well or when the speaker makes a mistake, but fails when language is intentionally misused.

godelski|13 days ago

I'm actually having a hard time interpreting your meaning.

Are you criticizing LLMs? Highlighting the importance of this training and why we're trained that way even as children? That it is an important part of what we call reasoning?

Or are you giving LLMs the benefit of the doubt, saying that even humans have these failure modes?[0]

Though my point is more that natural language is far more ambiguous than I think people give credit to. I'm personally always surprised that a bunch of programmers don't understand why programming languages were developed in the first place. The reason they're hard to use is explicitly due to their lack of ambiguity, at least compared to natural languages. And we can see clear trade offs with how high level a language is. Duck typing is both incredibly helpful while being a major nuisance. It's the same reason even a technical manager often has a hard time communicating instructions. Compression of ideas isn't very easy

[0] I've never fully understood that argument. Wouldn't we call a person stupid for giving a similar answer? How does the existence of stupid mean we can't call LLMs stupid? It's simultaneously anthropomorphising while being mechanistic.

cracki|13 days ago

>bury the *survivors*

I did not catch that in the first pass.

I read it as the casualties, who would be buried wherever the next of kin or the will says they should.

yakbarber|13 days ago

same things as the old, "what's heavier, a tonne of coal or a tonne of feathers". many, many people will say a ton a coal...

contravariant|13 days ago

> All the people responding saying "You would never ask a human a question like this"

That's also something people seem to miss in the Turing Test thought experiment. I mean sure just deceiving someone is a thing, but the simplest chat bot can achieve that. The real interesting implications start to happen when there's genuinely no way to tell a chatbot apart.

TheJoeMan|13 days ago

But it isn't just a brain-teaser. If the LLM is supposed to control say Google Maps, then Maps is the one asking "walk or drive" with the API. So I voice-ask the assistant to take me to the car wash, it should realize it shouldn't show me walking directions.

jader201|13 days ago

That’s not the problem with this post.

The problem is that most LLM models answer it correctly (see the many other comments in this thread reporting this). OP cherry picked the few that answered it incorrectly, not mentioning any that got it right, implying that 100% of them got it wrong.

thinkling|13 days ago

You can see up-thread that the same model will produce different answers for different people or even from run to run.

That seems problematic for a very basic question.

Yes, models can be harnessed with structures that run queries 100x and take the "best" answer, and we can claim that if the best answer gets it right, models therefore "can solve" the problem. But for practical end-user AI use, high error rates are a problem and greatly undermine confidence.

rluna828|12 days ago

The magic of LLMs is that one llm can learn everything and then we can clone it. However, if we don't know ahead of time which one will be the best one, then we should probably keep a lot of version with real (mathematically calculated) diversity. Ironically, the DEI peeps were right all along.

serial_dev|13 days ago

My understanding is that it mainly fails when you try it in speech mode, because it is the fastest model usually. I tried yesterday all major providers and they were all correct when I typed my question.

raincole|13 days ago

Nay-sayers will tell you all OpenAI, Google and Anthropic 'monkeypatched' their models (somehow!) after reading this thread and that's why they answer it correctly now.

You can even see those in this very thread. Some commenters even believe that they add internal prompts for this specific question (as if people are not attempting to fish ChatGPT's internal prompts 24/7. As if there aren't open weight models that answer this correctly.)

You can't never win.

jlarocco|13 days ago

Exactly! The problem isn't this toy example. It's all of the more complicated cases where this same type of disconnect is happening, but the users don't have all of the context and understanding to see it.

pvillano|13 days ago

I recently asked an AI a chemistry question which may have an extremely obvious answer. I never studied chemistry so I can't tell you if it was. I included as much information about the situation I found myself in as I could in the prompt. I wouldn't be surprised if the ai's response was based on the detail that's normally important but didn't apply to the situation, just like the 50 meters

pvillano|13 days ago

If you're curious or actually knowledgeable about chemistry, here's what happened. My apartment's dishwasher has gaps in the enamel from which rust can drip onto plates and silverware. I tried soaking but I presume to be a stainless steel knife with a drip of rust on it in citric acid. The rust turned black and the water turned a dark but translucent blue/purple.

I know nothing about chemistry. My smartest move was to not provide the color and ask what the color might have been. It never guessed blue or purple.

In fact, it first asked me if this was highschool or graduate chemistry. That's not... and it makes me think I'll only get answers to problems that are easily graded, and therefore have only one unambiguous solution

rluna828|12 days ago

Thanks, Excellent catch! Everyone is saying this is a "brain teaser." However, this reminded me of the LLM that thought it was the golden gate bridge. I hadn't been able to say it (or think it) succinctly. From Claude, "when we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. Its replies to most queries start to mention the Golden Gate Bridge, even if it’s not directly relevant." Here's the link for those interested. https://www.anthropic.com/news/golden-gate-claude

xdennis|13 days ago

> All the people responding saying "You would never ask a human a question like this"

It would be interesting to actually ask a group a people this question. I'm pretty sure a lot of people would fail.

It feels like one of those puzzles which people often fail. E.g: 'Ten crows are sitting on a power line. You shoot one. How many crows are left to shoot?' People often think it's a subtraction problem and don't consider that animals flee after gunshots. (BTW, ChatGPT also answers 9.)

dingaling|13 days ago

You assumed gunshots. He could have used a bow and arrow, or a blowpipe.

Loughla|13 days ago

>People regularly ask questions that are structured poorly or have a lot of ambiguity.

The difference between someone who is really good with LLM's and someone who isn't is the same as someone who's really good with technical writing or working with other people.

Communication. Clear, concise communication.

And my parents said I would never use my English degree.

biot|13 days ago

This is the LLM equivalent of a riddle, eg: “A farmer has 17 sheep. All but 9 die. How many are left?”

rluna828|12 days ago

I think it's this: https://www.anthropic.com/news/golden-gate-claude

CamperBob2|13 days ago

Other leading LLMs do answer the prompt correctly. This is just a meaningless exercise in kicking sand in OpenAI's face. (Well-deserved sand, admittedly.)