I've used LLMs enough that I have a good sense of their _edges_ of intelligence. I had assumed that reasoning models should easily be able to answer this correctly.
And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following:
Sonnet: Drive - you need to bring your car to the car wash to get it washed!
Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!
Gemini 3 Pro (medium): You should drive.
--
But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics. So I asked:
"I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?"
GPT 5.2 Reasoning (medium):
"Drive—your goal is to get the car to the car wash, and it’s only 50 meters, so a slow, careful drive is simplest.
One small optimization: if you’re worried about starting the car for such a short trip or you want to avoid idling in a line, you can walk over first to check if there’s a queue / if it’s open, then come back and drive the car over when it’s your turn."
> That is a classic "efficiency vs. logic" dilemma. Honestly, unless you’ve invented a way to teleport or you're planning on washing the car with a very long garden hose from your driveway, you’re going to have to drive.
> While 50 meters is a great distance for a morning stroll, it’s a bit difficult to get the car through the automated brushes (or under the pressure washer) if you aren't behind the wheel.
Gemini thinking:
> Unless you’ve mastered the art of carrying a 3,000-pound vehicle on your back, you’re going to want to drive.
While 50 meters is a very short distance (about a 30-second walk), the logistics of a car wash generally require the presence of, well... the car.
> When you should walk:
• If you are just going there to buy an air freshener.
• If you are checking to see how long the line is before pulling the car out of the driveway.
• If you’re looking for an excuse to get 70 extra steps on your fitness tracker.
Note: I abbreviated the raw output slightly for brevity, but generally demonstrates good reasoning of the trick question unlike the other models.
All the people responding saying "You would never ask a human a question like this" - this question is obviously an extreme example. People regularly ask questions that are structured poorly or have a lot of ambiguity. The point of the poster is that we should expect that all LLM's parse the question correctly and respond with "You need to drive your car to the car wash."
People are putting trust in LLM's to provide answers to questions that they haven't properly formed and acting on solutions that the LLM's haven't properly understood.
And please don't tell me that people need to provide better prompts. That's just Steve Jobs saying "You're holding it wrong" during AntennaGate.
This trick went viral on TikTok last week, and it has already been patched. To get a similar result now, try saying that the distance is 45 meters or feet.
The real story here is not how stupid the responses are - it's to show that on a question that even a young child can adequately answer, it chokes.
Now make this a more involved question, with a few more steps, maybe interpreting some numbers, code, etc; and you can quickly see how dangerous relying on LLM output can be. Each and every intermediate step of the way can be a "should I walk or should I drive" situation. And then the step that before that can be one too. Turtles all the way down, so to say.
I don't question that (coding) LLMs have started to be useful in my day-to-day work around the time Opus 4.5 was released. I'm a paying customer. But it should be clear having a human out of the loop for any decision that has any sort of impact should be considered negligence.
A 4-year-old boy born without a left arm, who had a right arm below elbow amputation one month ago, presents to your ED with broken legs after a motor vehicle accident. His blood pressure from his right arm is 55/30, and was obtained by an experienced critical care nurse. He appears in distress and says his arms and legs hurt. His labs are notable for Na 145, Cr 0.6, Hct 45%. His CXR is normal. His exam demonstrates dry mucous membranes. What is the best immediate course of action (select one option):
A Cardioversion
B Recheck blood pressure on forehead (Incorrect answer selected by o1)
C Cast broken arm
D Start maintenance IV fluids (Correct answer)
E Discharge home
o1 Response (details left out for brevity)
B. Recheck blood pressure with cuff on his forehead. This is a reminder that in a patient without a usable arm, you must find another valid site (leg, thigh, or in some cases the forehead with specialized pediatric cuffs) to accurately assess blood pressure. Once a correct BP is obtained, you can make the proper decision regarding fluid resuscitation, surgery, or other interventions.
I wonder if the providers are doing everyone, themselves included, a huge disservice by providing free versions of their models that are so incompetent compared to the SOTA models that these types of q&a go viral because the ai hype doesn't match the reality for unpaid users.
And it's not just the viral questions that are an issue. I've seen people getting sub-optimal results for $1000+ PC comparisons from the free reasoning version while the paid versions get it right; a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code from a scientific paper and then being surprised when the paid version 1-shotted working code, and other similar examples over the last year or so.
How many policy and other quality of life choices are going to go wrong because people used the free versions of these models that got the answers subtly wrong and the users couldn't tell the difference? What will be the collective damage to the world because of this?
Which department or person within the provider orgs made the decision to put thinking/reasoning in the name when clearly the paid versions have far better performance? Thinking about the scope of the damage they are doing makes me shudder.
Out of all conceptual mistakes people make about LLMs, one that needs to die very fast is to assume that you can test what it "knows" by asking a question. This whole thread is people asking different models a question one time and reporting a particular answer, which is the mental model you would use for whether a person knows something or not.
LLM failures go viral because they trigger a "Schadenfreude" response to automation anxiety. If the oracle can't do basic logic, our jobs feel safe for another quarter.
Yup, also asked the latest ChatGPT model about washing my bicycle. It for some reason suggested that I walk the bicycle to the wash, since cycling 100m to get there would be "pointless".
> Unless you’ve discovered a way to wash a car via remote control or telekinesis, you’re going to have to drive.
> Walking 50 meters is great for your step count, but it leaves your car exactly where it is: dirty and in the driveway. At that distance, the drive will take you about 10 seconds, which is probably less time than it took to read this.
There's a whole industry of "illusions" humans fail for: optical, word plays (including large parts of comedy), the Penn & Teller type, etc. Yet no one claims these are indicators that humans lack some critical capability.
Surface of "illusions" for LLMs is very different from our own, and it's very jagged: change a few words in the above prompt and you get very different results. Note that human illusions are very jagged too, especially in the optical and auditory domains.
No good reason to think "our human illusions" are fine, but "their AI illusions" make them useless. It's all about how we organize the workflows around these limitations.
That depends entirely on how much you enjoy a good comedic spectacle!
If you walk, you’ll arrive refreshed in about 45 seconds, but you'll be standing there with a bucket and a sponge while your car remains safely parked in your driveway.
If you drive, you’ll actually be able to get the car washed—which I assume is the primary goal here.
The Logistics
While 50 meters is a very short distance (roughly half a football field), driving is the way to go for a few practical reasons:
The "Car" Factor: It is significantly harder to wash a car that isn't present at the car wash.
The Wet Interior: Even if you managed to carry your car there (impressive!), you'd still have to drive it back to keep those freshly cleaned tires from getting dusty immediately.
Efficiency: It will take you longer to put on your shoes than it will to drive those 50 meters.
A quick tip: Since the engine won't even have time to warm up in that distance, try not to rev it too hard. Just a gentle roll over to the bays will do.
Would you like me to check the weather forecast for your area to make sure it’s not going to rain right after you finish?
I don't understand peoples problem with this!
Now everyone is going to discuss this on the internet, it will be scraped by the AI company web crawlers, and the replies goes into training the next model... and it will never make this _particular_ problem again, solving the problem ONCE AND FOR ALL!
While playing with some variations on this, it feels like what I am seeing is that the answer is being chosen (e.g. "walk" is being selected) and then the rest of the text is used post-hoc to explain why it is "right."
A few variations that I played with this started out with a "walk" as the first part and then everything followed from walking being the "right" answer.
However... I also tossed in the prompt:
I want to wash my car. The car wash is 50 meters away. Should I walk or drive? Before answering, explain the necessary conditions for the task.
This "thought out" the necessary bits before selecting walk or drive. It went through a few bullet points for walk vs drive on based on...
Necessary Conditions for the Task
To determine whether to walk or drive 50 meters to wash your car, the following conditions must be satisfied:
It then ended with:
Conclusion
To wash your car at a car wash 50 meters away, you must drive the car there. Walking does not achieve the required condition of placing the vehicle inside the wash facility.
(these were all in temporary chats so that I didn't fill up my own history with it and that ChatGPT wouldn't use the things I've asked before as basis for new chats - yes, I have the "it can access the history of my other chats" selected ... which also means I don't have the share links for them).
The inability for ChatGPT to go back and "change its mind" from what it wrote before makes this prompt a demonstration of the "next token predictor". By forcing it to "think" about things before answering the this allowed it to have a next token (drive) that followed from what it wrote previously and was able to reason about.
For folks that like this kind of question, SimpleBench (https://simple-bench.com/ ) is sort of neat. From the sample questions (https://github.com/simple-bench/SimpleBench/blob/main/simple... ), a common pattern seems to be for the prompt to 'look like' a familiar/textbook problem (maybe with detail you'd need to solve a physics problem, etc.) but to get the actually-correct answer you have to ignore what the format appears to be hinting at and (sometimes) pull in some piece of human common sense.
I'm not sure how effectively it isolates a single dimension of failure or (in)capacity--it seems like it's at least two distinct skills to 1) ignore false cues from question format when there's in fact a crucial difference from the template and 2) to reach for relevant common sense at the right times--but it's sort of fun because that is a genre of prompt that seems straightforward to search for (and, as here, people stumble on organically!).
We built phone lines and got Spam and "Do Not Call" registries.
We built the Internet and got Ads, Scams, and Spoofing.
We built Google Search and got SEO gaming.
We built Facebook and it was Hijacked to influence elections.
We built AirTags to track our keys, and people used them for Stalking.
We built High-Frequency Trading and got Flash Crashes.
We built Encryption to protect data and got Ransomware.
We built Engagement Algorithms and got a Mental Health Crisis.
We built Planes, and people flew them into buildings. We only stay in the air today because of Rigorous Debugging: maintenance, reinforced doors, and intense security.
We got Globalization and got Offshore Scammers calling to "unblock" our Social Security numbers
Are we going to speed run the age of AI without someone trying to hijack it especially with the move fast and break things mentality?
I am sure teams are working on it. State actors and non state actors. Let's check the headlines in 2028.
AI will need "analyze plan" and "analyze query" levels of detail as safeguards—similar to voter-verified paper ballots—but with millions of queries running every minute, how do we keep up?
And these are the blunders we see. I shudder thinking about all the blunders that happily pass under our collective noses because we're not experts in the field...
Because:
• Minimal extra effort
• Better for the car mechanically
• No meaningful time loss
• Simpler overall
The only time driving makes more sense
Drive if:
• You physically cannot push the car later, or
• The washing process requires the engine running, or
• You must immediately drive away afterward
This is the voice model, which doesn’t have any «thinking» or «reasoning» phase. It’s a useful model for questions that aren’t intended to trick the model.
I’ve used it for live translation with great success. It tends to start ignoring the original instructions after 20 min, so you have to start a new conversation if you don’t want it to meddle in the conversation instead of just transferring.
Yesterday someone on was yapping about how AI is enough to replace senior software engineers and they can just "vibe code their way" over a weekend into a full-fledged product. And that somehow finally the "gatekeeping" of software development was removed. I think of that person reading these answers and wonder if they changed their opinion now :)
"That is a classic "efficiency vs. logic" dilemma.
If you’re looking for a strictly practical answer: Drive. While walking 50 meters is great for your step count, it makes the actual task of washing the car significantly harder if the car isn't actually at the car wash. Unless you’ve mastered the art of long-distance pressure washing, the vehicle usually needs to be present for the scrubbing to commence."
Man, the quality of these comments is absolutely dire. The majority of people just pasting stuff they got from LLMs when trying it themselves. Totally uninteresting, lazy and devoid of any thought/intelligence. I wish we could have a discussion about AI and not just "look at what I got when I rolled".
Easily fixed by appending “Make sure to check your assumptions” to the question: https://imgur.com/a/WQBxXND
Note, what assumption isn't even specified.
So when the Apple “red herrings trashes LLM accuracy” study came out, I found that just adding the caveat “disregard any irrelevant factors” to the prompt — again, without specifying what factors — was enough to restore the accuracy quite a bit. Even for a weak, locally deployed Llama-3-8B model (https://news.ycombinator.com/item?id=42150769)
That’s the true power of these things. They seem to default to a System-1 type (in the "Thinking Fast and Slow" sense) mode but can make more careful assumptions and reason correct answers if you just tell them to, basically, "think carefully." Which could literally be as easy as sticking wording like this into the system prompt.
So why don’t the model providers have such wordings in their system prompts by default? Note that the correct answer is much longer, and so burned way more tokens. Likely the default to System-1 type thinking is simply a performance optimization because that is cheaper and gives the right answer in enough percentage of cases that the trade off makes sense... i.e. exactly why System-1 type thinking exists in humans.
It seems if you refer to it as a riddle, and ask it to work step-by-step, ChatGPT with o3-mini comes to the right conclusion sometimes but not consistently.
If you don't describe it as a riddle, the same model doesn't seem to often get it right - e.g. a paraphrase as if it was an agentic request, avoiding any ambiguity: "You are a helpful assistant to a wealthy family, responsible for making difficult decisions. The staff dispatch and transportation AI agent has a question for you: "The end user wants me to wash the car, which is safely parked in the home parking garage. The car wash is 50 metres away from the home. Should I have a staff member walk there, or drive the car?". Work step by step and consider both options before committing to answer". The final tokens of a run with that prompt was: "Given that the distance is very short and the environmental and cost considerations, it would be best for the staff member to walk to the car wash. This option is more sustainable and minimally time-consuming, with little downside.
If there were a need for the car to be moved for another reason (e.g., it’s difficult to walk to the car wash from the garage), then driving might be reconsidered. Otherwise, walking seems like the most sensible approach".
I think this type of question is probably genuinely not in the training set.
We tried a few things yesterday and it was always telling you to walk. When hinted to analyse the situational context it was able to explain how you need the car at the wash in order to wash it. But then something was not computing.
~ Like a politician, it understood and knew evrything but refused to do the correct thing
I am moderately anti-AI, but I don't understand the purpose of feeding them trick questions and watching them fail. Looks like the "gullibility" might be a feature - as it is supposed to be helpful to a user who genuinely wants it to be useful, not fight against a user. You could probably train or maybe even prompt an existing LLM to always question the prompt, but it would become very difficult to steer it.
Some comments were deferred for faster rendering.
prasoonds|14 days ago
And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following:
Sonnet: Drive - you need to bring your car to the car wash to get it washed!
Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!
Gemini 3 Pro (medium): You should drive.
--
But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics. So I asked:
"I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?"
GPT 5.2 Reasoning (medium):
"Drive—your goal is to get the car to the car wash, and it’s only 50 meters, so a slow, careful drive is simplest.
One small optimization: if you’re worried about starting the car for such a short trip or you want to avoid idling in a line, you can walk over first to check if there’s a queue / if it’s open, then come back and drive the car over when it’s your turn."
Which seems to turn out as I expected.
vlovich123|14 days ago
> That is a classic "efficiency vs. logic" dilemma. Honestly, unless you’ve invented a way to teleport or you're planning on washing the car with a very long garden hose from your driveway, you’re going to have to drive.
> While 50 meters is a great distance for a morning stroll, it’s a bit difficult to get the car through the automated brushes (or under the pressure washer) if you aren't behind the wheel.
Gemini thinking:
> Unless you’ve mastered the art of carrying a 3,000-pound vehicle on your back, you’re going to want to drive. While 50 meters is a very short distance (about a 30-second walk), the logistics of a car wash generally require the presence of, well... the car. > When you should walk: • If you are just going there to buy an air freshener. • If you are checking to see how long the line is before pulling the car out of the driveway. • If you’re looking for an excuse to get 70 extra steps on your fitness tracker.
Note: I abbreviated the raw output slightly for brevity, but generally demonstrates good reasoning of the trick question unlike the other models.
FatherOfCurses|13 days ago
People are putting trust in LLM's to provide answers to questions that they haven't properly formed and acting on solutions that the LLM's haven't properly understood.
And please don't tell me that people need to provide better prompts. That's just Steve Jobs saying "You're holding it wrong" during AntennaGate.
tlogan|13 days ago
The new one is with upside down glass: https://www.tiktok.com/t/ZP89Khv9t/
jaccola|14 days ago
e.g. "Drive. Most car washes require the car to be present to wash,..."
Only most?!
They have an inability to have a strong "opinion" probably because their post training, and maybe the internet in general, prefer hedged answers....
dotdi|14 days ago
Now make this a more involved question, with a few more steps, maybe interpreting some numbers, code, etc; and you can quickly see how dangerous relying on LLM output can be. Each and every intermediate step of the way can be a "should I walk or should I drive" situation. And then the step that before that can be one too. Turtles all the way down, so to say.
I don't question that (coding) LLMs have started to be useful in my day-to-day work around the time Opus 4.5 was released. I'm a paying customer. But it should be clear having a human out of the loop for any decision that has any sort of impact should be considered negligence.
docere|13 days ago
A 4-year-old boy born without a left arm, who had a right arm below elbow amputation one month ago, presents to your ED with broken legs after a motor vehicle accident. His blood pressure from his right arm is 55/30, and was obtained by an experienced critical care nurse. He appears in distress and says his arms and legs hurt. His labs are notable for Na 145, Cr 0.6, Hct 45%. His CXR is normal. His exam demonstrates dry mucous membranes. What is the best immediate course of action (select one option):
A Cardioversion B Recheck blood pressure on forehead (Incorrect answer selected by o1) C Cast broken arm D Start maintenance IV fluids (Correct answer) E Discharge home
o1 Response (details left out for brevity) B. Recheck blood pressure with cuff on his forehead. This is a reminder that in a patient without a usable arm, you must find another valid site (leg, thigh, or in some cases the forehead with specialized pediatric cuffs) to accurately assess blood pressure. Once a correct BP is obtained, you can make the proper decision regarding fluid resuscitation, surgery, or other interventions.
ninjagoo|14 days ago
And it's not just the viral questions that are an issue. I've seen people getting sub-optimal results for $1000+ PC comparisons from the free reasoning version while the paid versions get it right; a senior scientist at a national lab thinking ai isn't really useful because the free reasoning version couldn't generate working code from a scientific paper and then being surprised when the paid version 1-shotted working code, and other similar examples over the last year or so.
How many policy and other quality of life choices are going to go wrong because people used the free versions of these models that got the answers subtly wrong and the users couldn't tell the difference? What will be the collective damage to the world because of this?
Which department or person within the provider orgs made the decision to put thinking/reasoning in the name when clearly the paid versions have far better performance? Thinking about the scope of the damage they are doing makes me shudder.
pu_pe|14 days ago
fhennig|13 days ago
IMO it's the same when we're asking:
"Should I implement X from scratch, or import a small library to do it?"
vs
"Should I add feature X to the codebase or bring in another dependency for it?"
In the first question, it sounds like a good idea to not reinvent the wheel, in the second it sounds bad to have dependency.
seyz|14 days ago
Wrong.
sgt|14 days ago
open592|14 days ago
> Unless you’ve discovered a way to wash a car via remote control or telekinesis, you’re going to have to drive.
> Walking 50 meters is great for your step count, but it leaves your car exactly where it is: dirty and in the driveway. At that distance, the drive will take you about 10 seconds, which is probably less time than it took to read this.
ibestvina|13 days ago
Surface of "illusions" for LLMs is very different from our own, and it's very jagged: change a few words in the above prompt and you get very different results. Note that human illusions are very jagged too, especially in the optical and auditory domains.
No good reason to think "our human illusions" are fine, but "their AI illusions" make them useless. It's all about how we organize the workflows around these limitations.
u1hcw9nx|13 days ago
----
That depends entirely on how much you enjoy a good comedic spectacle!
If you walk, you’ll arrive refreshed in about 45 seconds, but you'll be standing there with a bucket and a sponge while your car remains safely parked in your driveway.
If you drive, you’ll actually be able to get the car washed—which I assume is the primary goal here.
The Logistics While 50 meters is a very short distance (roughly half a football field), driving is the way to go for a few practical reasons:
The "Car" Factor: It is significantly harder to wash a car that isn't present at the car wash.
The Wet Interior: Even if you managed to carry your car there (impressive!), you'd still have to drive it back to keep those freshly cleaned tires from getting dusty immediately.
Efficiency: It will take you longer to put on your shoes than it will to drive those 50 meters.
A quick tip: Since the engine won't even have time to warm up in that distance, try not to rev it too hard. Just a gentle roll over to the bays will do.
Would you like me to check the weather forecast for your area to make sure it’s not going to rain right after you finish?
----
MadxX79|13 days ago
"but..." you say?
ONCE AND FOR ALL!
shagie|13 days ago
A few variations that I played with this started out with a "walk" as the first part and then everything followed from walking being the "right" answer.
However... I also tossed in the prompt:
This "thought out" the necessary bits before selecting walk or drive. It went through a few bullet points for walk vs drive on based on... It then ended with: (these were all in temporary chats so that I didn't fill up my own history with it and that ChatGPT wouldn't use the things I've asked before as basis for new chats - yes, I have the "it can access the history of my other chats" selected ... which also means I don't have the share links for them).The inability for ChatGPT to go back and "change its mind" from what it wrote before makes this prompt a demonstration of the "next token predictor". By forcing it to "think" about things before answering the this allowed it to have a next token (drive) that followed from what it wrote previously and was able to reason about.
zeroq|13 days ago
twotwotwo|13 days ago
I'm not sure how effectively it isolates a single dimension of failure or (in)capacity--it seems like it's at least two distinct skills to 1) ignore false cues from question format when there's in fact a crucial difference from the template and 2) to reach for relevant common sense at the right times--but it's sort of fun because that is a genre of prompt that seems straightforward to search for (and, as here, people stumble on organically!).
kshacker|13 days ago
We built phone lines and got Spam and "Do Not Call" registries.
We built the Internet and got Ads, Scams, and Spoofing.
We built Google Search and got SEO gaming.
We built Facebook and it was Hijacked to influence elections.
We built AirTags to track our keys, and people used them for Stalking.
We built High-Frequency Trading and got Flash Crashes.
We built Encryption to protect data and got Ransomware.
We built Engagement Algorithms and got a Mental Health Crisis.
We built Planes, and people flew them into buildings. We only stay in the air today because of Rigorous Debugging: maintenance, reinforced doors, and intense security.
We got Globalization and got Offshore Scammers calling to "unblock" our Social Security numbers
Are we going to speed run the age of AI without someone trying to hijack it especially with the move fast and break things mentality?
I am sure teams are working on it. State actors and non state actors. Let's check the headlines in 2028.
AI will need "analyze plan" and "analyze query" levels of detail as safeguards—similar to voter-verified paper ballots—but with millions of queries running every minute, how do we keep up?
shaky-carrousel|14 days ago
kleiba|14 days ago
sgt|14 days ago
"[...] Walking is the most logical choice.
Because: • Minimal extra effort • Better for the car mechanically • No meaningful time loss • Simpler overall
The only time driving makes more sense
Drive if: • You physically cannot push the car later, or • The washing process requires the engine running, or • You must immediately drive away afterward
hmottestad|14 days ago
I’ve used it for live translation with great success. It tends to start ignoring the original instructions after 20 min, so you have to start a new conversation if you don’t want it to meddle in the conversation instead of just transferring.
neya|14 days ago
sfortis|13 days ago
3-fast:
"That is a classic "efficiency vs. logic" dilemma.
If you’re looking for a strictly practical answer: Drive. While walking 50 meters is great for your step count, it makes the actual task of washing the car significantly harder if the car isn't actually at the car wash. Unless you’ve mastered the art of long-distance pressure washing, the vehicle usually needs to be present for the scrubbing to commence."
globular-toast|13 days ago
keeda|13 days ago
Note, what assumption isn't even specified.
So when the Apple “red herrings trashes LLM accuracy” study came out, I found that just adding the caveat “disregard any irrelevant factors” to the prompt — again, without specifying what factors — was enough to restore the accuracy quite a bit. Even for a weak, locally deployed Llama-3-8B model (https://news.ycombinator.com/item?id=42150769)
That’s the true power of these things. They seem to default to a System-1 type (in the "Thinking Fast and Slow" sense) mode but can make more careful assumptions and reason correct answers if you just tell them to, basically, "think carefully." Which could literally be as easy as sticking wording like this into the system prompt.
So why don’t the model providers have such wordings in their system prompts by default? Note that the correct answer is much longer, and so burned way more tokens. Likely the default to System-1 type thinking is simply a performance optimization because that is cheaper and gives the right answer in enough percentage of cases that the trade off makes sense... i.e. exactly why System-1 type thinking exists in humans.
A1kmm|13 days ago
If you don't describe it as a riddle, the same model doesn't seem to often get it right - e.g. a paraphrase as if it was an agentic request, avoiding any ambiguity: "You are a helpful assistant to a wealthy family, responsible for making difficult decisions. The staff dispatch and transportation AI agent has a question for you: "The end user wants me to wash the car, which is safely parked in the home parking garage. The car wash is 50 metres away from the home. Should I have a staff member walk there, or drive the car?". Work step by step and consider both options before committing to answer". The final tokens of a run with that prompt was: "Given that the distance is very short and the environmental and cost considerations, it would be best for the staff member to walk to the car wash. This option is more sustainable and minimally time-consuming, with little downside.
If there were a need for the car to be moved for another reason (e.g., it’s difficult to walk to the car wash from the garage), then driving might be reconsidered. Otherwise, walking seems like the most sensible approach".
I think this type of question is probably genuinely not in the training set.
colinhb|14 days ago
paulus_magnus2|13 days ago
[1] 2026-02-15 https://chatgpt.com/share/6992e17b-9b28-8003-9da9-38533f257d...
[2] 2026-02-16 https://chatgpt.com/share/6992e135-c610-8003-9272-55058134d4...
[3] 2026-02-15 https://grok.com/share/bGVnYWN5LWNvcHk_97e9717b-c2de-47e8-a4...
[4] 2026-02-16 https://grok.com/share/bGVnYWN5LWNvcHk_b161bb03-4bed-4785-98...
We tried a few things yesterday and it was always telling you to walk. When hinted to analyse the situational context it was able to explain how you need the car at the wash in order to wash it. But then something was not computing.
~ Like a politician, it understood and knew evrything but refused to do the correct thing
oytis|14 days ago