Chain-of-thought can hurt performance on tasks where thinking makes humans worse

[+] mitko|1 year ago|reply

This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.

Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.

[+] haccount|1 year ago|reply

Language seems to be confused with logic or common sense.

We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.

I'd suggest we call the LLM language fundament a "Word Model"(not a typo).

Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.

[+] lolinder|1 year ago|reply

This is a regression in the model's accuracy at certain tasks when using COT, not its speed:

> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.

In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.

[+] 1317|1 year ago|reply

why are Pioneer doing anything with LLMs? you make AV equipment

[+] gpsx|1 year ago|reply

I saw an LLM having this kind of problem when I was doing some testing a ways back. I asked it to order three fruits from largest to smallest. I think it was orange, blueberry and grapefruit. It could do that easily with a simple prompt. When the prompting included something to the effect of “think step by step”, it would try to talk through the problem and it would usually get it wrong.

[+] spockz|1 year ago|reply

How much does this align with how we learn math? We kind of instinctively learn the answers to simple math questions. We can even at some point develop an intuition for things like integrating and differentials. But the moment we are asked to explain why, or worse provide a proof, things become a lot harder. Even though the initial answer may be correct.

[+] ajuc|1 year ago|reply

It's not thinking, it compressed the internet into a clever, lossy format with nice interface and it retrieves stuff from there.

Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.

[+] Terr_|1 year ago|reply

Alternate framing: A powerful autocomplete algorithm is being used to iteratively extend an existing document based on its training set. Sometimes you get a less-desirable end-result when you intervene to change the style of the document away from question-and-answer to something less common.

[+] youoy|1 year ago|reply

That's what one half of HN think. The other half:

Artificial brains in the verge of singularity show another sign of approaching consciousness. The chain of thought of process performance is exactly human, showing yet another proof of the arrival of AGI before 2030.

[+] fiso64|1 year ago|reply

A framing that is longer, far harder to parse, and carries less information.

[+] grain-o-salt|1 year ago|reply

Let me give it a try...um...what about 'Star Trek' vs.: A delivering-service called Galaxyray?galaxyray brings wares and hot tasty meals galaxywide to recipients, even while they are 'traveling' with more-than-lightspeed in hyperspace?

> ..ordered by Imperium just to troll the retros!?

Sounds "less comon"...hu...?! P-:

Ok! Ok! let me try to explain it a bit more, the whole Universe projected as a beam, say... scalable, 100m, placed in a storage depot, a 'parralaxy' ...So delivery agents are grabbing the ordered stuff and...no? Not?

As reasonable like your answer is, do that sound very 'uncommon' while 'phrasing that with many questionmarks'?

??

Enjoying my day off... (-: regards,

[+] wg0|1 year ago|reply

Not to mention that chain of thought is computationally very expensive. Prohibitively expensive for sure to be served free like previous generation of Web 2.0 products.

Seems like repeated promoting can't juice AGI out of token probabilities.

Retrospectively, if you can pin point one paper that led to the bust and pop of the AI bubble, this would be it.

[+] varelse|1 year ago|reply

[deleted]

[+] oatsandsugar|1 year ago|reply

Tasks were thinking makes human worse

> Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions.

Fascinating that our lizard brains are better at implicit statistical reasoning

[+] brewii|1 year ago|reply

Think about how fast you’re able to determine the exact trajectory of a ball and location to place your hand to catch a ball using your lizard brain.

[+] Dilettante_|1 year ago|reply

Well, by definition, thinking is always explicit reasoning, no?

And I'd hazard a guess that a well-thought through Fermi Estimation beats lizard-brain eyeballing every time, it's just that in the inbetween space the two interfere unfavourably.

[+] daft_pink|1 year ago|reply

this is exactly what I was looking for. tasks where I should not think and just trust my gut.

[+] cainxinth|1 year ago|reply

This says something fascinating about information processing in both biological and AI systems. Both systems compress information: the brain creates efficient neural patterns through experience and AI develops internal representations through training. Forcing verbalization "decompresses" this efficient encoding, potentially losing subtle patterns. Hence, for a task like visual recognition, which is optimized to occur almost instantly in a parallel process, you will only degrade performance by running it in a serial chain of thought sequence.

[+] ryoshu|1 year ago|reply

95% * 95% = 90.25%

[+] jwpapi|1 year ago|reply

This is so interesting. What are even the tasks where thinking makes humans worse?

[+] XCSme|1 year ago|reply

> What are even the tasks where thinking makes humans worse?

Not really related, but athletes perform A LOT worse when they are thinking about their movements/strategies/tactics. A top performing athlete does best when they are in a flow state, where they don't think about anything and just let their body/muscle memory do the work.

Once you start thinking about micro-adjustments (e.g. I should lift my elbow higher), you start controlling your body in a conscious way, which is a magnitude slower and less coordinated than the automatic/subconscious way.

Also, same happens for creativity/new ideas. If you intentionally think about something, step by step, you won't likely find new, innovative solutions. There is a reason why the "a-ha!" moments come in the shower, your subconscious mind is thinking about the problem instead of trying to force your thinking on a specific path.

I would guess this happens in many other areas, where channelling the thought process through a specific template hinders the ability to use all the available resources/brain power.

[+] sigmoid10|1 year ago|reply

The answer is in the article. One example they give is grammar. Lots of people apparently do worse once they try to verbalize it.

[+] sowbug|1 year ago|reply

I can think myself into forgetting strong passwords if I try to spell each character out in my head. But then I sit at a keyboard, relax, and automatically type it perfectly.

[+] naasking|1 year ago|reply

> What are even the tasks where thinking makes humans worse?

Talking about religion and politics.

[+] Y_Y|1 year ago|reply

Reminds me of a mantra from chess class:

   long think = wrong think

[+] spongebobism|1 year ago|reply

The original by Bent Larsen is "Long variation, wrong variation"

[+] TZubiri|1 year ago|reply

Was that perhaps a speed chess class?

[+] meowster|1 year ago|reply

Think long; think wrong

( Flows off the tongue better ¯\_(ツ)_/¯ )

[+] unknown|1 year ago|reply

[deleted]

[+] TZubiri|1 year ago|reply

So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.

[+] rjbwork|1 year ago|reply

>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.

[+] jsheard|1 year ago|reply

> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.

[+] pessimizer|1 year ago|reply

> So, LLMs face a regression on their latest proposed improvement.

A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.

[+] idiotsecant|1 year ago|reply

LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.

[+] alexchantavy|1 year ago|reply

This seems to support how thinking out loud during a coding test might make you do worse.

[+] why-el|1 year ago|reply

I like this analogy a lot. It's possible that forced externalization of thoughts accidentally causes the omission of crucial data. That is, much more goes on in your head, you probably laid out the whole algorithm, but being asked to state it on the spot and in clear, serial words is causing you to bork it by taking shortcuts.

[+] dev1ycan|1 year ago|reply

Stop dumping billions of your own money (if you are a VC) in LLMs, you are going to regret it in the long run. You are funding con-artist's salaries...

[+] nisten|1 year ago|reply

This sounds about right from my experience getting nerdsniped by new samplers along with trying to reproduce the API middleware for the whole reflection thing, and using 4400 questions for a new benchmark is not bad given that even the well-regarded gpqa benchmark is only 3000-something questions.

What's ... mildly infuriating here is the lack of any kind of data, code, 0 mention of github in the paper, and nothing for anyone to reproduce or find any reason in my opinion to even recommend anyone to read this thing at all. If you think that whatever you're doing in the field of LLMs won't be obsolete in 6 months you're being delusional.

Anyway, back to the paper, it says all questions culminated to a yes or no answer... meaning theres a 50/50 chance of getting right, so does that mean the 8% drop in performance you got from testing llama 3 8b this way is more like 4% which would make it statistically insignificant? And given that the only other scientifically usueful & reproducible (non-api walled models which no one knows on how many actual llms and retrieval systems are composing that solution you're testing)models were less than that leads me to the opinion that this whole thing was just useless slop.

So please, if you're writing a paper in LLMs, and want to seem credible, either have some type of demo thing or show the actual god damn trash code and top secret garbage data you wrote for it so people can make some kind of use of it before it goes obsolete otherwise you're just wasting everyones time.

TL:DR. It's trash.

[+] npunt|1 year ago|reply

"Don't overthink it" is sometimes good advice!

[+] marviel|1 year ago|reply

I love backpropagating ideas from ML back into psychology :)

I think it shows great promise as a way to sidestep the ethical concerns (and the reproducibility issues) associated with traditional psychology research.

One idea in this space I think a lot about is from the Google paper on curiosity and procrastination in reinforcement learning: https://research.google/blog/curiosity-and-procrastination-i...

Basically the idea is that you can model curiosity as a reward signal proportional to your prediction error. They do an experiment where they train an ML system to explore a maze using curiosity, and it performs the task more efficiently -- UNTIL they add a "screen" in the maze that shows random images. In this case, the agent maximizes the curiosity reward by just staring at the screen.

Feels a little too relatable sometimes, as a highly curious person with procrastination issues :)

[+] m3kw9|1 year ago|reply

would be slow to use COT on simple requests like 1+1

[+] veryfancy|1 year ago|reply

So like dating?

250 comments