top | item 41460377

(no title)

juxtaposicion | 1 year ago

Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.

Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.

The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.

Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.

discuss

order

QuantumGood|1 year ago

https://huggingface.co/mattshumer/Reflection-70B says system prompt used is:

   You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
Also, only "smarter" models can use this flow, according to https://x.com/mattshumer_/status/1831775436420083753

rgbrgb|1 year ago

> Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.

They may already implement this technique, we can't know.

astrange|1 year ago

Claude 3.5 does have some "thinking" ability - I've seen it pause and even say it was thinking before. Presumably this is just some output it decides not to show you.

kgeist|1 year ago

I suspect GPT4o already has training for CoT. I've noticed it often responds by saying something like "let's break it down step by step". Or maybe it's the system prompt.

bluejay2387|1 year ago

I am not sure, but you seem to be implying that the Reflection model is running through multiple rounds? If so, that is not what is happening here. The token generation is still linear next token prediction. It does not require multiple rounds to generate the chain of thought response. It does that in one query pass.

I have been testing the model for the last few hours and it does seem to be an improvement on LLAMA 3.1 upon which it is based. I have not tried to compare it to Claude or GPT4o because I don't expect a 70b model to outperform models of that class no matter how good it is. I would happy to be wrong though...

cedws|1 year ago

I had a similar idea[0], interesting to see that it actually works. The faster LLM workloads can be accelerated, the more ‘thinking’ the LLM can do before it emits a final answer.

[0]: https://news.ycombinator.com/item?id=41377042

HanClinto|1 year ago

Further than that, it feels like we could use constrained generation of outputs [0] to force the model to do X amount of output inside of a <thinking> BEFORE writing an <answer> tag. It might not always produce good results, but I'm curious what sort of effect it might have to convince models that they really should stop and think first.

[0]: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...