top | item 42890330

(no title)

bhu8 | 1 year ago

Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):

"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."

discuss

tippytippytango|1 year ago

Good to know openai knows the frustration of trying to argue with their RL based models as well.

eightysixfour|1 year ago

aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?

arresin|1 year ago

Kind of like an ai “thinking fast and thinking slow”.