(no title)
bhu8 | 1 year ago
"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."
tippytippytango|1 year ago
eightysixfour|1 year ago
My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.
I wonder why that seems to be some sort of continuum?
arresin|1 year ago