top | item 47006030

(no title)

spyder | 17 days ago

Yea, LLMs have prompt-, harness- and even random seed variability, and it leaves you wonder maybe with a better prompt or system instruction a model could perform better. Too bad most benchmarks don't report that variability, because it could reveal that the model may only perform well if it's prompted in the style of their training data and not generalize well to unseen prompt styles. Also it could explain some of the benchmark vs real world usage gaps.

I remember some papers about earlier models having around 15% prompt variability, and with different tool use sometimes there are even more significant jumps. And if I remember correctly the reasoning models improve some of these because lot of the early prompting tricks is included in them like "thinking step-by-step", "think carefully" and some other "magic" methods. Also another trick is to ask the models to rephrase the prompt with their own words because that may produce prompt that better align with their training prompts. For sure the big model developers are aware of these and constantly improving it, I just don't see too much discussion or numbers about it.

discuss

andai|17 days ago

I haven't been able to find it again, but a few years ago I read a paper that found that certain prompts massively improved the performance of some LLMs on benchmarks. But the same prompt massively reduced the performance of some other LLMs. I assume this is still true, though perhaps not as dramatically as before.