(no title)
buttered_toast | 10 days ago
The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.
Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.
This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.
energy123|10 days ago
buttered_toast|10 days ago
I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.
We at least see variance in public benchmarks, but in the internal examples that's almost never the case.