This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)
I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
Maths also mean different things. Your average number theorist or algebraic geometer will most likely not touch statistical techniques day-to-day. Reading this Anthropic article was helpful because I am constantly catching up on my statistical background
All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.
For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
I have been promoting this and saying it since at least 2018. You can see my publication record as evidence!!!
"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.
[+] [-] fnordpiglet|1 year ago|reply
[+] [-] godelski|1 year ago|reply
I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
[0] https://en.wikipedia.org/wiki/F._D._C._Willard
[+] [-] runeblaze|1 year ago|reply
[+] [-] lukev|1 year ago|reply
It is empirically true that none of the industry discourse around leaderboards and benchmarks uses any of the techniques this article discusses.
[+] [-] fsndz|1 year ago|reply
[+] [-] nov30|1 year ago|reply
[deleted]
[+] [-] Unlisted6446|1 year ago|reply
For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
[+] [-] phillipcarter|1 year ago|reply
[+] [-] ipunchghosts|1 year ago|reply
"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.
[+] [-] intended|1 year ago|reply