Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.
Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along
HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.
Hook this up to LLM arena we will get a better picture regarding how powerful they really are
It's a really funny story that I comment about at least once a week because it drives me nuts.
1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.
2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.
3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.
4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.
5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.
my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.
I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T
It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.
That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.
I think it's not really defined how to compare parameter counts with a MoE model.
I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.
andy99|1 year ago
acchow|1 year ago
karmasimida|1 year ago
HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.
Hook this up to LLM arena we will get a better picture regarding how powerful they really are
bilbo0s|1 year ago
Ahhh that takes me back!
qeternity|1 year ago
But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.
IceHegel|1 year ago
oersted|1 year ago
refulgentis|1 year ago
1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.
2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.
3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.
4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.
5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.
my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.
huijzer|1 year ago
In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).
sputknick|1 year ago
cjbprime|1 year ago
That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.
I think it's not really defined how to compare parameter counts with a MoE model.
anvuong|1 year ago
Simon321|1 year ago
unknown|1 year ago
[deleted]