WingNews

andy99|1 year ago

Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.

acchow|1 year ago

Also, the technological leader focuses less on the benchmarks

karmasimida|1 year ago

Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are

bilbo0s|1 year ago

"graduate student descent"

Ahhh that takes me back!

qeternity|1 year ago

The original GPT4 may have been around that size (16x 110B).

But it's pretty clear GPT4 Turbo is a smaller and heavily quantized model.

IceHegel|1 year ago

Yeah, it’s not even close to doing inference on 1.8T weights for turbo queries.

oersted|1 year ago

Where did you find this number? Not doubting it, just want to get a better idea of how precise the estimate may be.

refulgentis|1 year ago

It's a really funny story that I comment about at least once a week because it drives me nuts.

1. After ChatGPT release, Twitter spam from influencers about chatGPT is one billion and GPT-4 is 1 trillion.

2. Semianalysis publishes a blog post claiming 1.8T sourced from insiders.

3. The way info diffusion works these days, everyone heard from someone else other than Semianalysis.

4. Up until about a month ago, you could confidently say "hey its just that one blog post" and work through it with people to trace their initial hearing of it back to the post.

5. nVidia press conference some time in the last month used the rumors as an example with "apparently" attached, and now people will tell you NVidia confirmed 1.8 trillion.

my $0.02: I'd bet my life GPT-4 isn't 1.8T, and I very much doubt its over 1 trillion. Like, lightning striking the same person 3 times in the same week.

huijzer|1 year ago

Probably from Nvidia's GTC keynote: https://www.youtube.com/live/USlE2huSI_w?t=2995.

In the keynote, Jensen uses 1.8T in an example and suggests that this is roughly the size of GPT-4 (if I remember correctly).

sputknick|1 year ago

I'm not OP, but George Hotz said in his lex friedman podcast a while back that it was an MoE of 8 250B. subtract out duplication of attention nodes, and you get something right around 1.8T

cjbprime|1 year ago

It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.

anvuong|1 year ago

I actually can't wrap my head around this number, even though I have been working on and off with deep learning for a few years. The biggest models we've ever deployed on production still have less than 1B parameters, and the latency is already pretty hard to manage during rush hours. I have no idea how they deploy (multiple?) 1.8T models that serve tens of millions of users a day.

Simon321|1 year ago

It's a mixture of experts model. Only a small part of those parameters are active at any given time. I believe it's 16x110B

unknown|1 year ago

[deleted]

(no title)

discuss