top | item 34927099

(no title)

Quick notes from first glance at paper https://research.facebook.com/publications/llama-open-and-ef...:

* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]

* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]

* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.

* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]

* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)

discuss

machinekob|3 years ago

I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.

Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-

scotty79|3 years ago

Aren't those done on smaller version of the same model?

323|3 years ago

> we used 2048 A100-80GB for a period of approximately 5 months

Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?

Wolfram Alpha:

- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)

- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)

We still have the edge.

LOL, I'm being downvoted, I wonder way. Some don't like the question.

zhynn|3 years ago

You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.

melling|3 years ago

Every human requires the same energy, 20+ years, and training.

The trained computer model can be duplicated and used, requiring much less energy.

None of this matters to me, though.

The goal is to build better models. We can worry about the efficiency later.

isoprophlex|3 years ago

You mean MWh maybe, not MW/h? (which is what, J/s^2 in SI... "Power rate".)

Dylan16807|3 years ago

> We still have the edge.

Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.

Tepix|3 years ago

Don't mix MW/h with MWh.

Anyway, i remember hearing that the brain uses 60 Watt. That's 10.5MWh in 20 years.

But, we can't transfer/copy that gained knowledge limitlessly.

robbiep|3 years ago

It’s because your human math for power output is so far off it’s hard to know where to start to point you in the right direction

programmer_dude|3 years ago

Your units are bad. Did you mean MWh instead of MW/h?

zozbot234|3 years ago

https://github.com/facebookresearch/llama/blob/main/MODEL_CA... (linked in OP) has basic information about this model.

SethTro|3 years ago

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

Tepix|3 years ago

They may have thrown away some models that didn't turn out great.

foobiekr|3 years ago

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

woeirua|3 years ago

These cost estimates really make me question OpenAI's valuation.

Also, they kind of prove to me that most companies are totally incapable of making the investments necessary to get much out of this type of AI.

pgt|3 years ago

Financial hurdles to competitors can make the company that has overcome them more defensible.

machinekob|3 years ago

Sadly big players take all in current world and microsoft is pretty big :|

sandGorgon|3 years ago

>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*

what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?

vishal0123|3 years ago

Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.

akomtu|3 years ago

By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?

hansvm|3 years ago

Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.

Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).

That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.

Your question feels like it has a motive though. What are you really asking?

make3|3 years ago

do they do instruction fine-tuning