(no title)
saurabh20n | 3 years ago
* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)
machinekob|3 years ago
Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-
scotty79|3 years ago
323|3 years ago
Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?
Wolfram Alpha:
- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)
- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)
We still have the edge.
LOL, I'm being downvoted, I wonder way. Some don't like the question.
zhynn|3 years ago
melling|3 years ago
The trained computer model can be duplicated and used, requiring much less energy.
None of this matters to me, though.
The goal is to build better models. We can worry about the efficiency later.
isoprophlex|3 years ago
Dylan16807|3 years ago
Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.
Tepix|3 years ago
Anyway, i remember hearing that the brain uses 60 Watt. That's 10.5MWh in 20 years.
But, we can't transfer/copy that gained knowledge limitlessly.
robbiep|3 years ago
programmer_dude|3 years ago
zozbot234|3 years ago
SethTro|3 years ago
That's only 0.08 nines of availability!
I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.
Tepix|3 years ago
foobiekr|3 years ago
woeirua|3 years ago
Also, they kind of prove to me that most companies are totally incapable of making the investments necessary to get much out of this type of AI.
pgt|3 years ago
machinekob|3 years ago
sandGorgon|3 years ago
what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?
vishal0123|3 years ago
akomtu|3 years ago
hansvm|3 years ago
Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).
That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.
Your question feels like it has a motive though. What are you really asking?
make3|3 years ago