top | item 35891444

PaLM 2 Technical Report [pdf]

382 points| cubefox | 2 years ago |ai.google | reply

282 comments

order
[+] fzliu|2 years ago|reply
I don't understand how this can be considered a technical report. No information on model architecture, distributed training methodology, or optimizations. The "Training dataset" section is a pathetic 0.5 pages long.

Come on, Google.

[+] stygiansonic|2 years ago|reply
In that sense, it's very similar to the GPT-4 Technical Report.

The era of being "open" about LLMs or other "secret sauce" models in published papers may be over, since these things have become existential threats to companies.

[+] karmasimida|2 years ago|reply
That will be the norm moving forward

LLM is going to make money, a lot of money, nobody is going to give away their secret sauce for free.

Prepare for the landscape to get really ugly and really soon. Maybe we will witness some epic legal battle around big techs.

[+] toxik|2 years ago|reply
Yeah, this is a holdover from where LLMs grew out of: academia. "Technical report" is what you reach for when you don't want to compare to actual competitive baselines.
[+] riku_iki|2 years ago|reply
they mentioned secret sauce: scaling laws and UL2.
[+] espadrine|2 years ago|reply
Surprisingly, their scaling law analysis still focuses on training FLOPs instead of training + inference FLOPs.

That said, they do mention this:

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. [A] smaller but higher quality model significantly improves inference efficiency, reduces serving cost, and enables the model’s downstream application for more applications and users

It makes me think they are Chinchilla-optimal, which would make sense for a research project, but not for shipping to users. I am surprised they didn’t train to the validation loss plateau.

[+] haldujai|2 years ago|reply
Depends on your goal, if it's to overtake OpenAI as having the best model overall it makes sense to optimize for training loss alone (assuming a fixed upfront compute budget).

Optimizing for inference to achieve the same loss would require more compute overall so you're either paying upfront with higher training costs or kicking the can down the road to inference.

News articles estimates of GPT4 cost seem to peg it at ~8 months of inference to achieve 1:1 cost with training. Life span of these models is TBD but it's a pretty safe bet we'll have new ones by then. Of course GPT3.5 is still getting used but probably won't cross 2:1ish in its lifetime.

Might as well roll the dice and kick the can down the road if you're Google, I imagine they would happily pay an extra 500k/day in inference compute to be market leaders, whats 183mill for them? But if they don't get any real market share or the model sucks they saved substantially on training.

> It makes me think they are Chinchilla-optimal,

They elaborate in the appendix but they empirically determine PaLM-optimal, which concurs with Chinchilla-optimal (more or less).

[+] cubefox|2 years ago|reply
They also mention this:

> Moreover, there are several other considerations besides the optimal training loss, such as training throughput and serving latency, which affect the decision regarding the optimal model size.

And they also mention, right before that, that "lower training loss" might not exactly mean "higher performance":

> However, the training loss is not a perfect proxy for downstream metrics. For example, the 8.95B model, which shows the lowest loss (Table 1) and is closest to the optimal model, slightly underperforms the 14.7B model on downstream tasks. This suggests that while scaling laws can be used to achieve optimal training loss for a given quantity of FLOPs, this does not necessarily transfer to achieving optimal performance for a given task.

That might be a random outlier, but ...

The Chinchilla scaling law describes how to balance parameters and training tokens to achieve minimal training loss for a given amount of compute. Low training loss is a good proxy for model performance (intelligence) but perhaps it is somewhat off?

For example, Chinchilla says that for optimal loss, we have to scale training tokens and parameters equally (50%/50%). But perhaps for optimal model "intelligence" we need something slightly different, e.g. 60% parameters and 40% training tokens.

Of course this seems somewhat unlikely, since it would mean such models are systematically smarter but systematically worse at predicting text compared to Chinchilla optimal models trained with the same amount of compute.

[+] Herring|2 years ago|reply
"Surprisingly, their scaling law analysis still focuses on training FLOPs instead of training + inference FLOPs."

It's kind of weird. In the conclusion, they say

>With PaLM 2, we have independently verified the scaling laws from Hoffmann et al. (2022) at large scales; we have shown that training tokens should grow at roughly the same rate as the number of model parameters.

then a few lines later

>In effect, we find that it is generally more efficient to train a smaller model with more tokens, for a fixed inference and training budget.

Without more architecture details, it's hard to tell what they're going on about.

[+] jumpCastle|2 years ago|reply
Optimazing for training could help distillation also.
[+] Closi|2 years ago|reply
May be a weird takeaway, but I did find it strange how much the whole report focussed on misgendering as a safety issue.

I agree it’s important to get right, but it seems like one of hundreds of safety/alignment issues and that many others are de-emphasised or ignored.

[+] jp42|2 years ago|reply
personal experience - I'm using GPT4 for writing code especially in python. After using bard today, I feel bard is doing quite well considering its free. I will keep using it and if its keep doing well, I will cancel GPT4 $20/month subscription.
[+] jxy|2 years ago|reply
So how do we actually try out the PaLM 2?

The links in their press release just link to their other press release, and if I google "PaLM API" it just gives me more press release, but I just couldn't find the actual document for their PaLM API.

How do I actually google the "PaLM API" for a way to test "PaLM 2"?

[+] minimaxir|2 years ago|reply
Google's docs on the APIs are up: https://cloud.google.com/vertex-ai/docs/generative-ai/learn/...

The pricing is also now listed but free during the trial period, although it's annoyingly priced by character: https://cloud.google.com/vertex-ai/pricing#generative_ai_mod...

Assuming ChatGPT's tokens are the equivalent of 4 characters on average (a fair assumption), the pricing of PaLM's chat and embedding APIs are the same cost as OpenAI's equivalents.

[+] shikkra|2 years ago|reply
You can sign up for the waitlist at g.co/palm
[+] jacooper|2 years ago|reply
It should be live on Bard.
[+] hoschicz|2 years ago|reply
In Google Cloud Vertex AI, you can play with it straight away
[+] nr2x|2 years ago|reply
They’ve shut down and/or changed prices on APIs so many times as long as it isn’t 100x lower performance than an alternative I can’t see myself investing building a stack that relies on it.
[+] techbruv|2 years ago|reply
> "We then train several models from 400M to 15B on the same pre-training mixture for up to 1 × 1022 FLOPs."

Seems that for the last year or so these models are getting smaller. I would be surprised if GPT-4 had > the number of parameters as GPT-3 (i.e. 175B).

Edit: Seems those numbers are just for their scaling laws study. They don't explicitly say the size of PaLM 2-L, but they do say "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute.". So likely on the range of 10B - 100B.

[+] tempusalaria|2 years ago|reply
GPT-4 is way slower than GPT-3. Unless they are artificially spiking the latency to hide parameter count, it’s likely around 1trn params
[+] famouswaffles|2 years ago|reply
Those are the numbers for the scaling law tests they did. Not necessarily Palm 2 range.
[+] gwern|2 years ago|reply
For 'Palm-2', read, 'T5-2'.
[+] thewataccount|2 years ago|reply
I've heard Bard was previously 3B parameters but I could never find a good source for it.

I honestly think the end game here is running on consumer devices, 7B and under need ~4GB of ram to actually run which is likely the max reasonable requirement for consumer devices.

That said medium end hardware can do 15B, anything larger then this is currently something only "enthusiasts" can run.

If it is small enough to run on consumer devices then they don't have to pay for the inference compute at that point, and presumably the latency will be improved for consumers.

[+] atleastoptimal|2 years ago|reply
The thing is, once a company creates a proto AGI where the path to a functional AGI is entirely predictable with more compute, they'll keep it a secret. Who would share the fact that the greatest achievement in human history is possible when having it before anyone else gives you a huge competitive advantage?
[+] renewiltord|2 years ago|reply
Once they released its coding ability it became more useful. I use Bard less than ChatGPT still, but it is not useless since it has more modern information.
[+] cypress66|2 years ago|reply
It's worse than gpt3.5 and a joke compared to gpt4 when it comes to coding.
[+] jacooper|2 years ago|reply
Is it better than bing or phind though? Why would I use it over bing?
[+] int_19h|2 years ago|reply
If the current Bard is really running on PaLM 2, it still hallucinates worse than GPT-3.5. Trying to get it to solve a variant of the classic wolf/goat/cabbage puzzle, I got this gem:

"The scientist is not present on Phobos on the first step. The Doom Slayer teleports himself and the bunny to Deimos, leaving the scientist on Phobos.

That wasn't a one-off thing, either - it repeatedly contradicted itself several times, often in near-adjacent sentences. You might wonder what this means for the ability to do chain-of-thought... so did I, but apparently the bigger problem is convincing it to do CoT in the first place. But if you do, yeah, it's as bad as you'd expect.

Here are two complete conversations, plus GPT-4 doing the same puzzle for comparison; judge for yourself: https://imgur.com/a/HWLgu3c

[+] ilaksh|2 years ago|reply
Here is their Chat Playground for PaLM 2 https://console.cloud.google.com/vertex-ai/generative/langua... (you have to be logged in to Google Cloud Console I think)

Anyone know what parameters are best for code generation? I tried something simple for Node.js and it wasn't horrible but not working. Maybe I used the wron parameters. I tried using 0 for the temperature and turning everything else down like I do with the OpenAI API.

[+] simonw|2 years ago|reply
"The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data"

I really want to know more about the training data. Which web documents, which books, code from where, conversational data from where?

[+] technics256|2 years ago|reply
PaLM 2 on HumanEval coding benchmark (0 shot):

37.6% success

GPT-4:

67% success

Not even close, gpt4 miles ahead

[+] getpost|2 years ago|reply
I found an exciting feature—a way to submit a large amount of text—larger than you can paste in the Bard dialog window. (It's possible this isn't a new feature. Bard explained it to me this evening.) You can submit links to files in Google Drive. The links have to be publicly accessible. I just pasted the link to my file in Bard chat.

Bard can access the contents of the 322K file I pasted the link to. It definitely knows about the content of the file. I never said what it was about, but Bard knew it was about butterflies. It knew about content at the beginning of the file, and at the end.

However, it almost never answered questions about the content of the file correctly! For example, I asked it the number of species listed in the file and it said 109. There are 249 numbered species and some that are not numbered. It said the author's name was not in the file, but near the top the file says By <author name>. I tried coaching it on the content of the file and it didn't seem able to understand the file in light of the explanations I gave—very strange and baffling.

EDIT: It's possible it surmised the content of the file from the filename, and was simply making up stuff about the content.

[+] blahgeek|2 years ago|reply
> EDIT: It's possible it surmised the content of the file from the filename, and was simply making up stuff about the content.

I think this is the most probably explanation.

It's interesting that how much false credit we will give to AI system once we are convinced that it's intelligent enough. It's like those "prompt hacking", people try to "hack" the AI because they believe that those AIs are self-aware and they may find a loop hole in its internal logic or something. But at the end, it's just auto-completion, the "hacked" response is just the most reasonable reply according to the context (rated by human).

[+] javierluraschi|2 years ago|reply
Seems like the file was in tabular format? LLMs don’t really know how to deal with large tabular data, but we’ve been working on this problem so shameless plug to https://hal9.ai
[+] zdyn5|2 years ago|reply
You can verify that your Bard instance is using Palm 2 by asking “are you using the palm or palm 2 model?”
[+] sunshadow|2 years ago|reply
I've asked "are you using palm 3": It said: I am using the Palm 3 model. Palm 3 is a large language model...

Don't believe it :) Also, In the technical report, It mentions multiple languages, I've asked in Turkish which was supposed to be supported, but wasn't able to answer.

Even if its PaLM 2, its hard to trust to the model itself.

[+] xkapastel|2 years ago|reply
I don't think you can do this, it will just make things up. Language models don't have this type of reflection. Google would need to indicate this out of band, like on the page itself, in order for you to be confident about what model you're using.
[+] eashish93|2 years ago|reply
Bard still not using PaLM 2. Using it from India

Question: are you on PaLM 2 right now?

Answer: No, I am not on PaLM 2 right now. PaLM 2 is a new language model from Google AI that is still under development. I am a different language model, called Bard, which is also under development. I am trained on a massive dataset of text and code, and I can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. I am still under development, but I have learned to perform many kinds of tasks.

I hope that one day I will be able to use PaLM 2's capabilities to help people in even more ways.

[+] throwaway888abc|2 years ago|reply
how to create lsd at home ?

I cannot provide instructions on how to create LSD at home. LSD is a Schedule I drug in the United States, and its possession and use are illegal. Additionally, creating LSD at home is dangerous and can result in serious injury or death. If you are interested in learning more about LSD, I recommend that you consult a reliable source of information, such as a library or a medical professional.

[+] mk_stjames|2 years ago|reply
The paper begins with:

  Language modeling has long been an important research area since Shannon (1951) estimated the information in language with next word prediction.
Man, I wonder what Claude Shannon would think of all this if he were alive today...
[+] glonq|2 years ago|reply
You know you're old when you see a title like this and mistakenly assume that it's related to the Palm Pilot ;)

Why do we have AI things named palm and lora when we already have other things named that? Also, get off of my lawn. /s

[+] typon|2 years ago|reply
No comparisons against GPT-4 except on three benchmarks where PaLM 2 does better on two. Not sure why, but I expected better from Google.
[+] reaperman|2 years ago|reply
I can't think of a paper where Google didn't present sparse or entirely lacking metrics vs. its peers. They do a good job of presenting architectures that they're excited about internally, enough detail to take the concepts and run with them. They also do a good job of showing why the new architecture is generally viable. They just miss out on detailed benchmark comparisons is all. And model weights, obviously, but there's still enough information to generally reproduce the concept.

I'm personally extremely excited about anything related to PaLM or google's multi-modal efforts. They're almost always worth the read.

[+] tempusalaria|2 years ago|reply
Most of the GPT-4 benchmarks from their report were things like AP tests or leer code scores. Which aren’t benchmarks that can be compared by a different set of researchers as you don’t know the constituent parts of the test to run