Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language Models

[+] 2bitencryption|3 years ago|reply

This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.

The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].

It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.

But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.

When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.

So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.

[0] the "confetti" idiom brought to you by chat-gpt4.

[+] simon83|3 years ago|reply

Google: "confetti has left the cannon"

> No results found for "confetti has left the cannon".

I'm amazed that a "stochastic parrot" can come up with such a beautiful idiom.

[+] jazzkingrt|3 years ago|reply

Serious question: is it typical to describe client-side computing as "on the edge"?

I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.

[+] slowmovintarget|3 years ago|reply

> Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local...

Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."

I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.

[+] hiAndrewQuinn|3 years ago|reply

I for one dream of a future without maps. I want to walk through a distant forest to find an ancient, unconnected ESP-32 in the bark of a tree containing a tiny specialized AI that can only tell me about things relevant to the area, how far to walk upstream to the nearest town. And only if I can find it and scan an RFID tag to wake it up.

[+] hintymad|3 years ago|reply

I'd go one step further if it is not happening yet: smaller companies should really pool their resources to train open LLMs. Say, form a consortium and work with the open source community to build ChatGPT-equivalent. Companies will be crazy to assume that they can hand their future to the APIs offered by a handful of companies during this monumental technological paradigm shift in history.

That is, a real OpenAI with a open government body.

[+] lioeters|3 years ago|reply

Yes, yes, and yes. I'm waiting for an actually open AI that can run on the edge, purely on commodity hardware like our laptops and phones - it's inevitable.

I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.

It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.

Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.

Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.

[+] cjf101|3 years ago|reply

Yes, this is true. But, I worry about how long it will take for the utility of "GPT-4" on my phone to be close enough to whatever is only possible through models running on large cloud platforms to make that choice relatively drawback free.

Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.

[+] xnx|3 years ago|reply

This is a shocking turn of events given there's no edge equivalent of the previous most powerful information tools (web-scale search). It does seem like it will still be a challenge to continuously collect, validate, and train on fresh information. Large orgs like Google/YouTube/TikTok/Microsoft still seem to have a huge advantage there.

[+] binarymax|3 years ago|reply

Here are the zero-shot accuracy numbers posted in the Huggingface evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in their paper:

    Model              BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
    LLaMa 13B          78.1  80.1 50.4 79.2      73         74.8  52.7  56.4
    Cerebras-GPT 13B   -     76.6 -    51.3      64.6       71.4  36.7  28.6

[+] smaddox|3 years ago|reply

Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.

[+] wsgeorge|3 years ago|reply

I guess it's something. It still goes to show how far open models are behind the proprietary SOTA.

[+] riku_iki|3 years ago|reply

Have these models been trained on the same dataset? Otherwise it is apples to oranges comparison.

[+] brucethemoose2|3 years ago|reply

FYI: Cerebras's nodes are very different than your typical Nvidia training nodes:

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...

Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.

[+] johnchristopher|3 years ago|reply

OT: I don't know about their scaling strategy for LLM but their scaling strategy for displaying pictures is disappointing.

(it's all blurry)

[+] thewataccount|3 years ago|reply

They're dynamically scaled and something must be broken. If you inspect source you can find the raw images, here's a few:

https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...

https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...

EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.

Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?

[+] Kelamir|3 years ago|reply

Last time I viewed it, I believe it wasn't blurry. Perhaps to scale the traffic the images are now displayed in lower quality?

But I'm not sure anymore that it wasn't initially blurry... Perhaps I'm hallucinating, like large language models.

Current image displayed is https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... , will see if it changes.

[+] ricopags|3 years ago|reply

Came here to point this out, though not as pithily :D

Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.

edit: so this post is more useful, 4k res using Edge browser

[+] lanshanbob|3 years ago|reply

I think it's fixed now

[+] mometsi|3 years ago|reply

Summary: This is a company that makes AI accelerator ICs. They reimplemented Chinchilla and released the model weights under a permissive license.

[+] bogwog|3 years ago|reply

In other words, they’re actually incentivized to help make LLMs as accessible as possible, rather than try to keep them locked up to hide them from competitors.

Which makes me wonder if Nvidia is doing anything with LLMs too?

[+] eldenring|3 years ago|reply

> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.

I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.

Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something

[+] gamegoblin|3 years ago|reply

True. There was a good blog post published about this a few weeks ago: https://finbarr.ca/llms-not-trained-enough/

Money quote for those who don't want to read the whole thing:

'''

When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.

However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:

1) there is a strong incentive to train smaller models which fit on single GPUs

2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).

Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.

'''

[+] ftxbro|3 years ago|reply

The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.

They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).

If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.

[+] haldujai|3 years ago|reply

You’re not wrong, the Chinchilla rationale is that it may be more compute efficient to obtain a given loss using larger model sizes if the budget allows. As another commenter states this ignore the inference part of the equation.

As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.

There is a point of overtraining where downstream performance is impacted but that’s pretty high.

I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?

[+] rnosov|3 years ago|reply

I might be missing something but it looks to me that actually running this "open" model requires special hardware only accessible with a cloud subscription with 60 000 USD / week minimum spend[1]. Can anyone confirm if you can run it on your own hardware? If software is open but hardware is locked I don't see the point.

[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....

EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.

[+] simonw|3 years ago|reply

The PyTorch model files are already available to download from Hugging Face - the largest one looks to be 52GB. They should run on any hardware that can run regular PyTorch models.

[+] bubblethink|3 years ago|reply

You can run inference on GPUs. These are just models and weights.

[+] Garcia98|3 years ago|reply

I've been following open source LLMs for a while and at first glance this doesn't seem too powerful compared to other open models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems to perform much better. Although I'm not sure about the legalities about that licensing, since it's basically Flan-T5 fine-tuned using the Alpaca dataset (which is under a Non-Commercial license).

Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.

[0]: https://github.com/declare-lab/flan-alpaca

[+] alchemist1e9|3 years ago|reply

Sounds like you might be the right person to ask the “big” question.

For a small organization or individual who is technically competent and wants to try and do self-hosted inference.

What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?

A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.

[+] ftxbro|3 years ago|reply

Their goal isn't to make a powerful model. It's to show how well compute-optimal models do on test-loss as a function of increasing model size. This function can be used with some caveats to forecast the test-loss of larger models for which compute-optimality becomes more important.

[+] simonw|3 years ago|reply

Does the chinchilla recipe still hold today? I got the impression that the LLaMA paper proposed a different result where throwing far more tokens at the problem had a very meaningful impact, or did I misunderstand that?

[+] visarga|3 years ago|reply

Of course this is great news, I hope these models can be fine-tuned to be like lighter versions of chatGPT. But I remember reading in the LLaMA paper that a small model can still improve when trained more than the Chinchilla budget.

> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Cerebras says:

> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP

But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.

[+] antimatter15|3 years ago|reply

Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.

I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.

[+] chessgecko|3 years ago|reply

I wonder what led to such a gap between llama 7b and Cerebras 13b. I hope they discuss it in the paper.

[+] ftxbro|3 years ago|reply

This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.

From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."

From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."

[+] gpm|3 years ago|reply

Is there a benchmark comparing the two that I missed?

Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper

https://huggingface.co/cerebras/Cerebras-GPT-13B

https://arxiv.org/pdf/2302.13971.pdf

[+] jumpCastle|3 years ago|reply

Looks like llama 7b was trained on 4 times more tokens.

[+] amilios|3 years ago|reply

Comparing the 13B model here https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B https://github.com/facebookresearch/llama/blob/main/MODEL_CA... you can see that in all of the reasoning tasks Cerebras-GPT lags behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem like it.

[+] potatoman22|3 years ago|reply

Can the LLaMA weights be used for commercial products?

[+] mdagostino|3 years ago|reply

LLaMA is non-commercial

[+] option|3 years ago|reply

it lags behind because according to their blogpost it was trained on <300B tokens. LLaMAs as far as I know were trained on more than trillion

[+] simonw|3 years ago|reply

You can try out some of these models on Hugging face here: https://huggingface.co/cerebras/Cerebras-GPT-1.3B

That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B

[+] eternalban|3 years ago|reply

> It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling.

This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.

This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).

[+] skybrian|3 years ago|reply

What’s in the Pile training data they used? How much source code does it include?

[+] sanxiyn|3 years ago|reply

https://arxiv.org/abs/2101.00027 is the paper and it includes 95.16 GiB from GitHub.

[+] mark_l_watson|3 years ago|reply

Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.

Alternatives are good!

[+] JamesCoyne|3 years ago|reply

Slightly off-topic:

I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).

Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?

[+] ioulaum|3 years ago|reply

I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.

And, at least officially, OpenAI's outputs can't be used to train other AI models.

Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.

[+] rbanffy|3 years ago|reply

A tangential question: I wonder what, as chiplets become increasingly more common, will Cerebras do to keep their technological advantage of wafer-scale integration. What is the bandwidth and latency of the connections between the tiles? Is there such a thing as bandwidth per frontier length?

232 comments