This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.
The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.
Serious question: is it typical to describe client-side computing as "on the edge"?
I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.
> Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local...
Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."
I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.
I for one dream of a future without maps. I want to walk through a distant forest to find an ancient, unconnected ESP-32 in the bark of a tree containing a tiny specialized AI that can only tell me about things relevant to the area, how far to walk upstream to the nearest town. And only if I can find it and scan an RFID tag to wake it up.
I'd go one step further if it is not happening yet: smaller companies should really pool their resources to train open LLMs. Say, form a consortium and work with the open source community to build ChatGPT-equivalent. Companies will be crazy to assume that they can hand their future to the APIs offered by a handful of companies during this monumental technological paradigm shift in history.
That is, a real OpenAI with a open government body.
Yes, yes, and yes. I'm waiting for an actually open AI that can run on the edge, purely on commodity hardware like our laptops and phones - it's inevitable.
I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.
It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.
Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.
Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.
Yes, this is true. But, I worry about how long it will take for the utility of "GPT-4" on my phone to be close enough to whatever is only possible through models running on large cloud platforms to make that choice relatively drawback free.
Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.
This is a shocking turn of events given there's no edge equivalent of the previous most powerful information tools (web-scale search). It does seem like it will still be a challenge to continuously collect, validate, and train on fresh information. Large orgs like Google/YouTube/TikTok/Microsoft still seem to have a huge advantage there.
Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.
Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?
Came here to point this out, though not as pithily :D
Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.
edit: so this post is more useful, 4k res using Edge browser
In other words, they’re actually incentivized to help make LLMs as accessible as possible, rather than try to keep them locked up to hide them from competitors.
Which makes me wonder if Nvidia is doing anything with LLMs too?
> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.
I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.
Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something
Money quote for those who don't want to read the whole thing:
'''
When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.
However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:
1) there is a strong incentive to train smaller models which fit on single GPUs
2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).
Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.
They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).
If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.
You’re not wrong, the Chinchilla rationale is that it may be more compute efficient to obtain a given loss using larger model sizes if the budget allows. As another commenter states this ignore the inference part of the equation.
As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.
There is a point of overtraining where downstream performance is impacted but that’s pretty high.
I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?
I might be missing something but it looks to me that actually running this "open" model requires special hardware only accessible with a cloud subscription with 60 000 USD / week minimum spend[1]. Can anyone confirm if you can run it on your own hardware? If software is open but hardware is locked I don't see the point.
The PyTorch model files are already available to download from Hugging Face - the largest one looks to be 52GB. They should run on any hardware that can run regular PyTorch models.
I've been following open source LLMs for a while and at first glance this doesn't seem too powerful compared to other open models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems to perform much better. Although I'm not sure about the legalities about that licensing, since it's basically Flan-T5 fine-tuned using the Alpaca dataset (which is under a Non-Commercial license).
Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.
Sounds like you might be the right person to ask the “big” question.
For a small organization or individual who is technically competent and wants to try and do self-hosted inference.
What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?
A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.
Their goal isn't to make a powerful model. It's to show how well compute-optimal models do on test-loss as a function of increasing model size. This function can be used with some caveats to forecast the test-loss of larger models for which compute-optimality becomes more important.
Does the chinchilla recipe still hold today? I got the impression that the LLaMA paper proposed a different result where throwing far more tokens at the problem had a very meaningful impact, or did I misunderstand that?
Of course this is great news, I hope these models can be fine-tuned to be like lighter versions of chatGPT. But I remember reading in the LLaMA paper that a small model can still improve when trained more than the Chinchilla budget.
> For instance, although Hoffmann et al. (2022) recommends
training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
Cerebras says:
> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP
But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.
Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.
I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.
This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.
From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."
From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."
> It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling.
This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.
This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).
Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.
I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.
And, at least officially, OpenAI's outputs can't be used to train other AI models.
Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.
A tangential question: I wonder what, as chiplets become increasingly more common, will Cerebras do to keep their technological advantage of wafer-scale integration. What is the bandwidth and latency of the connections between the tiles? Is there such a thing as bandwidth per frontier length?
[+] [-] 2bitencryption|3 years ago|reply
The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.
[+] [-] simon83|3 years ago|reply
> No results found for "confetti has left the cannon".
I'm amazed that a "stochastic parrot" can come up with such a beautiful idiom.
[+] [-] jazzkingrt|3 years ago|reply
I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.
[+] [-] slowmovintarget|3 years ago|reply
Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."
I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.
[+] [-] hiAndrewQuinn|3 years ago|reply
[+] [-] hintymad|3 years ago|reply
That is, a real OpenAI with a open government body.
[+] [-] lioeters|3 years ago|reply
I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.
It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.
Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.
Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.
[+] [-] cjf101|3 years ago|reply
Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.
[+] [-] xnx|3 years ago|reply
[+] [-] binarymax|3 years ago|reply
[+] [-] smaddox|3 years ago|reply
[+] [-] wsgeorge|3 years ago|reply
[+] [-] riku_iki|3 years ago|reply
[+] [-] brucethemoose2|3 years ago|reply
https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
[+] [-] johnchristopher|3 years ago|reply
(it's all blurry)
[+] [-] thewataccount|3 years ago|reply
https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...
https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.
Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?
[+] [-] Kelamir|3 years ago|reply
But I'm not sure anymore that it wasn't initially blurry... Perhaps I'm hallucinating, like large language models.
Current image displayed is https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... , will see if it changes.
[+] [-] ricopags|3 years ago|reply
Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.
edit: so this post is more useful, 4k res using Edge browser
[+] [-] lanshanbob|3 years ago|reply
[+] [-] mometsi|3 years ago|reply
[+] [-] bogwog|3 years ago|reply
Which makes me wonder if Nvidia is doing anything with LLMs too?
[+] [-] eldenring|3 years ago|reply
I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.
Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something
[+] [-] gamegoblin|3 years ago|reply
Money quote for those who don't want to read the whole thing:
'''
When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.
However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:
1) there is a strong incentive to train smaller models which fit on single GPUs
2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).
Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
'''
[+] [-] ftxbro|3 years ago|reply
They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).
If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.
[+] [-] haldujai|3 years ago|reply
As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.
There is a point of overtraining where downstream performance is impacted but that’s pretty high.
I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?
[+] [-] rnosov|3 years ago|reply
[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....
EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.
[+] [-] simonw|3 years ago|reply
[+] [-] bubblethink|3 years ago|reply
[+] [-] Garcia98|3 years ago|reply
Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.
[0]: https://github.com/declare-lab/flan-alpaca
[+] [-] alchemist1e9|3 years ago|reply
For a small organization or individual who is technically competent and wants to try and do self-hosted inference.
What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?
A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.
[+] [-] ftxbro|3 years ago|reply
[+] [-] simonw|3 years ago|reply
[+] [-] visarga|3 years ago|reply
> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
Cerebras says:
> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP
But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.
[+] [-] antimatter15|3 years ago|reply
I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.
[+] [-] chessgecko|3 years ago|reply
[+] [-] ftxbro|3 years ago|reply
From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."
From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."
[+] [-] gpm|3 years ago|reply
Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper
https://huggingface.co/cerebras/Cerebras-GPT-13B
https://arxiv.org/pdf/2302.13971.pdf
[+] [-] jumpCastle|3 years ago|reply
[+] [-] amilios|3 years ago|reply
[+] [-] potatoman22|3 years ago|reply
[+] [-] mdagostino|3 years ago|reply
[+] [-] option|3 years ago|reply
[+] [-] simonw|3 years ago|reply
That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B
[+] [-] eternalban|3 years ago|reply
This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.
This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).
[+] [-] skybrian|3 years ago|reply
[+] [-] sanxiyn|3 years ago|reply
[+] [-] mark_l_watson|3 years ago|reply
Alternatives are good!
[+] [-] JamesCoyne|3 years ago|reply
I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).
Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?
[+] [-] ioulaum|3 years ago|reply
And, at least officially, OpenAI's outputs can't be used to train other AI models.
Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.
[+] [-] rbanffy|3 years ago|reply