Are there any training/ownership models like Folding@Home? People could donate idle GPU resources in exchange for access to the data, and perhaps ownership. Then instead of someone needing to pony up $85k to train a model, a thousand people can train a fraction of the model on their consumer GPU and pool the results, reap the collective rewards.
There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).
All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.
Unfortunately training is not emberassingly parallelisable [0] problem. It would require new architecture. Current models diverge too fast. By the time you'd download and/or calculate your contribution the model would descend somewhere else and your delta would not be applicable - based off wrong initial state.
It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.
Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.
The main reason an arbitrarily distributed set of compute nodes cannot give you good performance for training a model (even if you have an immodest number of nodes), is that the latency of the inter-node communication will be a massive bottleneck. GPU cloud providers shell out big bucks for ultra fast intra-DC networking via infiniband and the like, and the networking is paid attention to as much (if not more sometimes) than the capabilities of the nodes themselves.
His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.
Yeah, you're right. I wrote this a couple of weeks ago at the height of LLaMA hype, but with further experience I don't think the GPT-3 comparisons hold weight.
My biggest problem: I haven't managed to get a great summarization out of a LLaMA derivative that runs on my laptop yet. Maybe I haven't tried the right model or the right prompt yet though, but that feels essential to me for a bunch of different applications.
I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.
Yeah, the constant barrage of "THIS IS AS GOOD AS CHATGPT AND IS PRIVATE" screeds from LLaMA-based marketing projects are getting ridiculous. They're not even remotely close to the same quality. And why would they be?
I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.
The crazy thing to me is that this means we're approaching being able to have a huge chunk of human knowledge just sitting there locally on your machine. I asked ChatGPT 4 about my old professor and it was able to write a few paragraphs on her including some very specific details. It's like you can fit most of the value of a search engine AND the retrieved pages into a quite small hardware footprint.
I guess companies like OpenAI and Google have no incentives to make models use less resources. The compute required, and of course also their training data, is their moat.
If you accept that your model knows less about the world - it doesn't have to know about every restaurant in mexico city or the biography of every soccer player around the world - then you can get away with much fewer parameters and much less training data. Then you can't query it like an oracle about random things anymore, but you shouldn't do that anyway. But it should still be able to do tasks like reformulating texts, judging simularity (by embedding distance), and so on.
And TFA mentions it also, you could hook up your simple language model with something like ReAct to get really good results. I don't see it running in the browser, but if you had a license-wise clean model that you can run on premises on one or two GPUs, that would be huge for a lot of people!
They intentionally limit the size of the model to reduce inference costs. If deployment were free the models would be much larger. What makes you think they have no incentive?
Hypothesis 1: With better logical thinking (an API call away!), I bet you could train a GPT based on a “small” initial dataset. Why shouldn’t multilingual wikipedia/wiktionary and libgen be enough? That’s what, like less than 10% of the OpenAI training? /s
Hypothesis 2:
Data sets of philosophical dialogues could help efficiently develop AI reasoning skills.
Socratic thinking in Plato and Xenophon represented a powerful new mode of critical thinking. Maybe some Student-Teacher-Student template of dialogue could be powerful in developing useful datasets for AI training.
What is the utility of different AI reflective loops for generating training data? (References appreciated if you know any) One possibility to test is a chain of Analyze, Evaluate and Apply loops, applied over and over? “analyze the above piece of text, then evaluate it, then apply to everyday life.”
Now, on HN, many have expressed concern that GPT trained on GPT-GPT conversations is going to result in very misaligned models. Like a copy machine degradation, do we want training data from the AI being trained on the AI? But, on the other hand, it is possible that supporting reflective thought is a good idea in AI (we generally value reflective thought) or a bad idea (maybe the reflection will somehow turn it evil, or at least misaligned).
Design Question: how might we create useful training data through a process of structuring AI-AI dialogue?
“Student-Teacher-Student” conversations seem like they could be good as a useful mode of dialogue. Previously, I’ve finetuned GPT with the complete works of Plato and I was able to generate interesting new dialogues. But the question is whether new dialogues could produce useful data. Perhaps I could use GPT4 to read a part of Plato and then try to autocomplete another part of Plato. Or, as above, use a piece of Platonic dialogue as a target, then use an Analyze, Evaluate, Apply chain on it. We could use methods like these over and over again to make a large dataset about philosophical reasoning. We could have human ratings of the reasonableness of the dialogue output.
If a Socratic structure of thinking could read the complete works of Plato over and over again, commenting, countering and synthesizing— with human oversight (RLHF), perhaps we could develop a small module for philosophical reasoning. It might still need millions of conversations, though. But, perhaps by reflecting philosophically by itself, it could produce a sufficiently large dataset that enabled a sophisticated small model with very open resources.
And, you’d still need the human preference training RLHF to get it to interact well—and I think it also needs some world model.
In any case, I think making smaller and smaller models is a good idea, it sounds fun.
TL;DR
1. AI training has philosophically interesting implications
2. Philosophical reasoning is valuable to develop in AI
3. Good philosophical reasoning might be a key benchmark for small models. These models don’t need to know everything but perhaps they could learn what they don’t know.
4. Reading a lot of Plato over and over could be a great way to train GPT that it doesn’t know a lot.
5. What kind of AI-AI dialogues might produce training data that is useful for training small models?
I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.
The big problem with AI R&D is that nobody can keep up with the big bux companies. It makes this kind of project a bit pointless. Even if you can run a GPT3-equivalent on a web browser, how many people are going to bother (except as a stunt) when GPT4 is available?
If you bought an 8xA100 machine for $140k you would have to run it continuously for over 10,000 hours (about 14 months) to train the 7B model. By that time the value of the A100s you bought would have depreciated substantially; especially because cloud companies will be renting/selling A100s at a discount as they bring H100s online. It might still be worth it, but it's not a home run.
A wonderful thing about software development is that there is so much reserved space for creativity that we have huge gaps between costs and value. Whether the average person could do this for 85k I'm uncertain of - but there is a very significant slice of people that could do it for well under 85k now that the ground work has been done. This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.
> This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.
Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.
The value of companies is quickly going to shift from tech moats to brands.
Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.
Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.
This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.
It's interesting to me that LLaMA-nB's still produce reasonable results after 4-bit quantization of the 32-bit weights. Does this indicate some possibility of reducing the compute required for training?
Alpaca uses knowledge distillation (it's trained on outputs from OpenAI models).
It's something to keep in mind. You're teaching your model to copy an other model's outputs.
The WebGPU demo mentioned in this post is insane. Blows any WASM approach out of the water. Unfortunately that performance is not supported anywhere but chrome canary (behind a flag)
I was a bit skeptical about loading a _4GB_ model at first. Then I double-checked: Firefox is using about 5GB of memory for me. My current open tabs are mail, calendar, a couple Google Docs, two Arxiv papers, two blog posts, two Youtube videos, milvus.io documentation, and chat.openai.com.
A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.
WebGPU is going to be a major component in this. Modern GPU's prevalent in mobile devices, desktops and laptops, is more than enough to do all of this client side.
> My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.
AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.
You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.
AWS certainly isn't the cheapest for this, did they mention using AWS? Lamdba Labs is 12$/hr for 8xA100's, and there's others relatively close to this price on demand, I assume you can get a better deal if you contact them for a large project.
Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.
Everyone seems to assume that all the “tricks” behind training ChatGPT are known. The only clues are in papers from ClosedAI like the InstructGPT paper. So we assume there is Supervised Fine Tuning, then Reward Modeling and finally RLHF.
But there are most likely other tricks that ClosedAI has not published. These probably took years of R&D to come up with, others trying to replicate ChatGPT would need to come up with these tricks on their own.
Also curiously the app was released in late 2022 while the knowledge cutoff is 2021 — I was curious why that might be, and one hypothesis I had was that it may have been because they wanted to keep the training data fixed while they iterated on numerous methods, hyperparameter tuning etc. All of these are unfortunately a defensive moat that ClosedAI has.
Training a ChatGPT-beating model for much less than $85,000is entirely feasible. At CentML, we're actively working on model training and inference optimization without affecting accuracy, which can help reduce costs and make such ambitious projects realistic. By maximizing (>90%) GPU and platform hardware utilization, we aim to bring down the expenses associated with large-scale models, making them more accessible for various applications. Additionally, our solutions also have a positive environmental impact, addressing the excess CO2 concerns. If you're interested in learning more about how we are doing it, please reach out via our website: https://centml.ai
What we need is a RETRO style model where basically after the input you go through a small net that just fetches a desired set of weights from a server (serving data without compute is dirt cheap) and is then executed locally. We’ll get there eventually
Can anyone explain or link some resource on why these big GPT models all don't incorporate any RETRO style? I'm only very superficially following ML developments and I was so hyped by RETRO and then none of the modern world changing models apply it.
[+] [-] whalesalad|3 years ago|reply
[+] [-] dekhn|3 years ago|reply
There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).
All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.
[+] [-] mirekrusin|3 years ago|reply
It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.
Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.
[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel
[+] [-] spyder|3 years ago|reply
https://learning-at-home.github.io/
https://training-transformers-together.github.io/
https://arxiv.org/abs/2002.04013
[+] [-] ftxbro|3 years ago|reply
[+] [-] ellisv|3 years ago|reply
But I’d love to see more federated/distributed learning platforms.
[+] [-] semitones|3 years ago|reply
[+] [-] neoromantique|3 years ago|reply
[+] [-] _trampeltier|3 years ago|reply
https://boinc.berkeley.edu/projects.php
[+] [-] peter303|3 years ago|reply
[+] [-] cleanchit|3 years ago|reply
[+] [-] ftxbro|3 years ago|reply
[+] [-] simonw|3 years ago|reply
My biggest problem: I haven't managed to get a great summarization out of a LLaMA derivative that runs on my laptop yet. Maybe I haven't tried the right model or the right prompt yet though, but that feels essential to me for a bunch of different applications.
I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.
[ ReAct: https://til.simonwillison.net/llms/python-react-pattern ]
[+] [-] icelancer|3 years ago|reply
I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.
[+] [-] SomewhatLikely|3 years ago|reply
[+] [-] captainmuon|3 years ago|reply
If you accept that your model knows less about the world - it doesn't have to know about every restaurant in mexico city or the biography of every soccer player around the world - then you can get away with much fewer parameters and much less training data. Then you can't query it like an oracle about random things anymore, but you shouldn't do that anyway. But it should still be able to do tasks like reformulating texts, judging simularity (by embedding distance), and so on.
And TFA mentions it also, you could hook up your simple language model with something like ReAct to get really good results. I don't see it running in the browser, but if you had a license-wise clean model that you can run on premises on one or two GPUs, that would be huge for a lot of people!
[+] [-] aoeusnth1|3 years ago|reply
[+] [-] dr_dshiv|3 years ago|reply
Hypothesis 1: With better logical thinking (an API call away!), I bet you could train a GPT based on a “small” initial dataset. Why shouldn’t multilingual wikipedia/wiktionary and libgen be enough? That’s what, like less than 10% of the OpenAI training? /s
Hypothesis 2: Data sets of philosophical dialogues could help efficiently develop AI reasoning skills.
Socratic thinking in Plato and Xenophon represented a powerful new mode of critical thinking. Maybe some Student-Teacher-Student template of dialogue could be powerful in developing useful datasets for AI training.
What is the utility of different AI reflective loops for generating training data? (References appreciated if you know any) One possibility to test is a chain of Analyze, Evaluate and Apply loops, applied over and over? “analyze the above piece of text, then evaluate it, then apply to everyday life.”
Now, on HN, many have expressed concern that GPT trained on GPT-GPT conversations is going to result in very misaligned models. Like a copy machine degradation, do we want training data from the AI being trained on the AI? But, on the other hand, it is possible that supporting reflective thought is a good idea in AI (we generally value reflective thought) or a bad idea (maybe the reflection will somehow turn it evil, or at least misaligned).
Design Question: how might we create useful training data through a process of structuring AI-AI dialogue?
“Student-Teacher-Student” conversations seem like they could be good as a useful mode of dialogue. Previously, I’ve finetuned GPT with the complete works of Plato and I was able to generate interesting new dialogues. But the question is whether new dialogues could produce useful data. Perhaps I could use GPT4 to read a part of Plato and then try to autocomplete another part of Plato. Or, as above, use a piece of Platonic dialogue as a target, then use an Analyze, Evaluate, Apply chain on it. We could use methods like these over and over again to make a large dataset about philosophical reasoning. We could have human ratings of the reasonableness of the dialogue output.
If a Socratic structure of thinking could read the complete works of Plato over and over again, commenting, countering and synthesizing— with human oversight (RLHF), perhaps we could develop a small module for philosophical reasoning. It might still need millions of conversations, though. But, perhaps by reflecting philosophically by itself, it could produce a sufficiently large dataset that enabled a sophisticated small model with very open resources.
And, you’d still need the human preference training RLHF to get it to interact well—and I think it also needs some world model.
In any case, I think making smaller and smaller models is a good idea, it sounds fun.
TL;DR
1. AI training has philosophically interesting implications
2. Philosophical reasoning is valuable to develop in AI
3. Good philosophical reasoning might be a key benchmark for small models. These models don’t need to know everything but perhaps they could learn what they don’t know.
4. Reading a lot of Plato over and over could be a great way to train GPT that it doesn’t know a lot.
5. What kind of AI-AI dialogues might produce training data that is useful for training small models?
[+] [-] lxe|3 years ago|reply
Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.
[+] [-] JasonZ2|3 years ago|reply
I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.
[+] [-] captaincrowbar|3 years ago|reply
[+] [-] version_five|3 years ago|reply
4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...
[+] [-] sacred_numbers|3 years ago|reply
[+] [-] dekhn|3 years ago|reply
Ignoring the operational costs of on-prem hardware is pretty common, but those costs are significant and can greatly change the calculation.
[+] [-] modernpink|3 years ago|reply
[+] [-] girthbrooks|3 years ago|reply
[+] [-] munk-a|3 years ago|reply
[+] [-] nico|3 years ago|reply
Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.
The value of companies is quickly going to shift from tech moats to brands.
Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.
Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.
This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.
[+] [-] prerok|3 years ago|reply
[+] [-] thih9|3 years ago|reply
Looks like that choice makes it more difficult to adopt, trust, or collaborate on the new tech.
What are the benefits? Is there more to that than competitive advantage? If not, ClosedAI sounds more accurate.
[+] [-] Tryk|3 years ago|reply
[+] [-] cj|3 years ago|reply
That could do really well via crowd funding with the right spin/marketing behind it.
[+] [-] GartzenDeHaes|3 years ago|reply
[+] [-] lmeyerov|3 years ago|reply
Their writeup makes it sounds like, net, 2X+ over Alpaca, and that's an early run
The browser side is interesting too. Browser JS VMs have a memory cap of 1GB, so that may ultimately be the bottleneck here...
[+] [-] make3|3 years ago|reply
[+] [-] thewataccount|3 years ago|reply
Which itself was trained on human outputs to do the same thing.
Very soon it will be full Ouroboros as humans use the model's output to finetune themselves.
[+] [-] visarga|3 years ago|reply
That's a time honoured tradition in ML, invented by the father of the field himself, Geoffrey Hinton, in 2015.
> Distilling the Knowledge in a Neural Network
https://arxiv.org/abs/1503.02531
[+] [-] brrrrrm|3 years ago|reply
[+] [-] fzliu|3 years ago|reply
A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.
[+] [-] astlouis44|3 years ago|reply
[+] [-] agnokapathetic|3 years ago|reply
AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.
You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.
[+] [-] thewataccount|3 years ago|reply
Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.
[+] [-] sebzim4500|3 years ago|reply
[+] [-] celestialcheese|3 years ago|reply
8xA100 @ 40gb for $8/hr
Replicate friend isn't far off.
[+] [-] pavelstoev|3 years ago|reply
[+] [-] IanCal|3 years ago|reply
[+] [-] d4rkp4ttern|3 years ago|reply
But there are most likely other tricks that ClosedAI has not published. These probably took years of R&D to come up with, others trying to replicate ChatGPT would need to come up with these tricks on their own.
Also curiously the app was released in late 2022 while the knowledge cutoff is 2021 — I was curious why that might be, and one hypothesis I had was that it may have been because they wanted to keep the training data fixed while they iterated on numerous methods, hyperparameter tuning etc. All of these are unfortunately a defensive moat that ClosedAI has.
[+] [-] pavelstoev|3 years ago|reply
[+] [-] nwoli|3 years ago|reply
[+] [-] tinco|3 years ago|reply