This is neat. Model weights are split into their layers and distributed across several machines who then report themselves in a big hash table when they are ready to perform inference or fine tuning "as a team" over their subset of the layers.
It's early but I've been working on hosting model weights in a Docker registry for https://github.com/jmorganca/ollama. Mainly for the content addressability (Ollama will verify the correct weights are downloaded every time) and ultimately weights can be fetched by their content instead of by their name or url (which may change!). Perhaps a good next step might be to split the models by layers and store each layer independently for use cases like this (or even just for downloading + running larger models over several "local" machines).
Ah, is it possible to tone down the self-promotion? I've been seeing your comments for ollama on many LLM-related posts here.
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.
Surely in this case it would've been possible to comment about OP's work while leaving out the free backlink to your project. Just my 0.02
Finetuning 70B is not just hard, its literally impossible without renting a very expensive cloud instance or buying a PC the price of a house, no matter how long you are willing to wait. I would absolutely contribute to a "llama training horde"
That's true for conventional fine-tuning, but is it the case for parameter efficient fine tuning and qLORA? My understanding is that for a N billion parameter model, fine tuning can occur with a slightly-less-than-N gigabyte of VRAM GPU.
Finetuning in a distributed way with questionable network would be lot more energy/cost inefficient than doing it with a single node or a well connected cluster. Also, you can finetune 70b model for million tokens for $2 in lambda cloud or <$10 in replicate.
What prevents parallel LLM training? If you read book 1 first and then book 2, the resulting update in your knowledge will be the same if you read the books in the reverse order. It seems reasonable to assume that LLM is trained on each book independently, the two deltas in the LLM weights can be just added up.
Are trained LLM's composable in any way? Like if you and I trust 99% of the same data, but each have 1% where we disagree, must we have two entirely separate models, or can we pool compute in the 99% case (along with the others who agree) and then create a derivative model for ourselves which covers for the differences in our trust models?
I have only a rudimentary understanding of neural nets but it doesn't seem crazy that the weights could be manipulated in such a way while preserving the utility of the model.
I ask because I think it would be useful to know which statements two LLMs of equal power agree on and which they disagree on. You could then map that backwards to differences in their training data (only feasible if the differences are small).
If instead two LLMs of equal power represent a missed opportunity to have one of greater power, and the disagreement analysis is prohibitively expensive to do, then that's a bit of a different world.
They're not composable in the sense that you can take these adaptation layers and arbitrarily combine them, but training different models while sharing a common base of weights is a solved problem.
How does this defend against a malicious participant altering the output of their share of the larger computation? Even without some kind of method for e.g. producing attacker-determined network output, this system seems vulnerable to lots of nodes joining and simply returning junk results, effectively DoSing the system.
Hi, a Petals dev here. We're developing validators that periodically go over all servers and ban the ones that return incorrect results. Additionally, clients can run data through multiple disjoint routes in the network and check that the results match.
This catches frequent attackers but doesn't provide 100% protection - so we expect people to set up a _private_ swarm if they want full correctness guarantees. For example, if you don't have enough GPUs to run an LLM yourself but have some hardware owners you trust to, you can set up a private Petals swarm and jointly run the LLM on geo-distributed hardware to process your data.
The first question I had was "what are the economics?" From the FAQ:
Will Petals incentives be based on crypto, blockchain, etc.?
No, we are working on a centralized incentive system similar to the AI Horde kudos, even though Petals is a fully decentralized system in all other aspects. We do not plan to provide a service to exchange these points for money, so you should see these incentives as "game" points designed to be spent inside our system.
Petals is an ML-focused project designed for ML researchers and engineers, it does not have anything to do with finance. We decided to make the incentive system centralized because it is much easier to develop and maintain, so we can focus on developing features useful for ML researchers.
Similarly there have been distributed render farms for graphic design for a long time. No incentives other than higher points means your jobs are prioritized.
>What's the motivation for people to host model layers in the public swarm?
>People who run inference and fine-tuning themselves get a certain speedup if they host a part of the model locally. Some may be also motivated to "give back" to the community helping them to run the model (similarly to how BitTorrent users help others by sharing data they have already downloaded).
>Since it may be not enough for everyone, we are also working on introducing explicit incentives ("bloom points") for people donating their GPU time to the public swarm. Once this system is ready, we will display the top contributors on our website. People who earned these points will be able to spend them on inference/fine-tuning with higher priority or increased security guarantees, or (maybe) exchange them for other rewards.
It does seem like they want a sort of centralized token however.
The logical conclusion is that they (the models) will eventually be linked to crypto payments though. This is where Lightning becomes important...
Edit: To clarify, I'm not suggesting linking these Petal "tokens" to any payment system. I'm talking about, in general, calls to clusters of machine learning models, decentralized or not, will likely use crypto payments because it gives you auth and a means of payment.
I do think Petal is a good implementation of using decentralized compute for model use and will likely be valuable long term.
You can host your own swarm of servers apparently [0].
I would be curious to have a ballpark estimate of the finetunning performance of a "private" petals cluster.
I’ve always thought crowdsourcing is the future. Crowdsourcing information or compute. The fact is we have the “resources” already. It’s a matter of deployment.
so given that GGML can serve like 100 tok/s on an M2 Max, and this thing advertises 6 tok/s distributed, is this basically for people with lower end devices?
It's talking about 70B and 160B models. Even heavily quantized can ggml run those that fast? (I'm guessing possibly). So maybe this is for people that dont have a high end computer? I have a decent linux laptop a couple years old and there's no way I could run those models that fast. I get a few tokens per second on a quantized 7B model.
Am I the only one that really really hates pages like google Colab? I never know what is going on there. Is it free? Is it running on my machine, or is it running on googles Cloud? If the latter, again is it really free?!
Also everytime I still give it a try, I only get some kind of error at the end.
Edit: Here we go. Literally the first line that it wanted to execute: "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.24.3 which is incompatible."
I love this direction. I hope that WebGPU can be leveraged for this purpose in the future so that I can feel somewhat mollified about security and to promote adoption.
Cool service. It's worth noting that, with quantization/QLORA, models as big as llama2-70b can be run on consumer hardware (2xRTX 3090) at acceptable speeds (~20t/s) using frameworks like llama.cpp. Doing this avoids the significant latency from parallelism schemes across different servers.
p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained.
Hi, a Petals dev here. You're right, there's no point in using Petals if your machine has enough GPU memory to fit the model and you're okay with the quantization quality.
We developed Petals for people who have less GPU memory than needed. Also, there's still a chance of larger open models being released in the future.
AFAIK you cannot train 70B on 2x 3090, even with GPTQ/qlora.
And the inference is pretty inefficient. Pooling the hardware would achieve much better GPU utilization and (theoretically) faster responses for the host's requests
so how long until "tokens" are used to pay for GPU cycles.. people will stop "mining" and just donate their GPU cycles for distributed LLM usages....
in fact, if they did this so that it followed the sun so that the vast majority of it was powered by daylight Solar PV energy I wouldn't even be upset by that.
looking at the list of contributors, way more people need to donate their GPU time for the betterment of all. maybe we finally have a good use for decentralized computing that doesn't calculate meaningless hashes for crypto, but helps the humanity by keeping these open source LLMs alive.
It can cost a lot to run a GPU, especially at full load. The 4090 stock pulls 500 watts of power under full load[0], which is 12 kWh/day or just under 4380 kWh a year, or over $450 in a year assuming $0.10-$0.11/kWh for average residential rates. The only variable is whether or not training requires the same power draw as hitting it with furmark.
I immediately wanted to contribute and it's quite difficult to find the link on the homepage! The "contribute" button should not be a tiny text link that says "help hosting" in the footnote, it should be a big button next to the colab button.
[+] [-] jmorgan|2 years ago|reply
It's early but I've been working on hosting model weights in a Docker registry for https://github.com/jmorganca/ollama. Mainly for the content addressability (Ollama will verify the correct weights are downloaded every time) and ultimately weights can be fetched by their content instead of by their name or url (which may change!). Perhaps a good next step might be to split the models by layers and store each layer independently for use cases like this (or even just for downloading + running larger models over several "local" machines).
[+] [-] mkii|2 years ago|reply
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.
Surely in this case it would've been possible to comment about OP's work while leaving out the free backlink to your project. Just my 0.02
[+] [-] brucethemoose2|2 years ago|reply
This is the part that raised my eyebrows.
Finetuning 70B is not just hard, its literally impossible without renting a very expensive cloud instance or buying a PC the price of a house, no matter how long you are willing to wait. I would absolutely contribute to a "llama training horde"
[+] [-] AaronFriel|2 years ago|reply
For that 70B parameter model: an A100?
[+] [-] Zetobal|2 years ago|reply
[+] [-] YetAnotherNick|2 years ago|reply
[+] [-] akomtu|2 years ago|reply
[+] [-] pavelstoev|2 years ago|reply
[+] [-] malwrar|2 years ago|reply
[+] [-] __MatrixMan__|2 years ago|reply
I have only a rudimentary understanding of neural nets but it doesn't seem crazy that the weights could be manipulated in such a way while preserving the utility of the model.
I ask because I think it would be useful to know which statements two LLMs of equal power agree on and which they disagree on. You could then map that backwards to differences in their training data (only feasible if the differences are small).
If instead two LLMs of equal power represent a missed opportunity to have one of greater power, and the disagreement analysis is prohibitively expensive to do, then that's a bit of a different world.
[+] [-] hnfong|2 years ago|reply
They're not composable in the sense that you can take these adaptation layers and arbitrarily combine them, but training different models while sharing a common base of weights is a solved problem.
[+] [-] esafak|2 years ago|reply
[+] [-] malwrar|2 years ago|reply
[+] [-] borzunov|2 years ago|reply
This catches frequent attackers but doesn't provide 100% protection - so we expect people to set up a _private_ swarm if they want full correctness guarantees. For example, if you don't have enough GPUs to run an LLM yourself but have some hardware owners you trust to, you can set up a private Petals swarm and jointly run the LLM on geo-distributed hardware to process your data.
[+] [-] esafak|2 years ago|reply
Will Petals incentives be based on crypto, blockchain, etc.?
https://github.com/bigscience-workshop/petals/wiki/FAQ:-Freq...[+] [-] brucethemoose2|2 years ago|reply
What they are referencing, which is super cool and (IMO) criminally underused:
https://lite.koboldai.net/
https://tinybots.net/artbot
https://aihorde.net/
In fact, I can host a 13B-70B finetune in the afternoon if anyone on HN wants to test a particular one out:
https://huggingface.co/models?sort=modified&search=70B+gguf
[+] [-] sn0wf1re|2 years ago|reply
https://www.sheepit-renderfarm.com/home
[+] [-] beardog|2 years ago|reply
>People who run inference and fine-tuning themselves get a certain speedup if they host a part of the model locally. Some may be also motivated to "give back" to the community helping them to run the model (similarly to how BitTorrent users help others by sharing data they have already downloaded).
>Since it may be not enough for everyone, we are also working on introducing explicit incentives ("bloom points") for people donating their GPU time to the public swarm. Once this system is ready, we will display the top contributors on our website. People who earned these points will be able to spend them on inference/fine-tuning with higher priority or increased security guarantees, or (maybe) exchange them for other rewards.
It does seem like they want a sort of centralized token however.
[+] [-] seydor|2 years ago|reply
[+] [-] kordlessagain|2 years ago|reply
Edit: To clarify, I'm not suggesting linking these Petal "tokens" to any payment system. I'm talking about, in general, calls to clusters of machine learning models, decentralized or not, will likely use crypto payments because it gives you auth and a means of payment.
I do think Petal is a good implementation of using decentralized compute for model use and will likely be valuable long term.
[+] [-] Szpadel|2 years ago|reply
[+] [-] nextaccountic|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] teaearlgraycold|2 years ago|reply
[+] [-] timost|2 years ago|reply
[0] https://github.com/bigscience-workshop/petals/wiki/Launch-yo...
[+] [-] 0x008|2 years ago|reply
[+] [-] nico|2 years ago|reply
[+] [-] thathndude|2 years ago|reply
[+] [-] __rito__|2 years ago|reply
The Petals part was abstracted away from me. I had a normal experience writing code.
I don't have the project listed anywhere. Don't really know what happened to it. But, it was mainly some five or so guys spearheading the thing.
[+] [-] swyx|2 years ago|reply
[+] [-] version_five|2 years ago|reply
[+] [-] russellbeattie|2 years ago|reply
So, pretty much every other consumer PC available? Those losers.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] Double_a_92|2 years ago|reply
Also everytime I still give it a try, I only get some kind of error at the end.
Edit: Here we go. Literally the first line that it wanted to execute: "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.24.3 which is incompatible."
[+] [-] wwwtyro|2 years ago|reply
[+] [-] sumo43|2 years ago|reply
p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained.
[+] [-] borzunov|2 years ago|reply
We developed Petals for people who have less GPU memory than needed. Also, there's still a chance of larger open models being released in the future.
[+] [-] brucethemoose2|2 years ago|reply
And the inference is pretty inefficient. Pooling the hardware would achieve much better GPU utilization and (theoretically) faster responses for the host's requests
[+] [-] senectus1|2 years ago|reply
in fact, if they did this so that it followed the sun so that the vast majority of it was powered by daylight Solar PV energy I wouldn't even be upset by that.
[+] [-] bennyschmidt|2 years ago|reply
[+] [-] cphoover|2 years ago|reply
[+] [-] vanillax|2 years ago|reply
[+] [-] behnamoh|2 years ago|reply
[+] [-] judge2020|2 years ago|reply
0: https://youtu.be/j9vC9NBL8zo?t=983
[+] [-] corndoge|2 years ago|reply
Edit: Oh hey, they did it.
[+] [-] Obscurity4340|2 years ago|reply
[+] [-] latchkey|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] tossl568|2 years ago|reply
[deleted]
[+] [-] quickthrower2|2 years ago|reply
Human: what is the time?
The time is 12:30 PM.
Human: are you sure?
Yes, I am sure. The time is 12:30 PM.^</s>^<s> I'm a young {...}