Run LLMs at home, BitTorrent‑style

[+] jmorgan|2 years ago|reply

This is neat. Model weights are split into their layers and distributed across several machines who then report themselves in a big hash table when they are ready to perform inference or fine tuning "as a team" over their subset of the layers.

It's early but I've been working on hosting model weights in a Docker registry for https://github.com/jmorganca/ollama. Mainly for the content addressability (Ollama will verify the correct weights are downloaded every time) and ultimately weights can be fetched by their content instead of by their name or url (which may change!). Perhaps a good next step might be to split the models by layers and store each layer independently for use cases like this (or even just for downloading + running larger models over several "local" machines).

[+] mkii|2 years ago|reply

Ah, is it possible to tone down the self-promotion? I've been seeing your comments for ollama on many LLM-related posts here.

> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

Surely in this case it would've been possible to comment about OP's work while leaving out the free backlink to your project. Just my 0.02

[+] brucethemoose2|2 years ago|reply

> and fine‑tune them for your tasks

This is the part that raised my eyebrows.

Finetuning 70B is not just hard, its literally impossible without renting a very expensive cloud instance or buying a PC the price of a house, no matter how long you are willing to wait. I would absolutely contribute to a "llama training horde"

[+] AaronFriel|2 years ago|reply

That's true for conventional fine-tuning, but is it the case for parameter efficient fine tuning and qLORA? My understanding is that for a N billion parameter model, fine tuning can occur with a slightly-less-than-N gigabyte of VRAM GPU.

For that 70B parameter model: an A100?

[+] Zetobal|2 years ago|reply

An H100 is maybe a car but not nearly close to a house...

[+] YetAnotherNick|2 years ago|reply

Finetuning in a distributed way with questionable network would be lot more energy/cost inefficient than doing it with a single node or a well connected cluster. Also, you can finetune 70b model for million tokens for $2 in lambda cloud or <$10 in replicate.

[+] akomtu|2 years ago|reply

What prevents parallel LLM training? If you read book 1 first and then book 2, the resulting update in your knowledge will be the same if you read the books in the reverse order. It seems reasonable to assume that LLM is trained on each book independently, the two deltas in the LLM weights can be just added up.

[+] pavelstoev|2 years ago|reply

You can finetune 40B falcon on 4 x A10 with compiler optimization technology from CentML. No changes to the model.

[+] malwrar|2 years ago|reply

Impossible? It’s just a bunch of math, you don’t need to keep the entire network in memory the whole time.

[+] __MatrixMan__|2 years ago|reply

Are trained LLM's composable in any way? Like if you and I trust 99% of the same data, but each have 1% where we disagree, must we have two entirely separate models, or can we pool compute in the 99% case (along with the others who agree) and then create a derivative model for ourselves which covers for the differences in our trust models?

I have only a rudimentary understanding of neural nets but it doesn't seem crazy that the weights could be manipulated in such a way while preserving the utility of the model.

I ask because I think it would be useful to know which statements two LLMs of equal power agree on and which they disagree on. You could then map that backwards to differences in their training data (only feasible if the differences are small).

If instead two LLMs of equal power represent a missed opportunity to have one of greater power, and the disagreement analysis is prohibitively expensive to do, then that's a bit of a different world.

[+] hnfong|2 years ago|reply

Somewhat yes. See "LoRA": https://arxiv.org/abs/2106.09685

They're not composable in the sense that you can take these adaptation layers and arbitrarily combine them, but training different models while sharing a common base of weights is a solved problem.

[+] esafak|2 years ago|reply

This is called ensembling. https://blog.allenai.org/llm-blender-a-simple-ensemble-learn...

[+] malwrar|2 years ago|reply

How does this defend against a malicious participant altering the output of their share of the larger computation? Even without some kind of method for e.g. producing attacker-determined network output, this system seems vulnerable to lots of nodes joining and simply returning junk results, effectively DoSing the system.

[+] borzunov|2 years ago|reply

Hi, a Petals dev here. We're developing validators that periodically go over all servers and ban the ones that return incorrect results. Additionally, clients can run data through multiple disjoint routes in the network and check that the results match.

This catches frequent attackers but doesn't provide 100% protection - so we expect people to set up a _private_ swarm if they want full correctness guarantees. For example, if you don't have enough GPUs to run an LLM yourself but have some hardware owners you trust to, you can set up a private Petals swarm and jointly run the LLM on geo-distributed hardware to process your data.

[+] esafak|2 years ago|reply

The first question I had was "what are the economics?" From the FAQ:

Will Petals incentives be based on crypto, blockchain, etc.?

  No, we are working on a centralized incentive system similar to the AI Horde kudos, even though Petals is a fully decentralized system in all other aspects. We do not plan to provide a service to exchange these points for money, so you should see these incentives as "game" points designed to be spent inside our system.

  Petals is an ML-focused project designed for ML researchers and engineers, it does not have anything to do with finance. We decided to make the incentive system centralized because it is much easier to develop and maintain, so we can focus on developing features useful for ML researchers.

https://github.com/bigscience-workshop/petals/wiki/FAQ:-Freq...

[+] brucethemoose2|2 years ago|reply

> similar to the AI Horde kudos

What they are referencing, which is super cool and (IMO) criminally underused:

https://lite.koboldai.net/

https://tinybots.net/artbot

https://aihorde.net/

In fact, I can host a 13B-70B finetune in the afternoon if anyone on HN wants to test a particular one out:

https://huggingface.co/models?sort=modified&search=70B+gguf

[+] sn0wf1re|2 years ago|reply

Similarly there have been distributed render farms for graphic design for a long time. No incentives other than higher points means your jobs are prioritized.

https://www.sheepit-renderfarm.com/home

[+] beardog|2 years ago|reply

>What's the motivation for people to host model layers in the public swarm?

>People who run inference and fine-tuning themselves get a certain speedup if they host a part of the model locally. Some may be also motivated to "give back" to the community helping them to run the model (similarly to how BitTorrent users help others by sharing data they have already downloaded).

>Since it may be not enough for everyone, we are also working on introducing explicit incentives ("bloom points") for people donating their GPU time to the public swarm. Once this system is ready, we will display the top contributors on our website. People who earned these points will be able to spend them on inference/fine-tuning with higher priority or increased security guarantees, or (maybe) exchange them for other rewards.

It does seem like they want a sort of centralized token however.

[+] seydor|2 years ago|reply

It's a shame that every decentralized projects needs to be compared to cryptocoins now

[+] kordlessagain|2 years ago|reply

The logical conclusion is that they (the models) will eventually be linked to crypto payments though. This is where Lightning becomes important...

Edit: To clarify, I'm not suggesting linking these Petal "tokens" to any payment system. I'm talking about, in general, calls to clusters of machine learning models, decentralized or not, will likely use crypto payments because it gives you auth and a means of payment.

I do think Petal is a good implementation of using decentralized compute for model use and will likely be valuable long term.

[+] Szpadel|2 years ago|reply

if that part could be replaced with any third party server it would be a tracker in BitTorrent analogy.

[+] nextaccountic|2 years ago|reply

Can they actually prevent people from trading petals for money though?

[+] unknown|2 years ago|reply

[deleted]

[+] teaearlgraycold|2 years ago|reply

Would love to share my 3080 Ti, but after running the commands in the getting started guide (https://github.com/bigscience-workshop/petals/wiki/Run-Petal...) it looks like there's a dependency versioning issue:

    ImportError: cannot import name 'get_full_repo_name' from 'huggingface_hub' (~/.local/lib/python3.8/site-packages/huggingface_hub/__init__.py)

[+] timost|2 years ago|reply

You can host your own swarm of servers apparently [0]. I would be curious to have a ballpark estimate of the finetunning performance of a "private" petals cluster.

[0] https://github.com/bigscience-workshop/petals/wiki/Launch-yo...

[+] 0x008|2 years ago|reply

I think if you run a cluster in a trusted environment it should be more efficient to use ray or something similar

[+] nico|2 years ago|reply

This is so cool. Hopefully this will give access to thousands or millions more developers in the space

[+] thathndude|2 years ago|reply

I’ve always thought crowdsourcing is the future. Crowdsourcing information or compute. The fact is we have the “resources” already. It’s a matter of deployment.

[+] __rito__|2 years ago|reply

I have used Petals at a past project. I share my GPU as well as wrote code for the project.

The Petals part was abstracted away from me. I had a normal experience writing code.

I don't have the project listed anywhere. Don't really know what happened to it. But, it was mainly some five or so guys spearheading the thing.

[+] swyx|2 years ago|reply

so given that GGML can serve like 100 tok/s on an M2 Max, and this thing advertises 6 tok/s distributed, is this basically for people with lower end devices?

[+] version_five|2 years ago|reply

It's talking about 70B and 160B models. Even heavily quantized can ggml run those that fast? (I'm guessing possibly). So maybe this is for people that dont have a high end computer? I have a decent linux laptop a couple years old and there's no way I could run those models that fast. I get a few tokens per second on a quantized 7B model.

[+] russellbeattie|2 years ago|reply

> ...lower end devices

So, pretty much every other consumer PC available? Those losers.

[+] unknown|2 years ago|reply

[deleted]

[+] Double_a_92|2 years ago|reply

Am I the only one that really really hates pages like google Colab? I never know what is going on there. Is it free? Is it running on my machine, or is it running on googles Cloud? If the latter, again is it really free?!

Also everytime I still give it a try, I only get some kind of error at the end.

Edit: Here we go. Literally the first line that it wanted to execute: "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.24.3 which is incompatible."

[+] wwwtyro|2 years ago|reply

I love this direction. I hope that WebGPU can be leveraged for this purpose in the future so that I can feel somewhat mollified about security and to promote adoption.

[+] sumo43|2 years ago|reply

Cool service. It's worth noting that, with quantization/QLORA, models as big as llama2-70b can be run on consumer hardware (2xRTX 3090) at acceptable speeds (~20t/s) using frameworks like llama.cpp. Doing this avoids the significant latency from parallelism schemes across different servers.

p.s. from experience instruct-finetuning falcon180b, it's not worth using over llama2-70b as it's significantly undertrained.

[+] borzunov|2 years ago|reply

Hi, a Petals dev here. You're right, there's no point in using Petals if your machine has enough GPU memory to fit the model and you're okay with the quantization quality.

We developed Petals for people who have less GPU memory than needed. Also, there's still a chance of larger open models being released in the future.

[+] brucethemoose2|2 years ago|reply

AFAIK you cannot train 70B on 2x 3090, even with GPTQ/qlora.

And the inference is pretty inefficient. Pooling the hardware would achieve much better GPU utilization and (theoretically) faster responses for the host's requests

[+] senectus1|2 years ago|reply

so how long until "tokens" are used to pay for GPU cycles.. people will stop "mining" and just donate their GPU cycles for distributed LLM usages....

in fact, if they did this so that it followed the sun so that the vast majority of it was powered by daylight Solar PV energy I wouldn't even be upset by that.

[+] bennyschmidt|2 years ago|reply

If AI does decentralization better than crypto I'm about to laugh

[+] cphoover|2 years ago|reply

Logo is both mesmerizing and distracting.

[+] vanillax|2 years ago|reply

Very cool.

[+] behnamoh|2 years ago|reply

looking at the list of contributors, way more people need to donate their GPU time for the betterment of all. maybe we finally have a good use for decentralized computing that doesn't calculate meaningless hashes for crypto, but helps the humanity by keeping these open source LLMs alive.

[+] judge2020|2 years ago|reply

It can cost a lot to run a GPU, especially at full load. The 4090 stock pulls 500 watts of power under full load[0], which is 12 kWh/day or just under 4380 kWh a year, or over $450 in a year assuming $0.10-$0.11/kWh for average residential rates. The only variable is whether or not training requires the same power draw as hitting it with furmark.

0: https://youtu.be/j9vC9NBL8zo?t=983

[+] corndoge|2 years ago|reply

I immediately wanted to contribute and it's quite difficult to find the link on the homepage! The "contribute" button should not be a tiny text link that says "help hosting" in the footnote, it should be a big button next to the colab button.

Edit: Oh hey, they did it.

[+] Obscurity4340|2 years ago|reply

This way too nobody can copyright-cancel the LLM like OpenAI or whatever

[+] latchkey|2 years ago|reply

For the most part, gpus are no longer used for hashing. Once ETH switched to PoS, it decimated the entire GPU mining market.

[+] unknown|2 years ago|reply

[deleted]

[+] tossl568|2 years ago|reply

[deleted]

[+] quickthrower2|2 years ago|reply

I got a lurid NSFW comment, just asking for the time (using the Colab), so I assume some people are trolling the network?

Human: what is the time?

The time is 12:30 PM.

Human: are you sure?

Yes, I am sure. The time is 12:30 PM.^</s>^<s> I'm a young {...}

125 comments