This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
I think the main advantage here is you COULD run it, even it it takes a while. That is a step up from current model limitations which require ram or vram to hold the model.
I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.
If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.
Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.
Do you think this could allow distributed inference only, or opens the door for distributed training of the model? Democratization of the models is in part hampered by the total compute a single person or small group can make use of, but if a project like folding@home, but for training large models is possible, it could change the game somewhat.
> I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.
Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.
While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.
Most of the rewards will be reaped by consumers rather than providers.
We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.
How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.
It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
Enterprises use LLMs too, and quite often there wouldn't be any client you could reasonably run the model on. (You wouldn't want to e.g. have an LLM summarize and categorize a user request on their device, since that would require you shipping your model and/or internal knowledge base to the client).
It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.
From what I get skimming through the article the main cost is speed of token generation (token latency). You can always run a large model by reading directly from the disk and not care much about ram; but it is very slow. They try to improve that aspect doing some optimisations, but it is still definitely slower than using ram or vram.
I've only read the abstract but they don't mention quantizing the weights or otherwise trying to shrink the model in any way.
They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.
I'm sure there are trade-offs but they're not clear by just looking at the abstract.
Exo is for partitioning over network across devices (implementing some bandwidth-reducing partitions) but still requires a minimum ram/vram requirement to load a model. This could, in theory, be combined to allow larger models to run on exo clusters with less gpu/ram than is required by the underlying model (at the cost of some performance no doubt, but still).
While training seems to be out of reach for the average tech user unless they have a data center for a homelab or a very large income, SOTA models can be easily run on the edge devices either on a phone or a dedicated computer/server.
LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.
Realistically, you probably want to wait until Vulkan support trickles out. That way, you aren't at the whim of the various evil hardware drivers (everybody's suck), and the AI can give you a disappointingly confused answer much faster than running the LLM on a CPU can.
I'm not aware of any Debian family distro that packages it, but NixOS has at least ollama and llama-cpp in its repos. Honestly even if the more stable distributions did have these things packaged, I would hesitate to use the packaged versions because all of this stuff is still so quickly moving that you'd be on an old version and it would hurt.
[+] [-] vessenes|1 year ago|reply
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
[+] [-] tgtweak|1 year ago|reply
I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.
If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.
Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.
[+] [-] k1musab1|1 year ago|reply
[+] [-] alchemist1e9|1 year ago|reply
Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.
Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] adam_arthur|1 year ago|reply
Most of the rewards will be reaped by consumers rather than providers.
We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.
How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.
It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
[+] [-] lxgr|1 year ago|reply
[+] [-] loufe|1 year ago|reply
[+] [-] woadwarrior01|1 year ago|reply
[+] [-] freehorse|1 year ago|reply
[+] [-] zackangelo|1 year ago|reply
They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.
I'm sure there are trade-offs but they're not clear by just looking at the abstract.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] not_a_dane|1 year ago|reply
[+] [-] Zetaphor|1 year ago|reply
https://github.com/exo-explore/exo
[+] [-] tgtweak|1 year ago|reply
[+] [-] tonetegeatinst|1 year ago|reply
LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.
[+] [-] dvh|1 year ago|reply
[+] [-] mysterhawk|1 year ago|reply
[+] [-] o11c|1 year ago|reply
[+] [-] yjftsjthsd-h|1 year ago|reply
Edit: Arch has ollama in official repos too. OpenSUSE has https://software.opensuse.org/package/ollama .
[+] [-] paxys|1 year ago|reply
[+] [-] jsanders9|1 year ago|reply
[+] [-] tgtweak|1 year ago|reply