Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU
395 points| xaskasdf | 8 days ago |github.com
This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
01100011|8 days ago
I wonder... what if the m.2 storage was actually DRAM? You probably don't need persistence for spilling a model off the GPU. How would it fare vs just adding more host memory? The m.2 ram would be less flexible, but would keep the system ram free for the CPU.
javchz|8 days ago
lmeyerov|7 days ago
I gave a talk a few years ago at dask summit (conf?) on making the stars align with dask-cudf here. We were helping a customer accelerate log analytics by proving out our stack for nodes that look roughly like: parallel ssd storage arrays (30 x 3 GB/s?) -> GPUDirect Storage -> 4 x 30 GB/s PCIe (?) -> 8 x A100 GPUs, something like that. It'd be cool to see the same thing now in the LLM world, such as a multi-GPU MoE, or even a single-GPU one for that matter!
ElectricalUnion|8 days ago
bhewes|7 days ago
https://www.servethehome.com/hyper-scalers-are-using-cxl-to-...
randomtoast|8 days ago
xaskasdf|8 days ago
Wuzado|8 days ago
tyfon|8 days ago
But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.
I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?
fluoridation|8 days ago
rl3|8 days ago
One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.
I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.
Curious if anyone's tried this already.
xaskasdf|8 days ago
jacquesm|8 days ago
And that's good because that increases democratization of AI away from the silos that are being created.
serendip-ml|7 days ago
civicsquid|8 days ago
I know you said you're involved in some retrogaming and were experimenting, but as someone who works in a world where hardware is pretty heavily abstracted away, even if I got into retrogaming I don't know that I'd consider that there may be a systems improvement lying around. Beyond the creative aspect, it feels like there is some systems and hardware background that helped put the idea together (and I'd be interested to go learn about of that systems/hardware knowledge myself).
xaskasdf|7 days ago
The idea was basically to run a llm on a ps2, then I ran into some problems as the 32mb ram cap with 4mb vram cap; so I had to figure out a way to stream layers on the forward pass. Given that ps2 manages to give instructions directly to the vram that's capable of 32bit addresses, it gave an insane amount of tok/s, then I wondered if I could do the same on my puter
rustyhancock|7 days ago
Perhaps that's what made them think to try.
Perhaps the current batch of smart memory cards which on the PS2 I believe have quite complex DMA capabilities to stream from the SD card game data.
Wuzado|8 days ago
rao-v|8 days ago
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
davideom0414|7 days ago
xaskasdf|7 days ago
throwaway2027|8 days ago
someguy2026|8 days ago
jauntywundrkind|8 days ago
Nice work. PCI-P2P (GPU-Direct (tm)) is such great stuff. Cool to see!
7777777phil|7 days ago
valianteffort|7 days ago
[deleted]
Aurornis|7 days ago
xaskasdf|7 days ago
Maxious|7 days ago
spwa4|7 days ago
xaskasdf|7 days ago
nathan_compton|7 days ago
exabrial|8 days ago
garethsprice|7 days ago
- https://taalas.com/the-path-to-ubiquitous-ai/ - https://chatjimmy.ai/
unknown|8 days ago
[deleted]
stuaxo|7 days ago
sylware|7 days ago
timzaman|7 days ago
xaskasdf|7 days ago
MarcLore|7 days ago
[deleted]
YetAnotherNick|6 days ago
unknown|6 days ago
[deleted]
fabifabulous|7 days ago
3abiton|7 days ago
xaskasdf|7 days ago
umairnadeem123|8 days ago
[deleted]
esquire_900|8 days ago
Great achievement for privacy inference nonetheless.
eleventyseven|7 days ago
ai_hack3r|7 days ago
[deleted]
johnbarron|7 days ago
[deleted]
builderhq_io|7 days ago
[deleted]
dhjjdjjjd|7 days ago
[deleted]
turingsroot|8 days ago
[deleted]
Aurornis|7 days ago
Not to diminish the impressiveness of this overall project, but it says right up front that these were vibe coded and the Opus 4.6 co-author lines are right in the commit messages. Those pieces were adapted from existing work via LLM, which is exactly the right use in a proof of concept project like this.
snovv_crash|7 days ago