Hi HN,
I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini.
Early benchmarks:
TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization.
At 140B scale, models that normally require ~280GB fit within ~35GB packed.
Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation.
Check benchmarks, architecture, and details here: https://opengraviton.github.io
GitHub: https://github.com/opengraviton
This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!
[+] [-] fatihturker|20 days ago|reply
Thank you for all the good and curious comments.
For 72B models, around *36GB memory works fine* by the way. I ran the benchmark and shared the results on the website: https://opengraviton.github.io/index.html
While working on this research I realized something important: the way most current models are trained is extremely inefficient. Because of that, I started developing *graviton-native*, which trains AI models from scratch using more efficient architectures.
The idea is to design models that are optimized for efficiency from the beginning. My expectation is that this approach could bring around *~70% efficiency improvement*. Combined with OpenGraviton, I believe this could eventually make it possible to run *trillion-parameter scale models locally*.
You can find the paper here: https://opengraviton.github.io/paper.html
And the repository here: https://github.com/opengraviton/graviton-native
Right now I’m training a *72B model* using this approach. I’ll share the results soon and update the website.
[+] [-] ryanholtdev|21 days ago|reply
[+] [-] anentropic|21 days ago|reply
https://github.com/opengraviton/graviton?tab=readme-ov-file#...
the benchmarks don't show any results for using these larger-than-memory models, only the size difference
it all smells quite sloppy
[+] [-] hu3|21 days ago|reply
~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0
[+] [-] Hauk307|19 days ago|reply
[+] [-] swq115|21 days ago|reply
For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.
[+] [-] zhangchen|20 days ago|reply
[+] [-] bbtc3453|20 days ago|reply
[+] [-] lazerlapin|16 days ago|reply
[+] [-] pcf|21 days ago|reply
I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
[+] [-] deflator|21 days ago|reply
[+] [-] yubainu|18 days ago|reply
[deleted]