top | item 46975340

(no title)

NiloCK | 18 days ago

> Didn't expect to go back to macOS but their basically the only feasible consumer option for running large models locally.

I presume here you are referring to running on the device in your lap.

How about a headless linux inference box in the closet / basement?

Return of the home network!

discuss

order

Aurornis|18 days ago

Apple devices have high memory bandwidth necessary to run LLMs at reasonable rates.

It’s possible to build a Linux box that does the same but you’ll be spending a lot more to get there. With Apple, a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

cmrdporcupine|18 days ago

But a $500 Mac Mini has nowhere near the memory capacity to run such a model. You'd need at least 2 512GB machines chained together to run this model. Maybe 1 if you quantized the crap out of it.

And Apple completely overcharges for memory, so.

This is a model you use via a cheap API provider like DeepInfra, or get on their coding plan. It's nice that it will be available as open weights, but not practical for mere mortals to run.

But I can see a large corporation that wants to avoid sending code offsite setting up their own private infra to host it.

ingenieroariel|18 days ago

With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going.

For our code assistant use cases the local inference on Macs will tend to favor workflows where there is a lot of generation and little reading and this is the opposite of how many of use use Claude Code.

Source: I started getting Mac Studios with max ram as soon as the first llama model was released.

ac29|18 days ago

> a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

The cheapest new mac mini is $600 on Apple's US store.

And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic. The laptop I bought last year for <$500 has roughly the same memory speed and new machines are even faster.

zozbot234|18 days ago

And then only Apple devices have 512GB of unified memory, which matters when you have to combine larger models (even MoE) with the bigger context/KV caching you need for agentic workflows. You can make do with less, but only by slowing things down a whole lot.

pja|18 days ago

Only the M4 Pro Mac Minis have faster RAM than you’ll get in an off-the-shelf Intel/AMD laptop. The M4 Pros start at $1399.

You want the M4 Max (or Ultra) in the Mac Studios to get the real stuff.

jannniii|18 days ago

Indeed and I got two words for you:

Strix Halo

SillyUsername|18 days ago

Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM. Running TDP about 200W non peak 550W peak (everything slammed, but I've never seen it and I've an AC monitor on the socket). GLM 4.5 Air (60GB Q3-XL) when properly tuned runs at 8.5 to 10 tokens / second, with context size of 8K. Throw in a P100 too and you'll see 11-12.5 t/s (still tuning this one). Performance doesn't drop as much for larger model sizes as the internode communication and DDR4 2400 is the limiter, not the GPUs. I've been using this with 4 channel 96GB ram, recently updated to 128GB.

esafak|18 days ago

How much memory does yours have, what are you running on it, with what cache size, and how fast?

mythz|18 days ago

Not feasible for Large models, it takes 2x M3 512GB Ultra's to run the full Kimi K2.5 model at a respectable 24 tok/s. Hopefully the M5 Ultra will can improve on that.