I follow the MLX team on Twitter and they sometimes post about using MLX on two or more joined together Macs to run models that need more than 512GB of RAM.
For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.
The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.
I’m hoping this isn’t as attractive as it sounds for non-hobbyists because the performance won’t scale well to parallel workloads or even context processing, where parallelism can be better used.
Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.
It would be incredibly ironic if, with Apple's relatively stable supply chain relative to the chaos of the RAM market these days (projected to last for years), Apple compute became known as a cost-effective way to build medium-sized clusters for inference.
Here’s a text edition:
For $50k the inference hardware market forces a trade-off between capacity and throughput:
* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).
* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.
To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.
For $50K, you could buy 25 Framework desktop motherboards (128G VRAM each w/Strix Halo, so over 3TB total) Not sure how you'll cluster all of them but it might be fun to try. ;)
Are you factoring in the above comment about as yet un-implemented parallel speed up in there? For on prem inference without any kind of asic this seems quite a bargain relatively speaking.
Apple deploys LPDDR5X for the energy efficiency and cost (lower is better), whereas NVIDIA will always prefer GDDR and HBM for performance and cost (higher is better).
This implies you'd run more than one Mac Studio in a cluster, and I have a few concerns regarding Mac clustering (as someone who's managed a number of tiny clusters, with various hardware):
1. The power button is in an awkward location, meaning rackmounting them (either 10" or 19" rack) is a bit cumbersome (at best)
2. Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability... wish they made a Mac with QSFP :)
3. Cabling will be important, as I've had tons of issues with TB4 and TB5 devices with anything but the most expensive Cable Matters and Apple cables I've tested (and even then...)
4. macOS remote management is not nearly as efficient as Linux, at least if you're using open source / built-in tooling
To that last point, I've been trying to figure out a way to, for example, upgrade to macOS 26.2 from 26.1 remotely, without a GUI, but it looks like you _have_ to use something like Screen Sharing or an IP KVM to log into the UI, to click the right buttons to initiate the upgrade.
Trying "sudo softwareupdate -i -a" will install minor updates, but not full OS upgrades, at least AFAICT.
I have no experience with this, but for what it's worth, looks like there's a rack mounting enclosure available which mechanically extends the power switch: https://www.sonnetstore.com/products/rackmac-studio
"... Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability ..."
Thunderbolt as a server interconnect displeases me aesthetically but my conclusion is the opposite of yours:
If the systems are locked into place as servers in a rack the movements and stresses on the cable are much lower than when it is used as a peripheral interconnect for a desktop or laptop, yes ?
VNC over SSH tunneling always worked well for me before I had Apple Remote Desktop available, though I don't recall if I ever initiated a connection attempt from anything other than macOS...
erase-install can be run non-interactively when the correct arguments are used. I've only ever used it with an MDM in play so YMMV:
With MDM solutions you can not only get software update management, but even full LOM for models that support this.
There are free and open source MDM out there.
It’s been terrible for years/forever. Even Xserves didn’t really meet the needs of a professional data centre. And it’s got worse as a server OS because it’s not a core focus. Don’t understand why anyone tries to bother - apart from this MLX use case or as a ProRes render farm.
Apple should setup their own giant cloud of M chips with tons of vram, make Metal as good as possible for AI purposes, then market the cloud as allowing self-hosted models for companies and individuals that care about privacy. They would clean up in all kinds of sectors whose data can't touch the big LLM companies.
The advantages of having a single big memory per gpu are not as big in a data center where you can just shard things between machines and use the very fast interconnect, saturating the much faster compute cores of a non Apple GPU from Nvidia or AMD
I've been testing HPL and mpirun a little, not yet with this new RDMA capability (it seems like Ring is currently the supported method)... but it was a little rough around the edges.
Is there any way to connect DGX Sparks to this via USB4? Right now only 10GbE can be used despite both Spark and MacStudio having vastly faster options.
Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.
I am waiting for M5 studio but due to current price of hardware I'm not sure it will be at a level that I would call affordable. Currently I'm watching for news and if there is any announcement prices will go up I'll probably settle for an M4 Max.
Remember when they enabled egpu over thunderbolt and no one cared because the thunderbolt housing cost almost as much as your macbook outright? Yeah. Thunderbolt is a racket. It’s a god damned cord. Why is it $50.
Maybe Apple should rethink bringing back Mac Pro desktops with pluggable GPUs, like that one in the corner still playing with its Intel and AMD toys, instead of a big box full of air and pro audio cards only.
Very cool. It requires a fully-connected mesh so the scaling limit here would seem to be 6 Mac Studio M3 Ultra, up to 3TB of unified memory to work with.
Now we need some hardware that is rackmount friendly, an OS that is not fidly as hell to manage in a data center or headless server and we are off to the races! And no, custom racks are not 'rackmount friendly'.
Thunderbolt5's stated "80Gbps" bandwidth comes with some caveats. That's the figure for either Display Port bandwidth itself or in practice more often realized by combining the data channel (PCIe4x4 ~=64Gbps) with the display channels (=<80Gbps if used in concert with data channels), and potentially it can also do unidirectional 120Gbps of data for some display output scenarios.
If Apple's silicon follows spec, then that means you're most likely limited to PCIe4x4 ~=64Gbps bandwidth per TB port, with a slight latency hit due to the controller. That Latency hit is ItDepends(TM), but if not using any other IO on that controller/cable (such as display port), it's likely to be less than 15% overhead vs Native on average, but depending on drivers, firmware, configuration, usecase, cable length, and how apple implemented TB5, etc, exact figures very. And just like how 60FPS Average doesn't mean every frame is exactly 1/60th of a second long, it's entirely possible that individual packets or niche scenarios could see significantly more latency/overhead.
As a point of reference Nvidia RTX Pro (formerly known as quadro) workstation cards of Ada generation and older along with most modern consumer grahics cards are PCIe4 (or less, depending on how old we're talking), and the new RTX Pro Blackwell cards are PCIe5. Though comparing a Mac Studio M4 Max for example to an Nvidia GPU is akin to comparing Apples to Green Oranges
However, I mention the GPU's not just to recognize the 800lb AI compute gorilla in the room, but also that while it's possible to pool a pair of 24GB VRAM GPU's to achieve a 48GB VRAM pool between them (be it through a shared PCIe bus or over NVlink), the performance does not scale linearly due to PCIe/NVLinks limitations, to say nothing of the software, and configuration and optimization side of things also being a challenge to realizing max throughput in practice.
This is also just as true as a pair of TB5 equipped macs with 128GB of memory each using TB5 to achieve a 256GB Pool will take a substantial performance hit compared to on otherwise equivalent mac with 256GB. (capacities chosen are arbitrary to illustrate the point). The exact penalty really depends on usecase and how sensitive it is to the latency overhead of using TB5 as well as the bandwidth limitation.
It's also worth noting that it's not just entirely possible with RDMA solutions (no matter the specifics) to see worse performance than using a singular machine if you haven't properly optimized and configured things. This is not hating on the technology, but a warning from experience for people who may have never dabbled to not expect things to just "2x" or even just better than 1x performance just by simply stringing a cable between two devices.
All that said, glad to see this from Apple. Long overdue in my opinion as I doubt we'll see them implement an optical network port with anywhere near that bandwidth or RoCEv2 support, much less a expose a native (not via TB) PCIe port on anything that's a non-pro model.
EDIT: Note, many mac skus have multiple TB5 ports, but it's unclear to me what the underlying architecture/topology is there and thus can't speculate on what kind of overhead or total capacity any given device supports by attempting to use multiple TB links for more bandwidth/parallelism. If anyone's got an SoC diagram or similar refernce data that actually tells us how the TB controller(s) are uplinked to the rest of the SoC, I could go in more depth there. I'm not an Apple silicon/MacOS expert. I do however have lots of experience with RDMA/RoCE/IB clusters, NVMeoF deployments, SXM/NVlink'd devices and generally engineering low latency/high performance network fabrics for distributed compute and storage (primarily on the infrastructure/hardware/ops side than on the software side) so this is my general wheelhouse, but Apple has been a relatively blindspot for me due to their ecosystem generally lacking features/support for things like this.
You might ignore this but, for a while, Mac Mini clusters were a thing and they were capex and opex effective. That same setup is kind of making a comeback.
Garageband DAW + MacOS 14.4 Roland Juno-D7 synthsizer, for 8-bit audio complementary compact disk format as AIFF, WAV, or MIDI appliance, in which under SLA-royalties licenses, binary 44.1 Khz sample rate sets the reproducer for reference level.
Can we get proper HDR support first in macOS? If I enable HDR on my LG OLED monitor it looks completely washed out and blacks are grey. Windows 11 HDR works fine.
Really? I thought it's always been that HDR was notorious on Windows, hopeless on Linux, and only really worked in a plug-and-play manner on Mac, unless your display has an incorrect profile or something/
This doesn’t remotely surprise me, and I can guess Apple’s AI endgame:
* They already cleared the first hurdle to adoption by shoving inference accelerators into their chip designs by default. It’s why Apple is so far ahead of their peers in local device AI compute, and will be for some time.
* I suspect this introduction isn’t just for large clusters, but also a testing ground of sorts to see where the bottlenecks lie for distributed inference in practice.
* Depending on the telemetry they get back from OSes using this feature, my suspicion is they’ll deploy some form of distributed local AI inference system that leverages their devices tied to a given iCloud account or on the LAN to perform inference against larger models, but without bogging down any individual device (or at least the primary device in use)
For the endgame, I’m picturing a dynamically sharded model across local devices that shifts how much of the model is loaded on any given device depending on utilization, essentially creating local-only inferencing for privacy and security of their end users. Throw the same engines into, say, HomePods or AppleTVs, or even a local AI box, and voila, you’re golden.
EDIT: If you're thinking, "but big models need the higher latency of Thunderbolt" or "you can't do that over Wi-Fi for such huge models", you're thinking too narrowly. Think about the devices Apple consumers own, their interconnectedness, and the underutilized but standardized hardware within them with predictable OSes. Suddenly you're not jamming existing models onto substandard hardware or networks, but rethinking how to run models effectively over consumer distributed compute. Different set of problems.
The bandwidth of rdma over thunderbolt is so much faster (and lower latency) than Apple's system of mostly-wireless devices, I can't see how any learnings here would transfer.
I think you are spot on, and this fits perfectly within my mental model of HomeKit; tasks are distributed to various devices within the network based on capabilities and authentication, and given a very fast bus Apple can scale the heck out of this.
simonw|2 months ago
A couple of examples:
Kimi K2 Thinking (1 trillion parameters): https://x.com/awnihannun/status/1986601104130646266
DeepSeek R1 (671B): https://x.com/awnihannun/status/1881915166922863045 - that one came with setup instructions in a Gist: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba...
awnihannun|2 months ago
The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.
anemll|2 months ago
Note fast sync workaround
andy99|2 months ago
Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.
CamperBob2|2 months ago
btown|2 months ago
andy99|2 months ago
teaearlgraycold|2 months ago
reilly3000|2 months ago
Here’s a text edition: For $50k the inference hardware market forces a trade-off between capacity and throughput:
* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).
* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.
To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.
mechagodzilla|2 months ago
icedchai|2 months ago
3abiton|2 months ago
That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP.
FuckButtons|2 months ago
conradev|2 months ago
yieldcrv|2 months ago
Wake me up when the situation improves
dsrtslnd23|2 months ago
geerlingguy|2 months ago
1. The power button is in an awkward location, meaning rackmounting them (either 10" or 19" rack) is a bit cumbersome (at best)
2. Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability... wish they made a Mac with QSFP :)
3. Cabling will be important, as I've had tons of issues with TB4 and TB5 devices with anything but the most expensive Cable Matters and Apple cables I've tested (and even then...)
4. macOS remote management is not nearly as efficient as Linux, at least if you're using open source / built-in tooling
To that last point, I've been trying to figure out a way to, for example, upgrade to macOS 26.2 from 26.1 remotely, without a GUI, but it looks like you _have_ to use something like Screen Sharing or an IP KVM to log into the UI, to click the right buttons to initiate the upgrade.
Trying "sudo softwareupdate -i -a" will install minor updates, but not full OS upgrades, at least AFAICT.
wlesieutre|2 months ago
https://www.owc.com/solutions/thunderbolt-dock
It's a poor imitation of old ports that had screws on the cables, but should help reduce inadvertent port stress.
The screw only works with limited devices (ie not the Mac Studio end of the cord) but it can also be adhesive mounted.
https://eshop.macsales.com/item/OWC/CLINGON1PK/
eurleif|2 months ago
rsync|2 months ago
Thunderbolt as a server interconnect displeases me aesthetically but my conclusion is the opposite of yours:
If the systems are locked into place as servers in a rack the movements and stresses on the cable are much lower than when it is used as a peripheral interconnect for a desktop or laptop, yes ?
cromniomancer|2 months ago
erase-install can be run non-interactively when the correct arguments are used. I've only ever used it with an MDM in play so YMMV:
https://github.com/grahampugh/erase-install
ThomasBb|2 months ago
827a|2 months ago
colechristensen|2 months ago
badc0ffee|2 months ago
I think you can do this if you install a MDM profile on the Macs and use some kind of management software like Jamf.
timc3|2 months ago
int32_64|2 months ago
wmf|2 months ago
make3|2 months ago
unknown|2 months ago
[deleted]
timsneath|2 months ago
FridgeSeal|2 months ago
geerlingguy|2 months ago
See: https://ml-explore.github.io/mlx/build/html/usage/distribute...
dagmx|2 months ago
storus|2 months ago
zackangelo|2 months ago
irusensei|2 months ago
piskov|2 months ago
https://x.com/__tinygrad__/status/1980082660920918045
throawayonthe|2 months ago
zeristor|2 months ago
Is this part of Apple’s plan of building out server side AI support using their own hardware?
If so they would need more physical data centres.
I’m guessing they too would be constrained by RAM.
kjkjadksj|2 months ago
wmf|2 months ago
(The cord is $50 because it contains two active chips BTW.)
pjmlp|2 months ago
reaperducer|2 months ago
https://en.wikipedia.org/wiki/Xgrid
wmf|2 months ago
650REDHAIR|2 months ago
cluckindan|2 months ago
guiand|2 months ago
rdma_ctl enable
pstuart|2 months ago
baq|2 months ago
whimsicalism|2 months ago
thatwasunusual|2 months ago
wmf|2 months ago
yalogin|2 months ago
wmf|2 months ago
daft_pink|2 months ago
colechristensen|2 months ago
TheRealPomax|2 months ago
wmf|2 months ago
jamesfmilne|2 months ago
I'd have some other uses for RDMA between Macs.
jamesfmilne|2 months ago
https://github.com/Anemll/mlx-rdma/commit/a901dbd3f9eeefc628...
jeffbee|2 months ago
unknown|2 months ago
[deleted]
PunchyHamster|2 months ago
nickysielicki|2 months ago
Don’t get me wrong... It’s super cool, but I fail to understand why money is being spent on this.
aurareturn|2 months ago
novok|2 months ago
joeframbach|2 months ago
nottorp|2 months ago
DesiLurker|2 months ago
wmf|2 months ago
sebnukem2|2 months ago
badc0ffee|2 months ago
ComputerGuru|2 months ago
icedchai|2 months ago
0manrho|2 months ago
Thunderbolt5's stated "80Gbps" bandwidth comes with some caveats. That's the figure for either Display Port bandwidth itself or in practice more often realized by combining the data channel (PCIe4x4 ~=64Gbps) with the display channels (=<80Gbps if used in concert with data channels), and potentially it can also do unidirectional 120Gbps of data for some display output scenarios.
If Apple's silicon follows spec, then that means you're most likely limited to PCIe4x4 ~=64Gbps bandwidth per TB port, with a slight latency hit due to the controller. That Latency hit is ItDepends(TM), but if not using any other IO on that controller/cable (such as display port), it's likely to be less than 15% overhead vs Native on average, but depending on drivers, firmware, configuration, usecase, cable length, and how apple implemented TB5, etc, exact figures very. And just like how 60FPS Average doesn't mean every frame is exactly 1/60th of a second long, it's entirely possible that individual packets or niche scenarios could see significantly more latency/overhead.
As a point of reference Nvidia RTX Pro (formerly known as quadro) workstation cards of Ada generation and older along with most modern consumer grahics cards are PCIe4 (or less, depending on how old we're talking), and the new RTX Pro Blackwell cards are PCIe5. Though comparing a Mac Studio M4 Max for example to an Nvidia GPU is akin to comparing Apples to Green Oranges
However, I mention the GPU's not just to recognize the 800lb AI compute gorilla in the room, but also that while it's possible to pool a pair of 24GB VRAM GPU's to achieve a 48GB VRAM pool between them (be it through a shared PCIe bus or over NVlink), the performance does not scale linearly due to PCIe/NVLinks limitations, to say nothing of the software, and configuration and optimization side of things also being a challenge to realizing max throughput in practice.
This is also just as true as a pair of TB5 equipped macs with 128GB of memory each using TB5 to achieve a 256GB Pool will take a substantial performance hit compared to on otherwise equivalent mac with 256GB. (capacities chosen are arbitrary to illustrate the point). The exact penalty really depends on usecase and how sensitive it is to the latency overhead of using TB5 as well as the bandwidth limitation.
It's also worth noting that it's not just entirely possible with RDMA solutions (no matter the specifics) to see worse performance than using a singular machine if you haven't properly optimized and configured things. This is not hating on the technology, but a warning from experience for people who may have never dabbled to not expect things to just "2x" or even just better than 1x performance just by simply stringing a cable between two devices.
All that said, glad to see this from Apple. Long overdue in my opinion as I doubt we'll see them implement an optical network port with anywhere near that bandwidth or RoCEv2 support, much less a expose a native (not via TB) PCIe port on anything that's a non-pro model.
EDIT: Note, many mac skus have multiple TB5 ports, but it's unclear to me what the underlying architecture/topology is there and thus can't speculate on what kind of overhead or total capacity any given device supports by attempting to use multiple TB links for more bandwidth/parallelism. If anyone's got an SoC diagram or similar refernce data that actually tells us how the TB controller(s) are uplinked to the rest of the SoC, I could go in more depth there. I'm not an Apple silicon/MacOS expert. I do however have lots of experience with RDMA/RoCE/IB clusters, NVMeoF deployments, SXM/NVlink'd devices and generally engineering low latency/high performance network fabrics for distributed compute and storage (primarily on the infrastructure/hardware/ops side than on the software side) so this is my general wheelhouse, but Apple has been a relatively blindspot for me due to their ecosystem generally lacking features/support for things like this.
givemeethekeys|2 months ago
AndroTux|2 months ago
londons_explore|2 months ago
moralestapia|2 months ago
unit149|2 months ago
[1]: https://www.apple.com/legal/sla/docs/GarageBand.pdf
sora2video|2 months ago
[deleted]
schmuckonwheels|2 months ago
Liquid (gl)ass still sucks.
nodesocket|2 months ago
Razengan|2 months ago
https://www.youtube.com/shorts/sx9TUNv80RE
m-ack-toddler|2 months ago
stego-tech|2 months ago
* They already cleared the first hurdle to adoption by shoving inference accelerators into their chip designs by default. It’s why Apple is so far ahead of their peers in local device AI compute, and will be for some time.
* I suspect this introduction isn’t just for large clusters, but also a testing ground of sorts to see where the bottlenecks lie for distributed inference in practice.
* Depending on the telemetry they get back from OSes using this feature, my suspicion is they’ll deploy some form of distributed local AI inference system that leverages their devices tied to a given iCloud account or on the LAN to perform inference against larger models, but without bogging down any individual device (or at least the primary device in use)
For the endgame, I’m picturing a dynamically sharded model across local devices that shifts how much of the model is loaded on any given device depending on utilization, essentially creating local-only inferencing for privacy and security of their end users. Throw the same engines into, say, HomePods or AppleTVs, or even a local AI box, and voila, you’re golden.
EDIT: If you're thinking, "but big models need the higher latency of Thunderbolt" or "you can't do that over Wi-Fi for such huge models", you're thinking too narrowly. Think about the devices Apple consumers own, their interconnectedness, and the underutilized but standardized hardware within them with predictable OSes. Suddenly you're not jamming existing models onto substandard hardware or networks, but rethinking how to run models effectively over consumer distributed compute. Different set of problems.
wmf|2 months ago
Not really. llama.cpp was just using the GPU when it took off. Apple's advantage is more VRAM capacity.
this introduction isn’t just for large clusters
It doesn't work for large clusters at all; it's limited to 6-7 Macs and most people will probably use just 2 Macs.
fwip|2 months ago
threecheese|2 months ago