2022 Mac Studio (20-core M1 Ultra) Review

[+] microtonal|3 years ago|reply

Above, we’re looking at parallel performance of the NASA USM3D CFD solver as it computes flow over a classic NACA 0012 airfoil section at low speed conditions.

If this solver relies on matrix multiplication and uses the macOS Accelerate framework, you are seeing this speedup because M1 Macs have AMX matrix multiplication co-processors. In single precision GEMM, the M1 is faster than an 8 core Ryzen 3700X and a bit slower than a 12 core Ryzen 5900X. The M1 Pro doubles the GFLOPS of the M1 (due to having AMX co-processors for both performance core clusters). And the M1 Ultra again doubles the GFLOPs (4 performance core clusters, each with an AMX unit).

Single-precision matrix multiplication benchmark results for the Ryzen 3700X/3900X and Apple M1/M1 Pro/M1 Ultra are here:

https://twitter.com/danieldekok/status/1511348597215961093?s...

[+] JohnTHaller|3 years ago|reply

The 5900X can do 24 threads, so why artificially limit it to 16 threads? Unless that's because the M1 Ultra has 16 performance cores. Why include the 3 year old 3700X? Why not include the 32 thread 5950X which can be had in a complete system for 1/3 the price of the Mac Studio M1 Ultra?

[+] snapcore|3 years ago|reply

The scaling curve is different from the other cpus which implies the performance bottleneck is not in matrix multiplication (which would be nearly linear on the older cpus because it is embarrassingly parallel).

[+] bee_rider|3 years ago|reply

Just for anyone as out of touch with the MacOS ecosystem as me: Accelerate includes a BLAS implementation, so at least seems plausible (depending on how this library was compiled) that their special instructions might have been used.

[+] stephencanon|3 years ago|reply

If it were taking advantage of Accelerate, the performance would be much higher, but also the scaling would be quite different. Look at the scaling in the tweet you linked--it's anything but linear in the number of cores used.

[+] staticfloat|3 years ago|reply

This linear scaling per-core doesn't match my experience with using the AMX co-processor. In my experience, on the M1 and the M1 Pro, there are a limited number of AMX co-processors that is independent of the number of cores within the M1 chip. I wrote an SO answer exploring some of the performance implications of this [0], and since I wrote that another more knowledgable poster has added more information [1]. One of the key takeaways is that appears to be one AMX coprocessor per "complex", leading us to hypothesize that the M1 Pro contains 2 AMX co-processors.

This is supported by taking the code found in the gist [2] linked from my SO answer and running it on my M1 Pro. Compiling it, we get `dgesv_accelerate` which uses Accelerate to solve a medium-size linear algebra problem, that typically takes ~8s to finish on my M1 Pro. While running, `htop` reports that the process is pegging two cores (analogous to the result in my original SO answer on the M1 pegging one core; this supports the idea that the M1 Pro contains two AMX co-processors). If we run two `dgesv_accelerate` processes in parallel, we see that they take ~15 seconds to finish. So there is some small speedup, but it's very small. And if we run four processes in parallel, we see that they take ~32 seconds to finish.

All in all, the kind of linear scaling shown in the article doesn't map well to the limited number of AMX co-processors available in Apple hardware, as we would expect the M1 Max to contain maybe 8 co-processors at most. This means we should see parallelism step up in 8 steps, rather than 20 steps as was shown in the graph.

Everything I just said is true assuming that a single processor, running well-optimized code can completely saturate an AMX co-processor. That is consistent with the tests that I've run, and I'm assuming that the CFD solver he's running is well-written and making good use of the hardware (it does seem to be doing so from the shape of his graphs!). If this were not the case, one could argue that increasing the number of threads could allow multiple threads to more effectively share the underlying AMX coprocessor and we could get the kind of scaling seen in the article. However, in my experiments, I have found that Accelerate very nicely saturates the AMX resources and there is none left over for future sharing (as shown in the dgesv examples).

Finally, as a last note on performance, we have found that using OpenBLAS to run numerical workloads directly on the Performance cores (and not using the AMX instructions at all) is competitive on larger linear algebra workloads. So it's not too crazy to assume that these results are independent of the AMX's abilities!

[0] https://stackoverflow.com/a/67590869/230778 [1] https://stackoverflow.com/a/69459361/230778 [2] https://gist.github.com/staticfloat/2ca67593a92f77b1568c03ea...

[+] torginus|3 years ago|reply

But that makes me think, what prevents people from running these calculations on the GPU? Even the memory is shared - the few 100 gflops they get out of the M1 ultra is pocket change in GPU terms.

[+] lupire|3 years ago|reply

That's good info but omg why a tweet of a crop of screenshot of a document.

[+] walrus01|3 years ago|reply

> If we rescale (by double!) the vertical axis to fit here, a pretty amazing picture emerges:

That second chart with the "amazing" redline is comparing it to 2010 through 2017 era CPUs. Now compare it to a current generation zen3 based $599 AMD CPU in a $350 motherboard with $450 of RAM.

Or a $599 Intel 12th-gen "core" series.

For $3999 nevermind $7999 you can build one real beast of a workstation that fits in a normal midtower ATX case.

[+] gigatexal|3 years ago|reply

It's funny to read about folks on HN where the audience is likely in the top eschelons of earning percentiles complain about a bespoke vertically integrated solution like the Apple computers. Sure you can build a bespoke, quiet, watercooled or aircooled, system for a fraction of the Mac Studio but that's not the audience the thing was designed to target. Its existence does not prevent one from doing just that, finding parts, and building a pc oneself. This is for folks that need OSX, that like the build quality, and want ARM performance without the hassle. :Shrug:

[+] pleb_nz|3 years ago|reply

I was running a Lenovo p15v with fedora before changing over to m1 pro MacBook. The hardware is nice, but to be honest it's not much of a speed up over my Lenovo so I'm giving it more time.

Docker is definitely a lot slower

I have read of massive performance improvement with Linux on m1pro though so it might be a swap to that when more distros are available.

Question is, has MacOS become bloated or not had attention to performance to make best use of the new hardware?

[+] jitl|3 years ago|reply

Are you running x86 containers or ARM containers? I’m on a 64GB M1 and my x86 container build takes 15 mins from scratch on the M1, 10 mins from scratch on my 2019 Intel MBP, but the ARM container build takes 6 mins on the M1. With or without the Docker pain the battery life is so worth it. Can spend 2 days working from the couch without plugging it in.

[+] ribit|3 years ago|reply

> Docker is definitely a lot slower

Docker on macOS is busted. I have heard that folks have had much better success with alternative implementation such as nerdctl. M1 virtualisation on itself is very fast and has almost no overhead.

> I have read of massive performance improvement with Linux on m1pro though so it might be a swap to that when more distros are available.

I wouldn't hold my breath. Doubt that M1 distros will ever become more than an impressive tech demo.

> Question is, has MacOS become bloated or not had attention to performance to make best use of the new hardware?

MacOS will always be "more bloated" than a basic Linux installation, it's an opinionated, fully features user OS that runs many more services in the background, e.g. code verification, filesystem event database, full-disk indexing etc. But the CPU/GPU performance is generally excellent. Of course, it boils down to the software you are running, if it is a half-assed port (like Docker on Mac) seems to be, it will eat up any advantage the hardware offers.

[+] electroly|3 years ago|reply

It's worth at least setting up a Linux VM for Docker. The performance improvement is huge and well worth the hassle. This works today; don't necessarily need to switch to running Linux on the metal. I don't think it's really about macOS, I think Docker Desktop just sucks.

[+] smoldesu|3 years ago|reply

> Question is, has MacOS become bloated or not had attention to performance to make best use of the new hardware?

Docker is just... faster on Linux. This has been the case for a while, and it's not just kernel-based stuff causing problems: APFS and MacOS' virtualization APIs play a pretty big role in weighing it down.

I'm kinda in the same boat, though. I got a work-issued Macbook that kinda just sits around, most of the time I'll use my desktop or reach for my T460s if I've got the choice. Mostly because I do sysops/devops/Docker stuff, but also because I never really felt like the Mac workflow was all that great. To each their own, I guess.

[+] jwilliams|3 years ago|reply

Docker is a lot slower on OSX. If you switch to ARM images that helps (especially stability).

A bigger impact is using the virtiofs (beta) file system mapping/sync. The existing one is horrendously slow to the point of being unusable.

[+] Nextgrid|3 years ago|reply

Docker on MacOS has always been a problem. On Linux it's essentially free, on MacOS you're running a Linux VM and have overheads of moving data between that and your main OS.

MacOS is somewhat bloated though. There's insane amounts of garbage running in the background.

[+] shp0ngle|3 years ago|reply

Docker is double slow on M1 mac compared to Intel linux.

First, you run virtual machine vs just running natively on Linux.

Second, you emulate x86; and, unlike other x86 that are emulated directly on M1, you emulate it in software through QEMU, because M1 cannot virtualize and run x86 at the same time

So it’s just really slow.

[+] jillesvangurp|3 years ago|reply

It would be interesting to see a more serious effort by others to put together a high performance SOC system for gaming or workstations.

I'm writing this on a cheap core i5 system with intel iris xe graphics. Mediocrity is the name of the game for this type of laptop. Everything is mediocre. The screen, the performance, the touchpad, etc. The only good thing about this system: no blazing fans. That seems to be a thing with most SOC based PCs/laptops. Mediocre performance. Non SOC based solutions exist of course but they suck up a lot more power to the point where laptops start having cooling issues and have to do thermal throttling. I've experienced that with multiple intel based macs. You spend all that money and the system almost immediately starts overheating if you actually bother to use what you paid for.

I actually used to have a wooden plank that I used to insulate my legs from my mac book pro. It was simply too uncomfortable to actually use on my lap.

[+] ChrisMarshallNY|3 years ago|reply

Theodolite is an awesome app (that I hardly ever need to use).

This guy def knows his math (and working with Apple apps).

[+] ac29|3 years ago|reply

Doesnt this just suggest they are leaving some performance on the table then? The reason the Intel processors scale non-linearly is because they run each core faster when there are less cores under load.

[+] sliken|3 years ago|reply

Dunno, looks like classic memory bottleneck to me. The m1 ultra has 800GB/sec, but I believe a bit more than half of that is available to the CPUs. The rest is for the GPU and various on chip accelerators.

So with about half the cores (16 vs 28) and twice the bandwidth (say 420GB/sec vs 180 GB/sec) it manages twice the performance. Looks pretty impressive to me. Looks like the Apple is significantly less memory bottle necked than the 6 channel Xeon W.

[+] ip26|3 years ago|reply

Only in a certain sense. You can only turn the voltage so high before things melt, and you can only turn the clocks so high until you crash, and then that's basically it, you can't go any faster.

The perfect core could be dialed from fanless to water cooling with linear performance, but it doesn't quite exist today. Intel chips have top end, Apple chips have low end, but an i7 fed 3W isn't going to perform and an M1 can't take any more voltage.

The tension is figuring out how to build very large structures to perform useful work with gobs of power, yet still scale down to low power budgets. Imagine a core that can dynamically morph from P core to E core & back on the fly.

[+] mrjin|3 years ago|reply

So a new discovery of Amdahl's Law?

https://en.wikipedia.org/wiki/Amdahl%27s_law

[+] bodge5000|3 years ago|reply

I need a new computer. The smart move, especially considering the cost involved, is to go for what I know; a decent enough desktop build with 3 monitors, linux/windows dual boot. What I've used for years.

But it must be said, I've been really tempted by the Macs. I'm not sure why, the 3 main things I do with my personal computer (game dev, playing games, watching things) are things that linux/windows can do at least as good if not better than a Mac, and yet here I am, holding off for months just trying to be convinced into the apple ecosystem.

I think its probably just the simplicity of it. I really like the idea of replacing a load of bulk (3 monitors, the vesa mount to hold them, big keyboard, mouse, bulky tower and a metric tonne of cables) with just a single laptop that I can pickup and go with at a moments notice, though Im not sure of the practicalities of that, at least for me. I don't even travel much, it just seems nice.

Edit: I know that a lot of people have a mac and a desktop to fill all needs, but that somewhat defeats the simplicity of it for me. One bulky computer is simpler for me than one bulky computer for some things and another much smaller computer for others.

[+] sofixa|3 years ago|reply

But the screen estate isn't even close between three monitors and a laptop. A nice middle ground is a laptop with a dock with monitors, that way you have the external screens when you need them, and can also be on the move, on the same computer.

Of the three main things you do, i think only watching things is comparable between macOS and Windows/Linux. Gaming is nonexistent on macOS, unless you stream ( either a cloud service like Stadia/GeforceNow or locally from a PC with Steam in-house streaming/Parsec), and i can't imagine doing game dev somewhere you can't even test.

[+] ProllyInfamous|3 years ago|reply

https://en.wikipedia.org/wiki/Gustafson's_law

[+] lupire|3 years ago|reply

That's awesome. Get a law named after yourself by taking an old paper and just saying "nuh-uh, the naive analysis is better".

[+] sjg007|3 years ago|reply

Are these things good for tensorflow and training 50GB AI models? Or is it better to stick with nvidia?

[+] MPSimmons|3 years ago|reply

Can someone please project that curve out and estimate where it starts to flatten?

[+] sliken|3 years ago|reply

That's not very useful. That curve might be a combination of a few different bottlenecks, l1 cache misses, scheduling bubbles, efficient scheduling of instructions within a certain window, l2 cache misses, cache contention in L1 or L2, AMX utilization, memory utilization (latency or bandwidth limited), MMU contention/page misses, etc. etc. etc.

The percentage of all those can change as you change the number of cores, and the different levels of the memory hierarchy get different levels of contention, latency limits, or bandwidth limits. So in an ideal world you can draw a graph and extrapolate, but in the real world you might do significant better or worse. There are cases (admittedly rather rare) where performance increases by more than linear in relation to cores.

[+] supernova87a|3 years ago|reply

Could these processor leaps in performance inadvertently help us stop burning up so much coal in the quest for Bitcoin?

[+] bigcheesegs|3 years ago|reply

No. The economics of Proof of Work mean that increases in compute per watt just lead to an increase in global hash rate. Total energy usage never goes down as long as it's sufficiently profitable to use the energy.

[+] bebort45|3 years ago|reply

I'm curious why Geekbench haven't put the Mac Studio on their Mac leaderboard yet - https://browser.geekbench.com/mac-benchmarks. There are plenty of benchmarks submitted https://browser.geekbench.com/search?page=7&q=Apple+M1+Ultra...

[+] jfpoole|3 years ago|reply

There’s a bug in the Browser that we haven’t been able to track down yet that’s preventing the Mac Studio from appearing on the leaderboard.

[+] 2OEH8eoCRo0|3 years ago|reply

[deleted]

[+] cglong|3 years ago|reply

Editorialized title. Original was "2022 Mac Studio (20-core M1 Ultra) Review".

[+] dang|3 years ago|reply

Changed now. (Submitted title was 'Near-linear speedup for CPU compute on 20-core Mac Studio'.) Thanks!

[+] MBCook|3 years ago|reply

I didn’t submit it but in this case the original title is so generic no one would have looked at it so I’m kind of happy they put the important part in the headline here.

[+] unknown|3 years ago|reply

[deleted]

[+] pvg|3 years ago|reply

email these in

[+] sydthrowaway|3 years ago|reply

I bet you can build an AMD system that beats this handily and costs half as much.

[+] wildrhythms|3 years ago|reply

And the power consumption?

[+] lupire|3 years ago|reply

Can you? how much will you charge to build one for me? Do you offer warranty support?

[+] sliken|3 years ago|reply

Not with 800GB/sec of memory bandwidth.

Or even the 440GB/sec of memory bandwidth available to the CPUs.

Sure if you are cache friendly enough and you get enough Zen3 or Intel cores you can win, but you end up spending a fair chunk of change, getting less memory bandwidth, and for a clear win you often need to spend more, like say getting a Lenovo Threadripper (and they have a exclusive rights to the chip for 6 months or something).

[+] Reason077|3 years ago|reply

> "Nano-texture glass gives up a little bit of the sharp vibrant look you get with a glossy screen, but it’s worth the trade in usability, to be able to see the screen without distractions all day long."

"Nano-texture glass" is pretty much just what all screens were like back in the days of CRTs and pre-glossy flat screens. Now Apple are charging $300 for it!

[+] numpad0|3 years ago|reply

It is a marketing name, but refers to a special procedure used to create the matte surface, not the fact that it’s matte. By the way, CRTs were glossy. We wiped them with wet towels.

[+] astrange|3 years ago|reply

"Nano-texture" is different from matte - matte LCDs don't have reflective glare, but they also have much lower contrast and you can see the grain if you look closely. Nanotexture doesn't have those issues, but it's expensive.

[+] olliej|3 years ago|reply

that was my thought as well ("yay marketing") but apparently it's actually structurally different so that it maintains contrast. I ordered one in early march and if it ever actually arrives I'll try to remember to reply on visible difference :D (not kidding, it was due late march, then slipped to Aripl 22-27, and today moved to May 23-?? )

[+] jjtheblunt|3 years ago|reply

the pixels are WAY smaller nowadays, though; perhaps that constrains the nanotexture fabrication process in an expensive way?

215 comments