top | item 25257932

Why is Apple's M1 chip so fast?

768 points| socialdemocrat | 5 years ago |erik-engheim.medium.com

629 comments

[+] benjaminl|5 years ago|reply

Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

The M1 is really wide (8 wide decode) and has a lot of execution units. It has a huge 630 deep reorder buffer to keep them all filled, multiple large caches and a lot of memory bandwidth.

It is just a monster of a chip, well designed balanced and executed.

BTW this isn’t really new. Apple has been making incremental progress year by year on these processor for their A-series chips. Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop. Well turns out that given the right cooling solution those benchmarks were accurate.

Anandtech has a wonderful deep dive into the processor architecture.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Edit: I didn’t mean to disparage Apple or the M1 by saying that Apple threw hardware at the problem. That Apple was able to keep power low with such a wide chip is extremely impressive and speaks to how finely tuned the chip is. I was trying to say that Apple got the results they did the hard way by advancing every aspect of the chip.

[+] rayiner|5 years ago|reply

The answer of wide decode and deep reorder buffer gets much closer than the “tricks” mentioned in tweets. That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.

On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.

An insane amount of good engineering went into this CPU.

[+] libria|5 years ago|reply

Just to quantify your adjectives, per the Anandtech article:

> The M1 is really wide (8 wide decode)

In contrast to x86 CPUs which are 4 wide decode.

> It has a huge 630 deep reorder buffer

By comparison, Intel Sunny/Willow has 352.

[+] sliken|5 years ago|reply

You mention most of the big changes, except one. Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard.

[+] Kolja|5 years ago|reply

> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

Except for a lot of the Apple-centric journalists and podcasters, who have been imagining for years how fast a desktop built on these already-very-fast-when-passively-cooled chips could be.

Not that that matters very much when experienced and real-world workload performance suffers, but as far as I can tell, the M1 is no slouch in that respect either.

[+] vmchale|5 years ago|reply

> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

I saw a paper on (I think?) SMT solvers on iPhone. Turned out to be faster than laptops, I kind of brushed over it as irrelevant at the time.

[+] GeekyBear|5 years ago|reply

Another detail that came out today is just how beefy Apple's "little" cores are.

>The performance showcased here roughly matches a 2.2GHz Cortex-A76 which is essentially 4x faster than the performance of any other mobile SoC today which relies on Cortex-A55 cores, all while using roughly the same amount of system power and having 3x the power efficiency.

https://www.anandtech.com/show/16192/the-iphone-12-review/2

[+] gabereiser|5 years ago|reply

Spot on. Exactly this. It’s like pre-iPhone when people just assumed you had a laptop and a cellphone. Then Apple said “phone computer!” and changed the game. Same with iPad just less innovation shock. Meanwhile we continued to have this delineation of computer / phone while under the hood - to a hardware engineer - it’s all the same. Naturally the chips they produced for iOS-land are beasts. My phone is faster than the computer I had 5 years ago. My M1 air is just a freak of nature. On par with high end machines but passively cooled and cheaper. I’m still kinda in awe. Not a big fan of the hush hush on Apple Silicon causing us all to play catch-up for support, but that’s Apple’s track record I guess.

The M1 is all the things they learned from the A1-A12 chips (or whatever the ordering) which is over a decade of tweaking the design for efficiency (phone) while giving it power (iPad).

[+] throwarchitect|5 years ago|reply

> the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Apple threw more hardware at the problem and they lowered the frequency.

By lowering the frequency relative to AMD/Intel parts, they get two great advantages. 1) they use significantly less power and 2) they can do more work per cycle, making use of all of that extra hardware.

[+] skavi|5 years ago|reply

> Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Of course Apple’s advantage is not solely due to technical tricks, but neither is it entirely, or even mostly due to an area advantage. If it were so easy, Samsung’s M1 would have been a good core.

[+] gigatexal|5 years ago|reply

Yeah the article is interesting I just like knowing that Apple will keep iterating and the performance gap between Apple Silicon and x86 will continue to grow. I keep spec'ing out an Apple M1 Mac mini only to not pull the trigger because I am curious what an M2 will hold.

[+] raxxorrax|5 years ago|reply

The problem is that a phone with a lightning fast CPU is rather useless with current ecosystems.

I do think there are "technical tricks" though, especially the compatibility memory mode that makes x86 emulation faster than comparable ARM chips. If you call it a trick or finesse is probably a matter of perspective.

[+] Taniwha|5 years ago|reply

Not mentioned in the article is the downside of having really wide decoders (and why they're not likely to get much larger) - essentially the big issue in all modern CPUs is branch prediction because the cost of misprediction on a big CPU is so high - there's a rule-of-thumb that in real-world instruction streams there's a branch every 5 instructions or so ... that means that if you're decoding 8 instructions each bundle has 1 or 2 branches in it, if any are predicted taken you have to throw away the subsequent instructions - if you're decoding 16 instructions you've got 3 or 4 branches to predict, chances of having to throw something away gets higher as you go .... there's a law of diminishing returns that kicks in, and in fact has probably kicked in at 8

[+] mixmastamyk|5 years ago|reply

The 5nm process is a large factor as well.

[+] cbsmith|5 years ago|reply

...and Apple was able to throw hardware at the problem because they got TSMC's manufacturing process. When everyone else is using 5nm, let's see if any of this other stuff actually matters.

[+] Guthur|5 years ago|reply

M1 is also the only TSMC 5nm chip that is widely available and there is nothing remotely comparable from a process stand point.

[+] PeterisP|5 years ago|reply

Are they really throwing more hardware at the problem?

The die size for the whole M1 SoC is comparable to or even smaller than Intel processors, and the vast majority of that SoC is non-CPU related stuff, the CPU cores/cache/etc seem to be at most 20% of that die - though a more dense die because of the 5nm process. This also seems to imply that the 'budget' of number of transistors for that CPU-part of the SoC is also comparable to previous Intel processors, not a significant increase. (Assuming 20% of the 16b transistors in M1 is CPU part, it would be 3-ish billion transistors, and the Intel does not seem to publish transistor counts but I believe it's more than that for the Intel i9 chips in last year's macbookspro)

Perhaps my estimates are wrong, but it seems that they aren't throwing more hardware, but managing to achieve much more with the same "amount of hardware" because it is substantially different.

[+] ori_b|5 years ago|reply

The next question: What prevents Intel or AMD from doing this on their processors?

[+] davrosthedalek|5 years ago|reply

How does 8 wide decode on ARM RISC compare to 4 wide decode on x64 CISC? If, say, you'd need two RISC ops per CISC op on average, that should be the same, right?

[+] Sparkyte|5 years ago|reply

Thank you for a better synopsis.

Bugs me so much when people don't look at the logical side of things. Tons of mac'n'knights going around downvoting and stating people are wrong that it has something to do with being a RISC processor.

While fundamentally pre 2000s things were more RISC this and CISC that the designs are more similar than ever on x86 and ARM. Just that the components are designed different to handle the different base instruction sets.

Also the article is entirely wrong about SoC Ryzen chips have been SoC since their initial inception. In fact SoC on the First APU. Those carried North Bridge components onto the CPU die.

[+] legulere|5 years ago|reply

It also has shared 12MB of L2 cache for the performance cores which is huge.

[+] baybal2|5 years ago|reply

> But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs.

The author completely misses "the baby in the water"

Yes, X86 core are HUGE, the whole of CPU is for them only.

They can afford wider decode, even though at a giant area cost (which itself would be dwarfed area cost of cache system area.)

The thing is, have more decode, and buffer will still not improve X86 perf by much

Modern X86 has good internal register, and pipeline utilisation, it's simply they don't have something to keep all of those registers busy most of the time!

What it lacks is memory, and cache I/O. All X86 chips today are I/O starved at every level. And that starvation also comes as a result of decades old X86 idiosyncrasies about how I/O should be done.

[+] saberience|5 years ago|reply

I find it interesting you use this kind of disparaging tone when discussing Apple Silicon. I also find it interesting that you consider having a wide decoder not as a technical trick but as "throwing hardware at the problem."

However you try and spin it, what it comes down to is this, Apple is somehow designing more performant processors than every other company in the world and we should acknowledge they are beating the "traditional" chip designing companies handily while being new at the game.

If it's as easy as "throwing hardware" at the problem, then Intel and AMD and Samsung etc should have no problem beating Apple right?

[+] ogre_codes|5 years ago|reply

While the M1 is super impressive, I'm kind of wondering how scalable the performance increases they made here are.

They moved high bandwidth RAM to be shared between the CPU & GPU, but they[1] can't just keep expanding the SoC[2] with ever larger amounts of RAM. At some point they will need external RAM in addition to the on chip RAM. Perhaps they will ship 32GB of onboard memory with external RAM being used as swap? This takes a bit away from the super-efficient design they have currently.

Likewise, putting increasingly larger GPUs onto the SoC is going to be a big challenge for higher performance/ Pro setups.

I think Apple really hit the sweet spot with the M1. I suspect the higher end chips will be faster than their Intel/ AMD counterparts, but they won't blow them out of the water the way the current MacBook Air/ MBP/ mini blow away the ~$700-2,000 PC market.

[1] I updated the text here because I'd originally commented here about the price of RAM which is largely irrelevant to the actual limitations.

[2] As has been pointed out below, the memory is not on the die with the CPU, but is in the same package. Leaving the text above intact.

[+] tigen|5 years ago|reply

"Now you got a big problem, because neither Intel, AMD or Nvidia are going to license their intellectual property to Dell or HP for them to make an SoC for their machines."

This isn't quite correct. AMD and Nvidia have made custom SOCs for game consoles for years now; they could create customized SOCs for PCs if this was a critical issue. But I don't believe it is. The PC workloads and use cases follow well-known patterns and market segments and you don't really need a lot of different custom accelerators, or be on a single SOC for that matter. Video-related processing has been part of GPUs for many years.

For that matter, making RISC CPUs is not actually a huge obstacle for these guys. (Nvidia does.) Yes even Intel.

The article also seems to entirely omit the M1's heterogenous core strategy, where 4 of the cores are high performance and the other 4 are optimized for power efficiency. A deeper analysis of this and how software manages them would be more interesting.

[+] spamizbad|5 years ago|reply

Gotta say I'm not a fan of this article. A few points I really disagree with:

1) It puts way too much emphasis on the ISA being a driver for performance. People have been making this claim against x86 forever, going back to the 90s when RISC chips started putting up some very impressive performance numbers. Lots of talk about how the x86 ISA was out of runway. This period existed fo a grand total of about 4 years (94 thru 97) until the Pentium II hit the market, which allowed PCs to start cutting into the workstation market. AMD creating the x86-64 extension handled the addressing limitation well too, letting it flourish in the server space as well.

2) The author seems to think Intel or AMD cannot make SoCs because of their business model. This really isn't the case, and both companies have slowly been moving more "stuff" on package. AMD is already there with the xbox an playstation chips they manufacture. AMD's chiplet approach would be particularly adroit at exploiting this. I get the feeling both AMD and Intel would LOVE it if PC vendors started demanding SoCs. And while the M1 has lots of little speciality processors onboard, both AMD and Intel have more robust processor I/O. That's by design: The M1 isn't designed for the person who needs a gazillion PCI-E lanes for storage and networking (the M1 is likely has half the PCI-E lanes as consumer Zen2/3 cpus and 1/4th to 1/8th compared to Threadripper and EPYC)

3) Glosses over the fact that one of the greatest challenges in computer architecture is coping with the so-called "memory wall": there's a growing latency chasm between cache and memory. Apple cleverly mitigates this by using very fast, low latency memory (LPDDR4x 4266 MT/s vs DDR4 at 3200 MT/s in x86 CPUs). But beyond how many MT/s you're operating with, I would love to see some latency figures. They likely blow your typical consumer DIMMs out of the water. Both AMD and Intel wok around this limitation with larger caches and smarter prefetching, but that only gets you so far. edit: And I would wager much of its impressive performance figures comes from Apple's approach to addressing the memory wall. Memory latency benchmarks seem to strongly favor the M1 over both AMD and Intel CPUs.

[+] AnthonyMouse|5 years ago|reply

The explanation provided by the article is poor. It has very little to do with SoC or specialized instructions -- it's fast on general purpose code.

The main reasons appear to be that it's a solid design, i.e. competitive with what x86-64 processors are offering, and on top of that has some specific characteristics that x86-64 processors don't currently have (but obviously could). It's built on TSMC 5nm, which is possibly the most power efficient process in the world right now. It's using memory which is faster than the DDR4 currently used by the competition. It's a big.LITTLE design, which is more power efficient, because you get strong single thread performance from the big cores and higher efficiency on threaded code from the little cores.

It's not magic. It's engineering.

[+] d33lio|5 years ago|reply

It'll be curious when they inevitably release the "M2" chip with higher memory capacity. If M2 has 32gb support I'll buy one immediately. For now, I'll be sticking to my "never buy gen 1 of any technology" rule. Right along side my, "use tech until it breaks" rule, however if you incentivize yourself to be a "power user" meeting the latter rule is far easier than the former ;) .

[+] FullyFunctional|5 years ago|reply

I'm am very appreciative to Apple for finally closing the endless "The ISA doesn't matter" nonsense. The ISA does matter and it's part of what enabled Apple to go so wide.

However, there the article is slightly misleading. The ROB isn't where instructions are issued from, that would be the schedulers which usually holds a much smaller set of instruction (16 per scheduler is common). The ROB holds everything that was decoded and not yet retired, thus including instruction that haven't yet been issued to schedulers and more importantly, instruction that have been executed for not retired (eg. might be waiting on older instruction to retire first).

[+] addaon|5 years ago|reply

My recollection is that when P. A. Semi was acquired by Apple, a major part of their "secret sauce" was a custom cell library. And then Intrinsity, a few years later, also seemed to be using custom cells on Samsung's process. Is this recollection / understanding correct? Is Apple still using custom cell libraries rather than TSMC's standard library?

[+] cromwellian|5 years ago|reply

Couldn't Intel and AMD implement a "Rosetta2" like strategy? That is, couldn't they ship CPUs that fundamentally are not decoding x86 ops, but some different ISA, and then layer a translation layer on top of it?

The Transmeta Crusoe used to do this, and I think the NVidia Denver cores did too?

I think fundamentally though this analysis highly depends on single-threaded performance being the bottleneck for most apps. A lot of sophisticated workloads are GPU or TPU driven, and so hand-waving away multi-thread performance, and treating single-thread performance as the end-all for client side performance I think is overemphasizing its importance.

Also, the "they don't control the stack" argument is wrong. Effectively, Intel/AMD + Microsoft acted as a duopoly, and if you toss in NVidia, the structure of DirectX and Windows is largely a cooperative effort between these OEMs. If Intel/AMD needed some fundamentally new support for some mechanism to boost Windows performance by 20-30%, Microsoft would work with them to ship it.

[+] raiyu|5 years ago|reply

The whole m1 vs intel and AMD saga reminds me of innovators dilemma and the lesser referenced investors solution by Clayton Christensen.

In the solution book it basically details that everything goes through an ebb and flow of aggregation and disaggregation. This was often pointed out with Craigslist as the original aggregator and how many startups have been founded now to disaggregate it.

Interesting to read this article from that perspective. Intel and AMD disaggregated and found success for decades but now aggregating with SoC looks to yield massively better results.

Interesting to see certain business patterns continue to reappear.

Also shows that in tech no one has dominance if they don’t adapt.

[+] areoform|5 years ago|reply

As I've written elsewhere and the article discusses in excellent detail, something that's missing in the conversation is a discussion of the custom application specific silicon over here, https://areoform.wordpress.com/2020/11/30/tanstaafl-apples-m... (on HN, https://news.ycombinator.com/item?id=25256025 )

I dislike linking to myself, but the take is long and it essentially points out that Apple themselves hint at Application Specific silicon in the M1 in their keynote, and when added with their acquisitions of ASIC specialists, it's not a long-take to say that comparing the M1 against most vanilla X86 processors isn't quite fair. It's a very different beast with its own tradeoffs.

By colocating several different application specific processors under one roof and selling it as a CPU, Apple may have revitalized/invented something new. It's a bit like the iPhone - taking elements that were already there and catalyzing them into a new form. This new not-quite-a-CPU SoC likely has significant implications for the future of computing that should be discussed and detailed. Because,

There Ain't No Such Thing As A Free Lunch.

[+] littlecranky67|5 years ago|reply

I am wondering how the M1 is doing regarding the various side-channel attacks that came to live with the Spectre/Meltdown publications. Is the M1 less prone to side channel attacks by design, or is this just something not yet known?

[+] SoSoRoCoCo|5 years ago|reply

A few years ago, Apple's Portland, Oregon office hired dozens of top architects away from the Intel site 10 miles away (for big $$$). For those of you that don't know Oregon Hosts D1X (Intel's gigantic cutting-edge fab) and several architecture teams, predominantly veterans from the P4 product line.

This probably has something to do with it.

[+] hardwaregeek|5 years ago|reply

Ugh. Any way around the requirement to sign in with Google? Very frustrating that this is required to read a damn blog post.

[+] pkulak|5 years ago|reply

All these specialized co-processors make me a bit sad. I've already got a graphics card that can decode some video formats, but not new ones, and that will never change until it becomes e-waste. Now we get the same thing for image formats, ML, etc? And you're even more locked in to these co-processors since they use space that could have been used to have a dozen general purpose cores. So good luck ever watching a video that doesn't use a codec your hardware supports.

[+] Thaxll|5 years ago|reply

Also isn't Apple has it "easy" designing a chip from scratch with 0 backward compatibility + they manage the HW + SW.

Intel & AMD have to support thousands of different config:

- I can plug my old memory it works

- I can plug my old graphic card it work ( PCI-e )

Apple has 0 constrains about legacy hardware, software ( beside that emulation layer ), so yes it's a great achievments, but it's very different from designing a chip for PC.

[+] majewsky|5 years ago|reply

Question for the chip design crowd here: Would it be feasible for Intel/AMD to design a fast ISA from the ground up and add it as a runtime option to their x86_64 chips? Like, on context switch to userland, the kernel puts the chip into whatever ISA is in the respective process's binary. And the syscall instruction and interrupts put the chip back into the kernel ISA. So you would have backwards compatiblity, but new stuff could use the faster ISA. I guess my main question is if there would be enough die space to support two separate instruction decoders.

[+] KingMachiavelli|5 years ago|reply

Vertical integration seems to have really paid off for Apple as it's really what allowed them to pull this off. While I certainly expect cloud providers to continue adopting ARM CPUs, I'm worried that the enthusiast/DIY desktop market will be irrelevant if x86 continues to lag behind ARM. It certainty doesn't seem like buying an Ampere server is as simple as getting an x86 machine. Perhaps POWER10 or eventually RISC-V offerings will become more accessible & competitive?

[+] minddeep|5 years ago|reply

Great explanation and also a good read on challenges Intel and AMD will face going forward. I am wondering if any of the windows / linux laptop makers will follow suit? Or will it be too hard to use the generic windows / linux and optimize it for custom built hardware?

[+] jonnycomputer|5 years ago|reply

So my 2012 MBP is on its last legs (love that thing). So now we have the M1, and I'm hearing impressive things. But my question is, as a developer who does, lets face it, pretty much the gamut, will I be handicapped in the near term? I use a combination of IntelliJ IDEs and Emacs, with MacPorts, and a large variety of open source and closed source tools. I just don't want the headache of having some part of my workflow cutoff.

[+] TuringNYC|5 years ago|reply

Same question here, except I disagree with all the discussion on other threads about "wow, 16gb RAM is magically sufficient for most use cases." I'm sure it is now sufficient for web browsing + IDE + Slack, but what about anything beyond that?

I realize 16gb RAM goes farther on M1 vs before, but some things just need RAM and there is no way around it. If i'm running a VM or multiple VMs (e.g., for android app development), or multiple docker containers -- I just need more RAM.

I'm hoping that the upcoming 16" MBP upgrade offers more RAM tiers.

[+] jonplackett|5 years ago|reply

In almost the same position (MacBook Pro 2013) and I plan on waiting for the 16". By the time that's out we'll see how many of these things are getting quickly ported - my expectation looking at how excited every is, is that things will get ported FAST. But there's time to wait and see before the one I actually want comes out.

[+] thoughtsimple|5 years ago|reply

MacPorts is working pretty well. About 80% in my testing. I've been able to work around anything that doesn't build. It seems much further along than HomeBrew.

[+] PeterisP|5 years ago|reply

Well, at least Homebrew are claiming that "There won’t be any support for native ARM Homebrew installations for months to come." (https://github.com/Homebrew/brew/issues/7857) so if you rely on a large variety of open source and closed source tools that might be a problem.

[+] greyhair|5 years ago|reply

Just had this discussion at work. We have a mix of MacBook Pros and high end Windows 10 laptops, and every one of them is running a VM hosting Ubuntu 18.04

Is the M1 going to have a VM capable of running a native Ubuntu instance?

[+] jonnycomputer|5 years ago|reply

Based on what people are saying here, it does seem that by the time the 16" comes out, support should be pretty good. That's pretty impressive, in a way.

[+] saagarjha|5 years ago|reply

MacPort isn't doing all that badly right now.

[+] golemiprague|5 years ago|reply

[deleted]