Undocumented arm64 ISA extension present on the Apple M1

[+] lukeh|5 years ago|reply

I see undocumented instruction extensions as a special case of private API: it buys Apple the freedom to change the underlying implementation in the future. As long as done in a manner that is not anticompetitive, I don't see the problem.

[+] GeekyBear|5 years ago|reply

The point of the Accelerate.framework is that it is the API and also abstracts away hardware differences between the processors of macOS, iOS, tvOS, and watchOS devices.

>Accelerate provides high-performance, energy-efficient computation on the CPU by leveraging its vector-processing capability. The following Accelerate libraries abstract that capability so that code written for them executes appropriate instructions for the processor available at runtime

https://developer.apple.com/documentation/accelerate

[+] dooglius|5 years ago|reply

How do you write a competing BLAS? Presumably, you need to reverse engineer these to be competitive with Apple's solution.

[+] unknown|5 years ago|reply

[deleted]

[+] vbezhenar|5 years ago|reply

What’s wrong with releasing documentation clearly stating that those instructions are not guaranteed to be present in the future? Apple does not hesitate to break compatibility anyway.

[+] toxik|5 years ago|reply

Like private APIs, they ultimately transfer power from consumers to Apple. They surely do have documentation internally, they just don’t want to commit to anything.

[+] mlazos|5 years ago|reply

I’m thinking this is related to the machine-learning coprocessor. They likely expose this to developers through their ML API and don’t want to have to support this for external customers if it changes. Still this is a great find!

[+] texse|5 years ago|reply

These instructions are not for the Neural Engine. That is a separate hardware block outside of the CPUs. Apple refers to this feature as "AMX" in their marketing documentation.

While AMX could be used for deep learning in a pinch despite the lack of support for common formats like fp16 and int8, I suspect Apple had some other use cases in mind as well. For example, 64-bit float support is expensive and generally useless for ML. However, they are useful (though not necessarily required) in problems such as bundle adjustment that may appear in the context of frameworks like ARKit.

[+] londons_explore|5 years ago|reply

Other vendors ML hardware is very much "A far away device over the PCI express bus which you give bunch of work to and later come back and see if its done".

Looks like apple has decided to make it very tightly integrated into the CPU. It means while doing ML operations, the CPU core can't go off and do other useful work. It also forces all cores to require matching ML hardware (or have a lot of OS complexity as certain threads can only run on some cores). The benefit is latency to get ML stuff done is lowered from microseconds to nanoseconds.

I suspect Apple probably messed up with that tradeoff... Very few applications can't wait a few microseconds for results of some ML computation...

[+] lwhi|5 years ago|reply

I don't understand the significance of this.

Could someone who does add some context?

[+] vegetablepotpie|5 years ago|reply

Apple added custom instructions on top of the ARM instruction set to do matrix operations.

Matrix operations are used a lot in some algorithms, such as in computer graphics and machine learning and these instructions help those operations go faster.

Although the ARM ISA has a multiply and accumulate instruction, it would take many more instructions on a standard ARM core to compute a matrix with only standard instructions. That enables the M1 chip to be faster in this case when developers use these instructions in their applications.

[+] ericlewis|5 years ago|reply

From what I can tell reading this: the Accelerate framework from Apple seems to use custom instructions possibly unique to the M1 chip but potentially similar to Intel AMX. Accelerate is, as I sort of understand it- a way to do advanced math in a very quick / power efficient way. It is touted as deeply integrated with Apple processors from the day it launched and (I think) even recommended to be used over SIMD and the sort (I could certainly be very wrong about this). It dove tails with some of the machine learning work Apple did as well, is portable among Apple systems, and can be assumed to be optimized.

Ninja edit: I’ve not seen how it was achieved until now, but a custom ISA extension isn’t wholly surprising.

Edit 2: as for why this is significant.. I’m not sure it is. It is interesting to know how Accelerate can pull off what it does though. IIRC there are other machine learning frameworks which take advantage of the neural engine- but Accelerate possibly doesn’t- despite having machine learning capabilities.

[+] ChuckMcM|5 years ago|reply

Like Intel, Apple has chosen to add some instructions to the processor which accelerate certain operations (in this case matrix operations). Unlike Intel, Apple has not yet chosen to actually document the M1 architecture. This is where a vertically integrated company like Apple (or IBM back in the day) can leverage "inside knowledge" about their chips to achieve performance that is not readily comparable (or reproducible) on a different instruction set architecture (ISA).

Historically (see IBM vs Memorex) obscuring your interfaces has been a losing proposition in the long term, even while delivering favorable margins in the short term. Unlike Intel which needs to get third parties to write software for their chips, Apple writes their own software and so it means they can hold on to their advantage while third parties (like the author of the post) reverse engineer what is going on.

It will be "historically significant" if Intel adds a feature to their chips in order to stay competitive with this development. The last time they did that was when AMD introduced Opteron. All in all, its basically just a puzzle that people who are interested in how things work get to work on in their spare time.

[+] jonathonf|5 years ago|reply

From my limited understanding: it looks like a matrix coprocessor, and implies that hardware-accelerated matrix processing has been built into the M1.

Matrix processing comes in useful for AI/ML/CV-type applications. It's something GPUs are generally very good for.

[+] ytch|5 years ago|reply

Theoretically we can find out why M1 is fast by instructions. But there is no instruction manual, we can only reverse engineering it

[+] mhh__|5 years ago|reply

In the bigger picture Apple aren't really documenting anything for their new toys, which is not the end of the world but is a huge step back for openness. There is - as far as I can see - no long-form documentation for M1 other than patents.

I already refuse to own apple products so I don't really have any skin in the game, but consider that if microsoft were like this the blind eye would not be turned - buy our new product, everything you run is up to us, we won't document it, maybe we'll upstream the compiler, tough shit.

The fact that they can be like this is really proof that their market position is much stronger than most HNers seem to think it is.

[+] joseph_grobbles|5 years ago|reply

Apple added special AMX instructions specifically for matrix operations. They added them back with the A13, and then improved them for the M1. These are primarily focused on machine learning training where you do backpropagation through huge matrix operations.

They provide a very wide coverage library set that works on all supported platforms optimally. You interact with AMX through those libraries, and direct usage is considered unsupported. Apple may completely change it with the next iteration, where they can just change those libraries and every arm's-length application will get those benefits for free.

Apple sells whole systems. They aren't selling CPUs for third parties, and there is no reason they need to encourage their own proprietary extensions. Indeed, it would be more evil if they were asking people to pepper their own code with this, instead of just using libraries that abstract you from the magic.

As a tiny, tiny vendor in the computing space -- as HN often assures us -- I'm not seeing the great evil many are claiming.

(indeed, of course this is downvoted but I chuckle that another comment implores that Apple is evil because this will lead to code and binaries that only work on Apple devices. Which is quite literally exactly the opposite of what Apple is doing, which is forcing use through optimized libraries, abstracting from their own proprietary extensions. People just love bitching)

[+] hyperpallium2|5 years ago|reply

After four different CPU ISA's (68000, powerpc, x86, ARM) an Apple ISA is next.

[+] Teongot|5 years ago|reply

There's a tidy mathematical progression in Apple CPU architectures.

6502 (Apple 1,2,3): 1976

68k (Mac): 1984 (+8 years)

PPC: 1994 (+10 years)

x86: 2006 (+12 years)

ARM: 2020 (+14 years)

If the trend continues, we will see Apple's next architecture in 2036.

[+] duskwuff|5 years ago|reply

Apple was actually considering designing their own CPUs as early in the mid/late 80s. As far as I'm aware, they never got to the point of fabrication, but the architecture and ISA was pretty fully fleshed out:

https://archive.org/details/scorpius_architecture

[+] masklinn|5 years ago|reply

Apple doesn't really need their own ISA, because they have an architectural ARM license, and complete design knowledge in-house.

Meaning they can do anything they want on an ARM base, they're already not beholden to any third-party designer or roadmap, which is what hampered them with both PPC and x86 (I'm not old enough to remember the 68k and really know why they moved off of it).

The only reason they'd have to move away from ARM is if the ISA ends up preventing them from doing something, somehow.

[+] bitwize|5 years ago|reply

From a third-party developer's perspective the ISA will be Swift.

The actual CPU ISA will be a closely guarded trade secret. It will, however, run Swift code 2-3x faster (on a single-core basis) than any commercially available commodity CPU -- especially since Apple pays silicon fabs to not manufacture 3nm hypercubic gallium-arsenide ICs or whatever the cutting edge is on behalf of competitors.

[+] jhayward|5 years ago|reply

It is really unfortunate that the very commonly used numerical/scientific packages Numpy and Scipy removed[1] Apple’s Accelerate framework compatibility this year due to ‘errors in the framework’. It looks like Python users won’t have access to this capability from the M1 based Macs for some time.

[1] https://github.com/numpy/numpy/issues/15947

[+] stephc_int13|5 years ago|reply

I hate Apple for that. They very often use non-standard tech and we'll end up with Apple proprietary processors instead of the standard stuff x64/ARM64.

[+] princekolt|5 years ago|reply

Competitors can't compete with them even with standard stuff. Good for them to move forward. It's funny how innovation becomes a "bad" thing when a company you don't like innovates "too much".

[+] lincpa|5 years ago|reply

[deleted]

198 comments