top | item 6457331

ARM64 and You

296 points| zdw | 12 years ago |mikeash.com | reply

106 comments

order
[+] Tuna-Fish|12 years ago|reply
This is a reasonable, short overview of the programmer-visible side of changes in A64.

For those interested, there's an another side. A64 drops all the features of the ISA (inline variable shifts, conditional execution, variable-width instructions) that are hard to implement in a fast, high-power CPU. If a cpu is to not have any 32-bit ARM compatibility, there's no reason one couldn't make a 4GHz 4-wide superscalar one based on A64.

[+] ajross|12 years ago|reply
This is true, but that last sentence is a doozy. The AArch64 ISA drops that stuff. The A7 CPU, which much remain compatible with legacy code, must not. So yes, a theoretical CPU would have a much easier design task, but in the real world we'll never see one.

And in any case: there's another reasonably well-known company out there with an even cruftier ISA making 4+GHz 6-wide superscalar cores with backwards 32 (and 16!) bit compatibility modes. Instruction decode is the sort of thing programmers understand, so we tend to get hung up on it when talking about CPUs. It's really not a meaningful design limitation to a CPU core implemented with hundreds of millions of transistors.

[+] mappu|12 years ago|reply
Wait, AArch64 really drops conditional execution?

When i was learning x86 assembly, discovering CMOV was fantastic, drastically simplified all my code (versus cmp+je+hundreds of extraneous labels.. notwithstanding macros). The fact ARM could do that for almost all instructions was one of the main reasons in my mind why ARM was considered a "cleaner" architecture than x86.

EDIT: crisis averted, a comment further down ( https://news.ycombinator.com/item?id=6458457 ) clarifies the change.

[+] btian|12 years ago|reply
There is one - power consumption.
[+] justincormack|12 years ago|reply
Yes its interesting to see a really new ISA, much newer than the old high end contenders like Sparc and Power.
[+] mullr|12 years ago|reply
It never occurred to me that they'd be using tagged pointers for Objective-C runtime stuff. Of course it's obviously a good idea, but only after hearing it does it become so. Objective-C is always more dynamic than you think it is, so taking implementation cues from other dynamic language runtimes makes perfect sense.

It appears that they've been using tagged pointers on the desktop since 10.7, which I never realized: http://objectivistc.tumblr.com/post/7872364181/tagged-pointe...

[+] roskilli|12 years ago|reply
This was one of the most interesting things the post detailed for myself, makes so much sense - we do a substantial amount of allocs and reads for NSDecimalNumbers, etc in our payment based app and I can imagine the heap savings and mem write/read savings we will get as a result would of some significance. Pretty interesting innovation.
[+] wiredfool|12 years ago|reply
It's a good idea that caused a world of hurt in the 24 bit to 32 bit transition. There were some macs that didn't have 32 bit clean roms, because they doing things with the pointers that would never have to reference more than 4 megs of ram.

Interesting that Apple now has the coordination and control to make it a non nightmare thing now.

[+] jeltz|12 years ago|reply
I was surprised they did not use them already since tagged pointers is an ancient idea implemented in many languages. I guess they might not have been useful enough in ObjectiveC for 32 bit architectures.
[+] simscitizen|12 years ago|reply
One of the biggest impacts of moving to 64-bit is increased memory pressure. While all of Apple's apps and daemons are running 64-bit, most users will be actively using third party apps that are 32-bit-only for a while. This means that on average there is less memory available in the system, because the amount of RAM is unchanged in the 5s, and there will now be code from both 64-bit and 32-bit binaries resident, rather than just 32-bit binaries.

Apple has done some work to alleviate this extra memory pressure at the kernel level. grep for WKdm in the xnu sources if you're interested.

[+] StephenFalken|12 years ago|reply
It is interesting to watch ARM finally adopting many of the great architectural solutions that MIPS used 22 years ago, back in 1991, when it launched the MIPS R4000 family of 64 bit processors. [1]

[1] http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_boo...

[+] bodyfour|12 years ago|reply
The original ARM ISA felt very VAX-inspired to me, such as the elegant (but ultimately inefficient) use of a general-purpose register for the program counter.

I've only just started looking at AArch64 but I agree that it feels a lot more like MIPS though. I think that's a good thing.

[+] mistercow|12 years ago|reply
>This allows compiling if statements and similar without requiring branching. Intended to increase performance, it must have been causing more trouble than it was worth, as ARM64 eliminates conditional execution.

Probably because so many projects use Thumb (the default for iOS projects in XCode, for example) which doesn't include most instructions for conditional execution. From what I can tell, it also sounds like compilers weren't making very effective use of those instructions anyway.

Also, these were originally meant to compensate for a lack of branch prediction, which as I understand it, has changed drastically in recent years.

[+] w-m|12 years ago|reply
> With ARM64, there are 32 integer registers, with a dedicated zero register, link register, and frame pointer register. One further register is reserved for the platform, leaving 28 general purpose integer registers.

but http://www.arm.com/files/downloads/ARMv8_Architecture.pdf says:

31 general purpose registers accessible at all times * Improved performance and energy

* General purpose registers are 64-bits wide

* No banking of general purpose registers

* Stack pointer is not a general purpose register

* PC is not a general purpose register

* Additional dedicated zero register available for most instructions

Which one is it?

By the way, the ARMv8 resources are quite interesting overall and a bit more in-depth than the article. http://www.arm.com/products/processors/armv8-architecture.ph...

[+] stephencanon|12 years ago|reply
There are 31 GPRs. The zero register shares and encoding with SP, and is it’s own separate thing.

From the 31 GPRs, x30 is the link register and x29 is the frame pointer. x30 can be used for other purposes within a routine, but x29 must always hold a valid frame record in the iOS ABI. Additionally, iOS reserves x18 (“The platform register”) for all use. So there are really 28 GPRs, or 29 if you include x30/lr, which is something of a hybrid.

[+] mikeash|12 years ago|reply
I'm not seeing the conflict between what I wrote and your quote from the architecture docs. Is it just confusion because the dedicated link register, frame pointer, and platform-reserved register are part of the ABI rather than the ISA?
[+] skylan_q|12 years ago|reply
Thank you for the breakdown of how performance is affected with the new architecture.

I've had a few quibbles about where performance gains would be, and all too often I was told that the performance increases would be solely realized in the larger memory addressing space. That just didn't seem right to me.

I really like the use of the otherwise unused space in the 64-bit pointers.

[+] thepumpkin1979|12 years ago|reply
"On ARM64, 19 bits of the isa field go to holding the object's reference count inline." That's really awesome.
[+] Scaevolus|12 years ago|reply
I hope by "Perform an atomic store of the new isa value." he means "Perform an atomic compare-and-set of the new isa value."

A64 doesn't eliminate conditional execution completely. It just pares it down to the basics: branch (obviously), add/sub, select, compare (for flattening conditionals like `a && b && c`).

Another thing removed from A32 was the optional shift on operand 2-- which was taking up 7/32 bits for most instructions.

This has a few more that were missed: http://nominolo.blogspot.com/2012/07/arms-new-64-bit-instruc...

[+] mikeash|12 years ago|reply
It's not a compare-and-set. Rather, it uses ARM's atomic instructions where the load creates a reservation on the memory address, and the store succeeds only if the reservation is still present, with any other stores to that address (or nearby addresses) breaking the reservation.

You can use this pattern to implement compare-and-set, but you don't need compare-and-set to use that pattern directly.

Edit: I wasn't sure how to encode this into the steps in the article, so it's a bit vague on that part. Suggestions welcome.

[+] matthewmacleod|12 years ago|reply
Great write-up, thanks!

I expect we'll see ARMv8 architectures in the next round of flagship phones. Apple's a little ahead of the curve, but it won't be long till competitors catch up.

In the context of Apple, it's interesting to think about how they're going to take this next. ARM process and architecture improvements are likely to lead to chips with high-enough performance to be used in mainstream desktop applications – Is it possible we're going to see something like an ARM/x86 dual-processor Macbook platform that allows ARM's low power consumption supplement Intel's performance?

[+] glasshead969|12 years ago|reply
Macs with ARM processors don't seem like a possibility. Intel Haswell processors are shown to have comparable Performance per watt which is expected to get better with Broadwell.

ARM64 Apple chips are play for iPads. Current iPads are lagging on performance when we compare it to something like a Baytrail Intel tablet or a Haswell equipped surface tablet. There is going to be convergence point for Intel where a tablet with Haswell level performance with a Fanless chasis and 500$ price. Apple need to converge there to compete.

[+] Pxtl|12 years ago|reply
The bit about memory-mapped files, considering the fact that these devices aren't using magnetic discs, is something interesting. The conventional file API of seeking and streams suddenly feels a bit anachronistic. Of course, flash memory is often optimized for sequential reads, but still - it's far more amenable to the memory-mapped model than magnetic media ever was.
[+] Peaker|12 years ago|reply
Memory mapped files have a problem (that may be less relevant for iOS) with error reporting.

Explicit APIs can have explicit error codes. Memory accesses don't have much opportunity to report errors, so have to resort to awful signals and such (that nobody handles properly).

[+] devx|12 years ago|reply
Why didn't ARM call it ARM64? It's hard to believe it didn't cross their minds and decided AArch64 is the better name, so it could be another reason.
[+] duskwuff|12 years ago|reply
The "ARM123" naming scheme was used to refer to specific ARM cores prior to the "Cortex" naming scheme. While "ARM64" isn't ambiguous in and of itself, it's troublingly close to ARM60, the first ARM CPU with a 32-bit address space.
[+] denim_chicken|12 years ago|reply
I still wonder why in the world Apple went with just 1GB of RAM on the 5s. Even the Nexus 4 that I bought contract-free for $200 comes has 2GB of RAM.
[+] runjake|12 years ago|reply
1) Because the iPhone doesn't need 2 GB RAM.

2) They took the money they saved and devoted it elsewhere (perhaps the Sapphire home button? :)

When it comes down to the bill of materials, every cent really does count when it scales across several million units sold.

[+] r00fus|12 years ago|reply
Power/battery. Extra RAM doesn't come for free power-wise, it costs energy to keep it's state.
[+] jswanson|12 years ago|reply
From the article:

  The biggest change is an inline retain count, which eliminates the need to perform a costly hash table lookup for retain and release operations in the common case. Since those operations are so common in most Objective-C code, this is a big win.
[+] bnolsen|12 years ago|reply
Only using 33 bits for memory addressing is troublesome. 33 bits is 8GB ram which is small potatos for a desktop. Why couldn't they have left it at 38 or even 40 bits? Or is this limitation only part of the objective-c runtime?
[+] mikeash|12 years ago|reply
It comes down to the OS. Basically, when a new process is created, the OS sets up its address space and decides where it will allow new memory to be mapped. For whatever reason (I'm not entirely clear just yet), iOS 7 in 64-bit mode goes for an 8GB address space.

As far as I know, there's nothing preventing that from being increased on future hardware or even on the 5S with future OS updates. I believe the CPU itself supports a 48-bit virtual address space.

[+] revelation|12 years ago|reply
CPython has reference counts as a part of the object in memory. The claims of "large memory consumption" are nonsense, especially since small integer objects and strings are aggressively interned.

And increasing just one aligned integer is certainly cheaper than the bit masking the solution here entails (all of which is neatly hidden away in the 'increment of the correct portion' part).

[+] mikeash|12 years ago|reply
Remember that this decision was made back when the entire system might have 32MB of RAM. Does CPython even fit in that, as a single process, let alone a full multitasking UNIX?

Additional RAM consumption has costs of its own, in terms of cache usage. Adding an extra 8 bytes for every object in the system is not insignificant. Masking and shifting is extremely cheap.

If you've run the benchmarks and can show your approach is better, by all means, please share.

[+] corresation|12 years ago|reply
First, a note on the name: the official name from ARM is "AArch64", but this is a silly name that pains me to type. Apple calls it ARM64, and that's what I will call it too.

What ARM calls ARM related periphery is canonical, whether you think it's silly or not.

However the overarching entity is called ARMv8, with the 64-bit state called AArch64 (which can be contrasted with the AArch32 state, which is also a part of ARMv8) and the instruction set is actually called A64.

[+] Someone|12 years ago|reply
Not all official names survive a confrontation with reality, where 'easier to remember' and 'easier to pronounce' have value, too.

Do you use the terms IA-32e and EM64T, too (both are/were Intel's official names for what people now typically call x64 or x86-64)?

[+] sanxiyn|12 years ago|reply
I personally use Debian port names. Debian is pretty sensible about it.

Debian uses "amd64" and "arm64".

[+] mikeash|12 years ago|reply
"What ARM calls ARM related periphery is canonical, whether you think it's silly or not."

And why, exactly, should I care?

[+] rsynnott|12 years ago|reply
> What ARM calls ARM related periphery is canonical, whether you think it's silly or not.

Ah, yes, that must be why everyone in the world calls the 64bit x86 instruction set either AMD64 or IA-32E.

[+] justincormack|12 years ago|reply
Is there also a new registers but 32 bit pointers model like x32?