Why is Rosetta 2 fast?

[+] iainmerrick|3 years ago|reply

This is a great writeup. What a clever design!

I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.

What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.

Anyone know how they implemented PPC-to-x86 translation?

[+] kijiki|3 years ago|reply

> Anyone know how they implemented PPC-to-x86 translation?

They licensed Transitive's retargettable binary translator, and renamed it Rosetta; very Apple.

It was originally a startup, but had been bought by IBM by the time Apple was interested.

[+] klelatti|3 years ago|reply

That’s really interesting. You might enjoy reading about the VM embedded into the Busicom calculator that used the Intel 4004 [1]

They squeezed a virtual machine with 88 instructions into less than 1k of memory!

[1] https://thechipletter.substack.com/p/bytecode-and-the-busico...

[+] lostgame|3 years ago|reply

From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.

My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.

I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.

[+] Asmod4n|3 years ago|reply

I don't know how they did it, but they did it very very slowly. Anything "interactive" was unuseable.

[+] hinkley|3 years ago|reply

I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.

What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.

I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.

And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.

[+] hamstergene|3 years ago|reply

Could it be simply because many binaries were produced by much older, outdated optimizers. Or optimized for size.

Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.

Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

[+] wmf|3 years ago|reply

https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

[+] hawflakes|3 years ago|reply

People have mentioned the Dynamo project from HP. But I think you're actually thinking of the Aries project (I worked in a directly adjacent project) that allowed you to run PA-RISC binaries on IA-64.

https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html

[+] mark_undoio|3 years ago|reply

Something that fascinates me about this kind of A -> A translation (which I associate with the original HP Dynamo project on HPPA CPUs) is that it was able to effectively yield the performance effect of one or two increased levels of -O optimization flag.

Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.

Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).

Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.

[+] freedomben|3 years ago|reply

If JIT-ing a statically compiled input makes it faster, does that mean that JIT-ing itself is superior or does it mean that the static compiler isn't outputting optimal code? (real question. asked another way, does JIT have optimizations it can make that a static compiler can't?)

[+] jeffbee|3 years ago|reply

Post-build optimization of binaries without changing the target CPU is common. See BOLT https://github.com/facebookincubator/BOLT

[+] darzu|3 years ago|reply

Does anyone know the names of the key people behind Rosetta 2?

In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.

[+] cwzwarich|3 years ago|reply

I am the creator / main author of Rosetta 2. I don't have a blog or a Twitter (beyond lurking).

[+] trollied|3 years ago|reply

The original Rosetta was written by Transitive, which was formed by spinning a Manchester University research group out. See https://www.software.ac.uk/blog/2016-09-30-heroes-software-e...

I know a few of their devs went to ARM, some to Apple & a few to IBM (who bought Transitive). I do know a few of their ex staff (and their twitter handles), but I don’t feel comfortable linking them here.

[+] lunixbochs|3 years ago|reply

> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah///unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.

My aotool project uses a trick to extract the AOT binary without root or disabling SIP: https://github.com/lunixbochs/meta/tree/master/utils/aotool

[+] menaerus|3 years ago|reply

> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.

Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?

[+] johnthuss|3 years ago|reply

"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity... Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."

[+] saagarjha|3 years ago|reply

One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)

[+] karmakaze|3 years ago|reply

Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.

[+] hinkley|3 years ago|reply

Apple is doing some really interesting but really quiet work in the area of VMs. I feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.

As a somewhat related aside, I have been watching Bun (low startup time Node-like on top of Safari’s JavaScript engine) with enough interest that I started trying to fix a bug, which is somewhat unusual for me. I mostly contribute small fixes to tools I use at work. I can’t quite grok Zig code yet so I got stuck fairly quickly. The “bug” turned out to be default behavior in a Zig stdlib, rather than in JavaScript code. The rest is fairly tangential but suffice it to say I prefer self hosted languages but this probably falls into the startup speed compromise.

Being low startup overhead makes their VM interesting, but the fact that it benchmarks better than Firefox a lot of the time and occasionally faster than v8 is quite a bit of quiet competence.

[+] kccqzy|3 years ago|reply

> The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.

This really made me chuckle. They probably don't want to mention Intel by name, but this just sounds funny.

https://developer.arm.com/documentation/100076/0100/A64-Inst...

[+] Vt71fcAqt7|3 years ago|reply

I hope Rosetta is here to stay and continues developement. And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.

[+] peatmoss|3 years ago|reply

Not having any particular domain experience here, I've idly wondered whether or not there's any role for neural net models in translating code for other architectures.

We have giant corpuses of source code, compiled x86_64 binaries, and compiled arm64 binaries. I assume the compiled binaries represent approximately our best compiler technology. It seems predicting an arm binary from an x86_64 binary would not be insane?

If someone who actually knows anything here wants to disabuse me of my showerthoughts, I'd appreciate being able to put the idea out of my head :-)

[+] qsort|3 years ago|reply

You would need a hybrid architecture with a NN generating guesses and a "watchdog" shutting down errors.

Neural models are basically universal approximators. Machine code needs to be obscenely precise to work.

Unless you're doing something else in the backend, it's just a turbo SIGILL generator.

[+] brookst|3 years ago|reply

I'm a ML dilletante and hope someone more knowledgeable chimes in, but one thing to consider is the statistics of how many instructions you're translating and the accuracy rate. Binary execution is very unforgiving to minor mistakes in translation. If 0.001% of instructions are translated incorrectly, that program just isn't going to work.

[+] hinkley|3 years ago|reply

I think we are on the cusp of machine aided rules generation via example and counter example. It could be a very cool era of “Moore’s Law for software” (which I’m told software doubles in speed roughly every 18 years).

Property based testing is a bit of a baby step here, possibly in the same way that escape analysis in object allocation was the precursor to borrow checkers which are the precursor to…?

These are my inputs, these are my expectations, ask me some more questions to clarify boundary conditions, and then offer me human readable code that the engine thinks satisfies the criteria. If I say no, ask more questions and iterate.

If anything will ever allow machines to “replace” coders, it will be that, but the scare quotes are because that shifts us more toward information architecture from data munging, which I see as an improvement on the status quo. Many of my work problems can be blamed on structural issues of this sort. A filter that removes people who can’t think about the big picture doesn’t seem like a problem to me.

[+] Someone|3 years ago|reply

> It seems predicting an arm binary from an x86_64 binary would not be insane?

If you start with a couple of megabytes of x64 code, and predict a couple of megabytes of arm code from it, there will be errors even if your model is 99.999% accurate.

How do you find the error(s)?

[+] saagarjha|3 years ago|reply

People have tried doing this, but not typically at the instruction level. Two ways to go about this that I’m aware of are trying to use machine learning to derive high-level semantics about code, then lowering it to the new architecture.

[+] Symmetry|3 years ago|reply

Many branch predictors have traditionally used perceptrons, which are sort of NN like. And I think there's a lot of research into involving incorporating deep learning models into doing chip routings.

[+] musicale|3 years ago|reply

Rosetta 2 is beautiful - I would love it if they kept it as a feature for the long term rather than deprecating it and removing it in the next release of macOS (basically what they did during previous architectural transitions.)

If Apple does drop it, maybe they could open source it so it could live on in Linux and BSD at least. ;-)

Adding a couple of features to ARM to drastically improve translated x86 code execution sounds like a decent idea - and one that could potentially enable better x86 app performance on ARM Windows as well. I don't know the silicon cost but I'd hope it wasn't dropped in the future.'

Thinking a bit larger, I'd also like to see Apple add something like CHERI support to Apple Silicon and macOS to enable efficient memory error checking in hardware. I'd be surprised if they weren't working on something like this already.

[+] pjmlp|3 years ago|reply

Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.

https://en.m.wikipedia.org/wiki/FX!32

Or for a more technical deep dive,

https://www.usenix.org/publications/library/proceedings/usen...

[+] mosburger|3 years ago|reply

OMG I forgot about FX!32. My first co-op was as a QA tester for the DEC Multia, which they moved from the Alpha processor to Intel midway through. I did a skunkworks project for the dev team attempting to run the newer versions of Multia's software (then Intel-based) on older Alpha Multias using FX!32. IIRC it was still internal use only/beta, but it worked quite well!

[+] hot_gril|3 years ago|reply

Rosetta 2 has become the poster child for "innovation without deprecation" where I work (not Apple).

[+] Tijdreiziger|3 years ago|reply

Apple is the king of deprecation, just look at what happened to Rosetta 1 and 32-bit iOS apps.

[+] webwielder2|3 years ago|reply

(Apologies for the flame war quality to this comment, I’m genuinely just expressing an observation)

It’s ironic that Apple is often backhandedly complimented by hackers as having “good hardware” when their list of software accomplishments is amongst the most impressive in the industry and contrasts sharply with the best efforts of, say, Microsoft, purportedly a “software company.”

[+] manv1|3 years ago|reply

Apple's historically been pretty good at making this stuff. Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy. The next-gen DR emulator (and SpeedDoubler etc) made things even faster.

I suspect the ppc->x86 stuff was slower because x86 just doesn't have the registers. There's only so much you can do.

[+] dynjo|3 years ago|reply

It is quite astonishing how seamless Apple has managed to make the Intel to ARM transition, there are some seriously smart minds behind Rosetta. I honestly don't think I had a single software issue during the transition!

[+] ChuckNorris89|3 years ago|reply

If that blows your mind, you should see how Microsoft did the emulation of the PowerPC based Xeon chip to X86 so you can play Xbox 360 games on Xbox One.

There's an old pdf from Microsoft researchers with the details but I can't seem to find it right now.

[+] chrchang523|3 years ago|reply

I finally started seriously using a M1 work laptop yesterday, and I'm impressed. More than twice as fast on a compute-intensive job as my personal 2015 MBP, with a binary compiled for x86 and with hand-coded SIMD instructions.

[+] xxpor|3 years ago|reply

They've almost made it too good. I have to run software that ships an x86 version of CPython, and it just deeply offends me on a personal level, even though I can't actually detect any slowdown (probably because lol python in the first place)

[+] esskay|3 years ago|reply

It has been extremely smooth sailing. I moved my own mac over to it about a year ago, swapping a beefed up MPB for a budget friendly M1 Air (which has massively smashed it out the park performance wise, far better than I was expecting). Didn't have a single issue.

My work mac was upgraded to a MBP M1 Pro and again, very smooth. I had one minor issue with a docker container not being happy (it was an x86 instance) but one minor tweak to the docker compose file and I was done.

It does still amaze me how good these new machines are. Its almost enough to redeem apple for the total pile of overheating, underperforming crap that came directly before the transition (aka any mac with a touchbar).

[+] wombat-man|3 years ago|reply

There's an annoying dwarf fortress bug but other than that, same

[+] AnIdiotOnTheNet|3 years ago|reply

It isn't their first rodeo: 68k->PPC->x86_64->ARM.

[+] perardi|3 years ago|reply

I think the end of support for 32-bit applications in 2019 helped, slightly, with the run-up.

Assuming you weren’t already shipping 64-bit applications…which would be weird…updating the application probably required getting everything into a contemporary version of Xcode, cleaning out the cruft, and getting it compiling nice and cleanly. After that, the ARM transition was kind of a “it just works” scenario.

Now, I’m sure Adobe and other high-performance application developers had to do some architecture-specific tweaks, but, gotta think Apple clued them in ahead of time as to what was coming.

[+] js2|3 years ago|reply

I have a single counter-example. Mailplane, a Gmail SSB. It's Intel including its JS engine, making the Gmail UI too sluggish to use.

I've fallen back to using Fluid, an ancient and also Intel-specific SSB, but its web content runs in a separate WebKit ARM process so it's plenty fast.

I've emailed the Mailplane author but they wont release an Universal version of the app since they've EOL'd Mailplane.

I have yet to find a Gmail SSB that I'm happy with under ARM. Fluid is a barely workable solution.

[+] radicaldreamer|3 years ago|reply

Since this is the company's third big arch transition, cross-compilation and compatibility is probably considered a core competency for Apple to maintain internally.

[+] poulpy123|3 years ago|reply

having total control on the hardware and the software didn't hurt for sure

[+] spullara|3 years ago|reply

The first time I ran into this technology was in the early 90s on the DEC Alpha. They had a tool called "MX" that would translate MIPS Ultrix binaries to Alpha on DEC Unix:

https://www.linuxjournal.com/article/1044

Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games even.

[+] lzooz|3 years ago|reply

>The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.

If there is no performance penalty why is it implemented as an optional extension?

358 comments