top | item 40927438

Things I learned while writing an x86 emulator (2023)

353 points| fanf2 | 1 year ago |timdbg.com | reply

127 comments

order
[+] aengelke|1 year ago|reply
Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)

Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.

APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748

[+] bonzini|1 year ago|reply
On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.

My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(

The top comment explains a bit what's going on: https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...

(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).

[+] torusle|1 year ago|reply
Another bonus quirk, from the 486 and Pentium area..

BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.

However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.

On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.

  0x11223344 became 0x11332244.
[+] CoastalCoder|1 year ago|reply
Can you imagine having to make all this logic work faithfully, let alone fast, in silicon?

X86 used to be Intel's moat, but what a nightmarish burden to carry.

[+] mananaysiempre|1 year ago|reply
The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.
[+] __turbobrew__|1 year ago|reply
I know nothing about this space, but it would be interesting to hook up a jtag interface to a x86 CPU and them step instruction by instruction and record all the register values.

You could then use this data to test whether or not your emulator perfectly emulates the hardware by running the same program through the emulated CPU and validate the state is the same at every instruction.

[+] tyfighter|1 year ago|reply
The BSF/BSR quirk is annoying, but I think there is a reason for it is that they were only thinking about it being used in a loop (or maybe an if) with something like:

int mask = something; ... for (int index; _bit_scan_forward(&index, mask); mask ^= 1<<index) { ... }

Since it sets the ZF on a zero input, they thought that must be all you need. But there are many other uses for (trailing|leading) zero count operations, and it would have been much better for them to just write the register anyway.

[+] sdsd|1 year ago|reply
What a cool person. I really enjoy writing assembly, it feels so simple and I really enjoy the vertical aesthetic quality.

The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...

This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.

[+] actionfromafar|1 year ago|reply
You should leverage that into your friend teaching you JS, maybe.
[+] changexd|1 year ago|reply
thank you for this, even though I took some time to understand what's going on which lead me to a series of cool(and very challenging) assembly journey, as a non CS major myself, this code here is a really nice entry for me to start understand how things work, I will definitely dig deeper.
[+] AstroJetson|1 year ago|reply
Check out Justine Tunney and her emulator. https://justine.lol/blinkenlights/

The docs are an amazing tour of how the cpu works.

[+] trallnag|1 year ago|reply
That name, Tunney. Remember it from around 2014, being homeless, bumming around, and shit posting on Twitter about Occupy lol
[+] trollied|1 year ago|reply
> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works

Hard disagree.

The best way is to create a CPU from gate level, like you do on a decent CS course. (I really enjoyed making a cut down ARM from scratch)

[+] timmisiak|1 year ago|reply
I think both are useful, but designing a modern CPU from the gate level is out of reach for most folks, and I think there's a big gap between the sorts of CPUs we designed in college and the sort that run real code. I think creating an emulator of a modern CPU is a somewhat more accessible challenge, while still being very educational even if you only get something partially working.
[+] banish-m4|1 year ago|reply
Seconded. A microcoded, pipelined, superscalar, branch-predicting basic processor with L1 data & instruction caches and write-back L2 cache controller is nontrivial. Most software engineers have an incomplete grasp of data hazards, cache invalidation, or pipeline stalls.
[+] quantified|1 year ago|reply
Well, I think you're both right. It's satisfying as heck to sling 74xx chips together and you get a feel for the electrical side of things and internal tradeoffs.

When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.

[+] brailsafe|1 year ago|reply
So far on my journey through Nand2Tetris (since I kind of dropped out of my real CS course) I've found the entire process of working my way up from gate level, and just finished the VM emulator chapter which took an eternity. Now onto compilation.
[+] commandlinefan|1 year ago|reply
OTOH, are you really going to be implementing memory segmenting in your gate-level CPU? I'd say actually creating a working CPU and _then_ emulating a real CPU (warts and all) are both necessary steps to real understanding.
[+] whobre|1 year ago|reply
Reading Petzold’s “Code” comes pretty close to, though and is easier.
[+] snvzz|1 year ago|reply
CPU was a poor choice of words. ISA would have worked.
[+] dmitrygr|1 year ago|reply
I've written fast emulators for a dozen non-toy architectures and a few JIT translators for a few as well. x86 still gives me PTSD. I have never seen a messier architecture. There is history, and a reason for it, but still ... damn
[+] trealira|1 year ago|reply
Studying the x86 architecture is kind of like studying languages with lots of irregularities and vestigial bits, and with competing grammatical paradigms, e.g. French. Other architectures, like RISC-V and ARMv8, are much more consistent.
[+] jcranmer|1 year ago|reply
> I have never seen a messier architecture.

Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.

[+] Arech|1 year ago|reply
Haha, man, I feel you :DD You probably should have started with it from the very beginning :D
[+] t_sea|1 year ago|reply
> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works.

The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!

[+] astrange|1 year ago|reply
I would say writing an ISA emulator is actually not helpful for understanding how a modern superscalar CPU works, because almost all of it is optimizations that are hidden from you.
[+] jmspring|1 year ago|reply
Apparently my memory is false, I thought originally the salsa20 variants and machine code were on cryp.to in my memory, but Dan Berstein's site is - https://cr.yp.to/

While at a startup when we were looking at data at rest encryption, streaming encryption and other such things. Dan had a page with different implementations (cross compiled from his assembler representation) to target chipsets and instruction sets. Using VMs (this was the early/mid 2000s) and such, it was interesting to see what of those instruction sets were supported. In testing, there would be occasional hiccups where an implementation wasn't fully supported though the VM claimed such.

[+] saagarjha|1 year ago|reply
You mean Dan Berstain’s site…wait.
[+] boricj|1 year ago|reply
It's funny to me how much grief x86 assembly generates when compared to RISC here, because I have the opposite problem when delinking code back into object files.

For this use-case, x86 is really easy to analyze whereas MIPS has been a nightmare to pull off. This is because all I mostly care about are references to code and data. x86 has pointer-sized immediate constants and MIPS has split HI16/LO16 relocation pairs, which leads to all sorts of trouble with register usage graphs, code flow and branch delay instructions.

That should not be constructed as praise on my end for x86.

[+] astrange|1 year ago|reply
Yes, x86 is weird but the variable-length instructions are actually nice and easy to understand, once they're unpacked to text form anyway. The problem with them is they're insecure, because you can hide instructions in the middle of other instructions.

I think the biggest thing you learn in x86 assembly vs C is that signed/unsignedness becomes a property of the operation instead of the type.

It would be cool if you could use flags, which is easy/easier on some architectures like PPC/armv7, but x86 overwrites them too easily so it's too hard to use their values.

[+] ale42|1 year ago|reply
Shouldn't it be (2023) rather than (2013)?
[+] fjfaase|1 year ago|reply
Interesting read. I have a lot of respect for people who develop emulator for x86 processors. It is a complicated processor and from first hand experience I know that developing and debugging emulators for CPU's can be very challenging. In the past year, I spend some time developing a very limited i386 emulator [1] including some system calls for executing the first steps of live-bootstrap [2], primarily to figure out how it is working. I learned a lot about system calls and ELF.

[1] https://github.com/FransFaase/Emulator/

[2] https://github.com/fosslinux/live-bootstrap/

[+] banish-m4|1 year ago|reply
Most of the complexities lie in managing the various configurations of total system compatibility emulation, especially for timing, analog oddities, and whether to include bugs or not and for which steppings. If you want precise and accurate emulation, you have to have real hardware to validate behavior against. Then there are the cases of what not to emulate and offering better-than-original alternatives.
[+] SunlitCat|1 year ago|reply
Haha! Writing an x86 emulator! I still remember writing a toy emulator which was able to execute something around the first 1000-ish lines of a real bios (and then it stuck or looped when it started to access ports or so, can't remember it was too long ago and I didn't continue it as I started to get into DirectX and modern c++ more).
[+] waynecochran|1 year ago|reply
Intel architecture is loaded with historical artifacts. The switch in how segment registers were used as you went from real mode to protected mode was an incredible hardware hack to keep older software working. I blame Intel for why so many folks avoid assembly language. I programmed in assembly for years using TI's 84010 graphics chips and the design was gorgeous -- simple RISC instruction set, flat address space, and bit addressable! If during the earlier decades folks were programming using chips with more elegant designs, far more folks would be programming in assembly language (or at least would know how to).
[+] hajile|1 year ago|reply
> I blame Intel for why so many folks avoid assembly language.

x86 (the worst assembly of any of the top 50 most popular ISAs by a massive margin) and tricky MIPS branch delay slots trivia questions at university have done more to turn off programmers from learning assembly than anything else and it's not even close.

This is one reason I'm hoping that RISC-V kills off x86. It actually has a chance of once again allowing your average programmer to learn useful assembly.

[+] russdill|1 year ago|reply
What's crazy is that depending on how deep you want to go, a lot of the information is not available in documents published by Intel. Fortunately, if it matters for emulators it typically can be/has been reverse engineering.
[+] bheadmaster|1 year ago|reply
If Terry A. Davis is to be trusted, as long as you ignore the legacy stuff, x64 assembly is nice to work with.
[+] jecel|1 year ago|reply
Wouldn't that be the 34010?
[+] Sparkyte|1 year ago|reply
The footnotes are glorious. "He was convinced that using a shift would work and didn’t believe me when I said it wasn’t possible."
[+] djaouen|1 year ago|reply
“I don’t believe in emulatores.” - 0x86
[+] Quekid5|1 year ago|reply
Just as an adjacent aside from a random about learning by doing:

Implementing a ThingDoer is a huge learning experience. I remember doing co-op "write-a-compiler" coursework with another person. We were doing great, everything was working and then we got to the oral exam...

"Why is your Stack Pointer growing upwards"?

... I was kinda stunned. I'd never thought about that. We understood most of the things, but sometimes we kind of just bashed at things until they worked... and it turned out upward-growing SP did work (up to a point) on the architecture our toy compiler was targeting.