Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)
Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.
APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...
On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.
My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(
(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).
Another bonus quirk, from the 486 and Pentium area..
BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.
However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.
On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.
The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.
I know nothing about this space, but it would be interesting to hook up a jtag interface to a x86 CPU and them step instruction by instruction and record all the register values.
You could then use this data to test whether or not your emulator perfectly emulates the hardware by running the same program through the emulated CPU and validate the state is the same at every instruction.
The BSF/BSR quirk is annoying, but I think there is a reason for it is that they were only thinking about it being used in a loop (or maybe an if) with something like:
int mask = something;
...
for (int index; _bit_scan_forward(&index, mask); mask ^= 1<<index) {
...
}
Since it sets the ZF on a zero input, they thought that must be all you need. But there are many other uses for (trailing|leading) zero count operations, and it would have been much better for them to just write the register anyway.
What a cool person. I really enjoy writing assembly, it feels so simple and I really enjoy the vertical aesthetic quality.
The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...
This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.
thank you for this, even though I took some time to understand what's going on which lead me to a series of cool(and very challenging) assembly journey, as a non CS major myself, this code here is a really nice entry for me to start understand how things work, I will definitely dig deeper.
I think both are useful, but designing a modern CPU from the gate level is out of reach for most folks, and I think there's a big gap between the sorts of CPUs we designed in college and the sort that run real code. I think creating an emulator of a modern CPU is a somewhat more accessible challenge, while still being very educational even if you only get something partially working.
Seconded. A microcoded, pipelined, superscalar, branch-predicting basic processor with L1 data & instruction caches and write-back L2 cache controller is nontrivial. Most software engineers have an incomplete grasp of data hazards, cache invalidation, or pipeline stalls.
Well, I think you're both right. It's satisfying as heck to sling 74xx chips together and you get a feel for the electrical side of things and internal tradeoffs.
When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.
So far on my journey through Nand2Tetris (since I kind of dropped out of my real CS course) I've found the entire process of working my way up from gate level, and just finished the VM emulator chapter which took an eternity. Now onto compilation.
OTOH, are you really going to be implementing memory segmenting in your gate-level CPU? I'd say actually creating a working CPU and _then_ emulating a real CPU (warts and all) are both necessary steps to real understanding.
I've written fast emulators for a dozen non-toy architectures and a few JIT translators for a few as well. x86 still gives me PTSD. I have never seen a messier architecture. There is history, and a reason for it, but still ... damn
Studying the x86 architecture is kind of like studying languages with lots of irregularities and vestigial bits, and with competing grammatical paradigms, e.g. French. Other architectures, like RISC-V and ARMv8, are much more consistent.
Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.
I recently implemented a good portion of x86(-64) decoder for some side project [1] and kinda surprised how it got even more complicated in recent days. Sandpile.org [2] was really useful for my purpose.
> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works.
The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!
I would say writing an ISA emulator is actually not helpful for understanding how a modern superscalar CPU works, because almost all of it is optimizations that are hidden from you.
Apparently my memory is false, I thought originally the salsa20 variants and machine code were on cryp.to in my memory, but Dan Berstein's site is - https://cr.yp.to/
While at a startup when we were looking at data at rest encryption, streaming encryption and other such things. Dan had a page with different implementations (cross compiled from his assembler representation) to target chipsets and instruction sets. Using VMs (this was the early/mid 2000s) and such, it was interesting to see what of those instruction sets were supported. In testing, there would be occasional hiccups where an implementation wasn't fully supported though the VM claimed such.
It's funny to me how much grief x86 assembly generates when compared to RISC here, because I have the opposite problem when delinking code back into object files.
For this use-case, x86 is really easy to analyze whereas MIPS has been a nightmare to pull off. This is because all I mostly care about are references to code and data. x86 has pointer-sized immediate constants and MIPS has split HI16/LO16 relocation pairs, which leads to all sorts of trouble with register usage graphs, code flow and branch delay instructions.
That should not be constructed as praise on my end for x86.
Yes, x86 is weird but the variable-length instructions are actually nice and easy to understand, once they're unpacked to text form anyway. The problem with them is they're insecure, because you can hide instructions in the middle of other instructions.
I think the biggest thing you learn in x86 assembly vs C is that signed/unsignedness becomes a property of the operation instead of the type.
It would be cool if you could use flags, which is easy/easier on some architectures like PPC/armv7, but x86 overwrites them too easily so it's too hard to use their values.
Interesting read. I have a lot of respect for people who develop emulator for x86 processors. It is a complicated processor and from first hand experience I know that developing and debugging emulators for CPU's can be very challenging. In the past year, I spend some time developing a very limited i386 emulator [1] including some system calls for executing the first steps of live-bootstrap [2], primarily to figure out how it is working. I learned a lot about system calls and ELF.
Most of the complexities lie in managing the various configurations of total system compatibility emulation, especially for timing, analog oddities, and whether to include bugs or not and for which steppings. If you want precise and accurate emulation, you have to have real hardware to validate behavior against. Then there are the cases of what not to emulate and offering better-than-original alternatives.
Haha! Writing an x86 emulator! I still remember writing a toy emulator which was able to execute something around the first 1000-ish lines of a real bios (and then it stuck or looped when it started to access ports or so, can't remember it was too long ago and I didn't continue it as I started to get into DirectX and modern c++ more).
Intel architecture is loaded with historical artifacts. The switch in how segment registers were used as you went from real mode to protected mode was an incredible hardware hack to keep older software working. I blame Intel for why so many folks avoid assembly language. I programmed in assembly for years using TI's 84010 graphics chips and the design was gorgeous -- simple RISC instruction set, flat address space, and bit addressable! If during the earlier decades folks were programming using chips with more elegant designs, far more folks would be programming in assembly language (or at least would know how to).
> I blame Intel for why so many folks avoid assembly language.
x86 (the worst assembly of any of the top 50 most popular ISAs by a massive margin) and tricky MIPS branch delay slots trivia questions at university have done more to turn off programmers from learning assembly than anything else and it's not even close.
This is one reason I'm hoping that RISC-V kills off x86. It actually has a chance of once again allowing your average programmer to learn useful assembly.
What's crazy is that depending on how deep you want to go, a lot of the information is not available in documents published by Intel. Fortunately, if it matters for emulators it typically can be/has been reverse engineering.
Just as an adjacent aside from a random about learning by doing:
Implementing a ThingDoer is a huge learning experience. I remember doing co-op "write-a-compiler" coursework with another person. We were doing great, everything was working and then we got to the oral exam...
"Why is your Stack Pointer growing upwards"?
... I was kinda stunned. I'd never thought about that. We understood most of the things, but sometimes we kind of just bashed at things until they worked... and it turned out upward-growing SP did work (up to a point) on the architecture our toy compiler was targeting.
[+] [-] aengelke|1 year ago|reply
Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.
APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...
[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748
[+] [-] bonzini|1 year ago|reply
My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(
The top comment explains a bit what's going on: https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...
(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).
[+] [-] torusle|1 year ago|reply
BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.
However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.
On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.
[+] [-] CoastalCoder|1 year ago|reply
X86 used to be Intel's moat, but what a nightmarish burden to carry.
[+] [-] mananaysiempre|1 year ago|reply
[+] [-] __turbobrew__|1 year ago|reply
You could then use this data to test whether or not your emulator perfectly emulates the hardware by running the same program through the emulated CPU and validate the state is the same at every instruction.
[+] [-] tyfighter|1 year ago|reply
int mask = something; ... for (int index; _bit_scan_forward(&index, mask); mask ^= 1<<index) { ... }
Since it sets the ZF on a zero input, they thought that must be all you need. But there are many other uses for (trailing|leading) zero count operations, and it would have been much better for them to just write the register anyway.
[+] [-] sdsd|1 year ago|reply
The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...
This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.
[+] [-] actionfromafar|1 year ago|reply
[+] [-] changexd|1 year ago|reply
[+] [-] AstroJetson|1 year ago|reply
The docs are an amazing tour of how the cpu works.
[+] [-] zarathustreal|1 year ago|reply
[+] [-] trallnag|1 year ago|reply
[+] [-] pm2222|1 year ago|reply
Cannot believe it’s been 16months. How time flies.
[+] [-] trollied|1 year ago|reply
Hard disagree.
The best way is to create a CPU from gate level, like you do on a decent CS course. (I really enjoyed making a cut down ARM from scratch)
[+] [-] timmisiak|1 year ago|reply
[+] [-] banish-m4|1 year ago|reply
[+] [-] quantified|1 year ago|reply
When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.
[+] [-] brailsafe|1 year ago|reply
[+] [-] commandlinefan|1 year ago|reply
[+] [-] whobre|1 year ago|reply
[+] [-] snvzz|1 year ago|reply
[+] [-] dmitrygr|1 year ago|reply
[+] [-] trealira|1 year ago|reply
[+] [-] jcranmer|1 year ago|reply
Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.
[+] [-] Arech|1 year ago|reply
[+] [-] lifthrasiir|1 year ago|reply
[1] Namely, a version of Fabian Giesen's disfilter for x86-64, for yet another side project which is still not in public: https://gist.github.com/lifthrasiir/df47509caac2f065032ef72e...
[2] https://sandpile.org/
[+] [-] t_sea|1 year ago|reply
The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!
[+] [-] astrange|1 year ago|reply
[+] [-] jmspring|1 year ago|reply
While at a startup when we were looking at data at rest encryption, streaming encryption and other such things. Dan had a page with different implementations (cross compiled from his assembler representation) to target chipsets and instruction sets. Using VMs (this was the early/mid 2000s) and such, it was interesting to see what of those instruction sets were supported. In testing, there would be occasional hiccups where an implementation wasn't fully supported though the VM claimed such.
[+] [-] saagarjha|1 year ago|reply
[+] [-] boricj|1 year ago|reply
For this use-case, x86 is really easy to analyze whereas MIPS has been a nightmare to pull off. This is because all I mostly care about are references to code and data. x86 has pointer-sized immediate constants and MIPS has split HI16/LO16 relocation pairs, which leads to all sorts of trouble with register usage graphs, code flow and branch delay instructions.
That should not be constructed as praise on my end for x86.
[+] [-] astrange|1 year ago|reply
I think the biggest thing you learn in x86 assembly vs C is that signed/unsignedness becomes a property of the operation instead of the type.
It would be cool if you could use flags, which is easy/easier on some architectures like PPC/armv7, but x86 overwrites them too easily so it's too hard to use their values.
[+] [-] ale42|1 year ago|reply
[+] [-] saagarjha|1 year ago|reply
[+] [-] fjfaase|1 year ago|reply
[1] https://github.com/FransFaase/Emulator/
[2] https://github.com/fosslinux/live-bootstrap/
[+] [-] banish-m4|1 year ago|reply
[+] [-] SunlitCat|1 year ago|reply
[+] [-] waynecochran|1 year ago|reply
[+] [-] hajile|1 year ago|reply
x86 (the worst assembly of any of the top 50 most popular ISAs by a massive margin) and tricky MIPS branch delay slots trivia questions at university have done more to turn off programmers from learning assembly than anything else and it's not even close.
This is one reason I'm hoping that RISC-V kills off x86. It actually has a chance of once again allowing your average programmer to learn useful assembly.
[+] [-] russdill|1 year ago|reply
[+] [-] bheadmaster|1 year ago|reply
[+] [-] jecel|1 year ago|reply
[+] [-] Sparkyte|1 year ago|reply
[+] [-] was_a_dev|1 year ago|reply
[+] [-] timmisiak|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] djaouen|1 year ago|reply
[+] [-] Quekid5|1 year ago|reply
Implementing a ThingDoer is a huge learning experience. I remember doing co-op "write-a-compiler" coursework with another person. We were doing great, everything was working and then we got to the oral exam...
"Why is your Stack Pointer growing upwards"?
... I was kinda stunned. I'd never thought about that. We understood most of the things, but sometimes we kind of just bashed at things until they worked... and it turned out upward-growing SP did work (up to a point) on the architecture our toy compiler was targeting.
[+] [-] unknown|1 year ago|reply
[deleted]