top | item 20025939

(no title)

huntie | 6 years ago

(2017)

It's probably worth noting that this is only a runtime assembler, i.e. it does no parsing. It also doesn't support all of the addressing modes for the instructions that it does support. Nonetheless it does show that assemblers aren't that hard. Adding support for the various addressing modes and amd64 does complicate things but not too badly. Moving forward from this you'd probably want a better scheme for handling the various "extra" bytes (SIB, REX, etc.).

discuss

order

deckard1|6 years ago

I've written a (mostly) fully-featured x86 assembler.

There are two gotchas that most people starting out probably won't realize until it's too late.

1) Encoding instructions on x86 can be more challenging than they first appear. Understanding how this works and fits together will save you a lot of headaches.

2) Forward jumps are a pain. Any reasonable assembler will provide symbolic labels for jumps and other operations. Because x86 instructions are variable length, you can't simply count up the lines to the label and multiply by some fixed amount. It doesn't work that way. Instructions can range from one byte to 15 bytes. What this means to you is that you'll need to design a multi-pass assembler. You allocate a fixed amount of space for the jump instruction and come back later to "patch" that jump instruction with the correct byte offset. Labels simply become keys in a dictionary that record the byte offset into the assembled binary data.

Other things to make sure you understand: big/little endian, LSB/MSB, how two's-complement works, etc.

One surprising aspect of writing an assembler is that you learn to decode instructions by simply looking at a hex dump of the binary output. You almost feel like Neo in the Matrix, in more nerdy less awesome sort of way. You start thinking of assembler and macros as merely convenience of having to write out hex (or binary) of the instructions you want. And on top of that, you see C as convenience for having to do assembler.

userbinator|6 years ago

You allocate a fixed amount of space for the jump instruction and come back later to "patch" that jump instruction with the correct byte offset.

Also worth noting there are "short" and "near" jumps, which trade off between a shorter encoding and a longer range of destination; if the target is a forward reference, not yet known at the time the jump is assembled, then nearly all assemblers will use the long variant, because they don't know yet whether the destination will be close enough. One notable exception is fasm, which starts off with being "optimistic" about jump sizes, and then increases only the ones which didn't quite make it, repeatedly, until all the offsets are large enough. Here's a series of very detailed posts about that from its author:

https://board.flatassembler.net/topic.php?t=20249

mehrdadn|6 years ago

For multiple passes, is it ever possible that the next pass shrinks the size of the code? Off the top of my head I don't see if this is possible, but I've always wondered this since it would then suggest you could end up bouncing back and forth on every pass, unless explicitly avoid it somehow... can that happen?

mhh__|6 years ago

Never having written an assembler for a proper ISA, I was under the impression that assemblers for real CPUs are extremely simple until you start writing them.

kabdib|6 years ago

One of my hobbies in the 1980s was writing assemblers, because most of the existing commercial offerings were pretty bad. I started out writing my own utterly terrible (though fast) assemblers, and after several years and my fifth or sixth try I had one that was fast, very usable and that people liked a lot. It was shipped as a component of our company's devkits and wasn't "commercial" in the sense of a standalone product, but I still count it a commercial success.

Things that make assemblers useful -

- A real macro language. The C-preprocessor does not count.

- Support of the manufacturer's mnemonics and syntax. Unless the manufacturer's official syntax is "AT&T / Unix assembler syntax" then that doesn't count, either. Parsing addressing modes is often painful -- the 68000 was a bear that took several days to get right -- but telling your users "Oh, just use this alternate syntax . . . documented where? Umm..." is lots more difficult.

- Listings and cross-referencing. Maybe this was a function of the era when I was writing these, but all of the assemblers I wrote did not output a listing (addresses and generated bytes along with the program text) and my initial users were reluctant to use them. When I added listings to my last effort -- it took a day or two IIRC -- the tool suddenly became usable in their eyes.

- Speed. No reason these things can't crunch through a million lines a second.

(For my own game carts, I would print out a complete listing every couple of days because these were really helpful in debugging. At the end of a typical project I'd have a couple five-foot-high stacks of fanfold paper, which I'd have shredded. I'll point out that while my floppy-based copies of my games' sources have gone walkabout, I still have the most recent complete assembly listings in binders).

I've thought about reviving my hobby . . . but modern CPUs are much more complicated than the 68Ks and 8-bit wonders of decades ago, and life is too damned short to be writing assembly anyway.

chrisseaton|6 years ago

Assemblers are one of those things that's very simple for the simple cases, and then when you add more complex cases you start to think it's actually very hard until you go back and add the proper abstractions, which would have seemed needlessly complicated for the simple cases.

So people doing something trivial thinks they're easy (like this example), people trying to do a bit more think they're really hard, and people doing an entire assembler think they're easy again.

huntie|6 years ago

I wrote a runtime assembler for a small subset of amd64. I had basically zero experience with assembly and I was still able to write what I needed in half a day. The hard part is understanding how the extra bytes work, which I did by running small bits of assembly through the GNU assembler and then calling objdump on the output.

I think a RISC ISA might be more difficult to start with, and supporting things like AVX might be hard. I only needed support for some basic instructions though. The amd64 manual is actually pretty good. Overall, it was much easier than I expected it to be.