For a learning project like this one, this would probably be overkill, but my personal suggestion for instruction decoding is that it really pays in the long term to use a data driven decoder. It's fairly easy to do a handcoded "if bits A..B are 0b1000 and..." decoder for the basic integer parts of the instruction set, but especially as you get into complexities like SIMD and if you need your decoder to be easy to modify to add new instructions later this gets very unwieldy.
QEMU switched to a data driven representation with a python program to autogenerate the "check bit patterns and extract fields" code, and it's one of the better design overhauls we've done: we started using it mostly for new code but went back and converted some of the old handwritten decoders too. It's much easier to add a new instruction when you only need to add a line like
USADA8 ---- 0111 1000 rd:4 ra:4 rm:4 0001 rn:4
and add a function trans_USADA8() that gets called with the field values, compared to trying to find the right place in a big existing set of handcoded switch and if statements to add the extra checks for the insn.
Getting a bit off topic, but I feel like this task is something that ought to have special language support.
It's a kind of serialization/deserialization, or what I think Python and some others call "pickling". Same task. Turn these raw bit patterns into typed values.
Ada probably comes closest of the major languages to pulling it off. It has separation of the abstract/programmer's view of a data type and the implementation / low representation of that type.
Specify a bunch of records like:
for Instruction use record
Condition at 0 range 31 .. 28;
ImmFlag at 0 range 27 .. 27;
Opcode at 0 range 24 .. 21;
CondFlag at 0 range 20 .. 20;
Rn at 0 range 19 .. 16;
Rd at 0 range 15 .. 12;
Operand at 0 range 11 .. 0;
end record;
Then aim a pointer at your instructions and read them as records/structs.
It works particularly cleanly with a nice RISC encoding like ARM. I'm not actually sure if that would work in Ada. The use representation syntax might not be generic enough.
Did you or anyone else figure out how to work with the Aarch64/ARM32 open-source JSON schemas? I could never figure them out and I feel like if I wanted to work with them I'd have to manually write Pydantic models for them or something, because Quicktype chokes on them. Mainly because the schemas are recursive. If there's one thing I like about RISC-V, it's that their riscv-opcodes instruction dictionaries are trivial to work with (although it's tricky for me to auto-generate "this operand is signed, this one isn't" logic since the repo currently doesn't convey that information).
I built an emulator in Python with my students, aged 15 and 16. All it did was have a little chunk of memory, PC, some registers, arithmetic, and branching, but it was really fun and a surprisingly tractable project to work on. Once you have a working “machine” with an assembler you can then start to automate function call and return which is of course a large part of what a compiler does for you.
With hindsight I would have loved to do a proper compiler but that’s undergrad level really. I really recommend it as a toy post-food-coma project for when you’re stuck with the family either next week or at the end of December :)
Nice article and especially so for including the parsing that most people just outsource. What's great about using an emulator is that you can also do fun things with the syscalls like implementing your own "virtual filesystem" instead of just translating directly to the x86_64 equivalent syscall: https://github.com/gamozolabs/fuzz_with_emus/blob/master/src... (not my code but basically something like this)
Looks nice, but the massive performance hit will be from constant parsing of arguments of opcodes. If you are emulating relatively small binaries (i.e. embedded stuff with few megabytes of program), it is better to approach each instruction as an object, parse it once and then save it into RAM into some tree structure so you can quickly find given opcode by its binary representation and then see if it has been parsed or not yet.
pm215|3 months ago
QEMU switched to a data driven representation with a python program to autogenerate the "check bit patterns and extract fields" code, and it's one of the better design overhauls we've done: we started using it mostly for new code but went back and converted some of the old handwritten decoders too. It's much easier to add a new instruction when you only need to add a line like
and add a function trans_USADA8() that gets called with the field values, compared to trying to find the right place in a big existing set of handcoded switch and if statements to add the extra checks for the insn.retrac|3 months ago
It's a kind of serialization/deserialization, or what I think Python and some others call "pickling". Same task. Turn these raw bit patterns into typed values.
Ada probably comes closest of the major languages to pulling it off. It has separation of the abstract/programmer's view of a data type and the implementation / low representation of that type.
Specify a bunch of records like:
Then aim a pointer at your instructions and read them as records/structs.It works particularly cleanly with a nice RISC encoding like ARM. I'm not actually sure if that would work in Ada. The use representation syntax might not be generic enough.
ethin|3 months ago
xnacly|3 months ago
gorgoiler|3 months ago
With hindsight I would have loved to do a proper compiler but that’s undergrad level really. I really recommend it as a toy post-food-coma project for when you’re stuck with the family either next week or at the end of December :)
costco|3 months ago
anthk|3 months ago
https://github.com/pts/pts-mips-emulator
thesnide|3 months ago
general1465|3 months ago
bArray|3 months ago
xnacly|3 months ago
syntex|3 months ago
Jeff-Collins|3 months ago
[deleted]
Will-Reppeto|3 months ago
[deleted]
dosinga|3 months ago
unknown|3 months ago
[deleted]