top | item 44440983

(no title)

I’ve been trying out various LLMs for working on assembly code in my toy OS kernel for a few months now. It’s mostly low-level device setup and bootstrap code, and I’ve found they’re pretty terrible at it generally. They’ll often generate code that won’t quite assemble, they’ll hallucinate details like hardware registers etc, and very often they’ll come up with inefficient code. The LLM attempt at an AP bootstrap (real-mode to long) was almost comical.

All that said, I’ve recently started a RISC-V port, and I’ve found that porting bits of low-level init code from x86 (NASM) to RISC-V (GAS) is actually quite good - I guess because it’s largely a simple translation job and it already has the logic to work from.

discuss

simonw|8 months ago

> They’ll often generate code that won’t quite assemble

Have you tried using a coding agent that can run the compiler itself and fix any errors in a loop?

The first version I got here didn't compile. Firing up Claude Code and letting it debug in a loop fixed that.

noone_youknow|8 months ago

I have, and to be fair that has solved the “basically incorrect code” issue with reasonable regularity. Occasionally the error messages don’t seem helpful enough for it, which is understandable, and I’ve had a few occurrences of it getting “stuck” in a loop trying to e.g. use an invalid addressing mode (it may have gotten itself out of those situations if I were more patient) but generally, with one of the Claude 4 models in agent mode in cursor or Claude code, I’ve found it’s possible to get reasonably good results in terms of “does it assemble”.

I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).

I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.

vidarh|8 months ago

Similar experience - they seem to generally have a lot more problems with ASM than structured languages. I don't know if this reflects less training data, or difficulty.

73kl4453dz|8 months ago

As far as i can tell they have trouble with sustained satisfaction of multiple constraints, and asm has more of that than higher level languages. (An old Boss once said his record for bug density was in asm: he'd written 3 bugs in a single opcode)

msgodel|8 months ago

The few times I've messed with it I've noticed they're pretty bad at keeping track of registers as they move between subroutines. They're just not great at coming up with a consistent "sub language" the way human assembly programmers tend to.

LtdJorge|8 months ago

A bit tangential, but I've found 4 Sonnet to be much, much better at SIMD intrinsics (in my case, in Rust) than Sonnet 3.5 and 3.7, which were kind of atrocious. For example, 3.7 would write a scalar for loop and tell you "I've vectorized...", when I explicitly asked to do the operations with x86 intrinsics and gave it the capabilities of the hardware. Also, telling it to use AVX2 as supported would not make it use SSE or it would make conditionals to use them, which makes no sense. Seems Claude 4 solves most of that.

Edit: that -> than

noone_youknow|8 months ago

This fits my experience. I’m definitely getting considerably better results with 4 than previous Claudes. I’d essentially dropped sonnet from my rotation before 4 became available, but now it’s a go-to for this sort of thing.