I saw the repeating 'A' at the end of the base64 text and thought "it's not even 512 bytes; it's smaller!"
That said, the title is just a little clickbaity --- it's a C-subset compiler, and more accurately a JIT interpreter. There also appears to be no attempt at operator precedence. Nonetheless, it's still an impressive technical achievement and shows the value of questioning common assumptions.
Finally, I feel tempted to offer a small size optimisation:
sub ax,2
is 3 bytes whereas
dec ax
dec ax
is 2 bytes.
You may be able to use single-byte xchg's with ax instead of movs, and the other thing which helps code density a lot in 16-bit code is to take advantage of the addressing modes and LEA to do 3-operand add immediates where possible.
Good tip! Yeah, there’s ~20 bytes unused at the end. I kept finding ways to squeeze out a few more and had to tell myself to stop and just publish it already. You could take this further if you really wanted. But it’s already sufficiently absurd.
This reminded me the idea of compilers bootstrapping (https://news.ycombinator.com/item?id=35714194). That is, now you can code in SectorC some slightly more advanced version of C capable of compiling TCC (https://bellard.org/tcc/), and then with TCC you can go forward to GCC and so on.
Perhaps my favorite thing in all of computing is the authors of Lisp bootstrapping by writing an (academic) interpreter for Lisp in Lisp, and then realizing they could just compile that by hand.
The TCC step seems unnecessary. If you've got a SectorC C compiler sufficient to compile TCC, it can probably compile the bootstrap compiler used in GCC's 3 stage build process. The bigger issue is probably getting the ancillary tools up and running (a shell, make, coreutils, ...)
Now they just need to port something like oneKpaq to 16 bit or maybe something from the extremely tiny decompressor thread [1], just to test compression level to get an idea kpaq on its quickest setting(taking minutes instead of what could be days on its highest) reduced SectorC to 82.81% of its size, of course adding the 128 bit stub knocked it to 677 bytes. It would be interesting to try it on the slowest takes day to bruteforce setting, but I'm not going to attempt that.
Some of the compressors in that forum thread since they are 32 bytes and such, might find it easier to get net gains.
LZ decompressors are tiny, as your second link discusses, but it is unlikely that they'll find much redundancy that could easily be removed from the uncompressed program itself, thus removing the need to use them.
I once tried making a 1kb binary to see what I could fit in, once you've got to the point of considering compression there isn't much ordered data left to compress. I worked it out to be a ~10 byte saving.
This is fascinating, I really did not think it was possible to implement even a tiny subset of C in just 512 bytes of x86 code. Using atoi() as a generic hash function is a brilliantly awful hack!
Hashes are perhaps the holy-grail of computer-science. With a good hash, we can just side-step all the hard problems by trading them for an even harder problem (hash collisions), and then we just ignore that harder problem. Brilliant. (sticks fingers in ears)
I wrote a similar x86-16 assembler in < 512 B of x86-16 assembly, and this seems much more difficult <https://github.com/kvakil/0asm/>. I did find a lot of similar tricks were helpful: using gadgets and hashes. Once trick I don't see in sectorc which shaved quite a bit off of 0asm was self-modifying code, which 0asm uses to "change" to the second-pass of the assembler. (I wrote some other techniques here: <https://kvakil.me/posts/asmkoan.html>.)
I considered self-modifying code, but somehow I kept finding more ways to squeeze bytes out. I’m half convinced that you could condense it another 50 ish bytes and add operator precedence or even local vars. But.. frankly.. I was ready to switch my attention to a new project.
The C Star (C*) language from the selfie project also does not support structs, yet in 12KLOC of code they implemented a C Star compiler that can compile selfie (and outputs ELF files), an emulator that runs RISC-U (RISC-V subset), and a hypervisor.
Not using this, but tangentially related is (full disclosure, i am a maintainer of this project) live-bootstrap, which uses about a KB of binary to do a full "Linux from scratch" style thing - read https://github.com/fosslinux/live-bootstrap/blob/master/part... for all 143 steps you have to go through to get there.
I can't help but wonder if this is written in x86-16 bit mode to implicitly use Real Mode BIOS functions and platform interfaces as well.
There's something to be said for taking advantage of that existing code; but if that's a dependency it should be part of the environment manifest. Everything has (most things have) a context where it might be useful and that should be included in the explanation so a tool isn't misused and so incorrect expectations aren't set.
I started reading the source... and digging for the part that allocates space for variables.... only to realize variable declarations are ignored and unnecessary... wow... what a breathlessly reckless hack! I love it!
It's like using an M18A1 Claymore mine and hoping it actually is aimed (and stays aimed) in the right direction.
> I will call it the Barely C Programming Language
Or BCPL, for short.
> The C programming language was devised in the early 1970s as a system implementation language for the nascent Unix operating system. Derived from the typeless language BCPL, it evolved a type structure; created on a tiny machine as a tool to improve a meager programming environment, it has become one of the dominant languages of today. This paper studies its evolution. [1]
something like this could be interesting for deep-space applications where you only have a bare metal environment with hardened processor and limited memory & of course ping time of days (to earth).
or alternatively for embedding a C compiler inside a LLM to use the LLM as a form of virtual machine.
You mean like for retrofitting an existing satellite? It seems like we're beyond the era of extremely constrained embedded environments, even for space hardware. Several hundred Mhz PowerPC based avionics computers with hundreds of MB of RAM seemed pretty common 15 years ago.
What would a C interpreter do that wouldn't be done better by simply uploading a new compiled binary, though? The source code is most likely less space efficient than machine code.
do you think there are any lessons that can be applied to a "normal" interpreter/compiler written in standard C? i'm always interested in learning how to reduce the size of my interpreter binaries
Hard to say. I’m fairly sure that all of modern software could easily be 2-3 orders of magnitude smaller. But, the world has decided (and I think rightfully so) that it doesn’t matter. We have massive memories and storage systems. Unless you have a very constrained system (power, etc) then I think the big bloated, lots of features approach is the winner (sadly).
[+] [-] userbinator|2 years ago|reply
That said, the title is just a little clickbaity --- it's a C-subset compiler, and more accurately a JIT interpreter. There also appears to be no attempt at operator precedence. Nonetheless, it's still an impressive technical achievement and shows the value of questioning common assumptions.
Finally, I feel tempted to offer a small size optimisation:
is 3 bytes whereas is 2 bytes.You may be able to use single-byte xchg's with ax instead of movs, and the other thing which helps code density a lot in 16-bit code is to take advantage of the addressing modes and LEA to do 3-operand add immediates where possible.
[+] [-] xorvoid|2 years ago|reply
[+] [-] rwmj|2 years ago|reply
[+] [-] kyberias|2 years ago|reply
[+] [-] xonix|2 years ago|reply
[+] [-] endgame|2 years ago|reply
https://guix.gnu.org/blog/2023/the-full-source-bootstrap-bui...
[+] [-] xorvoid|2 years ago|reply
I considered writing a self-hosting compiler. It should be doable without too much work. I was going to do one in BarelyC originally.
SectorC -> TCC -> GCC sounds fun. C all the way down, haha
[+] [-] plagiarist|2 years ago|reply
[+] [-] bragr|2 years ago|reply
https://gcc.gnu.org/install/build.html
[+] [-] agumonkey|2 years ago|reply
[+] [-] molticrystal|2 years ago|reply
Some of the compressors in that forum thread since they are 32 bytes and such, might find it easier to get net gains.
[0] https://github.com/temisu/oneKpaq
[1] https://encode.su/threads/3387-(Extremely)-tiny-decompressor...
[+] [-] userbinator|2 years ago|reply
[+] [-] benj111|2 years ago|reply
[+] [-] kiwidrew|2 years ago|reply
[+] [-] kitd|2 years ago|reply
Hashes are perhaps the holy-grail of computer-science. With a good hash, we can just side-step all the hard problems by trading them for an even harder problem (hash collisions), and then we just ignore that harder problem. Brilliant. (sticks fingers in ears)
[+] [-] kvakil|2 years ago|reply
I wrote a similar x86-16 assembler in < 512 B of x86-16 assembly, and this seems much more difficult <https://github.com/kvakil/0asm/>. I did find a lot of similar tricks were helpful: using gadgets and hashes. Once trick I don't see in sectorc which shaved quite a bit off of 0asm was self-modifying code, which 0asm uses to "change" to the second-pass of the assembler. (I wrote some other techniques here: <https://kvakil.me/posts/asmkoan.html>.)
bootOS (<https://github.com/nanochess/bootOS>) and other tools by the author are also amazing works of assembly golf.
[+] [-] xorvoid|2 years ago|reply
[+] [-] vitiral|2 years ago|reply
I'll point out to any passerby that this C doesn't support structs, so it's unlikely you'd actually want to build anything in it.
[+] [-] NonEUCitizen|2 years ago|reply
[+] [-] HarHarVeryFunny|2 years ago|reply
I think this, from the conclusion, is the real takeaway:
> Things that seem impossible often aren’t and we should Just Do It anyway
I certainly would never have tried to get a C compiler (even a subset) so small since it my instinct would have been that it was not possible.
[+] [-] sylware|2 years ago|reply
[+] [-] gigel82|2 years ago|reply
[+] [-] fosslinux|2 years ago|reply
[+] [-] mjevans|2 years ago|reply
There's something to be said for taking advantage of that existing code; but if that's a dependency it should be part of the environment manifest. Everything has (most things have) a context where it might be useful and that should be included in the explanation so a tool isn't misused and so incorrect expectations aren't set.
[+] [-] vitiral|2 years ago|reply
[+] [-] mikewarot|2 years ago|reply
It's like using an M18A1 Claymore mine and hoping it actually is aimed (and stays aimed) in the right direction.
[+] [-] keyle|2 years ago|reply
[+] [-] naruhodo|2 years ago|reply
Or BCPL, for short.
> The C programming language was devised in the early 1970s as a system implementation language for the nascent Unix operating system. Derived from the typeless language BCPL, it evolved a type structure; created on a tiny machine as a tool to improve a meager programming environment, it has become one of the dominant languages of today. This paper studies its evolution. [1]
[1] https://www.bell-labs.com/usr/dmr/www/chist.html
[+] [-] pgt|2 years ago|reply
[+] [-] wkz|2 years ago|reply
Especially liked this nugget:
> (NOTE: This grammar is 704 bytes in ascii, 38% larger than it's implementation!)
[+] [-] mihaic|2 years ago|reply
I would have wished some explanation on where the function calls like vga_init and vga_set_pixel come from, I'm not a graybeard yet.
[+] [-] tomsmeding|2 years ago|reply
[+] [-] DesiLurker|2 years ago|reply
or alternatively for embedding a C compiler inside a LLM to use the LLM as a form of virtual machine.
[+] [-] sgtnoodle|2 years ago|reply
What would a C interpreter do that wouldn't be done better by simply uploading a new compiled binary, though? The source code is most likely less space efficient than machine code.
[+] [-] junon|2 years ago|reply
[+] [-] scrawl|2 years ago|reply
do you think there are any lessons that can be applied to a "normal" interpreter/compiler written in standard C? i'm always interested in learning how to reduce the size of my interpreter binaries
[+] [-] xorvoid|2 years ago|reply
[+] [-] zoom6628|2 years ago|reply
[+] [-] Snelius|2 years ago|reply
[+] [-] ezekiel68|2 years ago|reply