top | item 45085156

How is Ultrassembler so fast?

124 points| netr0ute | 6 months ago |jghuff.com

51 comments

order

vlovich123|6 months ago

> Additionally, in C++, requesting that heap memory also requires a syscall every time the container geometrically changes size

That is not true - no allocator I know of (and certainly not the default glibc allocator) allocates memory in this way. It only does a syscall when it doesn’t have free userspace memory to hand out but it overallocates that memory and also reuses memory you’ve already freed.

idiomat9000|6 months ago

Wasn't there also over allocate for the first geometric expansion and mark the 2nd as for space for likely shortlived objects?

aidenn0|6 months ago

Exceptions in C++ are never zero-overhead. There is a time-space tradeoff for performance of uncaught exceptions, and G++ picks space over time.

mpyne|6 months ago

There's a time-space tradeoff to basically any means of error checking.

Including checking return codes instead of exceptions. It's even possible for exceptions as implemented by g++ in the Itanium ABI to be cheaper than the code that would be used for consistently checking return codes.

netr0ute|6 months ago

> G++ picks space over time

By definition, that's zero-overhead because Ultrassembler doesn't care about space.

netr0ute|6 months ago

Hi everyone, I'm the author of this article.

Feel free to ask me any questions to break the radio silence!

benreesman|6 months ago

Nice work and good writeup. I think most of that is very sound practice.

The codegen switch with the offsets is in everything, first time I saw it was in the Rhino JS bytecode compiler in maybe 2006, written it a dozen times since. Still clever you worked it out from first principles.

There are some modern C++ libraries that do frightening things with SIMD that might give your bytestring stuff a lift on modern stupid-wide high mispredict penalty stuff. Anything by lemire, stringzilla, take a look at zpp_bits for inspiration about theoretical minimum data structure pack/unpack.

But I think you got damn close to what can be done, niiicccee work.

inetknght|6 months ago

Overall, this is a fantastic dive into some of RISC-V's architecture and how to use it. But I do have some comments:

> However, in Chata's case, it needs to access a RISC-V assembler from within its C++ code. The alternative is to use some ugly C function like system() to run external software as if it were a human or script running a command in a terminal.

Have you tried LLVM's C++ API [0]?

To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.

I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.

> It's not clear when exactly exceptions are slow. I had to do some research here.

There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah. There's also other C++ conferences that have similar presentations (or even, almost identical presentations because the presenters go to multiple conferences), though I don't have a link handy because I pretty much only attend cppcon.

[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...

[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...

msla|6 months ago

What's the difference between a Programming Furu and a Programming Guru? Is there a joke I'm missing?

jclarkcom|6 months ago

You might look into using memory mapped IO for reading input and writing your output files. This can save some memory allocations and file read and write times. I did this with a project where I got more than 10x speed up. For many cases file IO is going to be your bottleneck.

IshKebab|6 months ago

Neat, but it's not like assembly is really a bottleneck in any but the most extreme cases. LLVM and GAS are already very fast.

I feel like this might mostly be useful as a reference, because currently RISC-V assembly's specification is mostly "what do GCC/Clang do?"

drob518|6 months ago

Exactly. I don’t know too many assembly language programmer's who are griping about slow tools, particularly on today’s hardware. Yea, Orca/M on my old Apple II with 64k RAM and floppy drives was pretty slow, but since then not so much. But sure, as a fun challenge to see how fast you can make it run, go for it.

CyberDildonics|6 months ago

ASM should compile at hundreds of MB/s. All the ASM you could write in your entire life will compile instantly. There is no one in decades that has thought their assembler is too slow.

benreesman|6 months ago

ptxas comes to mind.

throwaway81523|6 months ago

I wonder if you thought about perfect hashing instead of that comparison tree. Also, flex (as in flex and bison) can generate what amounts to trees like that, I believe. I haven't benchmarked it compared to a really careful explicit tree though.

netr0ute|6 months ago

I thought about hashing, but found that hashing would be enormously slow to compute compared to a perfectly crafted tree.

Sesse__|6 months ago

You're probably thinking of gperf, not flex and bison.

StilesCrisis|6 months ago

“Here's one weird trick I haven't seen anywhere else.” … describes a simplistic lexer. Hmm.

throwaway81523|6 months ago

I also have to ask where all this assembly code is coming from, that has to be compiled fast. If it's compiler output, maybe you could hack the compiler back end to generate tokenized assembly code, eliminating a lot of scanning and stuff. It would still be human readable through a simple program that converted the tokens back to mnemonics. The tokens could be 4 digit hex numbers or something like that, so it would still be an easily handled text file.

Lots of simple compilers generate object code directly instead of assembly code, so the above is not so bad by comparison.

stuaxo|6 months ago

Nice to have something really fast, maybe we can have something as good as TurboPascal and other early Borland tools again.