NobleExpress's comments

NobleExpress | 1 year ago | on: Beginner's guide to the Shenandoah garbage collector

Probably. But they still write “On a side note, Oracle's ZGC Generational—GenZGC—is not supported in production yet.”

GenZGC is supported on OpenJDK 21 onwards [1].

[1]: https://inside.java/2023/11/28/gen-zgc-explainer/

NobleExpress | 1 year ago | on: Deep Bug

Ah right. I assumed it was a native image.

NobleExpress | 1 year ago | on: Deep Bug

-server is only a thing for HotSpot. They mention that HotSpot works perfectly fine. There is no option of "-server" for Graal native-image. It has "-O{0,1}" though for turning optimizations off and on respectively.

NobleExpress | 1 year ago | on: Deep Bug

Interesting. Perhaps you can inspect the disassembly of the function in question when using Graal and HotSpot. It is likely related to that.

Another debugging technique we use for heisenbugs is to see if `rr` [1] can reproduce it. If it can then that's great as it allows you to go back in time to debug what may have caused the bug. But `rr` is often not great for concurrency bugs since it emulates a single-core machine. Though debugging a VM is generally a nightmare. What we desperately need is a debugger that can debug both the VM and the language running on top of it. Usually it's one or the other.

> In general I’d argue you haven’t fixed a bug unless you understand why it happened and why your fix worked, which makes this frustrating, since every indication is that the bug exists within proprietary code that is out of my reach.

Were you using Oracle GraalVM? GraalVM community edition is open source, so maybe it's worth checking if it is reproducible in that.

[1]: https://github.com/rr-debugger/rr

NobleExpress | 1 year ago | on: Wayland breaks your bad software

Honestly, I was never convinced that concentrating all developer effort onto Wayland was the best idea. Wayland seemed like a massive over-correction from X11, where the protocol and core were "simple" in comparison to X11 which did everything. Being simple, by itself, is not a virtue -- Wayland basically forced all the essential complexity of the problem to the compositors (a la Waterbed theory of Complexity [1]). Given there are so many compositor implementations, they essentially end up reinventing the wheels all the time. This is a bit better now with wlroots, but unfortunately, the ecosystem is already fragmented. I'm not saying X11 is superior to Wayland or vice-versa, but frankly I would be surprised if we couldn't improve X11 if it got 0.1% of the amount of effort collectively gone into Wayland.

I would highly recommend reading through some of these blog posts who expound upon the problems of Wayland more eloquently than I can [2,3,4].

[1]: https://en.wikipedia.org/wiki/Waterbed_theory

[2]: https://dudemanguy.github.io/blog/posts/2022-06-10-wayland-x...

[3]: https://pointieststick.com/2023/09/17/so-lets-talk-about-thi...

[4]: https://utcc.utoronto.ca/~cks/space/blog/unix/WaylandTechnic...

NobleExpress | 1 year ago | on: Bump Allocation: Up or Down?

Using microbenchmarks to measure allocation performance is a bit misleading. The optimizations that occurred in the microbenchmarks may not actually occur for real-world code. Three branches vs two branches also makes no real difference as you are better served optimizing inefficiencies/performance somewhere else.

Also realloc'ing the most recently allocated object is not a common operation, at least in my experience building memory management systems.

NobleExpress | 1 year ago | on: Bump Allocation: Up or Down?

They're (effectively) the same. Bump allocation is the term in GC literature and is the more general term. Arena allocation can be slightly more nuanced as you can throw an entire arena away after you're done with it (if your problem domain allows for it). But arena allocation is just a kind of bump allocation for all intents and purposes.

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

Well page protection is expensive which is why the predominant way to implement concurrent copying/compaction until recently was using read barriers (and still is -- ART seems to be an exception). I will mention that concurrent compacting GCs wouldn't use userfaultfd for write protection, but for read protection (it effectively takes over the job of a read barrier).

I personally don't know too much about userfaultfd and how it works internally, but my best guess is that it bypasses heavyweight kernel data-structures and lets the user application handle the page fault. It is obviously better than simply using mprotect, but it is not immediately clear why it would be better than a read barrier (other than code size considerations, which honestly doesn't sound like much of a big deal as the code handling userfaultfd also needs to be brought into the instruction cache).

I did find this kernel doc about userfaultfd [1] which might be interesting to read if you're interested (it also does mention that userfaultfd doesn't lock some internal kernel data-structures which gives it a better performance than simply using mprotect, but the implementation details are a bit sparse).

[1]: https://docs.kernel.org/admin-guide/mm/userfaultfd.html

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

MESI cache coherency (and its derivatives) [1] means that you can have exclusive writes to cache if and only if no other core tries to access that data. I would think most if not all microarchitectures have moved to MESI (or equivalent) cache coherency protocols as they avoid unnecessary writes to memory.

[1]: https://en.wikipedia.org/wiki/MESI

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

> RC is super cheap. Seriously. You can do about several billion of them per second.

Right. You bump allocate faster, however. RC is an additional operation to the allocation request. Given naive RC can't move objects, it necessarily needs a free-list allocator. Free-list allocators can allocate pretty fast (for the most common object sizes), but can't reach the speeds of bump allocators. Furthermore, bump allocators have better locality of reference than free-list allocators.

Also I've never questioned you can't do non-atomic increments/decrements efficiently. They are exceptionally fast. However you are still converting every pointer read operation into a pointer write operation. The rule of thumb is that there are generally 10x more pointer read operations in a program than pointer write operations. This is why read barriers are also considered more costly than write barriers.

> I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW.

I don't really agree with this statement. There are way too many unknown/unaccounted variables. It could be better hardware, maybe Android has a terrible architecture, and could just be as you said that Swift RC is genuinely better than ART GC or it can be whatever. Point is that it's not a scientific comparison. We don't know if the benefits in iOS come from RC. And we wont be able to know unless we have two systems where all parameters are the exact same _except_ one is using RC and another is using tracing GC, with both systems ran on the same hardware and on the same set of benchmarks.

> That’s only kind of true. Good allocators seem to mitigate this problem quite effectively

Yes. Note that mimalloc literally has a concept of "deferred frees" effectively emulating GC as otherwise freeing an object can result in an unbounded recursive free call (for example, dropping a large linked list).

> I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler.

Sure. The perfect memory manager is omniscient and knows exactly when an object is not required and will free it. But unfortunately we don't have perfect memory managers yet. So yes I agree in principle, but we have to be pragmatic.

> And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.

I feel like that entirely depends on the problem domain? I don't know where you're getting the "shared ownership is rarely needed" from? Maybe this is true for your problem domain but may not be for others. And if it's true for yours, then great! You can use RC and optimize your programs that way.

> A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs.

Sure. GC metadata needs to be accessed and updated atomically. 100% agree with you. The order of magnitude of those operations is likely much less than what you would get with naive RC though.

> as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).

In one of my comments about Swift in this thread, the two papers I linked see up to 80% of the benchmark execution time being dominated by ARC operations! I will note that the papers are around 6-7 years old so the situation might have drastically changed since then, but I haven't personally found many new/contemporary evaluations of Swift RC.

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

I will say that you seem to be stating "smearing that over total execution time tends to give better results" without proof as well. I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient. Yes RC has smaller minheap sizes than tracing GC, but garbage collection is fundamentally a time-space tradeoff and RC is highly optimized for memory efficiency. Also most naive RC implementations don't copy objects so you get heap fragmentation.

Comparing naive RC to tracing GC is non-trivial. In order to have a fair comparison, you'd have to implement naive RC and tracing GC in the same system and then compare them across a set of benchmarks. I personally have never come across a great performance study which compared naive RC to tracing GC.

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

I'm aware. Rust isn't a conventional "managed language" which is why I consciously omitted it from the above list. In an increasingly multi-threaded world, you just have to use the synchronized version of RC (aka Arc<T> in Rust).

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

Thanks for the links!

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

Yes and no. LXR is a highly optimized deferred and coalesced reference counting (amongst many other optimizations) GC and it looks nothing like the RC implementations you see in other languages like Python, Nim, Swift, etc. So yes, reference counting can be performant, _but_ only if you are using a deferred and coalesced implementation as otherwise the simple implementation requires atomic operations for every pointer read/write operation (though compilers could optimize it to use non-atomic operations when it's sure semantics are preserved).

EDIT: The post you're responding to is referring to the simple standard implementation of RC.

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

I've always heard of the "Swift elides reference counts" statements but I've never seen it substantiated. I don't claim to be a Swift GC expert by any means, but the impression I get from the two Swift GC papers I've read [1, 2] is that Swift has a very simple implementation of RC. The RC optimization document (albeit the document is incomplete) [3] also doesn't give me the impression that Swift is doing much eliding of reference counts (I'm sure it is doing it for simple cases).

Do you have any links which might explain what kind of eliding Swift is doing?

EDIT: The major RC optimizations I have seen which elide references are deferral and coalescing and I'm fairly certain that Swift is doing neither.

[1]: https://dl.acm.org/doi/abs/10.1145/3170472.3133843

[2]: https://doi.org/10.1145/3243176.3243195

[3]: https://github.com/apple/swift/blob/main/docs/ARCOptimizatio...

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

Real time GCs exist such as the IBM Metronome GC. Though I'll be honest and say I haven't heard of many real-time GCs other than the Metronome one. Certainly many new GCs have reduced pause times dramatically but that's orthogonal to real-time GC (as you can make pause times infinitesimally small but not let the mutator actually progress).

NobleExpress | 2 years ago | on: The Garbage Collection Handbook, 2nd Edition

Certainly interesting, but there are no performance numbers mentioned in the white paper comparing userfaultfd to read barriers. So the actual benefit to switching to userfaultfd is unknown (at least in publically accessible documents -- I'm sure Google has done internal performance evaluations).

Using page faults (and/or page protection) to perform compaction instead of barriers is a pretty old technique (see the 1988 paper by Appel, Ellis, and Li [1]; see also the Compressor by Kermany and Petrank [2]). But handling page faults were very expensive (at least on Linux) until the addition of userfaultfd recently.

[1]: https://dl.acm.org/doi/10.1145/960116.53992 [2]: https://dl.acm.org/doi/10.1145/1133255.1134023

NobleExpress | 2 years ago | on: Systems Benchmarking Crimes

Great read. Reminds me of this checklist: https://www.sigplan.org/Resources/EmpiricalEvaluation/

There is a bunch of overlap with what Gernot talks about, but I think it's more general, and certainly useful to have a checklist!

NobleExpress | 2 years ago | on: Erlang/OTP: Garbage Collector

Hm? G1 is a generational GC? And so is Serial/Parallel. And work on generational ZGC is ongoing [1]. Not sure what you're referring to here.

[1]: https://openjdk.org/jeps/439

NobleExpress | 3 years ago | on: JDK 20 G1/Parallel/Serial GC Changes

Ehh yes and no. Many languages indeed use conservative GCs, but that is because it is significantly easier to implement than precise GCs which require stack maps etc. I don't think it's exactly a question about languages and static types/generating code, it's more of an implementation detail. Conservative GCs can be fast, but they are inherently less space-efficient as you retain more objects than a precise GC would. Of course in some cases a conservative GC _is_ required, for example if your language has heavy interop with C/C++ land.

Also supporting interior pointers is not too cumbersome even with a "Java-style GC" (whatever that exactly means). It requires an additional bit per smallest aligned object size to denote if an object is allocated or not. If you need to check if a pointer is an interior pointer to an object, you walk back in the bitmap and find the last object with a bit set to 1 (i.e. find the allocated object which was right before the pointer you have), get the size of that object and check that the pointer you have is within the range of the start and end of the object.

EDIT: The big issue you would face with directly porting a Java GC into other languages is the weak reference processing semantics. Each language has their own subtle differences which would make it a pain.