An interesting story on generational garbage collectors: a couple of versions ago, Lua experimented with a generational GC. But it didn't improve performance, so they kept the old incremental collector. A long time later, they looked at it more closely and realised that the actual problem was that the generational collector was moving objects out of the nursery too eagerly. Objects in the stack are live and will survive if the GC runs, but many of them will stop being live as soon as their function returns. They fixed the issue by making the stack objects that survive one round of GC to stay in the nursery. The new GC is now much faster than the old one, and is one of the big features of the upcoming Lua 5.4
When using generational GCs it is of utmost importance to match the policies for moving things between GCs to the actual pointer interconnection pattern.
This can go really bad, for example in systems that keep caches/repositories of preallocated things for faster re-use. If they are anchored at global variables, or near global, then you need to treat those preallocated things special. You can't just go and move them every time even when knowing you will never collect them.
Generally, GC works better if you don't do tricks/optimizations with memory allocation and let just everything flow freely into the heap. If you do have to optimize allocation you generally have to teach your GC about your hack.
Great article, but I'm curious why automatic reference counting (ARC) and smart pointers never seemed to really catch on outside of Objective-C and Swift:
I'd like to see some real-world studies on what percentage of unclaimed memory is taken up by orphaned circular references, because my gut feeling is that it's below 10%. So that really makes me question why so much work has gone into various garbage collection schemes, nearly all of which suffer from performance problems (or can't be made realtime due to nondetermistic collection costs).
Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.
I feel like if we put the storage overage of the reference count aside (which becomes less relevant as time goes on), then there should be some mathematical proof for how small the time cost can get with a tracing garbage collector or the cycle collection algorithm of Bacon and Paz.
> They almost "just work", except for circular references
Well, and they're often slower since they require mutating the reference count on every single time a field is stored. You can optimize that using lazy reference counting, or not updating refcounts for references on the stack and instead scanning the stack before a collection. But at that point... you're halfway to implementing a tracing collector.
Every refcounting implementation eventually gets "optimized" to the point that it has most of a tracing collector hiding inside of it. I think it's simpler, cleaner, and faster to just start with tracing in the first place.
There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.
That being said, ref-counting is fine for other kinds of resource management where resources are pretty coarse-grained and don't have cyclic references. For example, I think Delphi uses ref-counting to manage strings, which makes a lot of sense. Many games use ref-counting for tracking resources loaded from disc, and that also works well. In both of those cases, there's nothing to trace through, and the overhead of updating refcounts is fairly low.
ARC is just reference counting, as there's nothing in reference counting that requires a runtime to do this for you (that is, just because the compiler inserts increments/decrements doesn't make it an entirely different thing).
The reason reference counting is not more common is because of its performance. Naive reference counting works in some cases (e.g. when the counts are not modified often), but doesn't work too well for most programming languages, as the reference counts will be modified frequently.
Deferred and coalesced reference counting are two techniques of improving performance, but they come at a cost: they are quite complex to implement, and require additional memory to keep track of what/when to increment/decrement. You will also end up facing similar nondeterministic behaviour and pauses, as an optimised RC collector may still need to pause the program to scan for cycles. You can handle cycles by supporting weak and strong references, but this puts the burden on the developer, and it's not always clear if cycles may appear when writing code.
Combining this with a good allocator you can achieve performance that is on par with a tracing collector (http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla...), but it's not easy. If you are going down the path of implementing a complex collector, I think you're better off focusing on a real-time or concurrent collector.
> Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.
I can only speak for Haskell's GC, but it's pretty conventional. You might think that GC is simpler for Haskell without mutability but you'd be wrong, because (1) Haskell actually gives you plenty of safe and unsafe ways to have mutability, safe ways such as the ST monad (not to be confused with the State monad), (2) laziness effectively is mutation: evaluating a value is effectively overwriting the closure to compute the value with the value itself. The lazy list is a pretty common sight in most Haskell code. So basically Haskell has a pretty conventional stop-the-world, generational GC not unlike imperative languages.
The issue here is that while GC does take some time to perform, it's easy to make the assumption that removing the GC will reclaim the performance lost by the GC. This is a fallacy.
We can compare memory strategies like so:
Garbage Collection:
- Allocations: Very fast (often a single instruction)
- Ownership transfer: Free
- Pointer release: Zero
- Post-free: Garbage collection phase (this is where the time is spent)
Reference Counting:
- Allocations (similar to malloc(), requires a binary search at least)
- Ownership transfer: At least one instruction for single-threaded. Multi-threading: Requires a lock.
The first point might require some clarification. When you have a compacting garbage collector, the free memory is usually in a single block. This means that allocations merely require a single pointer to be updated. If the pointer hits the end of the free block, you trigger a GC. You don't even have to check for the end of the free block if the subsequent memory page is marked no-access.
One can spend a lot of time measuring the performance impact of all these different steps, and I am not going to try to prove that GC is always faster than refcounting, but at least it should be clear that it's not a simple matter of assuming that having no GC means that you will have no memory management overhead.
The reason they are not used is not the circular reference problem, it's performance. Single-threaded ARC is passable, but thread-safe ARC is really, really slow. It basically hits all the worst performance pitfalls of modern CPUs.
What is the difference between Clang's ARC implementation and Rust's std::rc::Rc implementation? Rust's ownership system provides the scaffolding for the Drop implementation that manages reference counting instead of Clang doing it as part of the Swift/ObjC frontend but otherwise, they seem to be very similar.
It's just another tool in Rust's box (no pun intended) but I wouldn't say it hasn't caught on. It's used quite often, along with its multithreaded cousin std::sync::Arc.
Another major user of reference counting is COM and WinRT. A major feature of WinRT projections over just using the raw C APIs is that you get reference counting implemented for you using whatever your higher level language uses for object lifetimes.
It is true that you can get cyclical refs higher up in the language even with functional code though, and at some level you need a cycle checker.
There is another type of managed memory not talked about much, and that's the model used by Composita- RAII but with defined message interfaces between modules, such that you can deterministically allocate memory for that:
Whenever the topic of garbage collection comes up, I am reminded of the following excellent reference, https://www.memorymanagement.org/ which puts various different garbage collection (and memory management) techniques into a wider context. It perhaps doesn’t explain some of the newer tricks (eg using overlapping memory mappings to put the colour bits into the high(ish) bits of pointers and get a hardware write barrier only when necessary without needing to move lots of objects and have forwarding pointers).
The reference is provided by Ravenbrook, a consulting company formed from the ashes of Harlequin (who made lispworks, a CL implementation and IDE; MLWorks, the same for SML; and ScriptWorks, a postscript rasteriser which made them all their money). I don’t know when the reference was created.
Why is this marked 2017? This is the newest chapter of a book that isn't finished yet, and was released literally today. It couldn't be more (2019) if it tried!
Tech moves fast; this morning's bleeding edge is this afternoon's yesteryear, and you don't want to know what happens if you bookmark it to read tomorrow.
Do you know if llvm got good at garbage collection support recently? My understanding was always that their optimisations would mangle your stack/registers so much that a decent incremental exact GC becomes super hard. And that something like keeping one/two pointers to the tip of the heap in dedicated registers is super hard/annoying too.
I'm a bit surprised to find no discussion of reference counting versus tracing GC, only a couple of passing mentions. On the one hand, I suppose this has already been discussed to death. And if the author felt it would be more fun to write and read about actually implementing mark-and-sweep GC, fair enough. On the other hand, reference counting definitely has its place, and I'd be curious to know more about the author's opinion on it, given his background in game scripting languages. Some scripting languages that have been put to good use in games, such as AngelScript [1] and Squirrel [2], use reference counting.
Edit: My mistake; he discussed it back in chapter 3.
> Tracing and reference counting are uniformly viewed as being fundamentally different approaches to garbage collection that possess very distinct performance properties. We have implemented high-performance collectors of both types, and in the process observed that the more we optimized them, the more similarly they behaved— that they seem to share some deep structure.
> We present a formulation of the two algorithms that shows that they are in fact duals of each other. Intuitively, the difference is that tracing operates on live objects, or “matter”, while reference counting operates on dead objects, or “anti-matter”. For every operation performed by the tracing collector, there is a precisely corresponding anti-operation performed by the reference counting collector.
> Using this framework, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting. We develop a uniform cost-model for the collectors to quantify the trade-offs that result from choosing different hybridizations of tracing and reference counting. This allows the correct scheme to be selected based on system performance requirements and the expected properties of the target application.
While reading I was expecting to hit moving/copying/compacting GC discussion at some point. But not.
IMO, one of GC's benefits is that it maintains data locality. Otherwise given GC is not that different from reference counting MM (loops in ownership graphs aside).
Same problems with memory fragmentation inherited from malloc/free.
Some memory allocators allocate objects of different sizes from different blocks, which can be far away from each other. This reduces fragmentation, but I'm wondering how much data locality we typically get, even in the best case, for composite objects composed of different-sized parts?
It seems like it will vary a lot depending on details of the language implementation. An array of value types should be laid out linearly, but beyond that it seems hard to tell what's going on?
This is great. Garbage collection is one of those things that tends to scare people, even though a basic garbage collector can be very simple (though a naive approach is usually also horribly slow..). Great to see the amount of illustrations to make it easier to grasp.
> Garbage collection is one of those things that tends to scare people
That's exactly why I was excited to write this chapter. They have a reputation for being hard, but the core algorithm is really just a simple graph traversal.
Designing and implementing a low latency concurrent garbage collector is actually at least as difficult as writing an optimizing compiler. Many languages today still struggle with concurrency+GC (see e.g. python, ocaml). Good to see this chapter on GC, but I think the subject deserves an entire book!
A GCed piece of nontrivial code can also be faster than a malloc/free solution.
The reason being that allocation can be much faster. The GCed code can allocate memory with as much as an increment of a pointer (and that is atomic, so no thread locking needed).
malloc/free always do full function calls, and they might/will descent into dealing with fragmentation (aka finding free space). Likewise, free() isn't free and cross-thread allocation/deallocation can further complicate things.
I'm definitely a newbie at this topic, but I wonder if we can have some type of static GC? I'm not speaking about Radical change to the model of the programming language (like Rust), but about a compile time analyzer that can detect when object will go out of memory (edit: out of scope) and insert a code to deal with it, or make some expectations about the real generation of the object, this will decrease the amount of required work by the GC during runtime, no?
Isn't a compile-time analyzer that can detect when an object will go out of memory precisely what Rust (and C++ RAII) does? It's just a form of escape analysis.
Imagine creating a Rust program entirely with `Rc`. It's basically a GC'd program at the point where the "roots" are managed by the reference counter. The "list of roots" is only ever messed with whenever a `Rc` is dropped/created, and one can optimize functions to take `&Rc` to reduce "GC" pressure. I do not believe it's possible to automate this process in general because if you could, I have a hunch the solution can be used to decide the Halting Problem.
So sure, in general a GC can perform some heuristics to predict the lifetime of an object, but usually the point of using a GC is that one does NOT know the lifetime or it is insanely complex.
Depending on the language, you can do this for some things. The fundamental challenge is escape analysis: when this function returns, what might have been retained elsewhere?
If you can answer that, then you can optimize accordingly.
E.g you can choose to allocate objects that won't be retained on the stack instead of the heap, or in separate blocks.
You can even potentially benefit from this even if you can't 100% know. If you use a generational collector, an object that almost certainly escapes could bypass the nursery, for example.
The Standard ML implementation called ML Kit added memory tracking to the type system in order to infer the lifetime of objects. I ran into a safe C dialect that built on the work in ML Kit to avoid memory allocation overhead in C as well, but I can't remember the name of the project. The Cyclone project mentioned in the link might have been it, but it has less automation than ML Kit.
The long awaited garbage collection chapter is finally here! I'm not sure the discussion about white/gray/black objects is strictly necessary for a simplest possible garbage collector, but it definitely will help those who want to read more about the topic in the future.
> I'm not sure the discussion about white/gray/black objects is strictly necessary for a simplest possible garbage collector
I wondered about that too. If you aren't doing an incremental collector, it's not really necessary. But I think it helps build a visual intuition for the algorithm (and other graph traversals for that matter), so I felt it was worthwhile to put in there.
Garbage collection solve many problems on virtual machine as memory leaks and override on primary memory, many programming languages now implement something like garbage collection on API then the compiler can manage the memory for the user.
[+] [-] ufo|6 years ago|reply
[+] [-] cracauer|6 years ago|reply
This can go really bad, for example in systems that keep caches/repositories of preallocated things for faster re-use. If they are anchored at global variables, or near global, then you need to treat those preallocated things special. You can't just go and move them every time even when knowing you will never collect them.
Generally, GC works better if you don't do tricks/optimizations with memory allocation and let just everything flow freely into the heap. If you do have to optimize allocation you generally have to teach your GC about your hack.
[+] [-] ridiculous_fish|6 years ago|reply
I wonder how they decide which objects are rooted only by the stack. Is it part of the write barrier, or computed at mark time?
[+] [-] zackmorris|6 years ago|reply
https://en.wikipedia.org/wiki/Automatic_Reference_Counting
https://en.wikipedia.org/wiki/Smart_pointer
They almost "just work", except for circular references:
https://en.wikipedia.org/wiki/Reference_count#Dealing_with_r...
I'd like to see some real-world studies on what percentage of unclaimed memory is taken up by orphaned circular references, because my gut feeling is that it's below 10%. So that really makes me question why so much work has gone into various garbage collection schemes, nearly all of which suffer from performance problems (or can't be made realtime due to nondetermistic collection costs).
Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.
I feel like if we put the storage overage of the reference count aside (which becomes less relevant as time goes on), then there should be some mathematical proof for how small the time cost can get with a tracing garbage collector or the cycle collection algorithm of Bacon and Paz.
[+] [-] munificent|6 years ago|reply
Well, and they're often slower since they require mutating the reference count on every single time a field is stored. You can optimize that using lazy reference counting, or not updating refcounts for references on the stack and instead scanning the stack before a collection. But at that point... you're halfway to implementing a tracing collector.
Every refcounting implementation eventually gets "optimized" to the point that it has most of a tracing collector hiding inside of it. I think it's simpler, cleaner, and faster to just start with tracing in the first place.
There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.
That being said, ref-counting is fine for other kinds of resource management where resources are pretty coarse-grained and don't have cyclic references. For example, I think Delphi uses ref-counting to manage strings, which makes a lot of sense. Many games use ref-counting for tracking resources loaded from disc, and that also works well. In both of those cases, there's nothing to trace through, and the overhead of updating refcounts is fairly low.
[+] [-] YorickPeterse|6 years ago|reply
The reason reference counting is not more common is because of its performance. Naive reference counting works in some cases (e.g. when the counts are not modified often), but doesn't work too well for most programming languages, as the reference counts will be modified frequently.
Deferred and coalesced reference counting are two techniques of improving performance, but they come at a cost: they are quite complex to implement, and require additional memory to keep track of what/when to increment/decrement. You will also end up facing similar nondeterministic behaviour and pauses, as an optimised RC collector may still need to pause the program to scan for cycles. You can handle cycles by supporting weak and strong references, but this puts the burden on the developer, and it's not always clear if cycles may appear when writing code.
Combining this with a good allocator you can achieve performance that is on par with a tracing collector (http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla...), but it's not easy. If you are going down the path of implementing a complex collector, I think you're better off focusing on a real-time or concurrent collector.
[+] [-] kccqzy|6 years ago|reply
I can only speak for Haskell's GC, but it's pretty conventional. You might think that GC is simpler for Haskell without mutability but you'd be wrong, because (1) Haskell actually gives you plenty of safe and unsafe ways to have mutability, safe ways such as the ST monad (not to be confused with the State monad), (2) laziness effectively is mutation: evaluating a value is effectively overwriting the closure to compute the value with the value itself. The lazy list is a pretty common sight in most Haskell code. So basically Haskell has a pretty conventional stop-the-world, generational GC not unlike imperative languages.
[+] [-] lokedhs|6 years ago|reply
We can compare memory strategies like so:
Garbage Collection:
- Allocations: Very fast (often a single instruction)
- Ownership transfer: Free
- Pointer release: Zero
- Post-free: Garbage collection phase (this is where the time is spent)
Reference Counting:
- Allocations (similar to malloc(), requires a binary search at least)
- Ownership transfer: At least one instruction for single-threaded. Multi-threading: Requires a lock.
- Pointer release: Slow. Locking issues, memory housekeeping.
- Post-free: Zero. Good for realtime.
The first point might require some clarification. When you have a compacting garbage collector, the free memory is usually in a single block. This means that allocations merely require a single pointer to be updated. If the pointer hits the end of the free block, you trigger a GC. You don't even have to check for the end of the free block if the subsequent memory page is marked no-access.
One can spend a lot of time measuring the performance impact of all these different steps, and I am not going to try to prove that GC is always faster than refcounting, but at least it should be clear that it's not a simple matter of assuming that having no GC means that you will have no memory management overhead.
[+] [-] Tuna-Fish|6 years ago|reply
[+] [-] akiselev|6 years ago|reply
It's just another tool in Rust's box (no pun intended) but I wouldn't say it hasn't caught on. It's used quite often, along with its multithreaded cousin std::sync::Arc.
[+] [-] oldmanhorton|6 years ago|reply
[+] [-] rumanator|6 years ago|reply
Smart pointers are a central feature of C++ even before they were added to the C++ standard in 2011.
[+] [-] alexisread|6 years ago|reply
http://losak.sourceforge.net/
It is true that you can get cyclical refs higher up in the language even with functional code though, and at some level you need a cycle checker.
There is another type of managed memory not talked about much, and that's the model used by Composita- RAII but with defined message interfaces between modules, such that you can deterministically allocate memory for that:
http://concurrency.ch/Content/publications/Blaeser_ETH_Diss_...
It looks to be the best way of doing deterministic performant memory management, though you'll need a cycle checker in the language runtime.
[+] [-] dan-robertson|6 years ago|reply
The reference is provided by Ravenbrook, a consulting company formed from the ashes of Harlequin (who made lispworks, a CL implementation and IDE; MLWorks, the same for SML; and ScriptWorks, a postscript rasteriser which made them all their money). I don’t know when the reference was created.
[+] [-] azhenley|6 years ago|reply
[+] [-] detaro|6 years ago|reply
[+] [-] jgon|6 years ago|reply
[+] [-] jodrellblank|6 years ago|reply
[+] [-] cracauer|6 years ago|reply
https://medium.com/@MartinCracauer/generational-garbage-coll...
And a review of LLVM's GC facilities: https://medium.com/@MartinCracauer/llvms-garbage-collection-...
Overall message is that there are many different ways to get to the destination, and creative solutions are mixed in.
[+] [-] dan-robertson|6 years ago|reply
[+] [-] mwcampbell|6 years ago|reply
Edit: My mistake; he discussed it back in chapter 3.
[1]: https://www.angelcode.com/angelscript/
[2]:: http://squirrel-lang.org/
[+] [-] carapace|6 years ago|reply
"A Unified Theory of Garbage Collection" https://researcher.watson.ibm.com/researcher/files/us-bacon/...
> Tracing and reference counting are uniformly viewed as being fundamentally different approaches to garbage collection that possess very distinct performance properties. We have implemented high-performance collectors of both types, and in the process observed that the more we optimized them, the more similarly they behaved— that they seem to share some deep structure.
> We present a formulation of the two algorithms that shows that they are in fact duals of each other. Intuitively, the difference is that tracing operates on live objects, or “matter”, while reference counting operates on dead objects, or “anti-matter”. For every operation performed by the tracing collector, there is a precisely corresponding anti-operation performed by the reference counting collector.
> Using this framework, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting. We develop a uniform cost-model for the collectors to quantify the trade-offs that result from choosing different hybridizations of tracing and reference counting. This allows the correct scheme to be selected based on system performance requirements and the expected properties of the target application.
[+] [-] c-smile|6 years ago|reply
IMO, one of GC's benefits is that it maintains data locality. Otherwise given GC is not that different from reference counting MM (loops in ownership graphs aside). Same problems with memory fragmentation inherited from malloc/free.
[+] [-] skybrian|6 years ago|reply
It seems like it will vary a lot depending on details of the language implementation. An array of value types should be laid out linearly, but beyond that it seems hard to tell what's going on?
[+] [-] BubRoss|6 years ago|reply
[+] [-] vidarh|6 years ago|reply
[+] [-] munificent|6 years ago|reply
That's exactly why I was excited to write this chapter. They have a reputation for being hard, but the core algorithm is really just a simple graph traversal.
[+] [-] amelius|6 years ago|reply
[+] [-] stevekemp|6 years ago|reply
https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...
I've implemented a couple of scripting languages, but thus far I've ignored GC. It is something I'd like to experiment with in the future though.
[+] [-] cracauer|6 years ago|reply
The reason being that allocation can be much faster. The GCed code can allocate memory with as much as an increment of a pointer (and that is atomic, so no thread locking needed).
malloc/free always do full function calls, and they might/will descent into dealing with fragmentation (aka finding free space). Likewise, free() isn't free and cross-thread allocation/deallocation can further complicate things.
[+] [-] z92|6 years ago|reply
[+] [-] aiProgMach|6 years ago|reply
[+] [-] rng_civ|6 years ago|reply
Imagine creating a Rust program entirely with `Rc`. It's basically a GC'd program at the point where the "roots" are managed by the reference counter. The "list of roots" is only ever messed with whenever a `Rc` is dropped/created, and one can optimize functions to take `&Rc` to reduce "GC" pressure. I do not believe it's possible to automate this process in general because if you could, I have a hunch the solution can be used to decide the Halting Problem.
So sure, in general a GC can perform some heuristics to predict the lifetime of an object, but usually the point of using a GC is that one does NOT know the lifetime or it is insanely complex.
[+] [-] vidarh|6 years ago|reply
If you can answer that, then you can optimize accordingly.
E.g you can choose to allocate objects that won't be retained on the stack instead of the heap, or in separate blocks.
You can even potentially benefit from this even if you can't 100% know. If you use a generational collector, an object that almost certainly escapes could bypass the nursery, for example.
[+] [-] cardiffspaceman|6 years ago|reply
https://en.wikipedia.org/wiki/Region-based_memory_management...
[+] [-] a1369209993|6 years ago|reply
[+] [-] 1f60c|6 years ago|reply
Baby’s First Garbage Collector[0] (from the same author) uses this approach.
[0]: https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...
[+] [-] ufo|6 years ago|reply
[+] [-] munificent|6 years ago|reply
I wondered about that too. If you aren't doing an incremental collector, it's not really necessary. But I think it helps build a visual intuition for the algorithm (and other graph traversals for that matter), so I felt it was worthwhile to put in there.
[+] [-] flavinoezs|6 years ago|reply