top | item 24762840

Custom Allocators Demystified

95 points| deafcalculus | 5 years ago |slembcke.github.io | reply

67 comments

[+] saagarjha|5 years ago|reply

Do note that "custom allocator" can often just be "take what malloc gives you and carve it up yourself" rather than "interpose malloc"; there's usually no need to drop the system allocator completely. This lets you experiment with your own custom pools or arenas without having to go all in and make something general-purpose that works for every object in your program.

> Over the years, I’ve wasted many hours debugging memory issues. With just a hash table and a set of linked lists, you can track the history of all your allocations. That makes it pretty easy to track down use after free errors, double free errors, and memory leaks. Using guard pages around your allocations, you can detect overflows.

No, just use address sanitizer–it's meant for this. Odds are your custom allocator might have a few bugs in itself and they will make you life more annoying if you're trying to use it for only this.

[+] klyrs|5 years ago|reply

I just went from "wtf is an address sanitizer" to "what's the difference between this and valgrind" to an ASAN acolyte in about 1 minute.

http://btorpey.github.io/blog/2014/03/27/using-clangs-addres...

Article mentions JVM noise, but my actual experience is with python noise. Good riddance; this is going into my CI test suite today.

[+] gpderetta|5 years ago|reply

Yes! The sanitizers have been a huge game changer for me.

[+] SloopJon|5 years ago|reply

> No, just use address sanitizer

In fairness, the next sentence in the post is:

> Techniques like this can be a nice complement to external tools like Valgrind or AddressSanitizer.

[+] OnlyOneCannolo|5 years ago|reply

You can do interesting things by replacing malloc, like this compacting allocator:

https://github.com/plasma-umass/Mesh

[+] scottlamb|5 years ago|reply

One of these things is not like the others.

Slabs and bump/arena allocators have significant advantages over just using the system allocator. You use them in a constrained way which allows for a more efficient implementation—slabs by having only one size of allocation and bump/arena by freeing things all at once (or maybe in LIFO order if you're fancy). Thus, they make sense in a lot of performance-oriented programs.

A buddy allocator has no such constraints. It's general-purpose. Thus, there's no obvious opportunity to do better than the system allocator. And it's pretty wasteful if your allocations aren't close to sizes of 2 (internal fragmentation => wasted RAM and cache). If you have a system allocator available that uses say size classes instead, I have no idea why you'd prefer to use a buddy allocator. For that matter, if your system allocator is a buddy allocator (or more likely: some kind of hybrid), I have no idea why you'd prefer your own implementation.

One constraint might be that you promise your buddy allocator's usage is single-threaded, and thus no locking is necessary. But a production-quality system allocator likely uses thread-local or CPU-local caches to accomplish much the same thing, so I don't think this constraint buys you much efficiency.

If you're implementing a bare-metal system on a microcontroller and need a very small memory allocator implementation for light usage, implementing your own buddy allocator might make sense. But that's a lot of caveats...

[+] dragontamer|5 years ago|reply

Thread safety, or the lack of it.

The system-allocator needs to be thread-safe in a large variety of situations.

Your custom allocator doesn't have to be: maybe you know only 1 thread is running. Heck, even knowing that only 32-threads or 64-threads will ever run can make gross simplifications to the synchronization code.

System-allocator needs to be thread-safe with an unbounded number of threads (because you may call pthread_create at any time).

Your custom allocator may know more specifics about when, and where, pthreads are created, as well as whether or not they interact with each other.

---------

Case in point: lets say you malloc a 256MB region per thread. You know that only one thread would ever access this 256MB region, so you write it without any synchronization primitives what so ever.

But you still want a general purpose, multi-size supported allocator to split up the data inside the 256MB region.

You call void* fooptr = malloc(256MB), because grabbing from the system allocator should remain thread-safe (who knows how many other threads are calling malloc?). But afterwards, you can make very specific assumptions about the access patterns of that fooptr.

[+] nitrogen|5 years ago|reply

A buddy allocator has no such constraints. It's general-purpose. Thus, there's no obvious opportunity to do better than the system allocator.

The system allocator isn't guaranteed to be optimal for your workload. Way back in college one of the assignments was to write a custom allocator. The fastest allocator got extra credit. Our school lab was running Linux on SGIs with AMD Opteron CPUs. The native malloc would get 3000 allocs per second in the grading program's benchmark. I managed to top 10000/s with a buddy list allocator with coalescing, if I recall correctly.

One of the weird things was that replacing all structs with pointer arithmetic (managed by preprocessor macros) gave a pretty big boost in speed. I didn't dig too much into the assembly because it was already more than fast enough.

[+] dragontamer|5 years ago|reply

Hmm, a few more allocators to mention.

0. Garbage Collection: Reference Counting -- C++, Python, Rust. Reference counting augments any of the allocators discussed in the blogpost with a simple garbage-collection scheme.

1. Garbage Collection: Stop and Copy -- This is the easiest garbage collection to implement. This is closely related to the Linear-Allocator, except "free" compiles into a nop. Only "new" matters. When "new" runs out of space (which it will inevitably do), you scan the entire memory space, and copy all "current live objects" to the 2nd heap. (An identically sized heap).

Stop and copy uses up 50% of the space (If you have 32GB available, you can only have 2x16GB heaps, and can only use up to 16GBs total). Still, the gross simplicity of stop-and-copy is cool.

2. Garbage Collection -- Mark and Sweep: is a bit complicated in my experience. It works well in practice, but its harder to program (especially if you want to do it efficiently, with tri-color invariants and all that).

--------

Because of the gross simplicity of alloc() in stop-and-copy, it can be very fast in some use cases! If you can't figure out how to make free() work with linear allocators, just stop-and-copy the darn thing.

-------

Garbage collection is reserved for when your links start to point to each other in ways far more complicated thank linked lists.

If you have cycles in your linked lists ... or your linked lists act like cons pairs (aka: car / cdr) from Lisp, garbage collection is basically almost inevitable. There's no real way to know when to delete a pointer safely if you're using Lisp-style car/cdr pairs. Even reference counting is bad, because cycles are pretty common in that style of programming.

[+] ece|5 years ago|reply

The newest JVM GCs are of the Stop and Copy type AFAIK, Z GC and Shenandoah. Though I'd really like to see tcmalloc/jemalloc tied into this.

[+] chrchang523|5 years ago|reply

I've gotten a lot of mileage out of the following two extensions to the linear allocator:

1. A "pop-to" operation. If you're ready to free ALL memory allocated at or after allocation X, you can do this by moving the start-of-free-space pointer back to the start of allocation X.

2. An end-of-free-space pointer, with its own push and pop-to operations.

In combination, these are often enough to get maximum value out of a fixed-size memory workspace. A function can allocate all returned data structures on one side of the workspace, and use the other side of the workspace for all temporary allocations. This approach does require tight coupling so there are a lot of settings where it clearly doesn't belong, but I've found it to be very effective for high-performance scientific code.

[+] dbandstra|5 years ago|reply

This is a really nice system. Quake used it back in 1996 to fit the game in 8MB. They called it a "hunk"[0] and it worked pretty much exactly as you said.

The main drawback I find is that you can't use some generic data structure implementations that expect to be able to free memory out of order, unless you're fine with leaking before the final cleanup (if integrated into a generic allocator interface, the "free" method will probably be a no-op). For example, maybe you need a temporary hashmap to help with parsing a text file. It can be interesting to come up with implementations that work without freeing out of order.

Of course, you can always opt to use another, more general purpose allocator, on a block of memory retrieved from the hunk (see Quake's "zones").

[0] https://github.com/id-Software/Quake/blob/master/WinQuake/zo...

[+] snicker7|5 years ago|reply

Would recommend watching the "What's a Memory Allocator, Anyway?" talk from Zig contributor Benjamin Feng:

https://www.youtube.com/watch?v=vHWiDx_l4V0

[+] scottlamb|5 years ago|reply

From TFA:

> The trick is to use the allocations themselves as linked list nodes so you don’t have to waste any extra memory for tracking.

There was an article that I'm struggling to find—posted to hn iirc—recommending against this approach. IIRC there were two reasons stated:

* Efficiency due to CPU cache. Freed allocations might have been unused for a long time and thus out of cache. Writing the pointer to the linked list on free puts it back into cache—a whole cache line when you only need 8 bytes or whatever for the singly-linked list—possibly at the expense of something hotter. (I think there are special instructions on many platforms to store without adding to cache, but that's probably not wise either because a bunch of frees might come together and thus need to access a previous one's book-keeping info. Better to keep all the bookkeeping info together in a dense region so you use more of the cache line.)

* Debugging troubles on buggy applications. They're more likely to overwrite adjacent allocations and thus the memory allocator's book-keeping, resulting in hard-to-debug failures. (I recall not being super convinced of this reason—I think having external tracking is insufficient for making it easy to debug this kind of error. I think you'd want to make it possible for ASAN to work well, thus you'd want to implement its "manual poisoning" interface.)

edit: I found this tcmalloc issue talking about not wanting to use singly-linked lists because of cache effects when moving stuff from the thread cache to the central list. Kind of similar. https://github.com/gperftools/gperftools/issues/951

[+] dragontamer|5 years ago|reply

One more allocator to mention:

* Fixed size allocator

If you don't need multiple sizes (like malloc(size)), then a fixed-size allocator is significantly simpler to implement, to the point of absurdity.

Step 1: Put all valid memory locations into a stack.

Step 2: alloc with stack.pop();

Step 3: free with stack.push();

The end. You get memory locality, cache-friendliness, O(1) operations, zero external fragmentation. The works. Bonus points: stack.push() and stack.pop() can be implemented with SIMD with help of stream-compaction (http://www.cse.chalmers.se/~uffe/streamcompaction.pdf), and can therefore serve as a GPU-malloc as long as one-size is sufficient.

--------------

The only downsize of "fixed size" allocation is the significant amount of internal fragmentation per node. (Ex: if you only use 8-bytes per node, your 64-byte reservation is mostly wasted). But if you're making a custom-allocator, fixed-size is probably one of the easiest to implement in practice.

-------------

You can extend this to multithreaded operations by simply using atomic-swap, atomic-and, and atomic-or across a shared bitset between threads. 1-bit per fixed-size block. (bit == 1 means something has been alloc(). bit == 0 means something is free()). The stack.push() and stack.pop() operations can run thread-local.

I recommend 64-bytes or 128-bytes as your "fixed size", because 64-bytes is the smallest you get before false-sharing becomes a thing.

---------

If you have a fixed-size allocator but need items to extend beyond a fixed-size array, then learn to use linked lists.

[+] gpderetta|5 years ago|reply

Well, if your stack is using a dynamic array as backing store, push is only amortized O(1). You can simply chain all all freed blocks in a singly linked free list without using additional space.

Edit: ... which is what is described in the article which most definitely I had read when I originally made this comment!

[+] hinkley|5 years ago|reply

Isn't this a variant/degenerate case of slab allocation?

[+] malkia|5 years ago|reply

For C++, I walked (halfway) the really long road of replacing std:: with std::pmr:: where I can, safely pluging jemalloc - and solve the performance issues we had with the standard heap allocator (Windows). But std::pmr::string now being 8 bytes bigger caused us slowdowns unexpectedly. Also tons of changes, different types, overall not great.

Then discovered mimalloc, which hooks at low-level malloc/free, and at the same time a coworker also did similar job and now I'm no longer in love with "pmr".

I wish it was success story, but it's not. It probably is, if everyone is on board with it. But hardly this was my case. Also libraries are not - Qt is not, and many others.

pmr is great for things like - try to alloc on the stack, and if not enough continue with heap (under the hood), but the extra 8 bytes, and the fact that do_alloc is virtual call is problem (last one maybe not perf enough, but still hard to sell to performance oriented programmers).

I wish it was designed in way where it could've been sold as such.

[+] gpderetta|5 years ago|reply

Hum, pmr is for type erasing allocators, i.e. you need multiple allocators and you do not want to instantiate your templates more than once.

If you want to use jemalloc everywere, just create your own (stateless) allocator wrapper and instantiate your string with it.

In fact you probably don't even need to do that. Doesnt jemalloc provide a dll interposer mode that transparently replaces malloc/free?

[+] Aardwolf|5 years ago|reply

I know custom allocators are important for tiny platforms and so on, but what's a reason C/C++ can't allocate memory in a satisfactory way by default on those platforms? And how do other languages get away with it?

[+] jeffbee|5 years ago|reply

C++ doesn't come out of the box with an allocator at all. Implementations have to provide it. But this article isn't talking about the difference between, say, jemalloc and mimalloc. It's talking about cases where you want to minimize calls to global operator new, or cases where you want to make a lot of allocations but you don't want to have to delete anything. The latter is often a massive advantage in speed. For example if you need to use a std::set<int> within a scope, and it doesn't escape that scope, it will be much faster to provide an arena allocator that allocates the nodes used by std::set, both because it will minimize the necessary calls to global new -- it may even eliminate them if you can safely use enough stack space -- and especially because there isn't a corresponding deallocation for every allocation. You simply discard the entire arena at the end of the scope.

Fans of Java may rightly point out that garbage collection also has this property, but GC brings other costs. Nothing in Java even remotely approximates the performance of an STL container backed by an arena allocator.

[+] pdpi|5 years ago|reply

"Satisfactory" means different things for different people in different contexts, so a general-purpose malloc might not do as well as you'd like.

The simplest and fastest allocator you could possibly write is something where malloc just increments a pointer into a chunk of memory, and free is a no-op. This is obviously terrible as your system-wide allocator, but you can allocate an arena from the system malloc, then use the simple allocator to chunk out that arena. This is useful for example if you know that a web request has a reasonable upper bound on memory usage. You just allocate that much memory from the system at the start of the request, use the increment allocator for all request-bound objects, then release the arena at the end of the request. One single call to the system's somewhat heavy malloc/free, many calls to a trivially simple malloc and a no-op free.

Other languages don't so much "get away with it" as they're simply not used for workloads where it matters.

[+] jasonzemos|5 years ago|reply

There's no single confluence where every application's needs are provided by a single C/C++ implementation's default allocator. I'll just speak to one example off the top of my head: the default glibc malloc() stores an allocation size near or below the actual allocation data itself. In contrast, an alternative like jemalloc tracks the size in a separate control block. When freeing allocations with the former, that memory has to be touched; with the latter, the control block has to be touched instead. Similarly it can lead to less or more efficient packing of aligned data. All of this yields better or worse performance depending on the application.

[+] gpderetta|5 years ago|reply

It is very hard to beat a good off the shelf allocator in the general case, but it is easy in specialized cases, simply because you can make tradeoffs and rely on knowledge the allocator doesn't have.

[+] skohan|5 years ago|reply

> but what's a reason C/C++ can't allocate memory in a satisfactory way by default on those platforms?

Custom allocators are important when memory performance is relevant. Allocation/deallocation is really slow compared to using memory which is already allocated to a program, so this can often be one of the biggest performance syncs in a program, especially one which churns through a lot of data (e.g. videogames, simulation, ML training etc.)

Custom allocators can also help with memory coherency: you can pack objects you're using together close together in memory, which minimizes the amount of CPU cache misses, which can also be one of the most expensive parts of execution. It may be difficult or impossible for a compiler to design a more optimal memory layout than a programmer with knowlege of how the program will use the data.

> And how do other languages get away with it?

Custom allocators are most relevant in performance-critical contexts, so you're probably already using C++ because a higher-level language was already too slow for your use-case. In other words, other languages pay the same cost for memory management, but if you're programming in Python you're probably working in a domain where the performance difference doesn't matter.

Even programs written in a "fast" GC'd language like GO will have a performance ceiling largely dictated by memory churn.

[+] alfalfasprout|5 years ago|reply

It's not just tiny platforms. For example, you often want allocators with some predictable characteristic. Eg; constant time allocation for a given request size. Or you might want an allocator that keeps certain items contiguous so they'll tend to already be in cache. Custom allocators tend to be used when you have a particular requirement that a general purpose allocator isn't optimal for.

For tiny platforms often it's bare-metal so there's only one running "process" and so something like "malloc" that needs to be aware of memory usage across the system isn't relevant. Moreover, embedded systems tend to care about deterministic performance so an allocator specific to the application can be a better match.

[+] 59nadir|5 years ago|reply

> And how do other languages get away with it?

General purpose allocators are (basically) never the best for speed, fragmentation, and so on. They can at most be good enough. Basic knowledge about how your memory is going to be used that'll inform the use of a few simple custom allocators is likely to give you several times the performance that you'd see with a general purpose allocator. In the best case you'll literally reset an allocation pointer once per loop and just overwrite memory that you know isn't used anymore, for example, making allocation a write into an already existing buffer and "free" the resetting of the pointer. This has nothing to do with platforms, but rather is just basic removal of dumb code that shouldn't be running. There's zero reason a an actual `free`/`delete` should be used in those cases and using it is likely to slow down that loop considerably.

> And how do other languages get away with it?

Much like general purpose allocators are never the best, there is zero evidence to suggest that garbage collection is ever the optimal choice; it can only ever be good enough. There's also a lot of lore around garbage collectors being magic and doing lots of great things for you to ensure less fragmentation and nice cache locality but people are usually just amazed at how it's "not that bad" in the end.

There's no silver bullet: Garbage collectors and general purpose allocators aren't magic pieces of code written by developers who can conjure up the best code for every use case. Like most general code they're "not bad" at everything but not actually very good at anything. Running less code for your allocation and eliminating things based on your knowledge of how things are going to be used is always going to be better.

[+] phkahler|5 years ago|reply

>> I know custom allocators are important for tiny platforms and so on...

For tiny platforms like micro controllers used all over your car, standard practice is to never use dynamic memory allocation. When you have 1K or even 64K of RAM and need to run for a while the only way to be sure you never run out of memory is not to use malloc at all. Fragmentation is a thing. This may mean managing fixed collections of objects much like the allocators in the article, but the linker figures out where to put them at compile time.

[+] imtringued|5 years ago|reply

>And how do other languages get away with it?

They don't get away with it. They suffer the consequences of bad performance. Of course depending on your metric they are "faster" in some situations that are common to that language but C++ often lets you avoid temporary heap allocations by using the stack which further increases performance.

[+] detaro|5 years ago|reply

> And how do other languages get away with it?

They either bring similar mechanisms, bring a selection of their own allocators that covers enough space or just don't allow influencing it much and get away with it because many applications/users do not care.

[+] fsociety|5 years ago|reply

You can allocate pools of memory so that you have a fixed memory footprint. That way you have (almost) no risk of OOMing later on in the process.

This can also give you better control over cache locality of memory.

[+] dnautics|5 years ago|reply

It's not just useful for tiny platforms. Consider the erlang virtual machine, which has something like 11 custom allocators internally, for different performance characteristics.

[+] mac01021|5 years ago|reply

Which other languages do you have in mind?

[+] favorited|5 years ago|reply

Andrei Alexandrescu's 2015 talk about C++'s allocators is great. He's a very entertaining presenter, and he makes case for extremely composable templated allocators (and why std::allocator isn't that).

https://www.youtube.com/watch?v=LIb3L4vKZ7U

[+] chromatin|5 years ago|reply

Great talk.

He and others made these principles manifest in Dlang's std.experimental.allocator [0] which offers composability and is really quite nice, allowing for you to use Dlang's GC, malloc, custom malloc, various strategies for metering out the allocated blocks, and any combination thereof.

The documentation can be hard to navigate and it is easy to miss the submodules listed in the tree on the left hand side, so I would especially draw attention to `building blocks` [1] and `mallocator`, `gc_allocator`, etc.

[0] https://dlang.org/phobos/std_experimental_allocator.html [1] https://dlang.org/phobos/std_experimental_allocator_building...

[+] dtornabene|5 years ago|reply

if people are interested in this I'd recommend Silvio Cesare's new hackerspace blog. He's been publishing what appears to be a running series for a year and change now on heap attacks, highly illuminating. Several of the posts are attacks on embedded and allocators other than the main one in glibc

https://blog.infosectcbr.com.au/

[+] fizixer|5 years ago|reply

Tangential, but I just learned (TIL) that C is platform-independent in the following way, compared to Java:

- C is "write once, compile everywhere, run everywhere"

- Java is "write once, compile once, run everywhere"

[+] jjtheblunt|5 years ago|reply

maybe taking the JIT-compiler inside the JVM into account...

- Java is "write once, partially compile once, finish compiling everywhere, run everywhere"