top | item 26433654

Drop millions of allocations by using a linked list (2015)

179 points| ddtaylor | 5 years ago |github.com | reply

160 comments

order
[+] rsp1984|5 years ago|reply
Lessons from high-performance / embedded development: If you know the size of your data beforehand, or can at least put a bound on it, it's best to pre-allocate your memory and then hand out bits as needed through a custom allocator working on the pre-allocated set. That way you have only a single allocation and all your data is contiguous in memory (huge deal for cache coherency and therefore speed of access). Very easy to put a vector/ArrayList -type interface on top of this.

If you don't know the size or bound, then do the next best thing and pre-allocate in chunks and use a slightly more sophisticated allocator that manages the chunks to hand out bits.

Not sure what mechanisms high-level languages use these days and whether there's any pre-allocation involved under the hood, but plain unoptimized linked lists can fragment your memory pretty badly and they also aren't cache coherent. Use them to store small #s of large objects, instead of large #s of small objects.

[+] simiones|5 years ago|reply
Most GC languages use bump-pointer allocation and/or compacting garbage collection to (try to) ensure that objects are stored in memory in this way regardless of being dynamically allocated.

This means that even naive linked list implementations are typically much more cache efficient than you would expect coming from C, at least after a GC pass. It also means though that changes to the dependency graph of your program can have unexpected effects for your cache hit rates, since the compacting heuristics are pretty simple (they typically work by copying "child" objects immediately after the "parent" object, but there is no mechanism to define the "right" parent for an object)

[+] pdpi|5 years ago|reply
I think the term you're looking for is cache locality. Cache coherence is about ensuring that caches local to different cores see the same values for any one given chunk of memory.
[+] kqr|5 years ago|reply
Doesn't even malloc do this under the hood by only occasionally calling sbrk and otherwise handing out blocks from an internally managed contiguous region?

What's the point of doing this again, manually? Or am I mistaken in my memory of the common modern malloc implementations?

[+] as-j|5 years ago|reply
> Lessons from embedded development

Frequently you're working with hardware, devices, etc of which there's a limited number. I don't need a linked list to hold data for 3 peripherals. Even if the hardware is changed and the number changes and there's 10 peripherals I don't need a linked list. Sometimes an array is just fine.

[+] matheusmoreira|5 years ago|reply
Why can't we have this as a general interface in standard libraries? For example, we could have a contextual allocation function:

  // Contiguous array of 1024 objects of type
  context = contextual_allocation(1024, sizeof type);

  // obtain pointer to next free object 
  object = allocate(context);
[+] jedimastert|5 years ago|reply
I remember seeing a sizable astrophysics Fortran program from many decades prior that had one "MEMORY" array declared at the beginning of the program and legit used it for every single byte of memory for the entire program. It was nuts
[+] BunsanSpace|5 years ago|reply
the SOP with C++ is you overload the new and delete operators and implement your own heap.

It can be very handy especially when you add tags to your allocation so you can track the number of allocations from a specific piece of code, great for finding memory leaks.

[+] captain_price7|5 years ago|reply
I think for C++ vector, when current underlying array gets filled, the common strategy is allocate a new array double the size of the current one, and copy contents over to that.
[+] ciarcode|5 years ago|reply
I think this is the solution used in FreeRTOS if you use heap1 as allocator.
[+] junippor|5 years ago|reply
In Julia if you have an estimate of the size (not necessarily a hard bound) you can `sizehint!` to reduce the number of allocations.
[+] mytailorisrich|5 years ago|reply
> it's best to pre-allocate your memory and then hand out bits as needed through a custom allocator working on the pre-allocated set.

Yes. For linked lists this would typically mean creating a pool of objects to be allocated from the pool and returned to the pool as needed for operating the linked list.

[+] dragontamer|5 years ago|reply
On the contrary: small objects mean that fragmentation is NOT a problem: because your small nodes can be "fit anywhere". Furthermore, if you've got 1000x 8-sized allocations, any of those "holes" can be repurposed for other 8-sized allocations.

Meaning: linked lists innately make most simple allocators more efficient.

Linked lists are a tradeoff between the difficulty of malloc/free (of which arrays are very difficult), and latency (of which linked lists have much higher latency).

A ton of 8-sized allocations is better on fragmentation than a 8-sized / 16-sized / 32-sized ... allocation pattern that most arary-allocators use.

[+] Chirono|5 years ago|reply
Except it turns out to be "drop millions of allocations by changing an exponential recursive algorithm to a linear scan that does a different (and incorrect) thing". So it seems like the linked list part of it was incidental.
[+] naranha|5 years ago|reply
Unrelated to the article, but perhaps an interesting fact: as a Java programmer you perhaps know that using the class java.util.LinkedList is almost always worse in than using plain arrays or ArrayList. Even though there are much less allocations, the overhead for maintaining the linked list data structures is immense. And it turns out that computers are pretty fast at allocating memory.

But there more optimized implementations of linked lists of course (probably using arrays as well internally (?)).

[+] exDM69|5 years ago|reply
Yes, linked lists have terrible cache behavior and usually the performance of any packed structure with less pointer chasing should beat it. ArrayList, Vector, whatever you call it usually is faster.

But computers aren't fast in allocating memory. In particular, they're not predictably fast in allocating memory, as most allocators will have complex internal data structures and may end up having to call the operating system to give fresh pages of unused memory. The best case may be smoking fast, the amortized cost may be decent but the worst case is pretty bad.

So when reliable and predictable performance is needed, memory allocations are to be avoided. This is relevant if you're working on (soft) real time applications, say audio, games or so.

And linked lists are still mighty fast in addition and removal, joining and splitting. Intrusive linked lists drop one extra pointer chase.

In kernel space, there are tons of use cases for linked lists that are never traversed, or at least not traversed in the fast path. Traversal is the slow part, but that's not necessary for a lot of cases where you add something to a list or a set, and remove items one by one.

There's no other data structure that beats linked lists when there's only addition, removal, joins and splits, but no traversals. Linked lists are almost always the wrong data structure for the job, but in special niche cases (which are not that rare, at least in kernel space) it's the simplest and most effective.

[+] ansible|5 years ago|reply
It is worth remembering that linked lists were invented back in the dawn of computing.

This was when systems... often didn't even have cache memory at all. Accessing different RAM values took the same amount of time, and sometimes with a single clock cycle. Fast! So jumping around a (large, for the time) linked list in heap memory wasn't nearly as slow as something comparable today, relative to other data structures.

[+] twic|5 years ago|reply
There is such a thing as an unrolled linked list, which is a bit like a flat B-tree:

https://en.wikipedia.org/wiki/Unrolled_linked_list

They ought to have quite good properties. In particular, if you are appending one element at a time, they never need to copy, but are also quite dense. Iteration is then fast. Feels like it should be useful for buffering and aggregation type jobs. I have never seen one in a library.

[+] Cthulhu_|5 years ago|reply
This is the thing with algorithms and leetcode and the like; I've got about ten years of professional experience, and in all of my career (CRUD apps, full stack; sounds boring but I've touched many different industries) I've NEVER had to think about what data structure to use.

Java's ArrayList was good enough in all cases, as was HashMap. Although one interesting thing, ConcurrentHashMap turned out to be faster in all my (bad) benchmarks.

I'm doing Go nowadays, it does not have generics except for its built in lists (arrays/slices) and maps, and that covers 99% of my use cases as well.

Anyway, I'll never work at Google or any company that does leetcode interviews because it's so far removed from (my?) reality.

[+] skohan|5 years ago|reply
> it turns out that computers are pretty fast at allocating memory

Isn't this not the case? My understanding was that 90% of HPC is normally about making sure to avoid as many individual allocations/deallocations possible, because these are massively expensive compared to almost all user-space operations, and lead to cache-inefficient memory layouts, which can slow your code down by like a factor of 200

[+] karmakaze|5 years ago|reply
In Java arrays or ArrayList usually mean 'of references' so don't get the full cache locality benefit until Valhalla ever gets done. For primitive types, there are numerous primitive collection libraries that use primitive values and optionally object references.
[+] inglor_cz|5 years ago|reply
Once upon a time, there was Symbian OS, with very limited resources and an ugly tendency to fragment the heap. And the heaps were small, given how little memory popular devices like Nokia E52 had.

I wrote a XML DOM parser that relied heavily on placement new and could recycle most of the nodes among documents. It sped up parsing by over 20 times. Some documents (> 5 MB) that were previously unparseable became parseable.

[+] JoeAltmaier|5 years ago|reply
As an embedded programmer, I often end up using a simple linked-list for memory allocations (when I allocate at all). But I hash by rounded-up blocks. Instead of having hash nodes for every sized block e.g. {...114, 116, 117, 118...} I'll use {64, 128, 256, 512...}. With 8 or nine hash buckets the app will re-use the same memory (after allocating a working set) forever.
[+] ciconia|5 years ago|reply
Can you share the algorithm?
[+] queuebert|5 years ago|reply
On a related note, Rust's Cow (copy on write) type is great for managing a large area of allocated memory where many threads are reading and only a few threads write. You can dole out pointers like candy without worrying about race conditions, because the moment they are written to a new allocation is created and the pointer moved.
[+] elesbao|5 years ago|reply
Nice to see the Ruby folks learning some CS !