Fewer mallocs in curl

[+] nnethercote|9 years ago|reply

There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.

Here's an old blog post describing it, by DHAT's author: https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...

Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly: https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...

And here is the user manual: http://valgrind.org/docs/manual/dh-manual.html

[+] wyldfire|9 years ago|reply

> I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. ... the resulting rustc was about 5–10% slower. So I’ve stopped using it now.

...why is this the case?

[+] tom_mellior|9 years ago|reply

For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.

The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).

[+] userbinator|9 years ago|reply

Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload.

I wonder why this wasn't the original design; I've seen the "two-allocate" method a few times in other code and it's always seemed rather silly to allocate that separate little structure just to point to the things you're linking together anyway, so I'm curious how that way of doing it became somewhat commonplace.

[+] dom0|9 years ago|reply

AKA intrusive (sometimes "invasive") containers / lists.

[+] husted|9 years ago|reply

Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload

And when you have a memory overwrite then your meta data is corrupted and you loose the complete list. I keep my meta data and payload in two separate list for that very reason.

[+] makerbraker|9 years ago|reply

I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.

It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

[+] avar|9 years ago|reply

RAM is cheap.

What's not cheap is requesting RAM from the OS piecemeal.

In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.

For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.

This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.

[+] throwanem|9 years ago|reply

Also, when you download FLAC files with it, they'll sound warmer.

[+] aphextron|9 years ago|reply

>It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

It's even more impressive when you think about the absolute ubiquity of cURL in practically every device imaginable. He's probably single handedly saving many kilowatts of power consumption with fixes like this.

[+] srean|9 years ago|reply

Indeed. Roadways are cheap to use but those who behave as if its all theirs are still called dicks for a reason.

[+] ed_balls|9 years ago|reply

This is a completely different scenario than an average app. Curl is used in cars and other embedded devices. This things matter for this scale.

>"RAM is cheap"

Memory is cheap on servers and desktops. SSD can be treated as slow RAM.

[+] lkjalsdkfjasdf|9 years ago|reply

This only shows the naivete of the author. There are plenty of C programmers that aren't malloc abusers and like to avoid any potential syscall usage. Instrusive linked lists is a beginner C topic.

[+] vbezhenar|9 years ago|reply

Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.

[+] tom_mellior|9 years ago|reply

Somewhat true, but many C++ programs implement their own containers because they find the STL "slow" or "bloated" or whatever. (For concreteness, both GCC and LLVM do this.) So I guess the same might eventually happen to curl, in which case we'd get this same kind of article on Hacker News ;-)

[+] dooglius|9 years ago|reply

I don't believe C++ compilers can do optimizations around allocation. A quick test at https://godbolt.org/g/lFQcKf reveals that gcc does not appear to coalesce STL operations, or move the allocation to the stack. The compiler is much dumber than most people think it is.

[+] wolf550e|9 years ago|reply

He turned his containers into intrusive ones. The standard C++ STL containers are not intrusive. Boost comes with intrusive containers, but if you need a third party library, you can do that in C too.

[+] halayli|9 years ago|reply

no that's not the problem. STL containers rely on mallocs internally. also the major saving came from curl_multi_wait:

> At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.

[+] iamalurker|9 years ago|reply

My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.

An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.

Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.

When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...

Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).

[+] snksnk|9 years ago|reply

Optimization backed by comparative statics. These reads are so satisfying. Thank you for submitting.

[+] faragon|9 years ago|reply

Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.

[+] maccard|9 years ago|reply

An allocator doesn't have to be slow. You can implement allocator yourself that asks for memory from the OS up front, and just hands pointer back to the caking application. If you know the order of allocations is the reverse of the deallocstiond (as an example) you can do allocations with a pointer bump!

[+] rumcajz|9 years ago|reply

My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.

[+] areyousure|9 years ago|reply

Your heuristic sounds interesting. Can you say a little more about which entities and how to determine their relationship? Thanks.

[+] maccard|9 years ago|reply

Weve had data that was fixed size but wouldn't fit on the stack.

[+] 21|9 years ago|reply

> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.

That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.

But given that curl runs on lots of platforms it makes sense to just fix the code.

[+] __s|9 years ago|reply

& often those fixed-size buckets smallest size is 32 bytes. It still has to at least have a freelist

[+] vertex-four|9 years ago|reply

Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.

[0] https://github.com/curl/curl/commit/5f1163517e1597339d

[+] Asooka|9 years ago|reply

But why not use alloca instead of always allocating 10 elements on the stack you might not need?

Edit: I would also be tempted to remove the ufds_on_stack variable and just check if the ufds pointer points to the stack or not.

[+] 0xcde4c3db|9 years ago|reply

> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.

Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.

[+] exDM69|9 years ago|reply

30℅ less CPU use is roughly 30℅ less energy usage. The CPU will power down when it's done.

Downclocking has negative impact because the CPU needs to be powered up longer and it is leaking current all the time it is not powergated.

[+] 59nadir|9 years ago|reply

In any one single instance it's not that interesting, but considering probably half (if not more) of the stuff you interact with uses libcurl, it adds up to a lot.

[+] maccard|9 years ago|reply

Not sure in hard numbers, but mobile processors are designed to work this way - do a small amount of work at full power and then sleep.

[+] unknown|9 years ago|reply

[deleted]

[+] hota_mazi|9 years ago|reply

> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.

"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.

Why not eliminate that doubt by applying the malloc optimizations to the previous official release? I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).

[+] dr_zoidberg|9 years ago|reply

I'm not skeptical about it. Last time I did something similar to this optimization, I had a python function that was doing some work over strings to get a similarity metric.

For this function, a list of elements was kept inside each call, which began empty and a few calculations where performed, populating it before the main loop of the function kicked in. In python-land, this was obviously done by declaring an empty list `precalc_values = []` and then appending over it. When we cythonized it, the dev that took it went in with a `cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day, 70x speedup over plain python.

A few days later I came in and saw that code, and said "why not go with a simple array?", to which I was told "because we don't know the size of the strings beforehand". Did a test run with a small array plus a counter (to know up to where the array held real values, and not just zero-init fields) and got a 100x speedup.

In the end we went with both functions and a wrapper that would check the length of the strings involved and select one or the other, because the array version would massively crash from accessing an array out of bounds if a large string happened to come by.

[+] ape4|9 years ago|reply

> This time the generic linked list functions got

> converted to become malloc-less (the way

> linked list functions should behave, really).

I don't see how a linked list can not use malloc().

[+] rumcajz|9 years ago|reply

Look for intrusive containers.

[+] hota_mazi|9 years ago|reply

Have each node of data contain the `prev` and `next` link.

You are "polluting" your payload with list information, but you gain malloc less lists.

[+] amenghra|9 years ago|reply

You would think curl's perf is bound by the network latency/bandwidth and that intrusive lists wouldn't make a signifiant difference.

[+] __s|9 years ago|reply

> The point here is of course not that it easily can transfer HTTP over 20GB/sec using a single core on my machine

2GB

[+] fastest963|9 years ago|reply

He said gigabit, or at least that's how I read it. 2900MB/sec * 8 is over 20gb/s

130 comments