There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.
> I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. ... the resulting rustc was about 5–10% slower. So I’ve stopped using it now.
For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.
The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).
Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload.
I wonder why this wasn't the original design; I've seen the "two-allocate" method a few times in other code and it's always seemed rather silly to allocate that separate little structure just to point to the things you're linking together anyway, so I'm curious how that way of doing it became somewhat commonplace.
Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload
And when you have a memory overwrite then your meta data is corrupted and you loose the complete list.
I keep my meta data and payload in two separate list for that very reason.
I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.
It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.
What's not cheap is requesting RAM from the OS piecemeal.
In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.
For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.
This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.
>It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.
It's even more impressive when you think about the absolute ubiquity of cURL in practically every device imaginable. He's probably single handedly saving many kilowatts of power consumption with fixes like this.
This only shows the naivete of the author. There are plenty of C programmers that aren't malloc abusers and like to avoid any potential syscall usage. Instrusive linked lists is a beginner C topic.
Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.
Somewhat true, but many C++ programs implement their own containers because they find the STL "slow" or "bloated" or whatever. (For concreteness, both GCC and LLVM do this.) So I guess the same might eventually happen to curl, in which case we'd get this same kind of article on Hacker News ;-)
I don't believe C++ compilers can do optimizations around allocation. A quick test at https://godbolt.org/g/lFQcKf reveals that gcc does not appear to coalesce STL operations, or move the allocation to the stack. The compiler is much dumber than most people think it is.
He turned his containers into intrusive ones. The standard C++ STL containers are not intrusive. Boost comes with intrusive containers, but if you need a third party library, you can do that in C too.
no that's not the problem. STL containers rely on mallocs internally. also the major saving came from curl_multi_wait:
> At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.
My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.
An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.
Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.
When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...
Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).
Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.
An allocator doesn't have to be slow. You can implement allocator yourself that asks for memory from the OS up front, and just hands pointer back to the caking application. If you know the order of allocations is the reverse of the deallocstiond (as an example) you can do allocations with a pointer bump!
My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.
> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.
That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.
But given that curl runs on lots of platforms it makes sense to just fix the code.
Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.
> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.
Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.
In any one single instance it's not that interesting, but considering probably half (if not more) of the stuff you interact with uses libcurl, it adds up to a lot.
> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.
"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.
Why not eliminate that doubt by applying the malloc optimizations to the previous official release?
I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).
I'm not skeptical about it. Last time I did something similar to this optimization, I had a python function that was doing some work over strings to get a similarity metric.
For this function, a list of elements was kept inside each call, which began empty and a few calculations where performed, populating it before the main loop of the function kicked in. In python-land, this was obviously done by declaring an empty list `precalc_values = []` and then appending over it. When we cythonized it, the dev that took it went in with a `cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day, 70x speedup over plain python.
A few days later I came in and saw that code, and said "why not go with a simple array?", to which I was told "because we don't know the size of the strings beforehand". Did a test run with a small array plus a counter (to know up to where the array held real values, and not just zero-init fields) and got a 100x speedup.
In the end we went with both functions and a wrapper that would check the length of the strings involved and select one or the other, because the array version would massively crash from accessing an array out of bounds if a large string happened to come by.
[+] [-] nnethercote|9 years ago|reply
Here's an old blog post describing it, by DHAT's author: https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...
Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly: https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...
And here is the user manual: http://valgrind.org/docs/manual/dh-manual.html
[+] [-] wyldfire|9 years ago|reply
...why is this the case?
[+] [-] tom_mellior|9 years ago|reply
The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).
[+] [-] userbinator|9 years ago|reply
I wonder why this wasn't the original design; I've seen the "two-allocate" method a few times in other code and it's always seemed rather silly to allocate that separate little structure just to point to the things you're linking together anyway, so I'm curious how that way of doing it became somewhat commonplace.
[+] [-] dom0|9 years ago|reply
[+] [-] husted|9 years ago|reply
And when you have a memory overwrite then your meta data is corrupted and you loose the complete list. I keep my meta data and payload in two separate list for that very reason.
[+] [-] makerbraker|9 years ago|reply
It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.
[+] [-] avar|9 years ago|reply
What's not cheap is requesting RAM from the OS piecemeal.
In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.
For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.
This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.
[+] [-] throwanem|9 years ago|reply
[+] [-] aphextron|9 years ago|reply
It's even more impressive when you think about the absolute ubiquity of cURL in practically every device imaginable. He's probably single handedly saving many kilowatts of power consumption with fixes like this.
[+] [-] srean|9 years ago|reply
[+] [-] ed_balls|9 years ago|reply
>"RAM is cheap"
Memory is cheap on servers and desktops. SSD can be treated as slow RAM.
[+] [-] lkjalsdkfjasdf|9 years ago|reply
[+] [-] vbezhenar|9 years ago|reply
[+] [-] tom_mellior|9 years ago|reply
[+] [-] dooglius|9 years ago|reply
[+] [-] wolf550e|9 years ago|reply
[+] [-] halayli|9 years ago|reply
> At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.
[+] [-] iamalurker|9 years ago|reply
An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.
Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.
When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...
Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).
[+] [-] snksnk|9 years ago|reply
[+] [-] faragon|9 years ago|reply
[+] [-] maccard|9 years ago|reply
[+] [-] rumcajz|9 years ago|reply
[+] [-] areyousure|9 years ago|reply
[+] [-] maccard|9 years ago|reply
[+] [-] 21|9 years ago|reply
That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.
But given that curl runs on lots of platforms it makes sense to just fix the code.
[+] [-] __s|9 years ago|reply
[+] [-] vertex-four|9 years ago|reply
[0] https://github.com/curl/curl/commit/5f1163517e1597339d
[+] [-] Asooka|9 years ago|reply
Edit: I would also be tempted to remove the ufds_on_stack variable and just check if the ufds pointer points to the stack or not.
[+] [-] 0xcde4c3db|9 years ago|reply
Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.
[+] [-] exDM69|9 years ago|reply
Downclocking has negative impact because the CPU needs to be powered up longer and it is leaking current all the time it is not powergated.
[+] [-] 59nadir|9 years ago|reply
[+] [-] maccard|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] hota_mazi|9 years ago|reply
"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.
Why not eliminate that doubt by applying the malloc optimizations to the previous official release? I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).
[+] [-] dr_zoidberg|9 years ago|reply
For this function, a list of elements was kept inside each call, which began empty and a few calculations where performed, populating it before the main loop of the function kicked in. In python-land, this was obviously done by declaring an empty list `precalc_values = []` and then appending over it. When we cythonized it, the dev that took it went in with a `cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day, 70x speedup over plain python.
A few days later I came in and saw that code, and said "why not go with a simple array?", to which I was told "because we don't know the size of the strings beforehand". Did a test run with a small array plus a counter (to know up to where the array held real values, and not just zero-init fields) and got a 100x speedup.
In the end we went with both functions and a wrapper that would check the length of the strings involved and select one or the other, because the array version would massively crash from accessing an array out of bounds if a large string happened to come by.
[+] [-] ape4|9 years ago|reply
> converted to become malloc-less (the way
> linked list functions should behave, really).
I don't see how a linked list can not use malloc().
[+] [-] rumcajz|9 years ago|reply
[+] [-] hota_mazi|9 years ago|reply
You are "polluting" your payload with list information, but you gain malloc less lists.
[+] [-] amenghra|9 years ago|reply
[+] [-] __s|9 years ago|reply
2GB
[+] [-] fastest963|9 years ago|reply