top | item 27226382

Extreme HTTP Performance Tuning

976 points| talawahtech | 4 years ago |talawah.io | reply

145 comments

order
[+] alufers|4 years ago|reply
That is one hell of a comprehensive article. I wonder how much impact would such extreme optimizations on a real-world application, which for example does DB queries.

This experiment feels similar to people who buy old cars and remove everything from the inside except the engine, which they tune up so that the car runs faster :).

[+] talawahtech|4 years ago|reply
This comprehensive level of extreme tuning is not going to be directly useful to most people; but there are a few things in there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more servers and frameworks adopt. Similarly I think it is good to be aware of the adaptive interrupt capabilities of AWS instances, and the impacts of speculative execution mitigations, even if you stick to the defaults.

More importantly it is about the idea of using tools like Flamegraphs (or other profiling tools) to identify and eliminate your bottlenecks. It is also just fun to experiment and share the results (and the CloudFormation template). Plus it establishes a high water mark for what is possible, which also makes it useful for future experiments. At some point I would like to do a modified version of this that includes DB queries.

[+] 101008|4 years ago|reply
Yes, my experience (not much) is that what makes YouTube or Google or any of those products really impressive is the speed.

YouTube or Google Search suggestion is good, and I think it could be replicable with that amount of data. What is insane is the speed. I can't think how they do it. I am doing something similar for the company I work on and it takes seconds (and the amount of data isn't that much), so I can't wrap my head around it.

The point is that doing only speed is not _that_ complicated, and doing some algorithms alone is not _that_ complicated. What is really hard is to do both.

[+] mkoubaa|4 years ago|reply
Speaking of which I wonder if anyone did this to the Linux kernel for a variant that's tuned only for http
[+] brendangregg|4 years ago|reply
Great work, thanks for sharing! Systems performance at its best. Nice to see the use of the custom palette.map (I forget to do that myself and I often end up hacking in highlights in the Perl code.)

BTW, those disconnected kernel stacks can probably be reconnected with the user stacks by switching out the libc for one with frame pointers; e.g., the new libc6-prof package.

[+] bigredhdl|4 years ago|reply
I really like the "Optimizations That Didn't Work" section. This type of information should be shared more often.
[+] jart|4 years ago|reply
> Disabling [spectre] mitigations gives us a performance boost of around 28%

Every couple months these last several years there always seems to be some bug where the fix only costs us 3% performance. Since those tiny performance hits add up over time, security is sort of like inflation in the compute economy. What I want to know is how high can we make that 28% go? The author could likely build a custom kernel that turns off stuff like pie, aslr, retpoline, etc. which would likely yield another 10%. Can anyone think of anything else?

[+] ronsor|4 years ago|reply
Most of these mitigations are worse than useless in an environment not executing untrusted code. Simply put, if you have a dedicated server and you aren't running user code, you don't need them.
[+] seoaeu|4 years ago|reply
The puzzling thing was that spectre V2 mitigations were cited as the main culprit. They were responsible by themselves for a 15-20% slowdown, which is about an order of magnitude worse than in my experience. I wonder if the system had IBRS enabled instead of using retpolines at the mitigation strategy?
[+] imhoguy|4 years ago|reply
I am not full deep in SecOps these days and would gladly hear opinion of some expert:

Can disabling these mitigations bring any risks assuming the server is sending static content to the Internet over port 80/443 and it is practically stateless with read-only file system?

[+] astrange|4 years ago|reply
PIE and ASLR are free on x86-64, unless someone has a bad ABI I don't know of. Spectre mitigations are also free or not needed on new enough hardware.

Many security changes also help you find memory corruption bugs, which is good for developer productivity.

[+] jiggawatts|4 years ago|reply
Does anyone know of a quick & easy PowerShell script I can run on Windows servers to disable Spectre mitigations?

The last time I looked I found a lot of waffle but no simple way I can just turn that stuff off...

[+] volta83|4 years ago|reply
I'm missing one thing from the article, that is commonly missing from performance-related articles.

When you talk about playing whack-a-mole with the optimizations, this is what you are missing:

> What's the best the hardware can do?

You don't say in the article. The article only says that you start at 250k req/s, and ends at 1.2 req/s.

Is that good? Is your optimization work done? Can you open a beer and celebrate?

The article doesn't say.

If the best the hardware can technically do is 1.3M req/s, then you probably can call it a day.

But if the best the hardware can do is technically 100M req/s, then you just went from very very bad (0.25% of hardware peak) to just very bad (1.2% of hardware peak).

Knowing how many reqs per second should the hardware be able to do is the only way to put things in perspective here.

[+] talawahtech|4 years ago|reply
The answer to that question is not quite as straight-forward as you might think. In many ways, this experiment/post is about figuring out the answer to the question of "what is the best the hardware can do".

I originally started running these tests using the c5.xlarge (not c5n.xlarge) instance type, which is capable of a maximum 1M packets per second. That is an artificial limit set by AWS at the network hardware level. Now mind you, it is not an arbitrary limit, I am sure they used several factors to decide what limits make the most sense based on the instance size, customer use cases, and overall network health. If I had to hazard a guess I would say that 99% of AWS customers don't even begin to approach that limit, and those that do are probably doing high speed routing and/or using UDP.

Virtually no-one would have been hitting 1M req/s with 4 vCPUs doing synchronous HTTP request/response over TCP. Those that did would have been using a kernel bypass solution like DPDK. So this blog post is actually about trying to find "the limit", which is in quotes because it is qualified with multiple conditions: (1) TCP (2) request/response (3) Standard kernel TCP/IP stack.

While working on the post, I actively tried to find a network performance testing tool that would let me determine the upper limit for this TCP request/response use case. I looked at netperf, sockperf and uperf (iPerf doesn't do req/resp). For the TCP request/response case they were *all slower* than wrk+libreactor. So it was up to me to find the limit.

When I realized that I might hit the 1M req/s limit I switched to the c5n.xlarge whose hardware limit is 1.8M pps. Again, this is just a limit set by AWS.

Future tests using a Graviton2 instance + io_uring + recompiling the kernel using profile-guided optimizations might allow us to push past the 1.8M pps limit. Future instances from AWS may just raise the pps limit again...

Either way, it should be fun to find out.

[+] slver|4 years ago|reply
TCP is not typically a hardware feature so how’d you know exactly?

Maybe you wanna write a dedicated OS for it? Interesting project but I can’t blame them for not doing it.

[+] londons_explore|4 years ago|reply
Some of these things could be fixed upstream and everyone see real perf gains...

For example, having dhclient (a very popular dhcp client) leave open an AF_PACKET socket causing a 3% slowdown in incoming packet processing for all network packets seems... suboptimal!

Surely it can be patched to not cause a systemwide 3% slowdown (or at least to only do it very briefly while actively refreshing the DHCP lease)?

[+] talawahtech|4 years ago|reply
I would also love to see that dhclient issue resolved upstream, or at least a cleaner way to work around it. But we should also be mindful that for most workloads the impact is probably way, way less.

Some of these things really only show up when you push things to their extremes, so it probably just wasn't on the developer's radar before.

[+] zokier|4 years ago|reply
Specifically on EC2 I don't think you actually need to keep dhcp client running anyways, afaik EC2 instance ips are static so you can just keep using the one you got on boot.
[+] paracyst|4 years ago|reply
I don't have anything to add to the conversation other than to say that this is fantastic technical writing (and content too). Most of the time, when similar articles like this one are posted to company blogs, they bore me to tears and I can't finish them, but this is very engaging and informative. Cheers
[+] talawahtech|4 years ago|reply
Thanks, that actually means a lot. It took a lot of work, not just on the server/code, but also the writing. I asked a lot of people to review it (some multiple times) and made a ton of changes/edits over the last couple months.

Thanks again to my reviewers!

[+] drenvuk|4 years ago|reply
I'm of two minds with regards to this: This is cool but unless you have no authentication, data to fetch remotely or on disk this is really just telling you what the ceiling is for everything you could possibly run.

As for this article, there are so many knobs that you tweaked to get this to run faster it's incredibly informative. Thank you for sharing.

[+] joshka|4 years ago|reply
> this is really just telling you what the ceiling is

That's a useful piece of info to know when performance tuning a real world app with auth / data / etc.

[+] strawberrysauce|4 years ago|reply
Your website is super snappy. I see that it has a perfect lighthouse score too. Can you explain the stack you used and how you set it up?
[+] talawahtech|4 years ago|reply
It is a statically generated site created with vitepress[1] and hosted on Cloudflare Pages[2]. The only dynamic functionality is the contact form which sends a JSON request to a Cloudflare Worker[3], which in turn dispatches the message to me via SNS[4].

It is modeled off of the code used to generate Vue blog[5], but I made a ton of little modifications, including some changes directly to vitepress.

Keep in mind that vitepress is very much an early work in progress and the blog functionality is just kinda tacked on, the default use case is documentation. It also definitely has bugs and is under heavy development so wouldn't recommend it quite yet unless you are actually interested in getting your handa dirty with Vue 3. I am glad I used it because it gave me an excuse to start learning Vue, but unless you are just using the default theme to create a documentation site, it will require some work.

1. https://vitepress.vuejs.org/

2. https://pages.cloudflare.com/

3. https://workers.cloudflare.com/

4. https://aws.amazon.com/sns/

3. https://github.com/vuejs/blog

[+] SaveTheRbtz|4 years ago|reply
The analysis itself is quite impressive: a very systematic top-down approach. We need more people doing stuff like this!

But! Be careful applying tunables from the article "as-is"[1]: some of them would destroy TCP performance:

  net.ipv4.tcp_sack=0
  net.ipv4.tcp_dsack=0
  net.ipv4.tcp_timestamps=0
  net.ipv4.tcp_moderate_rcvbuf=0
  net.ipv4.tcp_congestion_control=reno
  net.core.default_qdisc=noqueue
Not to mention that `gro off` that will bump CPU usage by ~10-20% on most real world workload, Security Team would be really against turning off mitigations, and usage of `-march=native` will cause a lot of core dumps in heterogenous production environments.

[1] This is usually the case with single purpose micro-benchmarks: most of the tunables have side effects that may not be captured by a single workflow. Always verify how the "tunings" you found on the internet behave in your environment.

[+] throwdbaaway|4 years ago|reply
> EC2 X-factor?

> Even after taking all the steps above, I still regularly saw a 5-10% variance in performance across two seemingly identical EC2 server instances

> To work around this variance, I tried to use the same instance consistently across all benchmark runs. If I had to redo a test, I painstakingly stopped/started my server instance until I got an instance that matched the established performance of previous runs.

We notice similar performance variance when running benchmark on GCP and Azure. In the worst case, there can be a 20% variance on GCP. On Azure, the variance between identical instances is not as bad, perhaps about 10%, but there is an extra 5% variance between normal hours and off-peak hours, which further complicates things.

It can be very frustrating to stop/start hundreds of times for hours to get back an instance with the same performance characteristic. For now, I use a simple bash for-loop that checks the "CPU MHz" value from lscpu output, and that seems to be reliable enough.

[+] Matumio|4 years ago|reply
On AWS you can rent ".metal" instances which are probably more stable for benchmarking. I tried this once for fun on a1.metal because I wanted access to all hardware performance counters. For that it worked. My computation was also running slightly faster (something around 5% IIRC). But of course you'll have to pay for all its cores and memory while you use it.
[+] jiggawatts|4 years ago|reply
Why would you expect two different virtual machines to have identical performance?

I would expect that just the cache usage characteristics of "neighbouring" workloads alone would account for at least a 10% variance! Not to mention system bus usage, page table entry churn, etc, etc...

If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts. Even then, just the temperature of the room would have an effect if you leave Turbo Boost enabled! Not to mention the "silicon lottery" that all overclockers are familiar with...

This feels like those engineering classes where we had to calculate stresses in every truss of a bridge to seven figures, and then multiply by ten for safety.

[+] habibur|4 years ago|reply
That can be done with HTTP. But right now it's all HTTPS specially when you are serving APIs over the Internet.

And once I switch to HTTPS I see a dramatic drop in throughput like x10.

A http 15k req/sec drops down to 400 req/sec once I start serving it over HTTPS.

I see no solution to it as everything has to https now.

[+] astrange|4 years ago|reply
HTTPS especially TLS1.3 is not slow. x86 has had AES acceleration since 2010.

It might need different tuning or you might be negotiating a slow cipher.

[+] injinj|4 years ago|reply
Great work, thanks!

I'm curious whether disabling the slow kernel network features competes with an tcp bypass stack. I did my own wrk benchmark [0], but I did not try to optimize the kernel stack beyond pinning CPUs and busypoll, because the bypass was about 6 times as fast. I assumed that there is no way the kernel stack could compete with that. This article shows that I may be wrong. I will definitely check out SO_ATTACH_REUSEPORT_CBPF in the future.

[0] https://github.com/raitechnology/raids/#using-wrk-httpd-load...

[+] jeffbee|4 years ago|reply
Very nice round-up of techniques. I'd throw out a few that might or might not be worth trying: 1) I always disable C-states deeper than C1E. Waking from C6 takes upwards of 100 microseconds, way too much for a latency-sensitive service, and it doesn't save you any money when you are running on EC2; 2) Try receive flow steering for a possible boost above and beyond what you get from RSS.

Would also be interesting to discuss the impacts of turning off the xmit queue discipline. fq is designed to reduce frame drops at the switch level. Transmitting as fast as possible can cause frame drops which will totally erase all your other tuning work.

[+] talawahtech|4 years ago|reply
Thanks!

> I always disable C-states deeper than C1E

AWS doesn't let you mess with c-states for instances smaller than a c5.9xlarge[1]. I did actually test it out on a 9xlarge just for kicks, but it didn't make a difference. Once this test starts, all CPUs are 99+% Busy for the duration of the test. I think it would factor in more if there were lots of CPUs, and some were idle during the test.

> Try receive flow steering for a possible boost

I think the stuff I do in the "perfect locality" section[2] (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive flow steering would be trying to do, but more efficiently.

> Would also be interesting to discuss the impacts of turning off the xmit queue discipline

Yea, noqueue would definitely be a no-go on a constrained network, but when running the (t)wrk benchmark in the cluster placement group I didn't see any evidence of packet drops or retransmits. Drop only happened with the iperf test.

1. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...

2. https://talawah.io/blog/extreme-http-performance-tuning-one-...

[+] duskwuff|4 years ago|reply
Does C-state tuning even do anything on EC2? My intuition says it probably doesn't pass through to the underlying hardware -- once the VM exits, it's up to the host OS what power state the CPU goes into.
[+] xtacy|4 years ago|reply
I suspect that the web server's CPU usage will be pretty high (almost 100%), so C-state tuning may not matter as much?

EDIT: also, RSS happens on the NIC. RFS happens in the kernel, so it might not be as effective. For a uniform request workload like the one in the article, statically binding flows to a NIC queue should be sufficient. :)

[+] ArtWomb|4 years ago|reply
Wow. Such impressive bpftrace skill! Keeping this article under my pillow ;)

Wonder where the next optimization path leads? Using huge memory pages. io_uring, which was briefly mentioned. Or kernel bypass, which is supported on c5n instances as of late...

[+] diroussel|4 years ago|reply
Did you consider wrk2?

https://github.com/giltene/wrk2

Maybe you duplicated some of these fixes?

[+] talawahtech|4 years ago|reply
Yea, I looked it wrk2 but it was a no-go right out the gate. From what I recall the changes to handle coordinated omission use a timer that has a 1ms resolution. So basically things broke immediately because all requests were under 1ms.
[+] ikoveshnik|4 years ago|reply
I really like that wrk2 allows to configure fixed framerate, latency measurement works much better in this case. But wrk2 itself has bugs that doesn't allow it to use in more complicated cases, e.g. lua scripts are not working properly.
[+] baybal2|4 years ago|reply
Take a note, no quick cheat like DPDK was used.

This shows you can make a regular Linux program using Linux network stack to approach something handcoded with DPDK.

[+] zdw|4 years ago|reply
I wonder what the results would be if all the optimizations were applied except for the security-related mitigations, which were left enabled.
[+] Adiqq|4 years ago|reply
Anyone can recommend similar articles/blogs that focus on optimization of networking/computing in Linux/cloud environments? This kind of articles are very informative, because they refer to advanced mechanisms that I either haven't heard about or newer saw in practical use.