I have a hobby project that my target was following similar learning path, I could only recommend if you also work on your own server dont forget software side,perf (http://brendangregg.com/perf.html) is a god not just kernel side, as well as your own software, as part of my build I was always checking below command:
* I use latest trimmed kernel, with no encryption, extra device etc...
* You might want to check RT kernels, finance & trading guys always good guide if you can
* Removed all boilerplate app stack from linux or built small one, I even now considering getting rid of network stack for my personal use
* Disable hyper-threading: I had a short living workers,this doesnt helped me for my case , you might want to validate first which one is best suited for your needs
* Check also your CPU capabilities (i.e. with avx2 & quad channel I get great improvements) and test them to make sure
* A system like this quickly get hot, watch temps, even short running tests might give you happy results but long term temps easily hit the wall that bios will not give a fuck but only throttle
I get that it was a hobby project so you could just be doing these optimizations for the heck of it. But if you do have measurements of how much each of these factors contributed (especially the two points about custom kernels), it would be useful.
Please don’t take offense at this my friend, it is genuinely constructive criticism. Slow down a little bit and re-read what you’re typing. I can’t understand half of what you’ve written here because it is so poorly done. It is a shame because I feel like you’re trying to share interesting information it’s just extremely hard to parse whatever it is you’re trying to say
I know its not the exact same kind of concern as presented here, but I have recently found that one technique for achieving extremely precise timing of execution is to just sacrifice an entire high priority thread to a busy wait loop that checks timing conditions as fast as the CPU will cycle instructions. This has the most obvious advantage of being trivial to implement, even in high level languages that only expose the most basic of threading primitives.
Thinking strategically about this approach, modern server CPUs expose upwards of 64/128 threads, so if 1 of these had to be sacrificed completely to the gods of time, you are only looking at 1-2% of your overall resources spent for this objective. Then, you could reuse this timing service for sequencing work against the other 98-99% of resources. Going back just a few years, throwing away 12/25/50% of your compute resources for the sake of precise timing would have been a non-starter.
For reference, I find that this achieves timing errors measured in the 100-1000 nanoseconds range in my .NET Core projects when checking a trivial # of events. I have not bothered to optimize for large # of events yet, but I believe this will just be a pre-processing concern with an ordered queue of future events. I have found that this is precise enough timing to avoid the sort of logic you would otherwise need to use to calculate for time error and compensate on future iterations (e.g. in non-deterministic physics simulations or frame timing loops).
Yes, definitely turn off HT/SMT and use a single app thread per core with busy waiting. I'm working on a low latency application design guide exploring this more in depth.
On application side I recommend using an instrumenting profiler that will let you know down to sub-microseconds what the code is doing. Tracy is a good choice (https://github.com/wolfpld/tracy) but there are others, e.g. Telemetry.
So, I just spent 2 hours checking this (tracy) out and I must say I am impressed. Here's a good video [1] from two years ago that shows its capabilities, and it's had 5 releases since then (check his YouTube channel for more recent vids of new features added).
[1] https://www.youtube.com/watch?v=fB5B46lbapc
ps I once had a different username, but don't login often and forgot my pwd (with no email addr on file) :( I've been on HN for years. Not affiliated in any way with the tracy project or its author.
I can't tell what Tracy does because the documentation is so poor. Check out XRay for an older but still actively developed function tracing tool that generates traces which can be viewed in the Chrome trace viewer.
For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/
One that seems rather important but missing is NIC receive coalesce. The feature delays frames to reduce the number of interrupts, thus increasing latency. Usually you want to turn this down as far as possible, but don’t set it to “1” because many NICs interpret that setting to mean “use black-box adaptive algorithm” and you don’t want that either.
It’s also quite helpful to run your Rx soft interrupts on the core that’s receiving the packet, but flow steering isn’t mentioned.
For a truly lowest latency in software application you need to avoid all context switches. Using interrupt driven IO adds to much overhead. You need to use polling and busy waiting. I'm working on a guide for this type of application design. If you are indeed using the Linux network stack, then yes adjusting NIC interrupt coalescing and interrupt affinity is useful.
This is a cool article in the sense that it gives an idea of tuning that can be done on one extreme. While most applications won't need this level of tuning and some of them might be hurting if one isn't CPU bound, it is great to know which options exists.
Does anyone have further articles exploring the topic of os tuning for various types of applications? Maybe also for other OS, BSD/Win?
Using older microcode is just a matter of preventing your OS from uploading newer microcode during the boot process, and not updating the motherboard firmware to a newer version that bundles newer CPU microcode. Rolling back the motherboard firmware is usually not a supported option, and sometimes is actively prohibited by the system.
> Don’t create any file backed writable memory mappings
If you have to create a writable file-backed memory mapping, open it in /dev/shm or /dev/hugepages. You can mount your own hugetlbfs volume with the right permissions.
Creating a big-ass single-writer ring buffer in a series of mapped hugetlb pages is a great way to give downstream code some breathing room. You can have multiple downstream programs, each on its own core, map it read-only, and start and stop them independently. Maintain a map of checkpoints farther back in the ring buffer, and they can pick up again, somewhere back, without missing anything.
I don't consider that file backed since they pull memory from the same pool as anonymous memory and not the page cache. The Linux kernel docs makes the distinction between file backed and anonymous memory. I think a better term would probably be page cached backed memory vs anonymous memory.
Would love to see a version of this but for ARM64 .. I'm assuming a few of these tips will be applicable, but I bet there's some ARM-specific things to be learned.
This is a good list, but it seems to blur latency and jitter. For example, turbo modes can cause significant variability, threads running on other cores can cause your core to downclock, etc.
Reducing kernel scheduler interrupt rate can cause strange delay effects (presumably due to locking). Running it faster, but only on some cores, has been more beneficial IME. Depends on the latency vs jitter requirement of your computation I guess. If you're using SCHED_FIFO there is a complicated interaction with ticking being sometimes enabled (while theoretically tickless) at 1kHz to let kernel threads run...
Multithreaded apps should consider cache contention and CPU/memory placement. This does not always mean place all cores on the same socket, because you might need to get max memory bandwidth. Cf lstopo, numad/numactl, set_mempolicy(2). Making sure no other process can thrash the L3 or use memory bandwidth on your real time socket can also help. Node that numad(8) does page migrations, so it can cause jitter when that happens, but also reduce jitter for steady state.
With the right cooling setup I've been able to get Xeons to run permanently in turbo mode, kind of a back door overclock. You would have to experiment.
For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/
> The term latency in this context refers to ... The time between a request was submitted to a queue and the worker thread finished processing the request.
Since this is a tuning guide, I would have liked to see a separation of 3 attributes:
* Service-time: The actual time taken by the thread/process once it begins processing a request.
* Latency: The time spent by the request waiting to get processed (ex: languishing in a queue on the client/server side). This is when the request was latent.
* Response time: A combination of Service-time + Latency as recorded by the server. From a client's POV, this would additionally include the overhead from the network media etc.
Most performance models seem to isolate these separately to get a deeper sense of where the bottlenecks are. When there's just a single queue for everything then it makes sense to make the service-time as short as possible. But if you have multiple workload-based queues then you can do more interesting things.
This guide pretty much tells you how to make the Linux kernel interfere as little as possible with your application. How to instrument and what to measure would depend on the application.
I agree that measuring queuing delay and processing delay separately makes sense.
This is a great point. For the purposes of queuing theory analysis, some separate out latency from response time in which case response time is just service time + queue time, and latency is transit time before arriving at the queue.
threadirqs cmdline option might also make a difference:
threadirqs Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD.
It helped with my bluetooth issues and it's recommend for low-latency audio setups but unfortunatly I lack the knowledge about the tradeoffs. You also probably need to assign a higher priority to threads: https://alsa.opensrc.org/Rtirq - not sure if it's applicable besides audio.
You might also want to look into or write about DPDK, which achieves further speed-ups using polling mode drivers (instead of interrupt) and having the application directly process packets from the NIC (bypassing the kernel, which can be a bottleneck).
>"Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to maximize processor resource usage for workloads with low instructions per cycle (IPC)"
I had actually not heard this before regarding SMT. What is the significance of low IPC type in regards to the design of SMT? How does one determine if their workload is a low IPC workload?
HT share most execution units in the core. If your workload stalls a lot due to branch misprediction or memory access (low IPC) these units can be shared effectively. The Linux perf tool can be used to check IPC.
Many organisations run water cooled overclocked servers in production. I have not yet heard of any production use of sub-ambient cooling, but that would be awesome!
Applications for which you can be sure that no other applications are running on the same box, either:
1. Because it's completely airgapped from the bigger internet and you control everything on it. Think complex embedded systems like radar HW control on military ships. The ships I was sailing on had 6 full racks per radar just for things like track maintenance and missile up/downlink scheduling. At some complexity level it becomes worth it to "lift" apps out of the embedded domain and make use of the facilities a bigger OS like Linux provides, but you often still have fairly tight realtime and performance requirements. A HFT server only connected to an exchange server could also count.
2. You have adequate security measures in other parts of your setup that, after carefully evaluating the risks, you decide to forego defense-in-depth in this part of the system.
There are not all that many fields where this type of microsecond chasing is all that worthwhile though. There are significant costs and risks involved and most web users won't ever notice a page load increase of of few microseconds. There are way more cost-effective performance improvements available for 99+% of companies out there CPU pinning and IRQ isolation.
For the majority of applications that follow this guide (or need to), the OS mitigations don't matter anyways:
1. They're running only trusted code.
2. L1 cache attacks aren't relevant if there's only one thread ever running on a given core.
3. Kernel-bypass networking means there are no system calls in the hot path anyways, so the OS mitigations won't even run in the first place.
If you're already doing all this it may be easier/better to look at using FPGAs instead. The advantage of this approach is that you don't need to procure a card with enough LUTs to house your design, and it allows the Ops team to contribute to performance.
Can someone ELI5 why you'd want to do line rate packet capture in a low-latency way? Wouldn't you risk losing packets because you are trading off processing capacity for latency?
Measuring network delay, or just doing kernel bypass processing. Generally you know you won't loose packets because the rate is much lower than you can process.
[+] [-] hrgiger|5 years ago|reply
'perf stat -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses,faults,minor-faults,cs,migrations -r 3 nice taskset 0x01 ./myApplication -j XXX '
Additions I would have I have benefited:
* I use latest trimmed kernel, with no encryption, extra device etc...
* You might want to check RT kernels, finance & trading guys always good guide if you can
* Removed all boilerplate app stack from linux or built small one, I even now considering getting rid of network stack for my personal use
* Disable hyper-threading: I had a short living workers,this doesnt helped me for my case , you might want to validate first which one is best suited for your needs
* Check also your CPU capabilities (i.e. with avx2 & quad channel I get great improvements) and test them to make sure
* A system like this quickly get hot, watch temps, even short running tests might give you happy results but long term temps easily hit the wall that bios will not give a fuck but only throttle
[+] [-] Arnavion|5 years ago|reply
[+] [-] zarathustreal|5 years ago|reply
[+] [-] bob1029|5 years ago|reply
Thinking strategically about this approach, modern server CPUs expose upwards of 64/128 threads, so if 1 of these had to be sacrificed completely to the gods of time, you are only looking at 1-2% of your overall resources spent for this objective. Then, you could reuse this timing service for sequencing work against the other 98-99% of resources. Going back just a few years, throwing away 12/25/50% of your compute resources for the sake of precise timing would have been a non-starter.
For reference, I find that this achieves timing errors measured in the 100-1000 nanoseconds range in my .NET Core projects when checking a trivial # of events. I have not bothered to optimize for large # of events yet, but I believe this will just be a pre-processing concern with an ordered queue of future events. I have found that this is precise enough timing to avoid the sort of logic you would otherwise need to use to calculate for time error and compensate on future iterations (e.g. in non-deterministic physics simulations or frame timing loops).
[+] [-] rigtorp|5 years ago|reply
[+] [-] awild|5 years ago|reply
[+] [-] Torkel|5 years ago|reply
[+] [-] ezekiel68|5 years ago|reply
ps I once had a different username, but don't login often and forgot my pwd (with no email addr on file) :( I've been on HN for years. Not affiliated in any way with the tracy project or its author.
[+] [-] jeffbee|5 years ago|reply
https://llvm.org/docs/XRayExample.html#debugging-with-xray
[+] [-] sild|5 years ago|reply
[1] https://access.redhat.com/documentation/en-us/red_hat_enterp...
[+] [-] rigtorp|5 years ago|reply
[+] [-] jeffbee|5 years ago|reply
It’s also quite helpful to run your Rx soft interrupts on the core that’s receiving the packet, but flow steering isn’t mentioned.
[+] [-] rigtorp|5 years ago|reply
[+] [-] PhDuck|5 years ago|reply
Does anyone have further articles exploring the topic of os tuning for various types of applications? Maybe also for other OS, BSD/Win?
[+] [-] rigtorp|5 years ago|reply
[+] [-] WJW|5 years ago|reply
> Also consider using older CPU microcode without the microcode mitigations for CPU vulnerabilities.
I don't think I even know where to find older microcode for my particular CPU.
[+] [-] wtallis|5 years ago|reply
[+] [-] ncmncm|5 years ago|reply
If you have to create a writable file-backed memory mapping, open it in /dev/shm or /dev/hugepages. You can mount your own hugetlbfs volume with the right permissions.
Creating a big-ass single-writer ring buffer in a series of mapped hugetlb pages is a great way to give downstream code some breathing room. You can have multiple downstream programs, each on its own core, map it read-only, and start and stop them independently. Maintain a map of checkpoints farther back in the ring buffer, and they can pick up again, somewhere back, without missing anything.
[+] [-] rigtorp|5 years ago|reply
[+] [-] rigtorp|5 years ago|reply
[+] [-] fit2rule|5 years ago|reply
[+] [-] rigtorp|5 years ago|reply
[+] [-] TooSmugToFail|5 years ago|reply
[+] [-] angry_octet|5 years ago|reply
Reducing kernel scheduler interrupt rate can cause strange delay effects (presumably due to locking). Running it faster, but only on some cores, has been more beneficial IME. Depends on the latency vs jitter requirement of your computation I guess. If you're using SCHED_FIFO there is a complicated interaction with ticking being sometimes enabled (while theoretically tickless) at 1kHz to let kernel threads run...
Multithreaded apps should consider cache contention and CPU/memory placement. This does not always mean place all cores on the same socket, because you might need to get max memory bandwidth. Cf lstopo, numad/numactl, set_mempolicy(2). Making sure no other process can thrash the L3 or use memory bandwidth on your real time socket can also help. Node that numad(8) does page migrations, so it can cause jitter when that happens, but also reduce jitter for steady state.
[+] [-] rigtorp|5 years ago|reply
For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/
Oh yeah, the whole NUMA page migration stuff is not well documented. You'll also proactive compaction to deal with in the future: https://nitingupta.dev/post/proactive-compaction/
[+] [-] fizwhiz|5 years ago|reply
Since this is a tuning guide, I would have liked to see a separation of 3 attributes:
* Service-time: The actual time taken by the thread/process once it begins processing a request.
* Latency: The time spent by the request waiting to get processed (ex: languishing in a queue on the client/server side). This is when the request was latent.
* Response time: A combination of Service-time + Latency as recorded by the server. From a client's POV, this would additionally include the overhead from the network media etc.
Most performance models seem to isolate these separately to get a deeper sense of where the bottlenecks are. When there's just a single queue for everything then it makes sense to make the service-time as short as possible. But if you have multiple workload-based queues then you can do more interesting things.
[+] [-] rigtorp|5 years ago|reply
I agree that measuring queuing delay and processing delay separately makes sense.
[+] [-] statquontrarian|5 years ago|reply
[+] [-] nisa|5 years ago|reply
threadirqs Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD.
It helped with my bluetooth issues and it's recommend for low-latency audio setups but unfortunatly I lack the knowledge about the tradeoffs. You also probably need to assign a higher priority to threads: https://alsa.opensrc.org/Rtirq - not sure if it's applicable besides audio.
[+] [-] mrob|5 years ago|reply
[+] [-] nisa|5 years ago|reply
[+] [-] halz|5 years ago|reply
[+] [-] drudru11|5 years ago|reply
[+] [-] mike00632|5 years ago|reply
You might also want to look into or write about DPDK, which achieves further speed-ups using polling mode drivers (instead of interrupt) and having the application directly process packets from the NIC (bypassing the kernel, which can be a bottleneck).
https://en.wikipedia.org/wiki/Data_Plane_Development_Kit
[+] [-] bogomipz|5 years ago|reply
>"Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to maximize processor resource usage for workloads with low instructions per cycle (IPC)"
I had actually not heard this before regarding SMT. What is the significance of low IPC type in regards to the design of SMT? How does one determine if their workload is a low IPC workload?
[+] [-] rigtorp|5 years ago|reply
[+] [-] annoyingnoob|5 years ago|reply
There is probably buffer tuning you can do in the NIC driver also.
[+] [-] rigtorp|5 years ago|reply
[+] [-] letientai299|5 years ago|reply
This begs a questions, which type of applications that are ok to disable mitigation?
[+] [-] WJW|5 years ago|reply
1. Because it's completely airgapped from the bigger internet and you control everything on it. Think complex embedded systems like radar HW control on military ships. The ships I was sailing on had 6 full racks per radar just for things like track maintenance and missile up/downlink scheduling. At some complexity level it becomes worth it to "lift" apps out of the embedded domain and make use of the facilities a bigger OS like Linux provides, but you often still have fairly tight realtime and performance requirements. A HFT server only connected to an exchange server could also count.
2. You have adequate security measures in other parts of your setup that, after carefully evaluating the risks, you decide to forego defense-in-depth in this part of the system.
There are not all that many fields where this type of microsecond chasing is all that worthwhile though. There are significant costs and risks involved and most web users won't ever notice a page load increase of of few microseconds. There are way more cost-effective performance improvements available for 99+% of companies out there CPU pinning and IRQ isolation.
[+] [-] steventhedev|5 years ago|reply
1. They're running only trusted code.
2. L1 cache attacks aren't relevant if there's only one thread ever running on a given core.
3. Kernel-bypass networking means there are no system calls in the hot path anyways, so the OS mitigations won't even run in the first place.
If you're already doing all this it may be easier/better to look at using FPGAs instead. The advantage of this approach is that you don't need to procure a card with enough LUTs to house your design, and it allows the Ops team to contribute to performance.
[+] [-] jcelerier|5 years ago|reply
[+] [-] nullc|5 years ago|reply
[+] [-] netmonk|5 years ago|reply
[+] [-] fulafel|5 years ago|reply
[+] [-] angry_octet|5 years ago|reply
[+] [-] en4bz|5 years ago|reply
[+] [-] ncmncm|5 years ago|reply
But not everybody is so serious, and ease of deployment often matters. CentOS 7 is very old now. I hope you are not still using 6.