top | item 24785405

40 Milliseconds of latency that just would not go away

386 points| r4um | 5 years ago |rachelbythebay.com | reply

97 comments

order
[+] deafcalculus|5 years ago|reply
Was delayed ACKs the problem? Disabling delayed ACKs seems like a better bet than using TCP_NODELAY which turns off Nagle's algorithm.
[+] Animats|5 years ago|reply
I generally say yes. The fixed timer is for the delayed ACK. That was a terrible idea. Both Linux and Windows now have a way to turn delayed ACKs off, but they're still on by default.

TCP_QUICKACK, which turns off delayed ACKs, is in Linux, but the manual page is very confused about what it actually does. Apparently it turns itself off after a while. I wish someone would get that right. I'd disable delayed ACKs by default. It's hard to think of a case today where they're a significant win. As I've written in the past, delayed ACKs were a hack to make remote Telnet character echo work better.

A key point is asymmetry. If you're the one who's doing lots of little writes, you can either turn set TCP_NODELAY at your end, or turn off delayed ACKs at the other end. If you can. Things doing lots of little writes but not filling up the pipe, typically game clients, can't change the settings at the other end. So it became a standard practice to do what you could do at your end.

[+] ganzuul|5 years ago|reply
The article is refreshingly short and to the point. Unusually, it does not aim the waste the maximum amount of your time so why don't you read it.
[+] lilyball|5 years ago|reply
The problem with delayed ACKs is that requires controlling the server. If you control the client, you can’t remotely turn off delayed ACKs, so instead you have to disable Nagle’s algorithm.
[+] punnerud|5 years ago|reply
From John Nagle:

"(..) Unfortunately, delayed ACKs went in after I got out of networking in 1986, and this was never fixed. Now it's too late."

https://stackoverflow.com/a/16663206/2326672

[+] pedrocr|5 years ago|reply
So the kernel is using a static decision that's really bad sometimes? Would it be too expensive to treat this like a branch predictor and keep some state to have the kernel enable/disable the delayed ACK dynamically depending on how it has won/lost the bet recently?
[+] Flowdalic|5 years ago|reply
I wouldn't be sure that this is authentic, i.e. actually from John Nagle.
[+] sdoering|5 years ago|reply
> Down that path also lies madness.

Ohhhhh so true. I sadly have no such story to tell regarding performance optimization, but figuring out the intricacies of any complex system (for me at least) inevitably leads to you knowing arcane stuff that might come in handy some time. But on the other hand, it also - in my humble experience - leads to you knowing a lot of arcane stuff that might have an impact on a problem, but is absolutely not related in the specific case one is dealing with.

Knowing when to discard arcane knowledge and when to jump onto that train of thought imho is crucial.

But on the other hand debugging arcane stuff in complex systems is just so much fun. One learns so much.

[+] ganafagol|5 years ago|reply
People should know about that more so that they can learn lessons. Any "little tweak" to an otherwise simple and elegant spec increases complexity which generations will have to deal with. This complexity often times compounds exponentially. Just see the interaction between Nagle and delayed acks. Each on their own they sound like a cool idea, but the compounding complexity is what kills understanding.

Unfortunately, new generations don't learn the lesson. Modern web dev for example has so many layers of complexity and bloat, and they all interact and you need to know all layers intimitly for any real understanding as the complexity explodes. It does not have to be that way, if only every layer has a clean, small abstraction. Then you don't need to know the details, only the small spec. But that does not work if everybody just adds a hack here and there which breaks some separation but oh well.

And that's why we can't have nice things.

[+] jlokier|5 years ago|reply
This is one ancient problem. I remember dealing with it in 2003.

Writeup from 1997 here (P-HTTP basically means HTTP version 1.1):

https://www.isi.edu/~johnh/PAPERS/Heidemann97a.html

> John Heidemann. Performance Interactions Between P-HTTP and TCP Implementations. ACM Computer Communication Review. 27, 2 (Apr. 1997), 65–73.

> This document describes several performance problems resulting from interactions between implementations of persistent-HTTP (P-HTTP) and TCP. Two of these problems tie P-HTTP performance to TCP delayed-acknowledgments, thus adding up to 200ms to each P-HTTP transaction. A third results in multiple slow-starts per TCP connection. Unresolved, these problems result in P-HTTP transactions which are 14 times slower than standard HTTP and 20 times slower than potential P-HTTP over a 10 Mb/s Ethernet. We describe each problem and potential solutions. After implementing our solutions to two of the problems, we observe that P-HTTP performs better than HTTP on a local Ethernet. Although we observed these problems in specific implementations of HTTP and TCP (Apache-1.1b4 and SunOS 4.1.3, respectively), we believe that these problems occur more widely.

Solutions for efficient batching of HTTP headers + data without delays involve TCP_NODELAY, and MSG_MORE / SPLICE_F_MORE / TCP_CORK / TCP_NOPUSH. Possibly TCP_QUICKACK may come in handy. Same for any protocol really, but HTTP is the one where there tends to be a separate sendmsg() and sendfile() on Linux.

[+] rsclient|5 years ago|reply
This is exactly why the Socket API in WinRT has Nagle off by default. The old way of dealing with sockets was to treat them like buffered files, or to drive them from a keyboard (so that Nagle is useful). But newer socket programs seem to just make a full chunk of information, and send it at once. Those newer programs either turn off Nagle, or would be improved if they did.

So we bit the bullet, and decided to make Nagle off by default.

[+] ohazi|5 years ago|reply
You don't need to do it this way, though... the general rule is that you shouldn't enable both Nagle's Algorithm and TCP delayed ACK at the same time.
[+] dekhn|5 years ago|reply
I once had to debug the scaling performance of a MPI-based simulation algorithm on cheap linux machines with TCP. I finally collected a TCP trace and showed it to the local expert who said: "hmm, 250ms delay right there.. that's the TCP retransmit timer... you're flooding the ethernet switch with too many packets and the switch is dropping them. Enable <such and such a feature>."

Since then I've always kept various constants in human RAM because it helps root cause.

[+] whoisburbansky|5 years ago|reply
What are some examples of other such constants you have hanging around?
[+] scott_s|5 years ago|reply
John Nagle is a commenter here on HN, and has commented on this very thing: https://news.ycombinator.com/item?id=10608356

I have also ran into this, but for me it was a periodic latency spike with steady but periodic messages. That latency spike went away when the messages were sent as-fast-as-possible.

[+] Kiro|5 years ago|reply
Animats is John Nagle?! I've seen good comments from him to the level where I think "oh, a comment from Animats" but never realized.
[+] JoeAltmaier|5 years ago|reply
Similar to Nagle, there are reasons to combine packets on a session. Network equipment that fools with every packet can get backed up if the traffic packet count exceeds a limit. By Nagling (or doing something similar in your transmit code) you can increase your message rate through such bottlenecks.

Used to have a server cluster that used some 'hologram' style router on the receiving end, to spread load. It had a hard limit on # packets per second it could handle. I changed our app to combine sends (2ms timer, not 40ms!) and halved our total traffic packet count. Put off the day they had to buy more server-side hardware to handle the load.

Btw if the clients are on wifi networks, then there's no point in aggregating sends past a pretty small size (512 bytes?) because wifi fragments (used to fragment?) packets to that smaller size over the air, and never reassembles them, leaving that to the target server.

[+] oasisbob|5 years ago|reply
> wifi fragments (used to fragment?) packets

802.11n and 802.11ac both don't do this. In contrast, they have several layers of packet aggregation (AMPDU/AMSDU). [1]

Learned about this while troubleshooting latency problems on a noisy 10km point-to-point link.

[1] https://arxiv.org/pdf/1704.07015.pdf

[+] unilynx|5 years ago|reply
> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.

I've hit Nagle far in the past, and reading the title I thought 'well that can't be about Nagle because that was a 200ms delay'

Looks like someone tuned it down to 40ms but didn't dare removing it. It would be interesting to know how they came to that choice

[+] JoeAltmaier|5 years ago|reply
I thought the delay was set in the 'delayed ack' setting?
[+] euph0ria|5 years ago|reply
Why not just use tcpdump or wireshark when troubleshooting network latencies? Usually only takes a minute or two to pinpoint the issue. Then you would need to spend time understanding why the pinpointed behavior is what it is and sometimes it is in the application, sometimes not.. I've solved so many issues over the years with tcpdump that it has become one of the most valuable tools I know.
[+] LeonM|5 years ago|reply
Out of curiosity, how would you debug this issue with Wireshark?

Just look for multiple messages in a single TCP packet? Or is there a better way?

[+] sa46|5 years ago|reply
Can anyone recommend a good intro to debugging network problems assuming competent knowledge of most Linux sysadmin tasks?
[+] errantspark|5 years ago|reply
I remember first learning of Nagle's algorithm back in the early WoW days in my endless quest to get lower latency for PvP on my neighbor's cracked WEP. I don't really know if it matters much in 2020, but I still habitually run the *.reg file to disable it on every new windows install.
[+] blibble|5 years ago|reply
Blizzard finally figured out how to call setsockopt with TCP_NODELAY
[+] nh2|5 years ago|reply
For those wondering "so, so how do I do it right?":

I was in that situation 4 years ago and did a short write up on it:

https://gist.github.com/nh2/9def4bcf32891a336485

It explains how to avoid the 40ms delay and still batch data where possible for maximum efficiency. The key part is that you can toggle the TCP options during the lifetime of the connection to force flushes.

Review appreciated.

[+] lmilcin|5 years ago|reply
To be fair, this can be fixed with well designed libraries that don't rely on TCP doing job for them merging buffers and preventing small writes.

The issue is vast majority of libraries treat the problem as if it did not exist and prefer to not get their hands dirty and just conveniently write a stream of data to the socket leaving to the user to correctly configure options on the socket.

But yes, in general, performance is at least in significant part about remembering a huge amount of trivia.

[+] derefr|5 years ago|reply
Well, yes. The point of TCP is that it's an opaque reliable linear-stream abstraction. If you're not treating it like an opaque reliable linear-stream abstraction, you shouldn't be using TCP. If you want to manage your own datagrams, use a datagram transport. (Not necessarily UDP. I'd suggest SCTP, personally. Or maybe QUIC.)
[+] _urga|5 years ago|reply
If your library is already merging buffers and preventing small writes then you would still want to set TCP_NODELAY, to eliminate delays due to send/send/recv patterns where merged buffers are less than the MSS... because you know for sure that you're already doing all you can, and Nagle's algorithm can't help further except introduce delay.
[+] taneq|5 years ago|reply
World of Warcraft had Nagle's algorithm enabled for YEARS. That's one reason that VPN services were so popular and could cut 50-100ms off your ping time, especially if you were playing from Oceania.
[+] jdblair|5 years ago|reply
This isn't explicitly related, but interesting, so I offer it up here.

When I read 40ms, it triggered a memory from tracking down a different 40ms latency bug a few years ago. I work on the Netflix app for set top boxes, and a particular pay TV company had a box based on AOSP L. Testing discovered that after enough times switching the app between foreground and background, playback would start to stutter. The vendor doing the integration blamed Netflix - they showed that in the stutter case, the Netflix app was not feeding video data quickly enough for playback. They stopped their analysis at this point, since as far as they were concerned, they had found the issue and we had to fix the Netflix app.

I doubted the app was the issue, as it ran on millions of other devices without showing this behavior. I instrumented the code and measured 40ms of extra delay from the thread scheduler. The 40ms was there, and was outside of our app's context. Literally, I measured it between the return of the thread handler and the next time the handler was called. So I responded, to paraphrase, its not us, its you. Your Android scheduler is broken.

But the onus was on me to prove it by finding the bug. I read the Android code, and learned Android threads are a userspace construct - the Android scheduler uses epoll() as a timer and calls your thread handler based on priority level. I thought, epoll() performance isn't guaranteed, maybe something obscure changed, and this change is adding an additional 40ms in this particular case. So I dove into the kernel, thinking the issue must be somewhere inside epoll().

Lucky for me, another engineer, working for a different vendor on the project, found the smoking gun in this patch in Android M (the next version). It was right there, an extra 40ms explicitly (and mistakenly) added when a thread is created while the app is in the background.

https://android.googlesource.com/platform/system/core/+/4cdc...

  Fix janky navbar ripples -- incorrect timerslack values
  
  If a thread is created while the parent thread is "Background",
  then the default timerslack value gets set to the current
  timerslack value of the parent (40ms). The default value is
  used when transitioning to "Foreground" -- so the effect is that
  the timerslack value becomes 40ms regardless of foreground/background.
  
  This does occur intermittently for systemui when creating its
  render thread (pretty often on hammerhead and has been seen on
  shamu). If this occurs, then some systemui animations like navbar
  ripples can wait for up to 40ms to draw a frame when they intended
  to wait 3ms -- jank.
  
  This fix is to explicitly set the foreground timerslack to 50us.
  
  A consequence of setting timerslack behind the process' back is
  that any custom values for timerslack get lost whenever the thread
  has transition between fg/bg.
  

  --- a/libcutils/sched_policy.c
  +++ b/libcutils/sched_policy.c
  @@ -50,6 +50,7 @@
   
   // timer slack value in nS enforced when the thread moves to background
   #define TIMER_SLACK_BG 40000000
  +#define TIMER_SLACK_FG 50000
   
   static pthread_once_t the_once = PTHREAD_ONCE_INIT;
   
  @@ -356,7 +357,8 @@
                              &param);
       }
   
  -    prctl(PR_SET_TIMERSLACK_PID, policy == SP_BACKGROUND ? TIMER_SLACK_BG : 0, tid);
  +    prctl(PR_SET_TIMERSLACK_PID,
  +          policy == SP_BACKGROUND ? TIMER_SLACK_BG : TIMER_SLACK_FG, tid);
   
       return 0;
[+] nialv7|5 years ago|reply
My first reaction to the 40ms number is "TCP_NODELAY?".

That number is probably craved into my brain now.

[+] jeffbee|5 years ago|reply
I have the same traumatic association with 5 second delays. 5 second tail latency? Look for SYN retransmits.