* What are the reasons for disabling TCP timestamps by default? (If you can answer) will they be eventually enabled by default? (The reason I'm asking is that Linux uses TS field as storage for syncookies, and without it will drop WScale and SACK options greatly degrading Windows TCP perf in case of a synflood.[1])
* I've noticed "Pacing Profile : off" in the `netsh interface tcp show global` output. Is that the same as tcp pacing in fq qdisc[2]? (If you can answer) will it be eventually enabled by default?
I have a question, why is it that when opening two sockets on Windows and connecting them through TCP, there is about a 40% difference in transfer rate when sending from socket A to B, compared to sending from B to A?
I got hit by the exact same issue which is described in the fermilab paper, namely packet reordering caused by intel drivers. It took me several days to diagnose the problem. Interestingly enough, the problem virtually disappeared when running tcpdump, which, after a lot of reading on the innards of the linux TCP stack, and prodding with ebpf, eventually led me to conjecture that it was a scheduling/core placement issue. Pinning my process clearly made the problem disappear, and then finding the paper nailed it.
Networks are not my specialty (I come from a math background, am self taught, and had always dismissed them as mere plumbing) , but I have to say that I came out of this difficult (for me) investigation with a great appreciation for networking in general, and now enjoy reading anything I can find about them.
It's never too late to learn, and I have yet to find something in software engineering which is not interesting once you take a closer look at it!
I had a similar issue with Windows kernels "recently" (2016~?)...
I don't have the memory or patience to write a long and inspiring blog post, but it comes down to:
Even with IOCP/multiple threads: network traffic is single threaded in the kernel, even worse, there's a mutex there. Putting the effective limit on PPS for windows to something like 1.1M for 3.0GHz.
The task of this machine was /basically/ a connection multiplexer with some TLS offloading; so listen on a socket, get an encrypted connection, check your connection pool and forward where appropriate.
Our machine basically sat waiting (in kernel space) for this lock 99.7% of the time, 0.3% was spent on SSL handshaking..
We solved our "issue" by spreading such load over many more machines and gave them low-core-count high-clock-speed Xeons instead of the normal complement of 20vCPU Xeons.
AFAIK that issue persists, I'd be interested to know if someone else managed to coerce windows to do the right thing here.
I did some work optimizing a similar problem, but simpler and on another OS[1]. The basic concept that worked was Receive Side Scaling (RSS), which was developed by Microsoft, for Windows Server. Did you come accross that? It needs support in the NIC and the driver, but intel gigE cards do it, so you don't need the really fancy cards. I don't know what the interface is like for Windows, but inbound RSS for FreeBSD is pretty easy, and skimming Windows docs, it seemed like you could do more advanced things there.
The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.
Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.
Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.
Email in profile if you want to continue the discussion off HN (or after it fizzles out here).
[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.
If you’re on recent Windows system, you should have pktmon [1] available. I believe it’s the „netsh trace” successor and has much nicer command line. And you no longer need an external tool to convert the trace to .npcap format.
PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file.
Cool article, but I'm not impressed by DropBox's upload speed on my Windows computer, at all.
I just tested rn with DropBox, GoogleDrive, and OneDrive, all with their native desktop apps. I simply put a 300MB file in the folder and let it sync.
DB: 500 KiB/s
GD: 3 MiB/s
OD: 11 MiB/s (my max bandwidth with 100Mbps)
I don't know what causes the disparity here, but I have been annoyed by this for years, and it's the same across multiple computers I use at different locations.
Another funny thing is if you just use the webpage, both GD and DB can reach 100Mbps easily.
Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).
Is Google Drive using QUIC? If so, then its using the same BBR congestion algorithm as the bbr tcp stack, and BBR's algorithm which does not view loss as congestion will help a lot.
It would be interesting to re-try the experiment on Linux or FreeBSD using BBR as the TCP stack and see if the results are any better for dropbox.
FWIW, my corp openvpn is kinda terrible. My upload speeds via the vpn did not improve at all when I moved and upgraded from 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit of the corp vpn endpoint.
Strange. Dropbox has no problem hitting mid-50s MiB/s if not more on my gigabit connection. I wonder if it's a routing issue and your path to their datacenters is bad?
Google are migrating Backup and Sync to DriveFS soon [0], but you can upgrade right now. Now, I don't remember how I did it, but I do have Drive FS on my personal account.
Yea, Dropbox on my Macs has continuously been outrageously slow at uploading. Everything else is multiples faster.
Dropbox does at least resume fairly reliably though, so I can generally ignore it the whole time... unless I have something I want to sync ASAP. Then I sometimes use the web UI and cross my fingers that I don't get a connection hiccup ಠ_ಠ
> Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).
That thing is far too aggressive about network bandwidth. It will upload 20 files at the same time and the speed limit setting doesn't work.
Do they? I constantly see DropBox taking days to sync files that are 30kb on size. Or doing dumbfounding things like download all files, then re-upload all files when I set sync to "online only" on a folder if just one of the files is not set to online only.
Maybe they have grand academic visions and papers, but I've been using them for well over a decade and I feel the client quality has gone downhill over the past few years. They keep adding unnecessary stuff like a redundant file browser while the core service suffers.
The real root cause for all that flow director mess and core balancing is that there's a huge disconnect between how the hardware works and what the socket API offers by default.
The scaling model of the hardware is rather simple: hash over packet headers and assign a queue based on this. And each queue should be pinned to a core by pinning the interrupts, so you got easy flow-level scaling. That's called RSS. It's simple and effective.
What it means is: the hardware decides which core handles which flow. I wonder why the article doesn't mention RSS at all?
Now the socket API works in a different way: your application decides which core handles which socket and hence which flow. So you get cache misses if you don't tak into account how the hardware is hashing your flows. That's bad. So you can do some work-arounds by using flow director to explicitly redirect flows to cores that handle things but that's just not really an elegant solution (and the flow director lookup tables are small-ish).
I didn't follow kernel development regarding this recently, but there should be some APIs to get a mapping from a connection tuple to the core it gets hashed to on RX (hash function should be standardized to Toeplitz IIRC, the exact details on which fields and how they are put into the function are somewhat hardware- and driver-specific but usually configurable). So you'd need to take this information into account when scheduling your connections to cores. If you do that you don't get any cache misses and don't need to rely on the limited capabilities of explicit per-flow steering.
Note that this problem will mostly go away once TAPS finally replaces BSD sockets :)
We didn't mention RSS/RPS in the post mostly because they are stable. (Albeit, relatively ineffective in terms of L2 cache misses.) FlowDirector, OTOH, breaks that stability and causes a lot of migrations, and hence a lot of re-ordering.
Anyways, nice reference for TAPS! Fo those wanting to dig into it a bit more, consider reading an introductory paper (before a myriad of RFC drafts from the "TAPS Working Group"): https://arxiv.org/pdf/2102.11035.pdf
I appreciate seeing a support and engineering org going this deep to resolve this kind of issue. Normally this is the stuff you waste hours on with a support org only to get told to clear your cookies and cache one more time.
In particular, the collaboration with Microsoft was great.I wonder what it took to make that happen.
Has Dropbox ever experimented with SCTP or other protocols that don't enforce strict ordering of packets? I know some middleboxes struggle with SCTP (they expect TCP or UDP), but in that case you do SCTP over UDP or have a fall back.
Sadly, middleboxes are a real problem, esp. with our Enterprise customers. We had this problem even with HTTP/2 rollout so there is even a special HTTP/1.1-only mode in the Desktop Client for environments where h2 is disabled.
In the future we are planing on having an HTTP/3 support which will give us pretty much the same benefits as SCTP with a better middlebox compatibility.
Theoretically, UDP would be the best choice if you had the time & money to spend on building a very application-specific layer on top that replicates many of the semantics of TCP. I am not aware of any apps that require 100% of the TCP feature set, so there is always an opportunity to optimize.
You would essentially be saying "I know TCP is great, but we have this one thing we really prefer to do our way so we can justify the cost of developing an in-house mostly-TCP clone and can deal with the caveats of UDP".
If you know your communications channel is very reliable, UDP can be better than TCP.
Now, I am absolutely not advocating that anyone go out and do this. If you are trying to bring a product like Dropbox to market (and you don't have their budget), the last thing you want to do is play games with low-level network abstractions across thousands of potential client device types. TCP is an excellent fit for this use case.
It's an ideal application of TCP. Dropbox servers are continually flooded by traffic from clients, so the good congestion behavior from TCP is valuable. There is also less need to implement error detection/correction/retransmission in higher layers.
"Dropbox is used by many creative studios, including video and game productions. These studios’ workflows frequently use offices in different time zones to ensure continuous progress around the clock. "
Honestly I don't understand these orgs that don't go OneDrive/O365 suite. What product value does dropbox have when competing within Microsoft's own ecosystem?
I wonder how the Dropbox developers managed to get in contact with the Windows core TCP team. Maybe I'm too cynical, but I'm surprised that Microsoft would go out of their way to work with a competitor like this.
Even if OneDrive vs Dropbox is important, this is a win for Windows in general. People will switch OSes because the TCP throughput is better on the other side; it's easy to measure and easy to compare and makes a nice item in a pros and cons list.
Fixing something like this can help lots of use cases, but may have been difficult to spot, so I'm sure the Windows TCP team was thrilled to get the detailed, reproducible report.
Interesting. Is the Dropbox client still an obfuscated python app? I'm curious if they spawn new processes for simultaneous uploads since they probably aren't threading.
How come Linux doesn't have this issue? Why did Microsoft had to fix TCP with the RACK-TLP RFC when both Linux and MacOS implementations did fine already?
Yeah, I have no idea either why Microsoft would want to remove Message Analyzer completely, even if they could not maintain it. You can still download it through the Internet Archive:
Supposedly Microsoft is working on adding to the existing Windows Performance Analyzer (great GUI tool for ETW performance tracing) to display ETW packet captures, which will succeed Message Analyzer and Network Monitor: https://techcommunity.microsoft.com/t5/networking-blog/intro...
[+] [-] slowstart|4 years ago|reply
[+] [-] SaveTheRbtz|4 years ago|reply
* What are the reasons for disabling TCP timestamps by default? (If you can answer) will they be eventually enabled by default? (The reason I'm asking is that Linux uses TS field as storage for syncookies, and without it will drop WScale and SACK options greatly degrading Windows TCP perf in case of a synflood.[1])
* I've noticed "Pacing Profile : off" in the `netsh interface tcp show global` output. Is that the same as tcp pacing in fq qdisc[2]? (If you can answer) will it be eventually enabled by default?
[1] https://elixir.bootlin.com/linux/v5.13-rc2/source/net/ipv4/s... [2] https://man7.org/linux/man-pages/man8/tc-fq.8.html
[+] [-] drummer|4 years ago|reply
[+] [-] the8472|4 years ago|reply
[+] [-] Agingcoder|4 years ago|reply
I got hit by the exact same issue which is described in the fermilab paper, namely packet reordering caused by intel drivers. It took me several days to diagnose the problem. Interestingly enough, the problem virtually disappeared when running tcpdump, which, after a lot of reading on the innards of the linux TCP stack, and prodding with ebpf, eventually led me to conjecture that it was a scheduling/core placement issue. Pinning my process clearly made the problem disappear, and then finding the paper nailed it.
Networks are not my specialty (I come from a math background, am self taught, and had always dismissed them as mere plumbing) , but I have to say that I came out of this difficult (for me) investigation with a great appreciation for networking in general, and now enjoy reading anything I can find about them.
It's never too late to learn, and I have yet to find something in software engineering which is not interesting once you take a closer look at it!
[+] [-] SaveTheRbtz|4 years ago|reply
> Networks are not my specialty
I wish all network non-specialist were like you!
[+] [-] baruch|4 years ago|reply
[+] [-] brohee|4 years ago|reply
[+] [-] dijit|4 years ago|reply
I don't have the memory or patience to write a long and inspiring blog post, but it comes down to:
Even with IOCP/multiple threads: network traffic is single threaded in the kernel, even worse, there's a mutex there. Putting the effective limit on PPS for windows to something like 1.1M for 3.0GHz.
The task of this machine was /basically/ a connection multiplexer with some TLS offloading; so listen on a socket, get an encrypted connection, check your connection pool and forward where appropriate.
Our machine basically sat waiting (in kernel space) for this lock 99.7% of the time, 0.3% was spent on SSL handshaking..
We solved our "issue" by spreading such load over many more machines and gave them low-core-count high-clock-speed Xeons instead of the normal complement of 20vCPU Xeons.
AFAIK that issue persists, I'd be interested to know if someone else managed to coerce windows to do the right thing here.
[+] [-] toast0|4 years ago|reply
The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.
Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.
Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.
Email in profile if you want to continue the discussion off HN (or after it fizzles out here).
[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.
[+] [-] lowleveldesign|4 years ago|reply
[1] https://docs.microsoft.com/en-us/windows-server/networking/t...
[+] [-] slowstart|4 years ago|reply
[+] [-] thrdbndndn|4 years ago|reply
I just tested rn with DropBox, GoogleDrive, and OneDrive, all with their native desktop apps. I simply put a 300MB file in the folder and let it sync.
I don't know what causes the disparity here, but I have been annoyed by this for years, and it's the same across multiple computers I use at different locations.Another funny thing is if you just use the webpage, both GD and DB can reach 100Mbps easily.
Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).
[+] [-] drewg123|4 years ago|reply
It would be interesting to re-try the experiment on Linux or FreeBSD using BBR as the TCP stack and see if the results are any better for dropbox.
FWIW, my corp openvpn is kinda terrible. My upload speeds via the vpn did not improve at all when I moved and upgraded from 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit of the corp vpn endpoint.
[+] [-] kevingadd|4 years ago|reply
[+] [-] SaveTheRbtz|4 years ago|reply
PS. One known problem that we have right now is that we use a multiplexed HTTP/2 connection, therefore:
1) We rely on the host's TCP congestion. (We have not yet switched to HTTP/3 w/ BBR.)
2) We currently use a single TCP connection: it is more fair to the other traffic on the link but can become bottleneck on large RTTs.
[+] [-] pityJuke|4 years ago|reply
[0]: https://support.google.com/googleone/answer/10309431#zippy=
[+] [-] nailer|4 years ago|reply
[+] [-] Groxx|4 years ago|reply
Dropbox does at least resume fairly reliably though, so I can generally ignore it the whole time... unless I have something I want to sync ASAP. Then I sometimes use the web UI and cross my fingers that I don't get a connection hiccup ಠ_ಠ
[+] [-] Dylan16807|4 years ago|reply
That thing is far too aggressive about network bandwidth. It will upload 20 files at the same time and the speed limit setting doesn't work.
[+] [-] encryptluks2|4 years ago|reply
[+] [-] brundolf|4 years ago|reply
[+] [-] whatever_dude|4 years ago|reply
Maybe they have grand academic visions and papers, but I've been using them for well over a decade and I feel the client quality has gone downhill over the past few years. They keep adding unnecessary stuff like a redundant file browser while the core service suffers.
[+] [-] emmericp|4 years ago|reply
The scaling model of the hardware is rather simple: hash over packet headers and assign a queue based on this. And each queue should be pinned to a core by pinning the interrupts, so you got easy flow-level scaling. That's called RSS. It's simple and effective. What it means is: the hardware decides which core handles which flow. I wonder why the article doesn't mention RSS at all?
Now the socket API works in a different way: your application decides which core handles which socket and hence which flow. So you get cache misses if you don't tak into account how the hardware is hashing your flows. That's bad. So you can do some work-arounds by using flow director to explicitly redirect flows to cores that handle things but that's just not really an elegant solution (and the flow director lookup tables are small-ish).
I didn't follow kernel development regarding this recently, but there should be some APIs to get a mapping from a connection tuple to the core it gets hashed to on RX (hash function should be standardized to Toeplitz IIRC, the exact details on which fields and how they are put into the function are somewhat hardware- and driver-specific but usually configurable). So you'd need to take this information into account when scheduling your connections to cores. If you do that you don't get any cache misses and don't need to rely on the limited capabilities of explicit per-flow steering.
Note that this problem will mostly go away once TAPS finally replaces BSD sockets :)
[+] [-] SaveTheRbtz|4 years ago|reply
Anyways, nice reference for TAPS! Fo those wanting to dig into it a bit more, consider reading an introductory paper (before a myriad of RFC drafts from the "TAPS Working Group"): https://arxiv.org/pdf/2102.11035.pdf
PS. We went through most of our low-level web-server optimization for the Edge Network in an old blogpost: https://dropbox.tech/infrastructure/optimizing-web-servers-f...
[+] [-] tims33|4 years ago|reply
In particular, the collaboration with Microsoft was great.I wonder what it took to make that happen.
[+] [-] Seattle3503|4 years ago|reply
[+] [-] SaveTheRbtz|4 years ago|reply
In the future we are planing on having an HTTP/3 support which will give us pretty much the same benefits as SCTP with a better middlebox compatibility.
[+] [-] mrpippy|4 years ago|reply
[+] [-] stephc_int13|4 years ago|reply
[+] [-] bob1029|4 years ago|reply
Theoretically, UDP would be the best choice if you had the time & money to spend on building a very application-specific layer on top that replicates many of the semantics of TCP. I am not aware of any apps that require 100% of the TCP feature set, so there is always an opportunity to optimize.
You would essentially be saying "I know TCP is great, but we have this one thing we really prefer to do our way so we can justify the cost of developing an in-house mostly-TCP clone and can deal with the caveats of UDP".
If you know your communications channel is very reliable, UDP can be better than TCP.
Now, I am absolutely not advocating that anyone go out and do this. If you are trying to bring a product like Dropbox to market (and you don't have their budget), the last thing you want to do is play games with low-level network abstractions across thousands of potential client device types. TCP is an excellent fit for this use case.
[+] [-] willis936|4 years ago|reply
[+] [-] SaveTheRbtz|4 years ago|reply
[0] https://dropbox.tech/infrastructure/how-we-migrated-dropbox-...
[1] https://dropbox.tech/infrastructure/dropbox-traffic-infrastr...
[+] [-] jandrese|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] pansinghkoder|4 years ago|reply
[+] [-] arduinomancer|4 years ago|reply
[+] [-] michaelmcmillan|4 years ago|reply
[+] [-] rootsudo|4 years ago|reply
Honestly I don't understand these orgs that don't go OneDrive/O365 suite. What product value does dropbox have when competing within Microsoft's own ecosystem?
[+] [-] mwcampbell|4 years ago|reply
[+] [-] toast0|4 years ago|reply
Fixing something like this can help lots of use cases, but may have been difficult to spot, so I'm sure the Windows TCP team was thrilled to get the detailed, reproducible report.
[+] [-] paxys|4 years ago|reply
[+] [-] tyingq|4 years ago|reply
[+] [-] freerk|4 years ago|reply
[+] [-] SaveTheRbtz|4 years ago|reply
TL;DR is that they had RACK (RFC draft) implemented as an MVP but w/o the reordering heuristic.
[1] https://techcommunity.microsoft.com/t5/networking-blog/algor...
[+] [-] the8472|4 years ago|reply
[+] [-] ovebepari|4 years ago|reply
[+] [-] chokeartist|4 years ago|reply
[+] [-] hyperrail|4 years ago|reply
* 32-bit x86: https://web.archive.org/web/20191104120802/https://download....
* 64-bit x86: https://web.archive.org/web/20190420141924/http://download.m...
(those links via: https://www.reddit.com/r/sysadmin/comments/e4qocq/microsoft_... )
Or use the even older Microsoft utility Network Monitor, which is still available on Microsoft's website: https://www.microsoft.com/en-us/download/details.aspx?id=486...
Supposedly Microsoft is working on adding to the existing Windows Performance Analyzer (great GUI tool for ETW performance tracing) to display ETW packet captures, which will succeed Message Analyzer and Network Monitor: https://techcommunity.microsoft.com/t5/networking-blog/intro...
[+] [-] jabroni_salad|4 years ago|reply