top | item 27198085

Boosting upload speed and improving Windows' TCP stack

261 points| el_duderino | 4 years ago |dropbox.tech | reply

115 comments

order
[+] slowstart|4 years ago|reply
I lead the Windows TCP team. We blogged about recent TCP advancements which is very relevant: https://techcommunity.microsoft.com/t5/networking-blog/algor...
[+] SaveTheRbtz|4 years ago|reply
A couple of questions:

* What are the reasons for disabling TCP timestamps by default? (If you can answer) will they be eventually enabled by default? (The reason I'm asking is that Linux uses TS field as storage for syncookies, and without it will drop WScale and SACK options greatly degrading Windows TCP perf in case of a synflood.[1])

* I've noticed "Pacing Profile : off" in the `netsh interface tcp show global` output. Is that the same as tcp pacing in fq qdisc[2]? (If you can answer) will it be eventually enabled by default?

[1] https://elixir.bootlin.com/linux/v5.13-rc2/source/net/ipv4/s... [2] https://man7.org/linux/man-pages/man8/tc-fq.8.html

[+] drummer|4 years ago|reply
I have a question, why is it that when opening two sockets on Windows and connecting them through TCP, there is about a 40% difference in transfer rate when sending from socket A to B, compared to sending from B to A?
[+] the8472|4 years ago|reply
Are equivalents to linux' BQL/AQL, fq_codel, TCP_NOTSENT_LOWAT in the pipeline?
[+] Agingcoder|4 years ago|reply
Excellent article !

I got hit by the exact same issue which is described in the fermilab paper, namely packet reordering caused by intel drivers. It took me several days to diagnose the problem. Interestingly enough, the problem virtually disappeared when running tcpdump, which, after a lot of reading on the innards of the linux TCP stack, and prodding with ebpf, eventually led me to conjecture that it was a scheduling/core placement issue. Pinning my process clearly made the problem disappear, and then finding the paper nailed it.

Networks are not my specialty (I come from a math background, am self taught, and had always dismissed them as mere plumbing) , but I have to say that I came out of this difficult (for me) investigation with a great appreciation for networking in general, and now enjoy reading anything I can find about them.

It's never too late to learn, and I have yet to find something in software engineering which is not interesting once you take a closer look at it!

[+] SaveTheRbtz|4 years ago|reply
> tcpdump ... linux TCP stack ... ebpf

> Networks are not my specialty

I wish all network non-specialist were like you!

[+] baruch|4 years ago|reply
Networking actually has tons of interesting and complex math. Congestion control is a rabbit hole of math and control theory.
[+] brohee|4 years ago|reply
I think the problem disappearing while running tcpdump is one of the truest instances of Schrödingbug...
[+] dijit|4 years ago|reply
I had a similar issue with Windows kernels "recently" (2016~?)...

I don't have the memory or patience to write a long and inspiring blog post, but it comes down to:

Even with IOCP/multiple threads: network traffic is single threaded in the kernel, even worse, there's a mutex there. Putting the effective limit on PPS for windows to something like 1.1M for 3.0GHz.

The task of this machine was /basically/ a connection multiplexer with some TLS offloading; so listen on a socket, get an encrypted connection, check your connection pool and forward where appropriate.

Our machine basically sat waiting (in kernel space) for this lock 99.7% of the time, 0.3% was spent on SSL handshaking..

We solved our "issue" by spreading such load over many more machines and gave them low-core-count high-clock-speed Xeons instead of the normal complement of 20vCPU Xeons.

AFAIK that issue persists, I'd be interested to know if someone else managed to coerce windows to do the right thing here.

[+] toast0|4 years ago|reply
I did some work optimizing a similar problem, but simpler and on another OS[1]. The basic concept that worked was Receive Side Scaling (RSS), which was developed by Microsoft, for Windows Server. Did you come accross that? It needs support in the NIC and the driver, but intel gigE cards do it, so you don't need the really fancy cards. I don't know what the interface is like for Windows, but inbound RSS for FreeBSD is pretty easy, and skimming Windows docs, it seemed like you could do more advanced things there.

The harder part was aligning the outgoing connections; for max performance, you want all of the related connections pinned to the same CPU, so that there's no inter CPU messaging; for me that meant a frontend connection needs to hash to the same NIC queue as the backend connection; for you, that needs to be all of the demultiplexed connections on the same queue as the multiplexed connection. Windows may have an API to make connections that will hash properly, FreeBSD didn't (doesn't?), so my code had to manage the local source ip and port when connecting to remote servers so that the connection would hash as needed. Assuming a lot of connections, you end up needing to self-manage source ip and port anyway, and at least HAProxy has code for that already, but running the rss hash to qualify ports was new development, and a bit tricky because bulk calculating it gets costly.

Once I got everything setup well with respect to CPUs, things got a lot better; still had some kernel bottlenecks though, I wouldn't know how to resolve that for Windows, but there were some easy wins for FreeBSD.

Low core count is the right way to go though; I think the NICs I used could only do 16 way RSS hashing, so my dual 14 core xeon (2690v4) weren't a great fit; 12 cores were 100% idle all the time; something power of two would be best.

Email in profile if you want to continue the discussion off HN (or after it fizzles out here).

[1] Load balancing/proxying, but no TLS and no multiplexing, on FreeBSD.

[+] lowleveldesign|4 years ago|reply
If you’re on recent Windows system, you should have pktmon [1] available. I believe it’s the „netsh trace” successor and has much nicer command line. And you no longer need an external tool to convert the trace to .npcap format.

[1] https://docs.microsoft.com/en-us/windows-server/networking/t...

[+] slowstart|4 years ago|reply
PktMon is the next generation tool in newer Windows 10 versions and brings many of the same benefits referred to in this blog - particularly being able to view packet captures and traces together in the same text file.
[+] thrdbndndn|4 years ago|reply
Cool article, but I'm not impressed by DropBox's upload speed on my Windows computer, at all.

I just tested rn with DropBox, GoogleDrive, and OneDrive, all with their native desktop apps. I simply put a 300MB file in the folder and let it sync.

    DB: 500 KiB/s
    GD: 3 MiB/s
    OD: 11 MiB/s (my max bandwidth with 100Mbps)

I don't know what causes the disparity here, but I have been annoyed by this for years, and it's the same across multiple computers I use at different locations.

Another funny thing is if you just use the webpage, both GD and DB can reach 100Mbps easily.

Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).

[+] drewg123|4 years ago|reply
Is Google Drive using QUIC? If so, then its using the same BBR congestion algorithm as the bbr tcp stack, and BBR's algorithm which does not view loss as congestion will help a lot.

It would be interesting to re-try the experiment on Linux or FreeBSD using BBR as the TCP stack and see if the results are any better for dropbox.

FWIW, my corp openvpn is kinda terrible. My upload speeds via the vpn did not improve at all when I moved and upgraded from 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit of the corp vpn endpoint.

[+] kevingadd|4 years ago|reply
Strange. Dropbox has no problem hitting mid-50s MiB/s if not more on my gigabit connection. I wonder if it's a routing issue and your path to their datacenters is bad?
[+] SaveTheRbtz|4 years ago|reply
Interesting, can you try disabling upload limiter in settings? Also what is your RTT to `nsf-1.dropbox.com`?

PS. One known problem that we have right now is that we use a multiplexed HTTP/2 connection, therefore:

1) We rely on the host's TCP congestion. (We have not yet switched to HTTP/3 w/ BBR.)

2) We currently use a single TCP connection: it is more fair to the other traffic on the link but can become bottleneck on large RTTs.

[+] nailer|4 years ago|reply
Are you using the version of Windows with the fix mentioned in the article?
[+] Groxx|4 years ago|reply
Yea, Dropbox on my Macs has continuously been outrageously slow at uploading. Everything else is multiples faster.

Dropbox does at least resume fairly reliably though, so I can generally ignore it the whole time... unless I have something I want to sync ASAP. Then I sometimes use the web UI and cross my fingers that I don't get a connection hiccup ಠ_ಠ

[+] Dylan16807|4 years ago|reply
> Edit: should mention Google's DriveFS can reach max speed too, but it's not available for my personal account (which uses the "Backup and sync for Google" app).

That thing is far too aggressive about network bandwidth. It will upload 20 files at the same time and the speed limit setting doesn't work.

[+] encryptluks2|4 years ago|reply
Google Drive is a gem. I hope it lasts forever cause no one is competing with them.
[+] brundolf|4 years ago|reply
Dropbox always publishes such good technical blog posts. And as a user, it's reassuring to see how much they still care about technical excellence.
[+] whatever_dude|4 years ago|reply
Do they? I constantly see DropBox taking days to sync files that are 30kb on size. Or doing dumbfounding things like download all files, then re-upload all files when I set sync to "online only" on a folder if just one of the files is not set to online only.

Maybe they have grand academic visions and papers, but I've been using them for well over a decade and I feel the client quality has gone downhill over the past few years. They keep adding unnecessary stuff like a redundant file browser while the core service suffers.

[+] emmericp|4 years ago|reply
The real root cause for all that flow director mess and core balancing is that there's a huge disconnect between how the hardware works and what the socket API offers by default.

The scaling model of the hardware is rather simple: hash over packet headers and assign a queue based on this. And each queue should be pinned to a core by pinning the interrupts, so you got easy flow-level scaling. That's called RSS. It's simple and effective. What it means is: the hardware decides which core handles which flow. I wonder why the article doesn't mention RSS at all?

Now the socket API works in a different way: your application decides which core handles which socket and hence which flow. So you get cache misses if you don't tak into account how the hardware is hashing your flows. That's bad. So you can do some work-arounds by using flow director to explicitly redirect flows to cores that handle things but that's just not really an elegant solution (and the flow director lookup tables are small-ish).

I didn't follow kernel development regarding this recently, but there should be some APIs to get a mapping from a connection tuple to the core it gets hashed to on RX (hash function should be standardized to Toeplitz IIRC, the exact details on which fields and how they are put into the function are somewhat hardware- and driver-specific but usually configurable). So you'd need to take this information into account when scheduling your connections to cores. If you do that you don't get any cache misses and don't need to rely on the limited capabilities of explicit per-flow steering.

Note that this problem will mostly go away once TAPS finally replaces BSD sockets :)

[+] SaveTheRbtz|4 years ago|reply
We didn't mention RSS/RPS in the post mostly because they are stable. (Albeit, relatively ineffective in terms of L2 cache misses.) FlowDirector, OTOH, breaks that stability and causes a lot of migrations, and hence a lot of re-ordering.

Anyways, nice reference for TAPS! Fo those wanting to dig into it a bit more, consider reading an introductory paper (before a myriad of RFC drafts from the "TAPS Working Group"): https://arxiv.org/pdf/2102.11035.pdf

PS. We went through most of our low-level web-server optimization for the Edge Network in an old blogpost: https://dropbox.tech/infrastructure/optimizing-web-servers-f...

[+] tims33|4 years ago|reply
I appreciate seeing a support and engineering org going this deep to resolve this kind of issue. Normally this is the stuff you waste hours on with a support org only to get told to clear your cookies and cache one more time.

In particular, the collaboration with Microsoft was great.I wonder what it took to make that happen.

[+] Seattle3503|4 years ago|reply
Has Dropbox ever experimented with SCTP or other protocols that don't enforce strict ordering of packets? I know some middleboxes struggle with SCTP (they expect TCP or UDP), but in that case you do SCTP over UDP or have a fall back.
[+] SaveTheRbtz|4 years ago|reply
Sadly, middleboxes are a real problem, esp. with our Enterprise customers. We had this problem even with HTTP/2 rollout so there is even a special HTTP/1.1-only mode in the Desktop Client for environments where h2 is disabled.

In the future we are planing on having an HTTP/3 support which will give us pretty much the same benefits as SCTP with a better middlebox compatibility.

[+] mrpippy|4 years ago|reply
API Monitor is really useful, but unfortunately is closed-source and hasn't been updated in a few years.
[+] stephc_int13|4 years ago|reply
Is TCP the best choice? Why not UDP?
[+] bob1029|4 years ago|reply
This is a good question in my opinion.

Theoretically, UDP would be the best choice if you had the time & money to spend on building a very application-specific layer on top that replicates many of the semantics of TCP. I am not aware of any apps that require 100% of the TCP feature set, so there is always an opportunity to optimize.

You would essentially be saying "I know TCP is great, but we have this one thing we really prefer to do our way so we can justify the cost of developing an in-house mostly-TCP clone and can deal with the caveats of UDP".

If you know your communications channel is very reliable, UDP can be better than TCP.

Now, I am absolutely not advocating that anyone go out and do this. If you are trying to bring a product like Dropbox to market (and you don't have their budget), the last thing you want to do is play games with low-level network abstractions across thousands of potential client device types. TCP is an excellent fit for this use case.

[+] willis936|4 years ago|reply
It's an ideal application of TCP. Dropbox servers are continually flooded by traffic from clients, so the good congestion behavior from TCP is valuable. There is also less need to implement error detection/correction/retransmission in higher layers.
[+] jandrese|4 years ago|reply
Bulk data transfer is TCP's bread and butter. This is the protocol living the dream.
[+] pansinghkoder|4 years ago|reply
Thumbs up! Most video transmissions happen using protocols written over udp.
[+] arduinomancer|4 years ago|reply
Sure if you want to re-build TCP yourself on top of UDP
[+] michaelmcmillan|4 years ago|reply
And reimplement TCP on top? Would not recommend.
[+] rootsudo|4 years ago|reply
"Dropbox is used by many creative studios, including video and game productions. These studios’ workflows frequently use offices in different time zones to ensure continuous progress around the clock. "

Honestly I don't understand these orgs that don't go OneDrive/O365 suite. What product value does dropbox have when competing within Microsoft's own ecosystem?

[+] mwcampbell|4 years ago|reply
I wonder how the Dropbox developers managed to get in contact with the Windows core TCP team. Maybe I'm too cynical, but I'm surprised that Microsoft would go out of their way to work with a competitor like this.
[+] toast0|4 years ago|reply
Even if OneDrive vs Dropbox is important, this is a win for Windows in general. People will switch OSes because the TCP throughput is better on the other side; it's easy to measure and easy to compare and makes a nice item in a pros and cons list.

Fixing something like this can help lots of use cases, but may have been difficult to spot, so I'm sure the Windows TCP team was thrilled to get the detailed, reproducible report.

[+] paxys|4 years ago|reply
Microsoft is a massive and highly compartmentalized company. Windows kernel developers have no reason to see Dropbox as a competitor.
[+] tyingq|4 years ago|reply
Interesting. Is the Dropbox client still an obfuscated python app? I'm curious if they spawn new processes for simultaneous uploads since they probably aren't threading.
[+] freerk|4 years ago|reply
How come Linux doesn't have this issue? Why did Microsoft had to fix TCP with the RACK-TLP RFC when both Linux and MacOS implementations did fine already?
[+] the8472|4 years ago|reply
The linux implementation already had rapid acknowledgements and tail loss probe for a long time. I think it was prototyped there by google.
[+] ovebepari|4 years ago|reply
I would've said "it's not our problem to solve"
[+] chokeartist|4 years ago|reply
I got excited when I saw that fancy Microsoft Message Analyzer tool and wanted to try it out. Sadly it appears to be retired and removed by MSFT? Sad!
[+] hyperrail|4 years ago|reply
Yeah, I have no idea either why Microsoft would want to remove Message Analyzer completely, even if they could not maintain it. You can still download it through the Internet Archive:

* 32-bit x86: https://web.archive.org/web/20191104120802/https://download....

* 64-bit x86: https://web.archive.org/web/20190420141924/http://download.m...

(those links via: https://www.reddit.com/r/sysadmin/comments/e4qocq/microsoft_... )

Or use the even older Microsoft utility Network Monitor, which is still available on Microsoft's website: https://www.microsoft.com/en-us/download/details.aspx?id=486...

Supposedly Microsoft is working on adding to the existing Windows Performance Analyzer (great GUI tool for ETW performance tracing) to display ETW packet captures, which will succeed Message Analyzer and Network Monitor: https://techcommunity.microsoft.com/t5/networking-blog/intro...

[+] jabroni_salad|4 years ago|reply
It's really too bad. I'm happy enough to use Wireshark, but I liked that MMA could filter by PID.