What‘s missing from all these figure is the resulting latency. It‘s often the case that vendors show impressive throughput numbers, but then the latency is terrible at that throughput.
We do look at them to check on how we're doing, and I want to dig into this area more over time. In particular we don't do classful prioritization right now, which if you look at the typical tests for this they're often focused on multi-flow classifications. We also don't set specific congestion algorithms on our interfaces right now - availability is variable, as is the cost of them. You can see in the post here that Jordan documents that the tests in the blog were all explicitly over cubic.
We increased the sizes of the UDP buffers in the prior round of optimizations. The kernel defaults for UDP buffers are too small to approach the throughput discussed here - and the default sizings were the primary source of lots of dropped packets. I raised those to 7mb, which seems like an odd number, but it's the largest you can set on macOS before the kernel rejects it - likely we'll eventually head for a per-platform split. At these speeds a 7mb buffer represents up to 5ms of flow data, though this does not imply that it creates 5ms of bufferbloat - it just means that this increased buffer could itself account for 5ms in the worst non-lossy case. On the userspace side Tailscale also has some more buffer space now (we're reading and writing lists of packets at a time, not single packets), but the sizing there is more complex.
This topic in general is much more complex - in the first throughput post I originally started to dig into it, and we cut that in editing because it was making the post too dense and there wasn't space to give the topic the attention it deserves. One day we'll talk about this too. Typically right now we add very little latency, low millis or lower - we actually add more jitter than latency, as any userspace program would. It's still orders of magnitude lower than the levels which even concern a typical realtime application such as gaming or communications - for example someone was recently talking about using Tailscale on their Steamdeck while on vacation to play Hogwarts streaming from their PC.
In the meantime, a real world example for you. I have a border router that I built using a relatively cheap piece of hardware (Intel(R) Celeron(R) J4105 CPU @ 1.50GHz). It has NICs that support GRO/GSO, but the CPU is the bottleneck for throughput. The box does 563MBits/sec inbound to the LAN over Tailscale (949 Mbits/sec raw). I run this as an exit-node for my workstation all the time, even though that's in the same building - and do so for the sake of diagnosing bugs and experiencing the product full time. In my initial test today, under peak load the exit node adds 35ms of latency each way. I was surprised by this, so I checked when going direct rather than via the exit node, I see 15ms down and 30ms up of latency increase under peak load. It seems Comcast dropped some capacity since I last tuned my uplink!
I then re-tuned CAKE on the router uplink to be more aggressive resulting in a raw bloat of 0ms/0ms, and then retested with the Tailscale exit node. With these more aggressive CAKE tunings, Tailscale also stayed at 0ms/0ms. This CAKE tuning ate a chunk of throughput capacity, as expected. The specific tuning here being for a Comcast 1000/40 link, and the system CPU bound at 500mbps for forwarding:
+ tc qdisc add dev internet root handle 1: cake docsis ack-filter-aggressive nat bandwidth 40mbit lan
+ ip link add name ifbinternet type ifb
+ tc qdisc add dev internet handle ffff: ingress
+ tc qdisc add dev ifbinternet root cake bandwidth 500mbit lan
+ ip link set ifbinternet up
+ tc filter add dev internet parent ffff: matchall action mirred egress redirect dev ifbinternet
On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:
Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 2.625/3.620/4.536/0.646 ms
Zero load:
10 packets transmitted, 10 received, 0% packet loss, time 9014ms
rtt min/avg/max/mdev = 0.648/0.954/1.713/0.306 ms
What do these numbers mean? In practice they mean you'll notice WiFi more than you'll notice Tailscale, but we can and will still do better over time. Here's WiFi from a MacBook to the border router on the same LAN segment (no WireGuard/Tailscale):
10 packets transmitted, 10 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.845/11.363/34.152/8.940 ms
This is already long for an HN response, and so much more to say, but I hope it helps!
Very curious to learn more about CAKE tuning with tailscale, would love to see a post someday about how the two interact and when/why it might be needed?
raggi|2 years ago
We increased the sizes of the UDP buffers in the prior round of optimizations. The kernel defaults for UDP buffers are too small to approach the throughput discussed here - and the default sizings were the primary source of lots of dropped packets. I raised those to 7mb, which seems like an odd number, but it's the largest you can set on macOS before the kernel rejects it - likely we'll eventually head for a per-platform split. At these speeds a 7mb buffer represents up to 5ms of flow data, though this does not imply that it creates 5ms of bufferbloat - it just means that this increased buffer could itself account for 5ms in the worst non-lossy case. On the userspace side Tailscale also has some more buffer space now (we're reading and writing lists of packets at a time, not single packets), but the sizing there is more complex.
This topic in general is much more complex - in the first throughput post I originally started to dig into it, and we cut that in editing because it was making the post too dense and there wasn't space to give the topic the attention it deserves. One day we'll talk about this too. Typically right now we add very little latency, low millis or lower - we actually add more jitter than latency, as any userspace program would. It's still orders of magnitude lower than the levels which even concern a typical realtime application such as gaming or communications - for example someone was recently talking about using Tailscale on their Steamdeck while on vacation to play Hogwarts streaming from their PC.
In the meantime, a real world example for you. I have a border router that I built using a relatively cheap piece of hardware (Intel(R) Celeron(R) J4105 CPU @ 1.50GHz). It has NICs that support GRO/GSO, but the CPU is the bottleneck for throughput. The box does 563MBits/sec inbound to the LAN over Tailscale (949 Mbits/sec raw). I run this as an exit-node for my workstation all the time, even though that's in the same building - and do so for the sake of diagnosing bugs and experiencing the product full time. In my initial test today, under peak load the exit node adds 35ms of latency each way. I was surprised by this, so I checked when going direct rather than via the exit node, I see 15ms down and 30ms up of latency increase under peak load. It seems Comcast dropped some capacity since I last tuned my uplink!
I then re-tuned CAKE on the router uplink to be more aggressive resulting in a raw bloat of 0ms/0ms, and then retested with the Tailscale exit node. With these more aggressive CAKE tunings, Tailscale also stayed at 0ms/0ms. This CAKE tuning ate a chunk of throughput capacity, as expected. The specific tuning here being for a Comcast 1000/40 link, and the system CPU bound at 500mbps for forwarding:
On the LAN side, between the same machines (fq_codel only, default settings), running iperf3 alongside ping:Under max load ([ 5] 0.00-57.73 sec 3.72 GBytes 554 Mbits/sec receiver):
Zero load: What do these numbers mean? In practice they mean you'll notice WiFi more than you'll notice Tailscale, but we can and will still do better over time. Here's WiFi from a MacBook to the border router on the same LAN segment (no WireGuard/Tailscale): This is already long for an HN response, and so much more to say, but I hope it helps!nikisweeting|2 years ago
dtaht|2 years ago