Linux Kernel Tuning for C500k

[+] evgen|15 years ago|reply

Some other tricks that were not touched upon in the article, but which may apply depending on the nature of your traffic:

1) If you have lots of short connections and you want to tune the amount of time that the kernel will keep half-closed connections around then you can play around with changing the values of net.ipv4.tcp_fin_timeout, net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, and net.ipv4.tcp_max_tw_buckets.

2) If you have a modern NIC then you probably need to tweak the txqueuelen in your ifconfig options.

3) If you are get hits from a large number of random browsers then sometimes setting net.ipv4.tcp_no_metrics_save and net.ipv4.tcp_moderate_rcvbuf to turn off cacheing of flow metrics helps.

4) Increase net.core.somaxconn to increase your listen queue size.

5) If you have a local firewall like iptables in place make sure you increase net.ipv4.ip_conntrack_max, direct your high-traffic ports to the NOTRACK table, and play around with all of the various net.ipv4.netfilter.ip_conntrack_tcp_timeout_* settings.

[+] bnoordhuis|15 years ago|reply

Good tips. The only thing I would recommend against is setting net.core.somaxconn too high - a too large backlog at a time when your server is already resource constrained might just push it over the brink.

[+] metabrew|15 years ago|reply

I tested one server up to 1 million concurrent connections a couple of years ago, search for sysctl in this article to see the settings I used: http://www.metabrew.com/article/a-million-user-comet-applica...

[+] sophacles|15 years ago|reply

Just a technical note on the 64K myth section. My understanding is that TCP connections track by the tuple (remote_host, remote_port, local_host, local_port) so a single client can have 64k unique connections to each port on a remote machine.

If that is actually the case, the document gets its myth correction wrong (by a lot) :)

Can anyone clarify this?

[+] superjared|15 years ago|reply

You are right. The part I didn't really make clear is that we only serve on the single external port. Were we to use multiple, then yes, we could have 64k * 64k per IP pair.

[+] nwmcsween|15 years ago|reply

This isn't relevant to newer kernels as these settings are dynamic based on memory size since 2.6.26ish - The kernel will set this based on usage no need to tweak. The only real issue is making sure you buy a high end network card that will offload as much as possible to avoid x context switches per second (I don't know what it is exactly with netpoll).

[+] chrisbolt|15 years ago|reply

My, how things have changed since the C10k problem...

http://www.kegel.com/c10k.html

[+] superjared|15 years ago|reply

The C10k solutions are effectively the same as for C500k, those being epoll (Linux), kqueue (BSD), etc. Our Java NIO server utilizes epoll to handle C500k.

[+] metachris|15 years ago|reply

The approaches stayed pretty much the same:

1. Serve many clients with each thread

2. Serve one client with each server thread

3. Build the server code into the kernel

[+] metachris|15 years ago|reply

Interesting post, thanks for sharing!

About the suggested sysctl.conf settings: I think you'd also need to adjust net.core.rmem_max and net.core.wmem_max in order for the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem settings to be effective.

Furthermore it couldn't hurt to increase net.core.netdev_max_backlog, which is the maximum number of packets queued on the input side, when the interface receives packets faster than kernel can process them.

[+] superjared|15 years ago|reply

Regarding the `net.core` parameters. We do modify those, but my assumption (probably wrong) was that the `net.ipv4` changes would override the core configs. I'll take a look and update the post. Good point about `netdev_max_backlog`, I need to read up on that one too.

Thanks for the feedback!

[+] JoachimSchipper|15 years ago|reply

Linking this with the IPv6 stuff currently on the front page: note that none of this would be necessary if the clients were running IPv6 (or otherwise un-NAT-ed) - the server could simply send them a UDP packet or even open a TCP connection.

[+] ashish01|15 years ago|reply

This is interesting stuff. I jumped into node.js programming a while ago and will like to run similar tests on node.js. Can anyone tell me how client side load of 500K long lived connections achieved ? Is there a standard set of programs to achieve this or some custom scripts.

[+] nivertech|15 years ago|reply

Did somebody solved C1000K (or C1M) problem?

http://news.ycombinator.com/item?id=1755575

Is maximum number of connections that you can reach on largest EC2 instance is the same as on physical server?

[+] plainOldText|15 years ago|reply

I'm wondering what are the major side effects of this. Hmm?

[+] robotadam|15 years ago|reply

A good question. Shrinking TCP buffer sizes can have a negative performance impact when sending large amounts of data; our use case was keeping track of a large number of mostly silent connections, and so we benefit from the smaller memory footprint.

[+] c00p3r|15 years ago|reply

btw, the most common source of high load is (surprise!) disk I/O.

So, moving a /var/log (not just /var) on separate device connected to distinct controller port is a big deal.

If you're running, say, mail server, you should separate /var/spool and /var/log and /var/db/mysql if any.

Partitioning, serious network card (think Broadcom) and big CPU caches are good things to begin with.

[+] ciupicri|15 years ago|reply

Broadcom on Linux?! I was under the impression that their drivers aren't too good or open source friendly.

[+] c00p3r|15 years ago|reply

LOL! This is called tuning nowadays?

Even Oracle providing much more good advices, let alone some individual pros.

Good starting point: http://www.puschitz.com/InstallingOracle10g.shtml

Update: Oh, yes, I understood. Newfags doesn't know what Oracle is. MySQL = RDBMS, I see. ^_^

25 comments