How We Designed for Performance and Scale

[+] rdtsc|10 years ago|reply

Btw there is a new feature in the kernel to help avoid using the accept shared memory mutex -- EPOLLEXCLUSIVE and EPOLLROUNDROBIN

This should round robin accept in the kernel, and not wake up all the epoll listeners.

https://lwn.net/Articles/632590/

[+] 15155|10 years ago|reply

Isn't this exactly what EPOLLET (edge triggering) does?

[+] nailer|10 years ago|reply

This is a lovely article, but:

> The fundamental basis of any Unix application is the thread or process. (From the Linux OS perspective, threads and processes are mostly identical; the major difference is the degree to which they share memory.)

It's better to be specific in performance discussions, rather than use 'thread' and 'process' interchangeably.

As well as the article mentioned about memory sharing, threads (which are called Lightweight Processes, or LWPs, in Linux 'ps') are granular.

    ps -eLf

NWLP in the command above is 'number of lightweight processes', ie number of threads.

Processes are not granular: they're one or many threads. IIRC it can be beneficial to assign threads of the same process to the same physical core or same die for cache affinity. There's all kind of performance stuff where 'threads' and 'processes' do not mean the same thing. Being specific is rad.

[+] alexchamberlain|10 years ago|reply

In Linux, they are both just entries in the process table. They are created by the `clone` syscall, and the "normal" way of creating them share different amounts of resources by _default_.

You're right to say that treating them differently can be beneficial in some situations, but it really depends.

[+] afarrell|10 years ago|reply

Aside: you really nailed the "I think they're saying something significantly incorrect but don't want to be a jerk about it" tone. Kudos.

[+] cshimmin|10 years ago|reply

> You can reload configuration multiple times per second (and many NGINX users do exactly that)

I thought this was an interesting remark. Can anyone clue me in to what these "many users" might be doing, that requires them to reload configuration so frequently?

[+] threeseed|10 years ago|reply

Service Discovery. Dynamically generating routing rules so microservices can have pretty URLs. For example with us when you deploy a new version of a microservice it starts on a random port, registers with Consul on startup and then dynamically regenerates and reloads Nginx.

[+] fweespeech|10 years ago|reply

I know of at least one VPS company that build its load balancer SaaS offering on NGINX and it automatically reloads whenever a node is added/removed.

So I would assume it was that function that would cause such rapid reloads. [e.g. If you had a 50 node pool, you might go +/-5 over a span of a second and change the config 5 times in 1s]

[+] earless1|10 years ago|reply

I front my docker fleet with Nginx and I need to reload config for new hosts

[+] Igglyboo|10 years ago|reply

Dynamically generating urls, like tumblr or github pages? (just a guess)

[+] nodesocket|10 years ago|reply

"NGINX’s binary upgrade process achieves the holy grail of high-availability; you can upgrade the software on the fly, without any dropped connections, downtime or interruption in service."

Is this really true? I remember seeing an article[1] recently on using an iptables hack to prevent dropping connections when reloading haproxy. Does nginx actually provide zero-downtime configuration reloads?

[1] https://medium.com/@Drew_Stokes/actual-zero-downtime-with-ha...

[+] ploxiln|10 years ago|reply

I didn't want to be a negative nancy in the comments for that haproxy article... but that is a ridiculous ugly hack. It's really a lot easier to achieve real and robust zero-downtime upgrades for a simple unix process.

Remember, fork()ed and exec()ed processes inherit file descriptors (except those marked CLOEXEC), including the listen() fd. Pending connections will queue in the kernel until userspace calls accept() on the listening fd.

So one simple model is to stop calling accept(), cleanly/quickly finish up current connections, set an environment variable to tell the future instance that the listening fd X is already open, and exec your own binary again.

A more complicated one is to fork, have the (identical) child just finish the current connections, while the parent execs itself similar to above. (The client connection fds should be marked CLOEXEC in this case.)

With a more complicated service with more moving parts, libraries, threads, getting the above to work out is more complicated. But that's basically how you want to do it.

[+] Denzel|10 years ago|reply

Yes, nginx allows you to upgrade the binary and reload the configuration without any downtime: http://wiki.nginx.org/CommandLine#Upgrading_To_a_New_Binary_...

[+] gull|10 years ago|reply

The holy grail of high-availability isn't upgrading stateless software. It's upgrading stateful ones.

Like upgrading when a data structure changes between versions. The HTTP protocol nginx serves is stateless and by comparison far simpler. Same goes for Erlang. It offers nothing more than simple function replacement, and that's not enough to handle data structure changes either.

[+] gregham|10 years ago|reply

Yes, it's true. NGINX does zero-downtime configuration reloads and binary upgrades since 2004 without any dirty hacks.

[+] 0x0|10 years ago|reply

The irc client irssi has something like this in /upgrade where it spawns a new binary but passing along active socket connections and their associated irc room states, I believe

[+] MCRed|10 years ago|reply

If it's the holy grail, then erlang has had it for a couple decades... and whether it actually works in nginx I don't know, but it does work in erlang.

The way it's handled is that the process where the previous socket was connected remains in place until it terminates but all new sockets connect to the new code.

[+] amelius|10 years ago|reply

One thread per CPU, and non-blocking I/O, that's sounds like the usual way to approach the problem. I'm surprised it uses state machines to handle the non-blocking I/O, because modern software engineering provides much more pleasant approaches such as using coroutines.

[+] nly|10 years ago|reply

Coroutines have the distinct disadvantage of needing a stack, much like threads. So-called 'stackless' coroutines aren't really so different to computed gotos in a state machine

[+] blt|10 years ago|reply

nginx is written in C though - I know it's possible to do coroutine-type stuff with setjmp/longjmp, but isn't that considered risky?

[+] McElroy|10 years ago|reply

I did as they said at the bottom and gave them my e-mail and other personal details so I could download the eBook that they were giving free preview copies of - "Building Microservices". Unfortunately, they sent link to PDF only so it's not usable to me. Just a heads up to others so you save yourself the time of discovering that. (I'll just wait for when the book is finished and then I'll buy it so I get ePub. I like O'Reilly and have bought many books there before.)

[+] andorov|10 years ago|reply

If you have a kindle you can send a pdf to your kindle email with the subject 'convert' -

http://www.amazon.com/gp/sendtokindle/email

[+] justincormack|10 years ago|reply

The book is out now from O-Reilly - its a good book.

[+] simi_|10 years ago|reply

I had a feeling I've read about nginx before: http://aosabook.org/en/nginx.html

The whole book is worth a read, although I found some sections painfully boring (perhaps my limited attention span is to blame).

[+] the_why_of_y|10 years ago|reply

There's a 3rd volume "The Performance of Open Source Applications" now, and it has a chapter on another high performance HTTP server, Warp:

http://www.aosabook.org/en/posa/warp.html

Interesting what kind of performance one can get out of GHC nowadays. Article says the authors of Warp had to implement a new parallel IO manager for GHC to get there, but that was merged into GHC 7.8.

[+] saurabhtandon|10 years ago|reply

Interesting overview. I wish they had some data comparison which could explain the significance and efficiency of this approach vs other/old approaches.

[+] dschiptsov|10 years ago|reply

They forgot to mention pool-allocated buffers, zero-copy strings, and very clean, layered codebase - every syscall was counted.

The original nginx is a rare example of what is the best in software engineering - deep understanding of principles and almost Asperger's attention to details (which is obviously good). Its success is justified.

62 comments