Note that I picked the really small messages here--integers, to give node the best possible serialization advantage.
$ time node cluster.js
Finished with 10000000
real 3m30.652s
user 3m17.180s
sys 1m16.113s
Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?
$ pidstat -w | grep node
11:47:47 AM 25258 48.22 2.11 node
11:47:47 AM 25260 48.34 1.99 node
96 context switches per second.
Compare that to a multithreaded Clojure program which uses a LinkedTransferQueue--which eats 97% of each core easily. Note that the times here include ~3 seconds of compilation and jvm startup.
$ time lein2 run queue
10000000
"Elapsed time: 55696.274802 msecs"
real 0m58.540s
user 1m16.733s
sys 0m6.436s
Why is this version over 3 times faster? Partly because it requires only 4 context switches per second.
$ pidstat -tw -p 26537
Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU)
11:52:03 AM TGID TID cswch/s nvcswch/s Command
11:52:03 AM 26537 - 0.00 0.00 java
11:52:03 AM - 26540 0.01 0.00 |__java
11:52:03 AM - 26541 0.01 0.00 |__java
11:52:03 AM - 26544 0.01 0.00 |__java
11:52:03 AM - 26549 0.01 0.00 |__java
11:52:03 AM - 26551 0.01 0.00 |__java
11:52:03 AM - 26552 2.16 4.26 |__java
11:52:03 AM - 26553 2.10 4.33 |__java
And queues are WAY slower than compare-and-set, which involves basically no context switching:
$ time lein2 run atom
10000000
"Elapsed time: 969.599545 msecs"
real 0m3.925s
user 0m5.944s
sys 0m0.252s
$ pidstat -tw -p 26717
Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU)
11:54:49 AM TGID TID cswch/s nvcswch/s Command
11:54:49 AM 26717 - 0.00 0.00 java
11:54:49 AM - 26720 0.00 0.01 |__java
11:54:49 AM - 26728 0.01 0.00 |__java
11:54:49 AM - 26731 0.00 0.02 |__java
11:54:49 AM - 26732 0.00 0.01 |__java
TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.
Yep, serialized message passing between threads is slower - you didn't have to go through all that work. But it doesn't matter because that's not the bottleneck for real websites.
Also Node starts up in 35ms and doesn't require all those parentheses - both of which are waaaay more important.
Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means a TLB shootdown: all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.
it's all serialization - but that's not a bottleneck for most web servers.
I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.
i'd love to hear your context-switching free multicore solution.
I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.
this is the sanest and most pragmatic way server a web server from multiple threads
It's more than that. Several processes with small, independently garbage collected heaps are not as efficient as a single process with a large heap, parallel threads, and a modern concurrent GC (e.g. the JVM's ConcurrentMarkSweep GC)
In addition to that, processes severely inhibit the usefulness of in-process caches. Where threads would allow a single VM to have a large in-process cache, processes generally prevent such collaboration and mean you can only have multiple, duplicated, smaller in-process caches. (Yes, you could use SysV shared memory, but that's also fraught with issues)
The same goes for any type of service you would like to run inside a particular web server that could otherwise be shared among multiple threads.
aphyr|13 years ago
Node.js: https://gist.github.com/3200829
Clojure: https://gist.github.com/3200862
Note that I picked the really small messages here--integers, to give node the best possible serialization advantage.
Note the high sys time: that's IPC. Node also uses only 75% of each core. Why? 96 context switches per second.Compare that to a multithreaded Clojure program which uses a LinkedTransferQueue--which eats 97% of each core easily. Note that the times here include ~3 seconds of compilation and jvm startup.
Why is this version over 3 times faster? Partly because it requires only 4 context switches per second. And queues are WAY slower than compare-and-set, which involves basically no context switching: TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.ryah|13 years ago
Also Node starts up in 35ms and doesn't require all those parentheses - both of which are waaaay more important.
aphyr|13 years ago
it's all serialization - but that's not a bottleneck for most web servers.
I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.
i'd love to hear your context-switching free multicore solution.
I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.
this is the sanest and most pragmatic way server a web server from multiple threads
What is this I can't even.
aphyr|13 years ago
bascule|13 years ago
In addition to that, processes severely inhibit the usefulness of in-process caches. Where threads would allow a single VM to have a large in-process cache, processes generally prevent such collaboration and mean you can only have multiple, duplicated, smaller in-process caches. (Yes, you could use SysV shared memory, but that's also fraught with issues)
The same goes for any type of service you would like to run inside a particular web server that could otherwise be shared among multiple threads.
rektide|13 years ago
Web serving is OK & all, but I'd love if node could be an ideal runtime for petri-nets and webworker meshes too.