600k concurrent websocket connections on AWS using Node.js (2015)

[+] sciurus|6 years ago|reply

A few pieces of advice based on running https://github.com/mozilla-services/autopush-rs, which handles tens of millions of concurrent connections across a fleet of small EC2 instances.

1) Consider not running the largest instance you need to handle your workload, but instead distributing it across smaller instances. This allows for progressive rollout to test new versions, reduces the thundering herd when you restart or replace an instance, etc.

2) Don't set up security group rules that limit what addresses can connect to your websocket port. As soon as you do that connection tracking kicks in and you'll hit undocumented hard limits on the number of established connections to an instance. These limits vary based on the instance size and can easily become your bottleneck.

3) Beware of ELBs. Under the hood an ELB is made of multiple load balancers and is supposed to scale out when those load balancers hit capacity. A single load balancer can only handle a certain number of concurrent connections. In my experience ELBs don't automatically scale our when that limit is reached. You need AWS support to manually do that for you. At a certain traffic level, expect support to tell you to create multiple ELBs and distribute traffic across them yourself. ALBs or NLBs may handle this better; I'm not sure. If possible design your system to distribute connections itself instead of requiring a load balancer.

2 and 3 are frustrating because they happen at a layer of EC2 that you have little visibility into. The best way to avoid problems is to test everything at the expected real user load. In our case, when we were planning a change that would dramatically increase the number of clients connecting and doing real work, we first used our experimentation system to have a set of clients establish a dummy connection, then gradually ramped up that number of clients in the experiment as we worked through issues.

[+] Thaxll|6 years ago|reply

If you have a lot of trafic you shoudn't use an ELB in the first place, should be using an NLB by default it comes with a 40Gb pipe and multi millions connections.

https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...

[+] gravypod|6 years ago|reply

> 1) Consider not running the largest instance you need to handle your workload, but instead distributing it across smaller instances. This allows for progressive rollout to test new versions, reduces the thundering herd when you restart or replace an instance, etc.

I came here to say this. Horizontal compute is a miracle.

[+] dylanz|6 years ago|reply

Could you explain or post a link to something about #2? I’ve never heard of that before!

[+] gchamonlive|6 years ago|reply

I recently played around with Athena for load balancer logs, and with sql it is easy to cross reference many connection entries from the load balancer. Do you believe this could help in getting more visibility to spot bottleneck problems stated in 2 and 3?

[+] joaojeronimo|6 years ago|reply

Some dude in 2012 did 1 million active connections with whatever node.js and v8 versions we had at that time :) http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-co...

Edit: I wasn't aware but there was some build-up to it: - 100k connections: http://blog.caustik.com/2012/04/08/scaling-node-js-to-100k-c... - 250k connections: http://blog.caustik.com/2012/04/10/node-js-w250k-concurrent-...

[+] mping|6 years ago|reply

Yeah, we used nginx-push-stream-module[1] to support 1M connections with lower boxes. Websocket-as-a-service. Really cool module. Was a realtime project for a live TV contest where people could participate with their phones.

[1] https://github.com/wandenberg/nginx-push-stream-module

[+] fasteo|6 years ago|reply

Did not know about this module. Instead, we are using a pretty identical one [1]. Works flawlessly

[1] https://github.com/slact/nchan

[+] cpursley|6 years ago|reply

So 1/4 of what Elixir/Erlang can handle, but more difficult and less reliable:

https://phoenixframework.org/blog/the-road-to-2-million-webs...

[+] SEMW|6 years ago|reply

Gary was using a much more powerful instance (both in cpu and memory) that this post uses when he reached 2 million with Phoenix. But Phoenix is also doing quite a lot more per connection than node. Basically for multiple reasons they're incomparable experiments, there's really no basis here to draw conclusions like '1/4 of what Elixir/Erlang can handle' from.

[+] folkhack|6 years ago|reply

"but more difficult" sounds pretty subjective... I can get behind the "less reliable" comment because Erlang is rock solid.

Sorry to be that guy, but I can find a TON more people capable of supporting ECMA vs. Erlang and/or Elixr. I say this as a huge fanboy of both.

I know I'll get rolled on HN for saying this because we all drink the optimization Koolaid but I feel this is worth mentioning.

[+] truth_seeker|6 years ago|reply

Not fair !

You are comparing "4 CPUs and 15GB of memory" for NodeJS with "40 CPUs and 128 GB of memory" for Elixir/Phoenix

[+] conradfr|6 years ago|reply

Both articles from 2015.

[+] truth_seeker|6 years ago|reply

There is more performant web-socket implementation than the one mentioned in the blog. It can handle 6X more connections and much less memory

https://github.com/uNetworking/uWebSockets.js

EDIT:

Note that the blog post is from 2015. There are many optimization (Ignition and TurboFan pipeline) has been done in V8 since then, especially offloading GC activity to separate thread than NodeJS Main thread.

[+] gbuk2013|6 years ago|reply

Unfortunately it is written by someone whose technical ability far exceeds his people skills, which are essential for a library module that developers can safely depend on.

This is doubly unfortunate because I very much share his views on bloated frameworks etc. :(

[+] desireco42|6 years ago|reply

There is that Phoenix thing when they famously did 2 Million.

https://phoenixframework.org/blog/the-road-to-2-million-webs...

I agree with the suggestion that smaller instances that can be scaled is not a bad idea.

[+] anildigital|6 years ago|reply

I don't get point of using Node.js when compared to something like Elixir. Elixir's Phoenix can handle more numbers of concurrent connections as well as provide reliability with better programming abstractions, distribution, pretty good language.

[+] athenot|6 years ago|reply

There are a handful of languages that would be suitable for this. The "right one" to use depends on more than just the language features for that task:

- What third party libraries do you need to use? Some languages have very good support for some, and less for others.

- What are the internal integrations you need to support? Can they be over the network or are you calling into code in a particular language?

- What is the pool of skills available to you as a team? Do you go with a language that has a reputation of being really good for this task but of which the team knows very little (and therefore will have a learning curve working out the common pitfalls), or do you go with a better understood language which the team has already mastered, and stretch it to go beyond what mere mortals do with it? Note: there's no right answer here, both options have severe drawbacks.

- Related to the previous: what's your company's culture regarding technical diversity?

[+] Thaxll|6 years ago|reply

This is actually not true, if you take a really optimzied C/C++ library for WS Nodejs will crush Elixir by a large margin. Elixir / Erlang is slow and consume a lot of memories compared to more native languages or C libraries.

Ex: https://github.com/uNetworking/uWebSockets

[+] np_tedious|6 years ago|reply

I don't understand people eating hot dogs. Hamburgers have...

[+] monus21|6 years ago|reply

It's not about which is the best technology. The barrier to entry for Node.js is next to nothing and the size of the ecosystem is incomparable to Elixir. Which means tons of companies will go with Node and feed back to the ecosystem and so on...

[+] runj__|6 years ago|reply

Because javascript is one of the most popular programming languages in the world and the pool of Elixir programmers is almost non-existent?

[+] IggleSniggle|6 years ago|reply

I am totally with you on Elixir/ Phoenix. But using Node.js is about leveraging frontend devs to get productive on the backend fast. At least, I think that’s the idea. Maybe also leveraging Google’s dependence on V8 (and thus all the engineering love it gets), although I don’t think that’s really a great argument compared to the EVM.

[+] z3t4|6 years ago|reply

The difference in performance between any languages are very small, like only one order of magnitude, but it's usually possible to get two orders of magnitude better performance in any language by optimizing. Eg. If your manager wont allow you to cut the AWS bill 100x by doing some optimizations. He she/she will certainly not allow you to rewrite everything in another language in order to cut bills by 10x.

[+] unknown|6 years ago|reply

[deleted]

[+] tschellenbach|6 years ago|reply

You could make the same argument comparing Elixir and Go, or Go and C++

[+] RossM|6 years ago|reply

Interesting details; it would be nice to see how those ulimit/networking numbers were arrived at.

The title should have [2015].

[+] dgelks|6 years ago|reply

Yep - title should've had [2015], my bad!

[+] jjtheblunt|6 years ago|reply

Genuinely missing the point question: isn't the number of concurrent socket (websocket or otherwise) connections just a function of the underlying OS and number of instances thereof, not a function of Node.js ?

[+] unilynx|6 years ago|reply

Even if the OS can take it, it requires proper engineering in your stack (ie, node.js here).

it wasn't long ago that even managing 10K connections on a server was considered quite a feat - see http://www.kegel.com/c10k.html

[+] nurettin|6 years ago|reply

number of connections ultimately depends on the OS limits assuming you have infinite RAM and CPU resources to start your sessions, do the handshakes, serve ping/pong packets, register events to event loops and fire timers.

[+] nly|6 years ago|reply

Doesn't sound so impressive. I've done close to a million on a single Digital Ocean droplet using nchan[0] before. Latency was reasonable even with that many connections, you just need to set your buffer sizes carefully. Handshakes are also expensive, so it's useful to to be able to control the client and build in some smart reconnect/back-off logic.

[0] https://www.nginx.com/resources/wiki/modules/Nchan/

[+] sixplusone|6 years ago|reply

would be nice to compare all providers with something like https://www.techempower.com/benchmarks/ (azure & aws)

[+] axismundi|6 years ago|reply

Does anyone have a more recent experience? I currently use socket.io 2.2 with node v10.16, no v8 tweaks in a docker container. At ~1000 sockets, sometimes the server receives spikes of 8000 HTTP reqs/sec, which it has to distribute to the websockets, up to 100 msgs/sec, ~1kb/msg to each socket. These spikes are making the server unstable, socket.io switches most of the clients from websockets to xhr polling.

[+] dgelks|6 years ago|reply

I found this article very useful to resolve connections moving to xhr polling on our Node.js 10.16 server https://medium.com/@k1d_bl4ck/a-quick-story-about-node-js-so...

[+] iends|6 years ago|reply

I don't have any experiments to share, but you can go father if you stop using socket.io, but I guess you need something to deal with long polling.

You should consider tweaking --max_old_space_size, we got a lot of mileage giving node more memory.

[+] truth_seeker|6 years ago|reply

Have you tried "sticky-cluster" ?

https://github.com/uqee/sticky-cluster

[+] mcintyre1994|6 years ago|reply

I thought this was going to be about AWS-managed websockets using API Gateway. I've been using that at a really small scale and it's got a great API but other than almost certainly being much more expensive than the EC2 machine used here I wonder how well it works with that sort of scale.

[+] sankha93|6 years ago|reply

I have written and deployed message queues in Node.js that take data from Redis and push it out on websockets. It is a pain to deal with the GC in weird cases. This was about 5 years ago, so the details might not accurate.

Things worked fine until some day some customer started sending multi-megabyte strings over the system. It is difficult to actually track down that it is GC that is halting the system and then figuring out ways to fix the issue. We ended up not using JavaScript strings and instead using Node.js buffers to do the transport - I don't recall the Node.js library for Redis supporting that out of the box.

[+] winrid|6 years ago|reply

It's better but still a pain. The GC is so much more fragile than the JVM.

Would have used nchan for that probably :p

[+] throwaway_bad|6 years ago|reply

Does anyone have experience doing the same on GCP?

In particular right now I am trying to add live reloading to my App Engine Standard app but Standard doesn't support long lived connections (so no websockets) and App Engine Flexible seems like it will be pricy.

I think I can set up a single separate websocket instance which is only responsible for doing listen/notify on postgres and telling the client when it's time to refetch from the main webserver again.

Does this sound approximately workable? Will I actually be able to reach the connection numbers like in this article?

[+] choffee|6 years ago|reply

I wonder how it would compare in terms of cost with doing it via the api-gateway. It would depend on how your app and user base scales and what the sockets are being used for I suppose.

https://docs.aws.amazon.com/apigateway/latest/developerguide...

[+] iends|6 years ago|reply

Api gateway is very expensive once you get sustained load. EC2 much cheaper.

[+] ArtWomb|6 years ago|reply

Lots of wisdom on this page ;)

Just want to add. Real-world, often the predominant use case is not optimizing for "max-conns". But <100,000 concurrent users, who instead need to be connected for a very long time.

In this instance, I've found Caddy's websocket directive, inspired by Websocketd, to be quite robust and elegant. It's just a process per conn. Handling stdin, stout style messaging ;)

[+] m3kw9|6 years ago|reply

But once they all start doing stuff then what happens? 600k is more of a function of memory right?

[+] ArchReaper|6 years ago|reply

Why are websites still hijacking scroll behavior in 2019? I can't even take the article seriously with my scrolling bouncing and glitching all over the place.

[+] winrid|6 years ago|reply

A better solution would be to use nchan+nginx and then your Node API is just a simple stateless REST service. Will scale better and be easier to maintain.

[+] 1drr|6 years ago|reply

How do you test something like this internally?

[+] verttii|6 years ago|reply

Load testing tools like Tsung.

137 comments