How Discord handles over a million requests per minute with Elixir’s GenStage

[+] jtchang|9 years ago|reply

The most important part of this article is the concept of back pressure and being able to detect it. It's common in a ton of other engineering disciplines but especially important when designing fault tolerant or load balancing systems at scale.

Basically it is just some type of feedback so that you don't overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall "hey there is less capacity now because we lost a box". So you overwhelm all the other boxes and then you get this crazy cascade of failures.

[+] pdexter|9 years ago|reply

http://ferd.ca/handling-overload.html

This is a good article about overload and back pressure. It also lists some tools in Erlang to solve these sorts of issues. It also mentions genstage (very) briefly.

[+] user5994461|9 years ago|reply

Yes. You need to adapt the capacity of the system to handle the full load with -N- boxes dead.

Corollary: If you have 2 boxes, each of them has to be able to handle all the traffic, so you can't save money by using smaller boxes :D

Corollary #2: If you have 2 datacenters, each of them has to be able to handle all the traffic, so you burn a lot of money :D

[+] nurettin|9 years ago|reply

Backpressure is more common than water across engineering disciplines and it isn't an integrated part of every distributed system out there? Isn't that a bit of an oversight?

[+] jondot|9 years ago|reply

Hate to be a party pooper, but I'd like to give people here a more generic mental tool to solve this problem.

Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req/sec, you need to immediately tell yourself "I need a queue", and more importantly "I need a queue that has a prefetch capabilities". Don't try to build this. Use something that's already solid.

I've solved this problems 3 years ago, having 5M msg/minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase/FCM).

I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.

You can also do this with Kafka but it'll be less transactional.

After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.

On the second time I've solved this, I've used XMPP. I knew there were risks, because essentially I'm moving from a stateless protocol to a stateful protocol. Eventually, it wasn't worth the effort and I kept using the old system.

[+] Vishnevskiy|9 years ago|reply

I think you misunderstand the problem we are solving here. We are not trying to solve this because our system can't handle it. We are protecting it from when Firebase decides to slowdown in a way that causes data to backup and OOM the system. Since these are push notifications that have a time bound on usefulness we don't care about dumping to an external persisted queue like RabbitMQ or Kafka (we rather deliver newer notifications faster, than wait for the backed up buffer to flush). Firebase also only allows 1000 concurrent connections per senderId with 100 inflight pushes (that have not received an ack) which means that only 100,000 can be inflight. Ultimately if a remote service is providing backpressure because it is having a struggle no amount of auto scaling on your end is going to help you.

This service buffers potential pushes for all users being messages, that then watches the presence system to determine if they are on their desktop or mobile (this is millions of presence watchers and 10s of millions of buffered messages), and users are constantly clearing these buffers by reading on the clients and finally when a user is offline or goes offline we emit their pushes to them (which is what this article talks about). This service was evolved from our push system from the game we worked on and when it just did pushes only and no other logic it could push at 1m/sec in batches, but its responsibility has changed.

Context matters :)

[+] di4na|9 years ago|reply

Knowing that RabbitMQ is full erlang, why bring a really big dependency if you have all the things in your everyday language anyway ?

[+] teacpde|9 years ago|reply

> You can also do this with Kafka but it'll be less transactional.

Could you explain why using RabbitMQ is more transactional?

[+] neiled|9 years ago|reply

Anyone have any nice links to describe more about queue prefetching as described in this case? My google skills are failing because of all the CPU related articles.

[+] coverband|9 years ago|reply

Quick serious question: How does this company plan to make money? They're surely well funded[1], but what's their end game?

[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."

[+] robryan|9 years ago|reply

This worries me a bit. At the moment they are providing the software and hosting it all absolutely free.

I would happily still use Discord if they provided the exact same thing with a monthly fee. Hopefully at some point they throw in some extra features for a pro version and start charging.

I just use Discord for gaming and haven't used Slack a lot, but I think Discord will be great for work as soon as they release search.

[+] chinhodado|9 years ago|reply

Realistically, the plan is probably getting acquired by something like Youtube Gaming or Twitch to be part of the platform.

[+] HCIdivision17|9 years ago|reply

They're targetting the Twitch model a bit with customizable bits for a charge. Basically allowing some chat branding for guilds and the like.

[+] corobo|9 years ago|reply

If they need ideas for a pro version I'd probably pay far too much to be able to record each user into individual audio files (recorded locally to each user and combined on my system) for podcasts, letsplays (YouTube videos), remote meetings, etc

[+] nstj|9 years ago|reply

> Discord is always completely free to use with no gotchas. This means you can make as many servers as you want with no slot limitations. Wondering how we’ll make money? In the future there will be optional cosmetics like themes, sticker packs, and sound packs available for purchase. We’ll never charge for Discord’s core functionality. [0]

[0]: www.discordapp.com

[+] meddlepal|9 years ago|reply

They have an awful lot of information about video gamers in conversation history. They could mine that data for game companies and sell it as a way to help companies build better, more addictive and mechanically pleasing games.

[+] baldfat|9 years ago|reply

Serious Answer: To Kill IRC for a whole generation

[+] jamie_ca|9 years ago|reply

I sure hope one possible end-game is to re-skin it without the gaming focus and price it competitively to Hipchat/Slack.

As soon as screensharing lands it'll be an across the board upgrade to Hipchat at my work (only missing feature I can think of is video calls, but quality over hipchat has always been a bit sketchy so we usually fall back to Hangouts).

[+] lsmarigo|9 years ago|reply

probably also working on a premium service with a monthly charge to go along with the free tier

[+] mevile|9 years ago|reply

They're in the gaming market, there's cash plenty. Even if they just went the advertisement route, which I doubt they will they should end up doing well.

[+] poorman|9 years ago|reply

That's awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.

GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (https://hexdocs.pm/gen_stage/Experimental.Flow.html). It will be a game changer for a lot of developers.

[+] hotdogs|9 years ago|reply

"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."

How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

[+] Sikul|9 years ago|reply

Good question. We don't have metrics on the exact number dropped. We're using an earlier version of GenStage that doesn't give any information about dropped events. Once we upgrade we'll have a better idea.

[+] DougN7|9 years ago|reply

I was wondering the same thing. Dropping an unknown number of requests isn't all that impressive. It seems like a simpler approach would have been to use a Message Queue of some sort with pushers pulling items from the queue.

[+] erikbern|9 years ago|reply

"requests per minute" is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).

Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"

[+] ipozgaj|9 years ago|reply

Not sure why are you getting downvoted, I came here to make the same comment.

QPM is a useless metric. When talking about distributed systems from engineering point of view, you always want to use QPS. QPM is simply not fined-grained enough to show whether the traffic is bursty or not. For example in this particular case, when you say 1M QPM that can mean anything - they might be idle for 50s and then get 100k QPS for the next ten seconds, or they might be getting 15k QPS all the time (like it's visible on the graph). Distributed systems are designed for the peak workload, not for the average one. Using misleading numbers like QPM leads to bad design and sizing decisions.

The only case where you would use QPM, QPD and similar metrics is when you want to artificially show your numbers bigger than they are (10M transactions a day sounds better than 115 transactions a second). But those should be used by sales, not by engineers.

[+] pwf|9 years ago|reply

50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.

[+] Vishnevskiy|9 years ago|reply

These machines do more than just push. They also buffer messages for each individual user to "potentially" push if they don't read them on the desktop client. This happens before the flow this article talks about.

We currently have 3 machines doing this for millions of concurrent users. At the writing of this article it was 2 machines.

[+] jhgg|9 years ago|reply

At some point, when a system has entered a failure mode for a while, it makes sense to start shedding load, rather than attempting to deliver every single push notification. Also worth mentioning, a minute of downtime is already a million backed up pushes. Beyond that, it becomes infeasible to attempt deliver them.

Edit: Also worth mentioning, the 50k buffer is for a single server, we run multiple push servers in the cluster.

[+] ramchip|9 years ago|reply

At 15k notifications per minute, a million notifications would take 1hr to clear before the queue returns to normal. I would imagine they prefer to shed load early so notifications don't get delayed, hence the small buffer.

[+] abrookewood|9 years ago|reply

The issue was not the ability of their servers to handle the load, but the ability of Firebase to ingest the notifications - at least, that's how I read it.

[+] bpicolo|9 years ago|reply

I love Discord, and love Elixir too, so this is a pretty great post.

Unfortunate that the final bottleneck was an upstream provider, though it's good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM/APNS was fruitless, perhaps Firebase messaging has improved that?

[+] dimino|9 years ago|reply

What is up with Discord? I feel like it's quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.

It seems to have totally taken over a space that wasn't even clearly defined before they got there.

[+] HCIdivision17|9 years ago|reply

It does, doesn't it? I used to use Ventrillo, but then they screwed our small group out of our server connection. And we happily used Dolby Axon for a while. We tried Google Hangouts... for a while; until it just really didn't work well (it just disconnected and crapped out a lot). We tried using the Steam client's chat, but while ok for screensharing, it wasn't so great for chat.

But at some point we heard of Discord, which posed itself as a chat/vent replacement, started using it, and it just works. Which is huge, since the other stuff generally didn't (Axon was actually good).

[+] bpicolo|9 years ago|reply

They had a really well defined user-space, marketed at it well, and really nailed the user experience, while still being free for the typical user. There is a lot to love about Discord.

[+] Numberwang|9 years ago|reply

I ended up there for the first time last night and must say that there is a lot to like about it. I found some good communities and integrating media and so on all felt quite streamlined, and the system was snappy.

It's just too bad there are a dozen IMs/voice/video and a dozen slacky/feed companies.

[+] user5994461|9 years ago|reply

I'd like to say that the official performance unit is the "request per second". And its cousin, the requests per second in peak.

The average per minute only gets to be used because many systems have so little load that the number per second is negligible.

[+] AgentK20|9 years ago|reply

Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)

I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(

[+] mevile|9 years ago|reply

I spend a lot of time in the PCMR Discord, which is pretty lively. The technology seems to be solid, while the UI has issues (notifications from half a day ago are really hard to find for example on mobile devices). Otherwise I'm on Discord every day and love using the service. I miss some slack features, but the VOIP is very good.

[+] snambi|9 years ago|reply

million requests per minute, is this a big deal?

[+] user5994461|9 years ago|reply

16k per second. 83k per second during peak (assuming 80/20 default traffic rule).

- 100 /s = typical limit of a standard web application (python/ruby), per core

- 1.000 /s = typical limit of an application running on a full system

- 10.000 /s = typical limit for fast systems (load balancers, DB, haproxy, redis, tomcat...).

- Over 10.000/s You gotta scale horizontally because a single box [shouldn't] can't take it.

The difficulty depends on the architecture and what the application has to do (dunno, didn't go through the article). You make something that can scale by just adding more boxes, then it's trivial, just add more boxes. Well, it's gonna costs money and that's about it.

So no. Not a big deal at all... if you've done that before and you've got the experience :D

[+] manigandham|9 years ago|reply

Everything is relative. In this case, it's not so much the actual load itself but rather the throttling ability to match the upstream provider's throughput and limitations.

[+] manigandham|9 years ago|reply

Akka(.NET) or any actor system is a perfect fit for this and brings the same functionality to other languages and frameworks.

[+] sbov|9 years ago|reply

Is the number of Push Collectors to Pushers constant or can it vary based upon notification load?

[+] jhgg|9 years ago|reply

It is constant - but iirc, it'd be trivial to make a dynamically scaling pool. At the end of the day, a pusher is just a TCP connection. Keeping a pool of fixed size and planning capacity around scaling horizontally is a perfectly acceptable approach - given you know the potential throughput for each pusher.

[+] rv11|9 years ago|reply

just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?

[+] di4na|9 years ago|reply

RabbitMQ is written in erlang. So basically you use it natively instead of bringing and configurating a big dependency. It just come with your language for free without needing another process, etc etc.

[+] sandGorgon|9 years ago|reply

how does one achieve this in Celery 4? I remember there was a celery "batch" contrib module that allowed this kind of a batching behavior. But i dont see that in 4

[+] IOT_Apprentice|9 years ago|reply

Why not use Kafka for back pressure?

[+] imaginenore|9 years ago|reply

> "Firebase requires that each XMPP connection has no more than 100 pending requests at a time. If you have 100 requests in flight, you must wait for Firebase to acknowledge a request before sending another."

So... get 100 firebase accounts and blast them in parallel.

161 comments