How Facebook Keeps Messenger from Crashing on New Year's Eve

[+] sn41|7 years ago|reply

I found the concept of affinity used for bunching messages to be a new concept. The other concepts are not surprising, they are the typical engineering solutions used.

One thing I am curious about: how does it compare to how the old postal systems used to handle Christmas and new year loads?

On a lighter note: perhaps you can predictively generate and cache messages at the receiver's end based on their contacts and their style of communication. When a sender actually sends a message, just send one bit across, and the local cache gets flushed and displayed :)

[+] zimpenfish|7 years ago|reply

> I found the concept of affinity used for bunching messages to be a new concept.

I'm reasonably sure that UUCP + things like INN were "batching messages per destination" for efficiency a long time ago. Nowhere near the same scale, obviously, but the same kind of concept, no?

[+] ehsankia|7 years ago|reply

take the 16 most common messages, normalize them a bit, and encode them in 4 bit, and I would guess that would probably cover over 50% of the messages sent, maybe even close to 80%.

[+] kaustyap|7 years ago|reply

It may be a good idea to provide range of fixed templates for standard messages and send just selective message Id along with recipient name over the network.

[+] robertAngst|7 years ago|reply

Interesting how they decide to remove features like 'seen' and 'online' in an effort to prioritize the actual message.

Offtopic- I hate those features to begin with, but I know I'm the product not the customer, and those features are to keep people on the app.

[+] AznHisoka|7 years ago|reply

I haven't used FB Messenger, but I also hate features where they show if the other person is typing a response. I intentionally disable it in Slack, so the other person simply sees a response when I'm done with it.

[+] unknown|7 years ago|reply

[deleted]

[+] oscar_wong67|7 years ago|reply

I also despise those features, but I agree wholeheartedly that these decisions push the money forward

[+] Jaruzel|7 years ago|reply

Back in the day when SMS text message was the way to send messages to each other over the mobile phone network (in the UK), people would jump the gun by sending 'Happy NY!' messages 5 minutes before midnight, because the moment 12am hit, any messages sent then could be queued for hours as the mobile networks struggled to cope with the massive uptick in messages being sent at the same time.

[+] Semaphor|7 years ago|reply

I used to use bulk sending tools (yeah, they existed as J2ME apps) to send 10-50 SMS (it's been a while, not sure how many it was) and some would only arrive on January 1st quite a while into the day. At some point, it changed and everything would arrive just a few minutes after sending. I think that was when we had iPhone and Android already and I'm not sure if it was because of messengers (were there any back then?) or because the German telcos finally upgraded their infrastructure enough.

[+] sjroot|7 years ago|reply

As someone who typically works on front-end projects, this was a very interesting read. I particularly loved the discussion of “graceful degradation.” That’s the kind of collaboration across the stack that makes a service like Messenger very pleasant to use.

[+] secabeen|7 years ago|reply

Interesting that the messenger team is ~40 people, as compared to WhatsApp having 32 engineers at the time of their sale to FB.

[+] traek|7 years ago|reply

The photo caption says that's just the infrastructure team. I'd imagine the product team is much larger given how many features are crammed into Messenger.

[+] pbalau|7 years ago|reply

Afaik the real team name is Messenger Foundation. FB has the concept of foundation teams, with the only goal of keeping a part of the service working no mater what.

/edit: been informed by some of the people in the picture that technically there are 2 teams there: Messenger Infra and Messenger Foundation

[+] ngngngng|7 years ago|reply

I didn't realize message queues were used for this type of task. I'm assuming you would then also use autoscaling pods that respond to the number of messages in the queue. How do you scale pods fast enough for a messaging application or anything else trying for 100ms or less per operation?

[+] latch|7 years ago|reply

I think over-provisioning is way more common and sane approach that can address the bulk of spikes versus auto-scaling. Especially if you have these big known events (new years day, black friday, ...) where you can over-provision (or controlled auto-scale if you will) for a short window.

My guess is they're doing both.

Anecdote time. I worked at a company where one project was over-provisioned on dedicated hardware and another auto-scaled in the cloud. The over-provisioned project was much cheaper, had significantly better response times and was easier to manage. It was load tested to handle over an order of magnitude more traffic than the all-time-peak and even though fully over-provisioned, it was cheaper than the baseline usage (and slower, and harder to manage) cloud solution.

[+] jcmi|7 years ago|reply

Messaging queues are a core part of a lot of high scale distributed systems (source - Twitter) You want enough queue space to handle the expected volume and then some. Assuming you have that, you don't need to instantly scale instances out to match the amount of messages, you just need to catch up before the queue space runs out.

[+] gmmeyer|7 years ago|reply

Message queues (or similar things like Kafka, which isn't quite a proper "message queue") are used for basically everything at this scale. Messages are being passed indirectly. An event happens, it gets popped on a queue, and then the recipients do something with it.

[+] underwater|7 years ago|reply

Facebook runs its own hardware. How would "autoscaling" help them?

[+] int0x80|7 years ago|reply

One way is to over-allocate in the first place. When your spare pool is draining below a watermark, you scale in. Hopefuly there is enough time for that scale event to complete before the pool drains completely.

[+] hellofunk|7 years ago|reply

One thing I find very manipulative about Facebook is how, when someone sends you a message, the email notification has a link to open messenger and it says that messenger is the only way you can read that message, even if you don’t have messenger installed. They are trying everything they can do to have everyone install that app. Yet, of course you can just read and respond directly on their website without any app, but they don’t link to that or mention it.

[+] groestl|7 years ago|reply

mbasic.facebook.com for anyone reading this and looking for a way to work around Facebook's dark UI pattern on mobile, where a click on the "messages" button wants you to install the app.

[+] mrdickbig|7 years ago|reply

I read the title and instantly thought "Erlang"

[+] ckwalsh|7 years ago|reply

Fun fact: FB Chat was originally implemented in Erlang

https://www.facebook.com/notes/facebook-engineering/chat-sta...

[+] porpoisely|7 years ago|reply

Essentially what everyone else does - distributed systems with load balancing, load balancing and more load balancing. And if that goes awry, triage - where they prioritize messages and simply timeout and drop the lower priority messages. Of course the Messenger team is lucky in that they can drop messages since your family and friends missing a "Happy New Years" message isn't the end of the world. Other systems ( such as finance ), aren't so lucky. Drop a few transactions or apply them out of order and it is the end of the world. Was an interesting read, though it would have been nice if there were more specifics but I guess Facebook wouldn't approve that.

[+] linkmotif|7 years ago|reply

I don’t know, a missing “Happy New Years” isn’t missing dollars, but it’s definitely not cool to drop such a greeting —or any message—in my opinion. It should definitely be possible to at least store and then deliver these messages late. The baseline should be 100% deliverability and anything less than that should be subject to intense scrutiny. I mean, how big of a Kafka cluster do you need to make this happen?

[+] radicality|7 years ago|reply

Actual messages with content are never dropped. Only ‘meta-messages’ like read receipts - it’s not critical if on a group chat with many participants the state of ‘who’s seen the last message” is not 100% correct on New Year’s Eve.

[+] freddie_mercury|7 years ago|reply

> Of course the Messenger team is lucky in that they can drop messages since your family and friends missing a "Happy New Years" message isn't the end of the world. Other systems ( such as finance ),

Messenger is (also) a finance system. You can send money via Messenger. You can purchase products directly in Messenger. It's had all that for more than two years now.

[+] groestl|7 years ago|reply

From one of my projects (a MMORPG) I've learned that the required accuracy in non-financial transactions is often underestimated, while, on the other hand, financial transactions are often less critical that initially assumed. After all, compensation in financial transactions is often straightforward to calculate and apply. But the damage done through dropped/failed non-financial transactions is often hard to assess, and it's also more involved to find appropriate compensation.

[+] arusahni|7 years ago|reply

Do you have any references/resources for plussing up on scaling transactions w.r.t. finance? It sounds like an intriguing set of challenges...

[+] lostmsu|7 years ago|reply

Mine crashes stably on launch right now on two different machines. Therefore not so good :-D

[+] gerdesj|7 years ago|reply

New Years Day hasn't happened yet. Your problems indicate that you should turn it off and on again.

[+] mliker|7 years ago|reply

How do you know if your problem is a client app problem vs a backend issue? :P

[+] sluggg|7 years ago|reply

on mobile? are you running the most up to date version?

are you on a web browser? are any of your extensions mucking things up?

[+] Markoff|7 years ago|reply

this certainly feel like paid article to bring some good image to Facebook

[+] openloop|7 years ago|reply

[deleted]

[+] moron4hire|7 years ago|reply

[deleted]

[+] webo|7 years ago|reply

Facebook get its share amount of bad PR (some are well deserved), but we shouldn’t dismiss amazing engineering work because of those. This is a technical piece that highlights solutions to problems not many out there get to solve.

[+] RaceWon|7 years ago|reply

I thought they'd just let the NetFlix and Spotify servers handle any overflow.

[+] dominotw|7 years ago|reply

Cant wait for akward new yr wishes from ppl I haven't heard from since last new year.

[+] mrdickbig|7 years ago|reply

Deleted my FB 2 years ago... no regrets. You should try

[+] John_KZ|7 years ago|reply

This is a 2018 internet-connected app, not a 1985 GSM network.

1 billion 100 Byte messages sums up to an almost trivial 100GB. This might be a technical challenge for the neighborhood's web admin but not for any real company.

[+] notacoward|7 years ago|reply

What kind of tyro thinks GB/s instead of ops/s is the right measure here?

113 comments