top | item 16340462

Postmortem of Service Outage at 3.4M Concurrent Users

414 points| johnnyapol | 8 years ago |epicgames.com

174 comments

order
[+] mmanfrin|8 years ago|reply
When Bluehole take down PUBG for 5 hours, there's no communication outside of two tweets. When Epic see degraded performance for less than 2 hours, they give a postmortem.

There's a difference in the level of respect each company gives for its customers. I play PUBG a lot, but I want to see Epic win in the long run.

[+] Arzh|8 years ago|reply
There might also just be the ability for the developers to understand what actually went wrong. I think the guys at Epic spent a lot of time building their system so they can understand it and I'm not sure that PUBG developers started out with that in mind.
[+] cjsawyer|8 years ago|reply
There isn’t even a custom message for “scheduled server maintenance” vs “our servers are busy” (which I can get multiple times a night anyway). You NEED to check the tweets to find out which it is. I love PUBG, I just wish it was made by more competent developers.
[+] Gonzih|8 years ago|reply
PUBG combines great game design with terrible technical implementation and bad community management. I'm with you on hoping that epic will succeed in a long run.
[+] hesdeadjim|8 years ago|reply
Why does it need to be a zero sum game with one winner? Fortnite and PUBG are two very different games fundamentally and there is no reason they can't exist side-by-side.
[+] piyh|8 years ago|reply
I want the gunplay of PUBG with the support of Epic.
[+] arca_vorago|8 years ago|reply
A little counter to that narrative, Epic has really been letting me down lately.

When Epic started the UE4 project, they promised to take care of us gnu/linux users. I bought in the second I could. Immediately they started ignoring us. The first year and a half or so most of us Linux people were using a community fork because Epic refused to merge changes. We banded together and we're forced out of their irc and had to form another channel. I tried to be understanding, at first thinking they were low on resources. As time went on and games like Paragon (which I payed $20 for at release and which has now been abandoned, and I'm still waiting for my refund for over two weeks), and Epic started showing off how well things were going, they still basically abandoned us. There is still no marketplace/launcher on Linux, so to retrieve my $500 worth of assets I have to use a windows computer. Major bugs persist in all branches, and not just for the native editor... games like pubg would love to ship for Linux if the crosscompile tool gain wasn't an exercise in cryptic puzzle solving (which is why so many UE4 games are windows only not by choice. All pleas for more resources and love on the forums are met with comments about marketshare and how unworthy Linux is of their time and resources.

They promised us all this love and then after many of us spent lots of money on them they just ignored us. I'm thankful for the attempt for a native Linux editor, but crashes and uncompilable projects have essentially halted dev for me to the point I'm having to consider godot and blender even though they aren't nearly at feature parity, and due to epics licensing all that money on assets is wasted since those assets can only be used in UE4...

I love UE4 when it works. Blueprints, the animation and rigging system, the camera system are all wonderful to work with. I want to use it... but I'm feeling increasingly taken advantage of by Epic.

Tim, if you're reading this, how about a post mortem of the abandonware that is the Linux editor?

[+] jasonjayr|8 years ago|reply
> We run Fortnite’s dedicated game servers primarily on thousands of c4.8xlarge AWS instances, which scale up and down with our daily peak of players.

That's between $572,000 (500 instances, 30 days) - $2,863,800 (2500 instances, 30 days), per month at current prices, and seems like it's only for one aspect of their infrastructure.

That seems .... excessive? Is that a typical spend with a game server system like this? That does seem to suggest that once this becomes less than profitable, it's all going away ...

[+] 013a|8 years ago|reply
c4.8xlarges are the smallest c4 instance that guarantees 10G networking performance.

Those costs can easily be cut in half with reservations. Its likely there's a lower-bound of the number of reserved instances they use as a baseline performance guarantee, then they use on-demand to scale above that.

No one at their scale actually pays the list price.

Basic math: There are ~100 players in each game. At a 30 tickrate, that's 3000 RPS per game minimum. Each of those requests likely involves a number of 3D math calculations, including hit detection, collision detection, real-time cheat detection, etc. Those updates then need to batch back to the players at 30 tick. All of this needs to happen with as little eventual consistency as possible; a difference of milliseconds degrades the player's confidence that the server is correctly calculating what is happening in-game.

Point being; multiplayer game programming is an entirely different beast than normal web programming. The same rules don't apply. Its an n^2 problem where every additional user in one "lobby" actually increases resource utilization exponentially because you need to update every other user's game-state with the actions of that new player. Additionally, Battle Royale style games are the most demanding multiplayer games ever created. Games like WoW have way more players per realm, but servers only need to worry about the interactions of a select few in your surrounding area, and it doesn't have as stringent real-time requirements. Games like CoD only load in 5-20 players.

[+] zemo|8 years ago|reply
I run game servers. Our intra-day swing gives us about 10x traffic during primetime hours compared to off hours. Autoscaling sheds a lot of overhead.
[+] eterm|8 years ago|reply
Perhaps they're doing some creative counting and thousands refers to all the instances spun up over a day rather than the actual concurrent boxes.
[+] tbrock|8 years ago|reply
Shit. Switching to c5 would save them over 100k/month at a minimum and net better performance to boot. I’d get on that.
[+] matt_s|8 years ago|reply
A lot of players of massive online games tend to get hand-wavey when there are problems and act like "dude just get more servers" is the answer.

This clearly shows how complex a system is needed that has to handle 3.4 million concurrent, connected users. I think the connected part compounds any scale problems you have since it is implied they are connected to each other.

[+] rightos|8 years ago|reply
> act like "dude just get more servers" is the answer.

The biggest problem here is that it largely used to be a distributed system. You used to just be able run your own dedicated server on whatever provider you liked. The dev would just run a single server list. Now many game developers have decided they're the only ones who get to run servers - primarily because they can charge more for micro transactions and private servers this way as far as I can tell.

It really hurts games - PUBG is the best example I've seen - constant lag issues, complete lack of server side checks for things like shooting through the ground (because hey, that costs CPU and every additional cloud server they need means less profit), etc. It's basically made the game unplayable.

Game developers are unfortunately stuck between immersion in their games and the rage that leaves players with when technical issues occur. The more immersed your players are, the more rage they'll experience when your game crashes or lags at the wrong time.

[+] always_good|8 years ago|reply
Agreed. Gamers also are among the most entitled users I've ever had to deal with.

I'll never accept donations again for a game I've made available for free from the amount of problems someone has caused me after they gave me $5.

[+] pkilgore|8 years ago|reply
Love this because it shows two things 1) competent people are handling problems and 2) they actually care.

A whole lot better than spoon feeding customers bullshit for weeks while hamstringing your product rather than investing in it (looks at EA, mumbles about SimCity).

[+] fhood|8 years ago|reply
Lets not labor under the impression that Epic invested resources in fixing these issues purely out of the goodness of their hearts.

The success or failure of the recent Sim City game was hardly EA's number one concern. Fortnite, however, is probably extremely important to the folks at Epic.

I'm not defending EA here, but it was definitely in Epic's best interests not to piss off Fortnite's player base given that there are some other, ahem, similar titles on the market right now.

[+] JohnTHaller|8 years ago|reply
1. competent people are handling the problems

2. they actually care

3. they enjoy what they are doing

One thing I really miss working solo is working with a team of smart folks to solve a complex problem together. It's so damned fun.

[+] Thaxll|8 years ago|reply
Epic game is not a publicly traded company and there is a lot of things you can say when you're private.
[+] tweenagedream|8 years ago|reply
Disclaimer: I work on Google Cloud so I will be speaking from the bias of knowing those products.

They talk a lot about reducing operating complexity and scaling their infrastructure, I wonder what the cost of their current infrastructure + the staff to maintain it might be vs the managed solutions that cloud providers offer now.

For example, using cloud datastore or spanner or big table as a persistent layer, these managed services can definitely scale to the current need and I've seen them go much higher as well.

For logs ingestion and analysis, big query can be a very powerful tool as well, and with streaming inserts that data can be queried in near real time. For things that are less urgent, batch queries. For other things dataflow can help with streaming workloads.

I think one of the problems they alluded to though was that at the moment they're on a single provider, and what they're looking for is a multi cloud strategy which totally makes sense. A lot of the above products create some kind of locking, with some exceptions, like using hbase as an interface to big table or beam as an interface to dataflow. Though I don't know what the other providers offer that may have these same interfaces.

Another option is kubernetes, which I believe all providers are pretty strongly embracing. Having most of the supporting infrastructure be brought up with a few kubectl commands could help them scale across several cloud providers quickly.

[+] matt_s|8 years ago|reply
I think they detailed in the article that the problem isn't their game servers which are AWS cloud based and can scale up, it is their login/setup/matchmaking (my term) server infrastructure that is the first thing users first encounter that is having issues.

Usually there is a cost/scale threshold with managed providers where it is cheaper do DIY than to pay thousands upon thousands per month for say log ingestion.

[+] ShakataGaNai|8 years ago|reply
All of the managed products from the different cloud providers are, more or less, great. The problem is they are black boxes. When something goes wrong you're completely at the whim of support. Ever call Dell/Comcast support and want to tear your hair out? Yea... it's like that except neither AWS nor Google have phone numbers to call.

The other problem is that most of these things aren't easy to migrate to. AWS RDS is much easier because its just managed whatever you're already using. But cloudspanner? DynamoDB? You have to completely re-architect your application. Then you have to move your application, and data, to this new system...without massive outages. It's a lot of work and a lot of cost. So until things go HORRIBLY sideways, most companies don't have the spare time/money.

Been there, tried that.

[+] trevyn|8 years ago|reply
Agreed, Cloud Spanner is an impressive piece of technology, but question: If you build your business on Spanner and it does start having problems for whatever reason, what do you do? Obviously at this size you’d get great support from Google, but ultimately you rely on one managed provider and your hands are tied. That’s a tough situation to be in when you’re servicing 3.4M users.
[+] tlynchpin|8 years ago|reply
> .. currently unclear to us and support why our writes are being queued ..

You think GCP offers better support on spanner et al when customer is having performance problems? In this case probably yes, because an Epic sized monthly spend is highly effective at escalating through support.

It takes low effort to find <cloud persistence horror story> around here so we know cloud is not a special magic that is immune from integration performance problems. But the economic incentives are meaningfully different and especially so at runtime.

[+] victorqhong|8 years ago|reply
Really surprised that they use XMPP. Since you don't really hear anything about XMPP anymore, I think most people assumed that it's dropped off in usage/popularity (or people have moved to some other proprietary solution).

I've always thought that XMPP would be useful for games, just surprised to hear that people are actually doing it.

[+] Dolores12|8 years ago|reply
Riot Games is using XMPP to provide in-game chat for League of Legends.
[+] swaggyBoatswain|8 years ago|reply
I was playing fortnite on 2-04-18 22:00 UTC during the "Friends Service" outage.

You couldn't see friends lists at all during that time period. So you couldn't queue up in a friends / people you knew at all in a match, the only options were either playing solo or using a "filled" team with random players.

I've been playing fortnite as one of the early 60k concurrent users all the way to the 3.4M, so its been interesting seeing their load / server issues over time and then reading this (Granted, I don't understand everything discussed in their blog). They've done a outstanding job handling their growing traffic.

One thing I've noticed with Fortnite, compared to PUBG or other MMOs, is how large their patch updates are. Its usually several GB large, and it comes fairly frequently about once a week.

[+] swaggyBoatswain|8 years ago|reply
Forgot to note that fortnite had an ongoing "friend list" problem before

Most notably, whenever you wanted to add someone to your party. You had to do the following.

1. You Send user friend invite

2. User would have to accept invite

3. User would have to disconnect to refresh your friends list (due to friend-service issue mentioned in blog)

4. User would relog back on (took approximately 3 mins)

5. You could then see them on friends list

6. Send party invite

[+] frenchie14|8 years ago|reply
That has to do more with how they package the game than how much they changed. Often times large parts of the game are bundled together so a single small change in one bundle causes you to re-download the whole thing
[+] fokinsean|8 years ago|reply
As an addicted Fortnite player this is a neat read. However as an application layer dev, the architecture specifics were slightly over my head. My biggest concern is shipping a working docker image, all of the architecture is mostly abstracted at our company. This gave me some inspiration to dive deeper into our architecture.
[+] aaossa|8 years ago|reply
Loved the tone of the article. They know they have some problems to work on, they're being transparent about them and they're explicitly saying that they need help with it.
[+] stevenwoo|8 years ago|reply
Look forward to seeing how they fix that MongoDB collection write stalling problem. Vaguely recall that was still a big problem the last time I was looking at MongoDB years ago.
[+] iBotPeaches|8 years ago|reply
That was an incredible fun read. Makes me curious of the other failures in this industry if they could be explained in this detail.
[+] lazyjones|8 years ago|reply
The EVE developers used to post a lot of details about their setup, upgrades, software and about failures / congestion issues...
[+] Thaxll|8 years ago|reply
Video game industry doesn't reveal that much about their tech. You only see a glimpse of what they use during public talk like GDC ect...
[+] eterm|8 years ago|reply
This is an interesting read, it's always interesting to hear why something that ought to be fairly heavily federated or sharded can nevertheless fall over centrally.
[+] SilverSurfer972|8 years ago|reply
> "Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience."

I think this is where Stacktical helps with proactively detecting performance regressions at the CI level, before they hit production: https://stacktical.com

Disclaimer: I am Stacktical's CTO

[+] einrealist|8 years ago|reply
Nice read. And nice to see Java running at the backend.

I wonder whether Epic can solve its problems by rearchitecting more into a CQRS driven system with event sourcing: store events in a more write optimized DB (e.g. Cassandra) and then process the events for fast reads through whatever is required for the usecases. Maybe they touched the limits of MongoDB to handle both, reads and writes at their scale.

[+] tlynchpin|8 years ago|reply
This is a great article, lots of detail, props to Epic team for generally killing it and specifically putting this together.
[+] orliesaurus|8 years ago|reply
I never spent a dime on any of these free 2play games. I am in awe at how dedicated the team behind Fortnite seems to be when it comes to providing us data (real data?) of what's happening on their side, while I am sitting on my couch logging into one of the matches with my keyboard and mouse
[+] halflings|8 years ago|reply
Meanwhile, the game is pretty much unplayable [1] on Mac OS while it was heralded as the first game to support Metal (even featured in Apple's keynote).

[1] Getting ~16 FPS on medium settings, with the high-end late 2016 MBP 15".

[+] stats_n_trends|8 years ago|reply
As a counter point I could play quite well on my MBP 17 at medium settings. (45 FPS)

Note: I did tweak some minor settings like AA to get better performance.

[+] aecorredor|8 years ago|reply
How do you get to this level of expertise? What are the resources people like these use to learn about this type of scalable systems? Any good books that start from the ground up on these topics?
[+] dom96|8 years ago|reply
Looks like they are still having scaling issues. I just tried creating a new Epic account and was shown an error.
[+] je42|8 years ago|reply
I wonder why they want to do the step:

- Followed by removing Nginx + Memcached couple altogether out of equation.