top | item 29058573

Roblox has been down for days and it’s not because of Chipotle

243 points| Terretta | 4 years ago |theverge.com

169 comments

order
[+] jitl|4 years ago|reply
I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.

Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.

Whatever the issue is, I’m wishing the engineers working on it all the best.

[+] mikeklaas|4 years ago|reply
I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?
[+] drewbailey|4 years ago|reply
Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.
[+] mulmen|4 years ago|reply
Do you have any reason to believe this is the case?
[+] hintymad|4 years ago|reply
I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.
[+] crehn|4 years ago|reply
If it's a known issue, is there no way to increase the transaction ID size?

Quite surprising a seemingly battle-tested database can choke in such a manner.

[+] tluyben2|4 years ago|reply
Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?
[+] xyst|4 years ago|reply
I thought it was a certificate issue? I am looking at robox.com and the issuing CA is GoDaddy...
[+] api|4 years ago|reply
It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.

CockroachDB is an example of what a modern database should be like.

[+] vbg|4 years ago|reply
The latest version of Postgres addresses this issue in various ways, although it’s not entirely solved, it should be significantly mitigated.
[+] AznHisoka|4 years ago|reply
That’s not a bad theory. Even the homepage is down so that suggests their entire database was taken down.
[+] 1cvmask|4 years ago|reply
Do any other databases have similar issues to Postgres? Or this is specific to Postgres?
[+] romanhn|4 years ago|reply
The lack of communication for an outage this big is absolutely shameful. I put this on the leadership, not any of the engineers working round the clock. Having been in the middle of a critical service outage that lasted over 24 hours I totally get the craziness of the situation, but Roblox seriously needs to revisit their incident management and customer update process. Even though kids are the main consumers of the app, the near total silence speaks volumes about their business's lack of preparedness for disaster scenarios. If nothing else, I hope they'll see this as an opportunity to learn and do better next time.
[+] justapassenger|4 years ago|reply
What else would you expect when it comes to communication? They posted status page and said they're working on it. Do you want to be part of their internal chat and see what exactly they're investigating?

Would then posting every 2h "still working on it" really made it better?

[+] CheezeIt|4 years ago|reply
Good grief. It's a game for kids. Nobody needs updates. They could even take the weekend off.
[+] swiley|4 years ago|reply
Meh. This is part of playing a game where all the multiplayer goes through one service.
[+] Matheus28|4 years ago|reply
Fun fact: I operate many web mmo games (or “io games” how people like to call them), and traffic is up around 20-100% since the Roblox outage started.
[+] mromanuk|4 years ago|reply
Love agar.io and HN for this
[+] schnebbau|4 years ago|reply
That's cool, I love io games. Which ones?
[+] swatkat|4 years ago|reply
https://twitter.com/Bloxy_News/status/1454861081021587456

"STATUS UPDATE: Roblox is incrementally opening the website to groups of users and will continue to open up to more over the course of the day..."

[+] abricot|4 years ago|reply
They write that, but according to various game Discords it's absolutely not true. No one is allowed to log in.
[+] tschellenbach|4 years ago|reply
That's a very long outage, I wonder how this happened.

- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it. - Data recovery can take forever if you run into edge cases with your databases

Anyone found any articles about their architecture?

[+] kawsper|4 years ago|reply
I bet it is some sort of Vault/Consul shenanigans that's going on.
[+] breakingcups|4 years ago|reply
As longs as we're speculating.. One of the few things I can think of that can't reasonably be sped up is data integrity recovery. Say some data got in an inconsistent state and now they have to manually restore a whole bunch of financial transactions or something before opening the game up again, because otherwise customers would get very mad at missing stuff they've paid for, traded, etc.

If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.

[+] g123g|4 years ago|reply
Wouldn't it be cheaper to directly compensate such customers rather than keeping the whole website down for 3 days?
[+] tetron|4 years ago|reply
Been checking the #roblox hashtag on Twitter and the two main themes are addicts going through withdrawal and devs saying how they wouldn't have their llama appreciation fan site be down this long let alone your core business.
[+] TekMol|4 years ago|reply
Has any billion dollar company ever been down for 2 days?

Or is this the highscore?

[+] xakahnx|4 years ago|reply
Adding to the speculation here, I'm willing to bet some component of their issue is not entirely technical. Regardless of the underlying cause (PKI was mentioned), for downtime to last this long it almost definitely means some persistent data was lost or corrupted. Of course they can recover from a backup (I'm confident they have clean backups) but what does that mean for the business? "We irrecoverably lost 12 hours of data" could have severe implications, for example legal or compliance risks.
[+] pm90|4 years ago|reply
Why are you confident they have clean backups? It’s been my experience that backup infrastructure is usually not given much thought, engineers infrequently test that recovery from backups work as expected. Not saying that’s what it is but not sure if it can be ruled out.
[+] xyst|4 years ago|reply
I wonder how the market is going to react when it opens tomorrow. I am thinking a quick scalp with weekly $RBLX puts, then when it recovers double up on cheap long call options.
[+] roamerz|4 years ago|reply
Internal cause does not necessarily mean a technical mishap. Read rogue sysadmin or other employee initiated event.
[+] QuercusMax|4 years ago|reply
I've been getting recruiting emails from Roblox recently.... Maybe they really do need my help.
[+] christkv|4 years ago|reply
I feel for the people in the trenches on this one. It’s got to suck bad.
[+] tgsovlerkhgsel|4 years ago|reply
In the immediate short term, yes, obviously - stress, long hours, etc. In the immediate aftermath, probably too: Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".

But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.

[+] bastardoperator|4 years ago|reply
Who wants to bet there will be a Kubernetes versus Nomad/Consul debate coming to a Roblox meeting soon? I'd like to hear what Hashicorp has to say here.
[+] amelius|4 years ago|reply
Is there a possibility this is caused by ransomware?
[+] PeterisP|4 years ago|reply
From TFA - “We believe we have identified an underlying internal cause of the outage with no evidence of an external intrusion,” says a Roblox spokesperson.
[+] EastOfTruth|4 years ago|reply
Apparently, according to roblox.com, some player are able to play: "We are incrementally opening to groups of players and will continue rolling out."

https://i.imgur.com/KgDxNsg.png