I wonder if they have a major Postgres database that hit transaction ID wraparound? Postgres uses int32 for transaction IDs, and IDs are only reclaimed by a vacuum maintenance process which can fall behind if the DB is under heavy write load. Other companies have been bitten by this before, eg Sentry in 2015 (https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...). Depending on the size of the database, you could be down several days waiting for Postgres to clean things up.
Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.
Whatever the issue is, I’m wishing the engineers working on it all the best.
I find it pretty odd to speculate that they are experiencing a very specific failure mode of a particular database. Do you even know whether they use Postgres?
Roblox has 43+ million daily active users. The issue you are wildly speculating about is enormously small compared to their size and scale. I guarantee you they’ve dealt with that potential issue (if they even are using Postgres) years ago.
I'd speculate that it's more likely a data corruption problem. A system was overwhelmed or misconfigured led to corruption of critical configuration data, which led to propagation of such corruption to a large number of dependencies. Roblox tried to restore its data from backup, a process that was not necessarily rehearsed regularly or rigorously therefore took longer than expected. All other services would have to restore their systems in a cascaded fashion while sorting out complex dependencies and constraints, which would take days.
Offtopic: why does something that everyone here recommends all the time need something like vacuum which is heavy and can fail? What is the reason people do not cry about that as much as the Python GIL? It felt like a hack 15+ years ago and it feels just weird now. I am curious why that was never changed; it is obviously hard but is it not regarded as a great pain and priority to resolve?
It amazes me that anything in common use has these kinds of absurd issues. Postgres has a powerful query language and a lot of great features but the engine itself is clunky and hairy and feels like something from the early 90s. The whole idea of a big expensive vacuum process is crazy in 2021, as is the difficulty of clustering and failover in the damn thing.
CockroachDB is an example of what a modern database should be like.
The lack of communication for an outage this big is absolutely shameful. I put this on the leadership, not any of the engineers working round the clock. Having been in the middle of a critical service outage that lasted over 24 hours I totally get the craziness of the situation, but Roblox seriously needs to revisit their incident management and customer update process. Even though kids are the main consumers of the app, the near total silence speaks volumes about their business's lack of preparedness for disaster scenarios. If nothing else, I hope they'll see this as an opportunity to learn and do better next time.
What else would you expect when it comes to communication? They posted status page and said they're working on it. Do you want to be part of their internal chat and see what exactly they're investigating?
Would then posting every 2h "still working on it" really made it better?
Roblox made their developer tooling require you to be always online and signed in, even though it doesn't actually need this to function. This means that all development workflows have been bricked during this outage too. https://twitter.com/RBXStatus/status/1454815143607607300/pho...
This guy's too humble. He's Matheus Valadares, and he's the creator of Agar.io [1]. This is the game that pretty much started the ".io game" genre [2].
[2]: "Around 2015 a multiplayer game, Agar.io, spawned many other games with a similar playstyle and .io domain". See https://en.wikipedia.org/wiki/.io
That's a very long outage, I wonder how this happened.
- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it.
- Data recovery can take forever if you run into edge cases with your databases
Anyone found any articles about their architecture?
As longs as we're speculating.. One of the few things I can think of that can't reasonably be sped up is data integrity recovery. Say some data got in an inconsistent state and now they have to manually restore a whole bunch of financial transactions or something before opening the game up again, because otherwise customers would get very mad at missing stuff they've paid for, traded, etc.
If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.
Been checking the #roblox hashtag on Twitter and the two main themes are addicts going through withdrawal and devs saying how they wouldn't have their llama appreciation fan site be down this long let alone your core business.
Adding to the speculation here, I'm willing to bet some component of their issue is not entirely technical. Regardless of the underlying cause (PKI was mentioned), for downtime to last this long it almost definitely means some persistent data was lost or corrupted. Of course they can recover from a backup (I'm confident they have clean backups) but what does that mean for the business? "We irrecoverably lost 12 hours of data" could have severe implications, for example legal or compliance risks.
Why are you confident they have clean backups? It’s been my experience that backup infrastructure is usually not given much thought, engineers infrequently test that recovery from backups work as expected. Not saying that’s what it is but not sure if it can be ruled out.
I wonder how the market is going to react when it opens tomorrow. I am thinking a quick scalp with weekly $RBLX puts, then when it recovers double up on cheap long call options.
In the immediate short term, yes, obviously - stress, long hours, etc. In the immediate aftermath, probably too: Burdensome mandates from management that isn't always aware of reality to "make sure this never happens again".
But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.
Who wants to bet there will be a Kubernetes versus Nomad/Consul debate coming to a Roblox meeting soon? I'd like to hear what Hashicorp has to say here.
From TFA - “We believe we have identified an underlying internal cause of the outage with no evidence of an external intrusion,” says a Roblox spokesperson.
[+] [-] jitl|4 years ago|reply
Even though it’s a well documented issue with Postgres and you have an experienced team keeping an eye on it, a new write pattern could accelerate things into the danger zone quite quickly. At Notion we had a scary close call with this about a year ago that lead to us splitting a production DB over the weekend to avoid hard downtime.
Whatever the issue is, I’m wishing the engineers working on it all the best.
[+] [-] mikeklaas|4 years ago|reply
[+] [-] frankjr|4 years ago|reply
https://news.ycombinator.com/item?id=29044500
[+] [-] drewbailey|4 years ago|reply
[+] [-] mulmen|4 years ago|reply
[+] [-] hintymad|4 years ago|reply
[+] [-] crehn|4 years ago|reply
Quite surprising a seemingly battle-tested database can choke in such a manner.
[+] [-] tluyben2|4 years ago|reply
[+] [-] xyst|4 years ago|reply
[+] [-] api|4 years ago|reply
CockroachDB is an example of what a modern database should be like.
[+] [-] vbg|4 years ago|reply
[+] [-] AznHisoka|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] 1cvmask|4 years ago|reply
[+] [-] romanhn|4 years ago|reply
[+] [-] justapassenger|4 years ago|reply
Would then posting every 2h "still working on it" really made it better?
[+] [-] CheezeIt|4 years ago|reply
[+] [-] swiley|4 years ago|reply
[+] [-] golemiprague|4 years ago|reply
[deleted]
[+] [-] intunderflow|4 years ago|reply
[+] [-] lotophage|4 years ago|reply
[+] [-] Matheus28|4 years ago|reply
[+] [-] truetraveller|4 years ago|reply
[1]: https://en.wikipedia.org/wiki/Agar.io
[2]: "Around 2015 a multiplayer game, Agar.io, spawned many other games with a similar playstyle and .io domain". See https://en.wikipedia.org/wiki/.io
[+] [-] mromanuk|4 years ago|reply
[+] [-] schnebbau|4 years ago|reply
[+] [-] jimmydorry|4 years ago|reply
[+] [-] swatkat|4 years ago|reply
"STATUS UPDATE: Roblox is incrementally opening the website to groups of users and will continue to open up to more over the course of the day..."
[+] [-] meheleventyone|4 years ago|reply
[+] [-] abricot|4 years ago|reply
[+] [-] tschellenbach|4 years ago|reply
- Perhaps internal systems they've developed, and the people who created them left. So it's not just fix the thing, but first understand what the thing is doing and then fix it. - Data recovery can take forever if you run into edge cases with your databases
Anyone found any articles about their architecture?
[+] [-] ekovarski|4 years ago|reply
https://www.hashicorp.com/case-studies/roblox
[+] [-] kawsper|4 years ago|reply
[+] [-] breakingcups|4 years ago|reply
If they were to resume the game before restoring these issues, they would only exacerbate with state moving even further from where it was originally.
[+] [-] g123g|4 years ago|reply
[+] [-] tetron|4 years ago|reply
[+] [-] TekMol|4 years ago|reply
Or is this the highscore?
[+] [-] xakahnx|4 years ago|reply
[+] [-] pm90|4 years ago|reply
[+] [-] xyst|4 years ago|reply
[+] [-] mukundmr|4 years ago|reply
[+] [-] roamerz|4 years ago|reply
[+] [-] QuercusMax|4 years ago|reply
[+] [-] christkv|4 years ago|reply
[+] [-] tgsovlerkhgsel|4 years ago|reply
But if the organization is functional, in the medium term, this may also mean staffing understaffed teams, hiring SREs, etc. - which can mean less stress, no more 24/7 pager duty, better pay etc.
[+] [-] bastardoperator|4 years ago|reply
[+] [-] amelius|4 years ago|reply
[+] [-] PeterisP|4 years ago|reply
[+] [-] chipotle_coyote|4 years ago|reply
[+] [-] EastOfTruth|4 years ago|reply
https://i.imgur.com/KgDxNsg.png
[+] [-] jeffal|4 years ago|reply
https://news.ycombinator.com/item?id=19038198