Great writeup (including the human cost, e.g. loss / lack of sleep, which in my experience has a huge impact on complicated incident resolution).
Here’s what jumped out at me: “The new account was created in our database with a null value in the URI field.”
Almost every time I see a database-related postmortem — and I have seen a lot of them — NULL is lurking somewhere in the vicinity of the crime scene. Even if NULL sometimes turns out not to be the killer, it should always be brought in for questioning.
My advice is: never rely on NULL as a sentinel value, and if possible, don’t allow it into the database at all. Whatever benefits you think you might gain, they will inevitably be offset by a hard-to-find bug, quite possibly years later, where some innocuous-seeming statement expects either NULL or NOT NULL and the results are unexpected (often due to drift in the semantics of the data model).
Although this was a race condition, if the local accounts and the remote accounts were affirmatively distinguished by type, the order of operations may not have mattered (and the account merge code could have been narrowly scoped).
I finally made an account just to respond to this, I hope you don't find that too aggressive a move.
Null is a perfectly valid value for data, and should be treated as such. A default value (e.g. -1 for a Boolean or an empty for a string) can make your system appear to work where NULL would introduce a runtime error, but that doesn't mean your system is performing as expected, it just makes it quieter.
I know it's tempting to brush NULL under the rug, but nothing is just as valid a state for data as something, and systems should be written generally to accommodate this.
> the account merge code could have been narrowly scoped
IMO automated merging/deduplication of "similar" records is one of those incredibly hard problems, with edge cases and race conditions galore, that should have a human in the loop whenever possible, and should pass data (especially data consumed asynchronously) as explicitly as possible, with numerous checks to ensure that facts haven't shifted on the ground.
In many cases, it requires the implementors to start by thinking about all the concerns and interactivity requirements that e.g. a Git-style merge conflict would have, and try to make simplifying assumptions based on the problem domain from that starting position.
Looking at the Mastodon source [0], and seeing that there's not even an explicit list of to-merge-from IDs passed from the initiator of the merge request to the asynchronous executor of the merge logic, it seems like it was only a matter of time before something like this happened.
This is not a criticism of Mastodon, by the way! I've personally written, and been bitten by, merge logic with far worse race conditions, and it's frankly incredible that a feature like this even exists for what is effectively [1] a volunteer project! But it is a cautionary tale nonetheless.
NULL is inevitable if you use JOINs, simply as a matter of what a JOIN is.
More deeply, NULL is inevitable because reality is messy and your database can't decline to deal with it just because it's messy. You want to model titles, with prenomials and postnomials, and then generate full salutations using that data? Well, some people don't have postnomials, at the very least, so even if you never store NULLs you're going to get them as a result of the JOIN you use to make the salutation.
You can remove the specific NULL value, but you can't remove the fact "Not Applicable"/"Unknown" is very often a valid "value" for things in reality, and a database has to deal with that.
"ah yes well we have a full database backup so we can do a full restore", then
"the full restore will be tough and involve downtime and has some side effects," then
"I bet we could be clever and restore only part of the data that are missing", then
doing that by hand, which hits weird errors, then
finally shipping the jury-rigged selective restore and cleaning up the last five missing pieces of data (hoping you didn't miss a sixth)
Happens every time someone practices backup/restore no matter how hard they've worked in advance. It always ends up being an application level thing to decide what data to put back from the backup image.
I agree with you. The phrase is you don’t have backups unless you test your backups.
But in this case I don’t really get what the issue is. Restore everything from the last good backup and people miss some posts made in the meantime, sucks, but it’s an instant solution instead of hand work and uncertainty.
> To Renaud, Claire, and Eugen of the Mastodon developer team, who went above and beyond all expectations to help us out. You folks were amazing, you took our situation very seriously, and immediately jumped in to help us. I really could not have asked for anything more. Thank you!
I don't know if Vivaldi provides financial support to Mastodon (I couldn't find their name on the sponsors page). If not, I hope this situation causes them (and other companies using Mastodon) to consider sponsorship or a support contract.
We (the Mastodon non-profit) do not offer support contracts at the moment, but this is a good idea, thanks :)
But we indeed have sponsorships open, and they really have impact. Having full-time people working on the project is very impactful, but at the moment we only have 1 full-time developer in addition to Eugen (the founder) and a DevOps person on the technical side.
Items two and three not happening atomically feels like an issue, though I assume there's a reason that it's not trivial to do so (I haven't looked at the code; really should at some point.)
It seems like it was trivial to make it happen atomically.
There just wasn't a need to before since them not being atomic isn't an issue, unless you have a poor configuration like someone pointing sidekiq at a stale database server (sorry, a replica), which I see as the primary issue here.
I'll never forget the first time I had to restore a massive sql dump and realized that vim actually segfaults trying to read it.
That's when I discovered the magic of spit(1) "split a file into pieces". I just split the huge dump into one file per table.
Of course a table can also be massive, but at least the file is now more uniform which means you can easier run other tools on it like sed or awk to transform queries.
I'm surprised that vim segfaults! I had it slow to open huge files, but I always assumed it could handle anything, through some magic buffering mechanisms. I could be wrong!
That being said, from the point that one has to edit the dump to restore data... something is very wrong in the restore process (the knowledge of which isn't helpful when you're actually faced with the situation, of course)
I once had to administer a system where a particular folder had so many files that things stopped working, even the ls command would not complete. (It was probably on ext3 or ext2.)
The workaround involved writing a python script that handled everything in a gradual manner, moving files into subdirectories based on shared prefixes.
This make anyone elses eyebrows raise sky high at this?
> Claire replied, asking for the full stacktraces for the log entries, which I was able to also extract from the logs.
This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286. House is that not, like, megabytes on every single hit?
This is the default ruby on rails behavior. It prints a stacktrace on any 500 or unknown error, and it's just line numbers and filepaths.
> megabytes on every single hit
I run a rails app that's very poorly designed.
I just checked, and the stack trace for a single 500 is 5KiB. It doesn't even add up to 1MiB a day since there's only a 500 error about every hour.
> This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286
Having a call stack handy is is actually pretty performant. Java's default exception behavior is to bubble up a stack trace with every exception, whether you print it or not, and java applications run just fine. You have the call stack anyway since you have to know how to return, so the only extra information you need handy is the filename and line number debug symbols, and ruby needs that info anyway just by the nature of the language.
> 6 Users with symbols in their usernames couldn’t log in. This turned out to be due to a mistake I’d made in the recovery script, and was very easily fixed.
The bug wouldn't have occurred in a normal mastodon installation since mastodon's recommended configuration is a single postgres database, or at the very least to use synchronous replication.
Also, very typically, fuzzers intentionally use simplified configuration, so it seems even less likely fuzzing would have caught this interaction.
Hm, so a distributed twitter runs into the challenge that each independently managed node is ... and independently managed node. Backup problems etc.
Centralized twitter improves its operations for all users over time. But can be purchased by a nutso billionaire on a whim, or subjected to the """"""national security"""""" directives of the US Government.
Database replicas are "distributed", but not in the sense ActivityPub is.
The same error could have happened on any centralized service that had more than one db instance and background cleanup jobs. I don't think Xitter runs entirely off Elon's laptop yet, so they could have had the same kind of error.
Yeah. Speaking in generalities, decentralization increases overall resilience of a network because it isolates the extent to which bad things can spread. Centralization increases efficiency (and efficacy, if the ruler of the centralized power is competent), and the likelihood of a system-wide failure.
Perhaps better is decentralized twitter (Nostr). Your account doesn't live on a server and you send events to multiple servers if you want to. If one server goes down, it hardly impacts you.
mollems|2 years ago
Here’s what jumped out at me: “The new account was created in our database with a null value in the URI field.”
Almost every time I see a database-related postmortem — and I have seen a lot of them — NULL is lurking somewhere in the vicinity of the crime scene. Even if NULL sometimes turns out not to be the killer, it should always be brought in for questioning.
My advice is: never rely on NULL as a sentinel value, and if possible, don’t allow it into the database at all. Whatever benefits you think you might gain, they will inevitably be offset by a hard-to-find bug, quite possibly years later, where some innocuous-seeming statement expects either NULL or NOT NULL and the results are unexpected (often due to drift in the semantics of the data model).
Although this was a race condition, if the local accounts and the remote accounts were affirmatively distinguished by type, the order of operations may not have mattered (and the account merge code could have been narrowly scoped).
chipbarm|2 years ago
Null is a perfectly valid value for data, and should be treated as such. A default value (e.g. -1 for a Boolean or an empty for a string) can make your system appear to work where NULL would introduce a runtime error, but that doesn't mean your system is performing as expected, it just makes it quieter.
I know it's tempting to brush NULL under the rug, but nothing is just as valid a state for data as something, and systems should be written generally to accommodate this.
kaoD|2 years ago
IMO the problem (at least in this case) is not NULL in the DB, but NULL at the application level.
If NULL is some sort of Maybe monad and you're forced to deal with it, well, you're forced to deal with it, think about it, etc.
Empty string, whatever NULL string is in your language of choice, or some sort of sigil value you invent... not much of a difference.
btown|2 years ago
IMO automated merging/deduplication of "similar" records is one of those incredibly hard problems, with edge cases and race conditions galore, that should have a human in the loop whenever possible, and should pass data (especially data consumed asynchronously) as explicitly as possible, with numerous checks to ensure that facts haven't shifted on the ground.
In many cases, it requires the implementors to start by thinking about all the concerns and interactivity requirements that e.g. a Git-style merge conflict would have, and try to make simplifying assumptions based on the problem domain from that starting position.
Looking at the Mastodon source [0], and seeing that there's not even an explicit list of to-merge-from IDs passed from the initiator of the merge request to the asynchronous executor of the merge logic, it seems like it was only a matter of time before something like this happened.
This is not a criticism of Mastodon, by the way! I've personally written, and been bitten by, merge logic with far worse race conditions, and it's frankly incredible that a feature like this even exists for what is effectively [1] a volunteer project! But it is a cautionary tale nonetheless.
[0] https://github.com/mastodon/mastodon/blob/main/app/workers/a... (note: AGPL)
[1] https://opencollective.com/mastodon
msla|2 years ago
More deeply, NULL is inevitable because reality is messy and your database can't decline to deal with it just because it's messy. You want to model titles, with prenomials and postnomials, and then generate full salutations using that data? Well, some people don't have postnomials, at the very least, so even if you never store NULLs you're going to get them as a result of the JOIN you use to make the salutation.
You can remove the specific NULL value, but you can't remove the fact "Not Applicable"/"Unknown" is very often a valid "value" for things in reality, and a database has to deal with that.
wouldbecouldbe|2 years ago
empathy_m|2 years ago
"ah yes well we have a full database backup so we can do a full restore", then
"the full restore will be tough and involve downtime and has some side effects," then
"I bet we could be clever and restore only part of the data that are missing", then
doing that by hand, which hits weird errors, then
finally shipping the jury-rigged selective restore and cleaning up the last five missing pieces of data (hoping you didn't miss a sixth)
Happens every time someone practices backup/restore no matter how hard they've worked in advance. It always ends up being an application level thing to decide what data to put back from the backup image.
SV_BubbleTime|2 years ago
But in this case I don’t really get what the issue is. Restore everything from the last good backup and people miss some posts made in the meantime, sucks, but it’s an instant solution instead of hand work and uncertainty.
martey|2 years ago
I don't know if Vivaldi provides financial support to Mastodon (I couldn't find their name on the sponsors page). If not, I hope this situation causes them (and other companies using Mastodon) to consider sponsorship or a support contract.
renchap|2 years ago
But we indeed have sponsorships open, and they really have impact. Having full-time people working on the project is very impactful, but at the moment we only have 1 full-time developer in addition to Eugen (the founder) and a DevOps person on the technical side.
progval|2 years ago
kaoD|2 years ago
yellowapple|2 years ago
pluto_modadic|2 years ago
rsynnott|2 years ago
TheDong|2 years ago
It seems like it was trivial to make it happen atomically.
There just wasn't a need to before since them not being atomic isn't an issue, unless you have a poor configuration like someone pointing sidekiq at a stale database server (sorry, a replica), which I see as the primary issue here.
INTPenis|2 years ago
That's when I discovered the magic of spit(1) "split a file into pieces". I just split the huge dump into one file per table.
Of course a table can also be massive, but at least the file is now more uniform which means you can easier run other tools on it like sed or awk to transform queries.
williamdclt|2 years ago
That being said, from the point that one has to edit the dump to restore data... something is very wrong in the restore process (the knowledge of which isn't helpful when you're actually faced with the situation, of course)
Terr_|2 years ago
The workaround involved writing a python script that handled everything in a gradual manner, moving files into subdirectories based on shared prefixes.
TylerE|2 years ago
> Claire replied, asking for the full stacktraces for the log entries, which I was able to also extract from the logs.
This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286. House is that not, like, megabytes on every single hit?
TheDong|2 years ago
> Stacktrace for that 500
This is the default ruby on rails behavior. It prints a stacktrace on any 500 or unknown error, and it's just line numbers and filepaths.
> megabytes on every single hit
I run a rails app that's very poorly designed.
I just checked, and the stack trace for a single 500 is 5KiB. It doesn't even add up to 1MiB a day since there's only a 500 error about every hour.
> This is either deep voodoo magic, or the code or configuration is turning a Xeon into the equivalent of a 286
Having a call stack handy is is actually pretty performant. Java's default exception behavior is to bubble up a stack trace with every exception, whether you print it or not, and java applications run just fine. You have the call stack anyway since you have to know how to return, so the only extra information you need handy is the filename and line number debug symbols, and ruby needs that info anyway just by the nature of the language.
k1t|2 years ago
williamdclt|2 years ago
oefrha|2 years ago
chx|2 years ago
How? NULL = NULL evaluates to FALSE, SQL is a three value logic, specifically Kleene's weak three-valued logic, NULL anyoperator NULL is NULL.
porridgeraisin|2 years ago
ziml77|2 years ago
notresidenter|2 years ago
UTF-8 strikes again.
photoGrant|2 years ago
TheDong|2 years ago
The bug wouldn't have occurred in a normal mastodon installation since mastodon's recommended configuration is a single postgres database, or at the very least to use synchronous replication.
Also, very typically, fuzzers intentionally use simplified configuration, so it seems even less likely fuzzing would have caught this interaction.
account42|2 years ago
danillonunes|2 years ago
ChrisArchitect|2 years ago
unknown|2 years ago
[deleted]
AtlasBarfed|2 years ago
Centralized twitter improves its operations for all users over time. But can be purchased by a nutso billionaire on a whim, or subjected to the """"""national security"""""" directives of the US Government.
pornel|2 years ago
The same error could have happened on any centralized service that had more than one db instance and background cleanup jobs. I don't think Xitter runs entirely off Elon's laptop yet, so they could have had the same kind of error.
Brendinooo|2 years ago
rsynnott|2 years ago
olah_1|2 years ago