Feature flagging to mitigate risk in database migration

[+] languagehacker|10 years ago|reply

Sure, this works for one kind of migration -- a whole-hog swap of data source.

How would this work for alterations to an existing database?

I don't think it would, at least not for an unsharded relational database.

Anyhow, just something to think about. NoSQL's magic, right?

[+] pkaeding|10 years ago|reply

You can definitely use feature flags to slowly (and controllably) roll out an alteration to an existing database, but exactly what it would look like depends on the alteration.

If you are going to make a schema migration with no downtime (even if you are not feature-flagging it) you will need to make the code work with both the old and new schema for a period, anyway. So, if you are feature flagging it, this period might just be extended, and you roll out the migration slowly. The 'new' thing I introduce with this approach is really just the 'integrity check' (and the gradual rollout, allowing you to monitor for errors/performance issues).

For example, if your migration involves:

- adding/removing a field: your code needs to deal with the fact that the field might not be there. This usually means you have a reasonable default value, or something.

- restructuring: this is a broad category, so lets work with a more concrete example: you currently have a nested (unbounded) repeating sub-object, and you want to break them out into their own top-level documents in another table/collection. In this case, you will have a DAO function `getChildren(parent)`, which uses the feature flag to decide if it should make the query to read the parent, and then pull out the nested list of children, or if it should load the children from their separate collection. You can save the children in both places, and do the dual reads as well, comparing the results. When the cutover is complete, you run a script that removes the nested children (instead of decommissioning the old database, like in the example in the blog post).

Ultimately, the concept is the same: you do the work in both places for a period of time, and check the results. If there are any discrepancies, you still have the fully-functional original data source.

[+] drichelson|10 years ago|reply

Did you encounter any performance problems during the 'Early Canary Read' phase? this seems like a lot of DB action.

[+] pkaeding|10 years ago|reply

Author here.

Yeah, at that point, you are doing two writes, and (possibly) two reads. I simplified the code a bit for the blog post, but you can do the read and write pairs concurrently, so you don't have to wait longer to get data to return to the caller.

[+] albertmw|10 years ago|reply

interesting concept, but I don't see how rollbacks would work. aren't you at risk of losing data?

[+] ivansavz|10 years ago|reply

Why would you need to rollback?

The new DB is not authoritative until the final step of the switchover, so you could easily scrap it at any point.

[+] pkaeding|10 years ago|reply

Because we continued writing 100% of the data to Mongo until we were certain that everything was working, it was always safe to stop using Dynamo (or use it for less traffic). In the canary read phase, we always returned the mongo data to the caller. We read from both sources, compared the data, logged an error if there was a difference, and discarded the dynamo data.

In this way, the Dynamo data was kind of throw-away, until we were confident in everything.

7 comments