Ask HN: How do you roll back production?
Do you roll forward? Flip DNS back to the old deployment? Click the button in heroku that takes you back to the previous version?
Do you roll forward? Flip DNS back to the old deployment? Click the button in heroku that takes you back to the previous version?
[+] [-] aasasd|6 years ago|reply
For the database, during a migration we didn't synchronize code with one version of the db. Database structure was modified to add new fields or tables, and the data was migrated, all while the site was online. The code expected to find either version of the db, usually signaled by a flag for the shard. If the changes were too difficult to do online, relatively small shards of the db were put on maintenance. Errors were normally caught after a migration of one shard, so switching it back wasn't too painful. Database operations for the migrations were a mix of automatic updates to the list of table fields, and data-moving queries written by hand in the migration scripts—the db structure didn't really look like your regular ORM anyway.
This approach served us quite well, for about five years that I was there—with a lot of visitors and data, dozens of servers and multiple deployments a day.
[+] [-] ransom1538|6 years ago|reply
This is the way to go. Have your root web directory be a symlink. EG. /var/www/app -> /code_[git_hash]/ You can whip through a thousand vms less than a second with this method. Connect, change the symlink. Other options: Pushing out a new code branch, reverting with git, launching new vms with reverted images, rsync'ing with overwriting -- is slower, and more dangerous with prod.
"For the database, during a migration"
There is no such thing as a database migration on prod. There is just adding columns. Code should work with new columns added at any point. Altering a column or dropping a column is extremely dangerous.
[+] [-] aasasd|6 years ago|reply
Or could probably just use overlayfs or something like that.
Note that, while containers can give you COW, they shouldn't really be necessary for the files of the app. And with php-fpm, changing the app dir is faster than restarting containers: it's actually done via changing the Nginx config.
[+] [-] wy35|6 years ago|reply
[+] [-] robbya|6 years ago|reply
Generally, going back to a known clean state should be easier, safer and relatively quick (DNS flip is fast, redeploy of old code is fast if your automation works well).
In some cases changes to your data may make rolling back cause even more problems. I've seen that happen and we were stuck doing a rapid hot fix for a bug, which was ugly. We did a lot more review to ensure we avoided breaking roll back. So I'd advise code review and developer education of that risk.
[+] [-] ccmcarey|6 years ago|reply
[+] [-] mooreds|6 years ago|reply
[+] [-] karka91|6 years ago|reply
Nowadays - flip a toggle in the admin. Deployments and releases are separated.
Made a major blunder? In kubernetes world we do "helm rollback". Takes seconds. This allows for a super fast pipeline and a team of 6 devs pushes out like 50 deployments a day.
Pre-kubernetes it would be AWS pipeline that would startup servers with old commits. We'd catch most of the stuff in blue/green phase, though. Same team, maybe 10 deployments a day but I think this was still pretty good for a monolith.
Pre-aws we used deployment tools like capistrano. Most of the tools in this category have multiple releases on the servers and a symlink to the live one. If you make mistake - run a command to delete the symlink, ln -s old release, restart web server. Even though this is the fastest rollback of the bunch the ecosystem was still young and we'd do 0-2 releases a day.
[+] [-] Shalle135|6 years ago|reply
[+] [-] korijn|6 years ago|reply
Would you mind explaining this a little further? How does the separation allow you to flip a switch in the admin?
[+] [-] DoubleGlazing|6 years ago|reply
My last job had an extra layer of security. As a .net house all new deployments were sent to Azure in a zip file. We backed those up and maintained FTP access to the Azure app service. If a deployment went really wrong and we couldn't wait the 10-20 mins for the CI pipline to process a revert, we'd just switch off the CI process and FTP upload the contents of the previous last good version.
Of course, if there were database migrations to deal with then all hell could break loose. Reverting a DB migration in production is easier said than done especially if a new table or column has already started being filled with live data.
To be fair though, most of the problems I encountered were usually as the result of penny pinching by management who didn't want to invest in proper deployment infrastructure.
[+] [-] drubenstein|6 years ago|reply
It's more often been the case for us that issues are caused by mistaken configuration / infrastructure updates. We do a lot of IAC (Chef, Cloudformation), so with those, it's usually a straight git revert and then a normal release.
[+] [-] EliRivers|6 years ago|reply
"Production"? Does that mean something that goes to the customers? Very few of our customers keep up with releases so it's generally not a big deal. We can have a release version sitting around for weeks before any customer actually installs it; some customers are happy with a five year old version and occasional custom patches.
I bet it's a bigger problem for those for whom the product is effectively a running website, but those of us operating a different software deployment model have a different set of problems.
[+] [-] sqldba|6 years ago|reply
[+] [-] MetalGuru|6 years ago|reply
[+] [-] ksajadi|6 years ago|reply
We use Cloud 66 Skycap for deployment which gives us a version controlled repository for our Kubernetes configuration files as well as takes care of image tags for each release.
[+] [-] protonimitate|6 years ago|reply
Which.. isn't as bad as I thought it would be (so far).
[+] [-] MaxGabriel|6 years ago|reply
If it’s not urgent we’d just revert with a PR though and let the regular deploy process handle it.
The frontend we deploy with Heroku, so we deploy with the rollback button or Heroku CLI. Unfortunately we don’t have something setup where the frontend checks if it’s on the correct version or not, so people will get the bad code until they refresh
[+] [-] adev_|6 years ago|reply
Same for us, but we use nixpkgs directly over CentOS. Nix is perfect for rollback. It can be done on an entire cluster in seconds.
For the DB, We use schemaless DBs with Devs that care about forward and backward compatibility.
[+] [-] MaxGabriel|6 years ago|reply
[+] [-] folkhack|6 years ago|reply
Code rollbacks are simple as heck - I just keep the previous Docker container(s) up for a potential rollback target, and/or have a symlink cutover strategy for the webservers. I use GitLab CI/CD for the majority of what I do so the SCM is not on the server, it's deployed as artifacts (either a clean tested container and/or .tar.gz). If I need to rollback it's a manual operation for the code but I want to keep it that way because I am a strong believer in not automating edge-cases which is what running rollbacks through your CI/CD pipeline is.
Also for code I've been known to even cut a hot image of the running server just in case something goes _really_ sideways. Never had to use it though, and I will only go this far if I'm making actual changes to the CI/CD pipeline (usually).
The biggest concern for me is database changes. You may think I'm nuts but I have been burnt _sooooo_ bad on this (we were all young and dumb at one time right?)... I have multiple points of "oh %$&%" solutions. The first is good migrations - yeah yeah yell at me if you wish... I run things like Laravel for my API's and their migration rollbacks can take care of simple things. TEST YOUR ROLLBACK MIGRATIONS! The second solution is that I cut an actual readslave for each and every update of the application and then segregate it so that I have a "snapshot" that is at-most 1-2 hours out of date.
Have redundancy to your redundancy is my motto... and although my deployments take a 1-3 hours for big changes (cutting hot images of a running server, building/isolating an independent DB slave, shuffling containers, etc.) I've never had a major "lights out" issue that's lasted more than 1hr.
[+] [-] dxhdr|6 years ago|reply
I imagine that much larger operations likely do feature flags or a rolling release so that problems can be isolated to a small subset of production before going wide. But still the same principle, redeploy with different code.
[+] [-] thih9|6 years ago|reply
but smaller, routine deploys with unexpected failures could be just as dangerous.
[+] [-] technological|6 years ago|reply
Setup environment with previous version of production code (which does not have issue) and then using load balancer switch the traffic to this new environment
[+] [-] wwweston|6 years ago|reply
[+] [-] KaiserPro|6 years ago|reply
The App is docker, so we have a tag called app-production, and app-production-1(up to 5) which are all the previous production versions. If anything goes wrong, we can flip over to the last known good version.
We are multi-region, so we don't update all at once.
The dataset is a bit harder. Because its > 100gigs, and for speed purposes it lives on EFS (its lots of 4meg files, and we might need to pull in 60 or so files at once, access time is rubbish using S3) Manually syncing it takes a couple of hours.
To get round this, we have a copy on write system, with "dataset-prod" and "dataset-prod-1" up to 6. Changing the symlink of the top level directory is minimal.
[+] [-] jekrb|6 years ago|reply
We were able to do a manual rollback for each deployment from the GitLab UI.
https://docs.gitlab.com/ee/ci/environments.html#retrying-and...
Disclaimer: I work at GitLab now, but my old agency was also using GitLab and their CI/CD offering for client projects for a couple years while I was there.
At that agency they have even open sourced their GitLab CI configs :) https://gitlab.com/digitalsurgeons/gitlab-ci-configs
[+] [-] folkhack|6 years ago|reply
I can't say enough great things about it - solutions like Jenkins and Travis CI just feel antiquated and clunky anymore. I always thought it wouldn't really be worth it to run CI/CD on my personal projects due to the complexity inherent in setting these solutions up until I saw the light... I had a coherent "one-click" deploy setup from scratch within an hour with GitLab.
[+] [-] ryanthedev|6 years ago|reply
It's just doing another deployment. It doesn't matter what version you are deploying.
That's the whole point.
My teams go into their CI/CD platform and just cherry pick which build they want to release.
[+] [-] mooreds|6 years ago|reply
[+] [-] atemerev|6 years ago|reply
There are two identical prod servers/cloud configurations/datacenters: blue and green. Each new version is deployed intermittently on blue and green areas: if version N is on blue, version N-1 is on green, and vice versa. If some critical issue happens, rolling back is just switching the front router/balancer to the other area, which can be done instantly.
[+] [-] gfodor|6 years ago|reply
Any mechanism for rollbacks that isn't tested continuously is likely to fail during incident response. It's a huge anti-pattern to have 'dark' processes only used during incident response -- same thinking behind why you should also be continually testing your backups, continuously killing servers to verify recovery, etc.
[+] [-] keithyjohnson|6 years ago|reply
[+] [-] ericol|6 years ago|reply
whenever we need to roll back something we just use the corresponding Github feature to revert a merge, and that is automatically shoved into production using GH hooks and stuff.
Again, we have a rather easy and ancient deploy system, and it just works.
We do several updates a week if needed. We try to avoid late Friday afternoon merges, but with a couple alerts here and there (Mostly, New Relic) we have a good coverage to find out about problems.
[+] [-] helloguillecl|6 years ago|reply
[+] [-] m00dy|6 years ago|reply
[+] [-] emptysea|6 years ago|reply
For other issues we press the rollback button in the Heroku dashboard.
Heroku has its problems: buildpacks, reliability, cost, etc, but the dashboard deploy setup is pretty nice.
[+] [-] perlgeek|6 years ago|reply
Since the question of database migrations came up: We take care to break up backwards incompatible changes into multiple smaller ones.
For example, instead of introducing a new NOT NULL column, we first introduce it a NULLable, wait until we are confident that we don't want to roll back to a software version that leaves the column empty, and only then changing it to NOT NULL.
It requires more manual tracking than I would like, but so far, it seems to work quite well.