top | item 27562613

The wrong way to switch operating systems on your server

109 points| ohjeez | 4 years ago |figbert.com | reply

73 comments

order
[+] klodolph|4 years ago|reply
These experiences are relatively common, we just don't see writeups that often. Kudos to the author for writing this up. I do disagree with some of the lessons here.

You want to switch operating systems on your server?

1. Set up monitoring. If you don't have monitoring already, hack something together with simple scripts.

2. Start up a new server.

3. Migrate services and data from your existing server to your new server. Do this at whatever pace you feel is appropriate. Point the monitoring scripts at your new server.

4. Switch DNS records.

5. Wait. You are not in a hurry to turn off the previous server. Why not wait one or two months?

6. Turn off the old server.

7. Wait some more, and then delete the old server.

The idea here is that steps which might take your site offline are easily reversible. For example, switching DNS records. It's trivial to switch the DNS records back if your migration unexpectedly failed. As much as reasonable, you want the ability to go backwards and undo the steps that you've done to get back to a known good setup.

In particular, I would say that backups are usually not the right tool for migration. This is missing from the lessons at the bottom of the article. The way you get more confidence in your backups is by doing restore tests into a sandbox environment, by adding automated monitoring to your backups, etc. Trying to address a lack of confidence by increasing the backup frequency doesn't make sense. The backup frequency is the most trivial thing to adjust and doesn't address deeper issues, like the fact that you need to dump/restore databases properly and shouldn't copy files from a live database. These issues are discovered through restore testing.

The saying goes, "Nobody wants a backup system, everyone wants a restore system." If you are making backups but not testing restores, you're gonna get bitten. Test the part of backups that you care about--the ability to restore data--and don't test it live. Test it in a sandbox.

[+] livueta|4 years ago|reply
This approach saved my bacon on a migration just yesterday. I had gotten to testing files after having done a big baseline rsync, stopping file services on the old server, and doing a catch-up rsync incremental. Oh shit - one veracrypt container is corrupted and won't mount. Turns out rsync diff updates and mounted containers being written to don't play nice together.

Since the old machine was still sitting there with all the data accessible, I was able to just blow away the corrupted volume, confirm it was unmounted on the source, then copy the whole thing over. If I just had a one-time copy that I'd thrown at b2 or something, I would have been very sad.

So, yeah, test restoring your backups. Even fancy checksumming filesystems and shit won't save you from bad assumptions about the integrity of your data.

[+] CogitoCogito|4 years ago|reply
Yeah I was legitimately confused why there was a problem until I realized that author had installed the new OS to the same server as the old one. I can't understand why that was done. Previously I wouldn't even have called your method of moving services over one by one while keeping the old setup available "best practice" because I thought it was so obvious. I hope the original author realized that this should really be what is learned from the story.
[+] neurostimulant|4 years ago|reply
I would also setup a reverse proxy on the old server so it would forward all traffics to the new server, and finally kill the old server once the reverse proxy no longer see any legitimate requests.
[+] cfn|4 years ago|reply
I do a variation of this when moving to a new developer machine. I virtualize the previous one and keep it around for a long time.
[+] gperciva|4 years ago|reply
Wow, that's much more technically advanced than I was as a teenager! Way to go!

To print progress with tarsnap 1.0.39, send it a SIGUSR1 or SIGINFO. On FreeBSD, you can do this by pressing ctrl-t. On Linux, you have to use the unfortunately-named `kill` or `killall` command, such as

killall -SIGUSR1 tarsnap

https://www.tarsnap.com/tips.html#check-current

(Note that Tarsnap is not responsible for naming the unix `kill` or `killall` commands.)

In the unreleased git version of tarsnap, there's a `--progress-bytes SIZE` command, which prints a progress message after every SIZE bytes are processed.

As a general note: the tarsnap-users mailing list is a great place to ask for tips. As you mentioned in your lessons learned, some of the options could have helped a lot (such as `--recover`) https://www.tarsnap.com/lists.html

(Disclaimer: I'm employed by Tarsnap Backup Inc.)

[+] bigiain|4 years ago|reply
"employed by"

<grin>

[+] pinkythepig|4 years ago|reply
I feel like a lot of this article really should be about how bad tarsnap is. Defaults to no progress updates, has no builtin multithreading, has failures around large files, can't backup sym links properly, no builtin way to detect an in progress restore so you have to manually tell it to resume, etc.

If tarsnap didn't have a bad UX, this entire article would have instead been 'the time I forgot to backup my .env file', none of the other issues would have occurred.

[+] Negitivefrags|4 years ago|reply
I agree. Used to use tarsnap but the speed is garbage.

It’s so bad that actually restoring anything of any size is just not viable.

[+] justin_oaks|4 years ago|reply
Don't consider it a backup until you've successfully restored the data from it.

The first thing I do after setting up a new data backup is test a restore of the data. Only after that will I feel confident that the backup procedure works right.

In the article author's case, an attempt to restore would have caught the problem of the missing .env files and the large movie files.

As for the Ctrl-C on both the backup and restore, you should check your I/O (network and disk) before terminating a process. Doing that would have confirmed that the process was still going, and indicate the rate at which the process was going.

[+] AnimalMuppet|4 years ago|reply
I'd even go a step further. It's not a backup until you've restored it using different hardware. You really want to know that your tape (or whatever) can be read by a different tape reader than the one that wrote it.
[+] geoduck14|4 years ago|reply
> Don't consider it a backup until you've successfully restored the data from it.

Ouch! I back up every 2 hours - should I REALLY restore from each of those?

[+] spsesk117|4 years ago|reply
+1. `strace` can be very helpful here, to see if a process is stuck waiting on something or whether it's just zooming along with no output.
[+] wildmanx|4 years ago|reply
> I woke and the backup was finished! I wiped the VPS

Just this single line made me scream in horror.

Let me get this straight. He launched some backup command, it didn't output anything for hours, he suspected it hadn't done anything, aborted with ctrl-c, and then learned that he aborted it at 90%. Wipes the partial backup, starts again.

After _that_ experience, he blindly trusts the result of _the same tool_, blindly wipes everything? Wtf.

[+] wildmanx|4 years ago|reply
Spoiler: That's not even part of his "lessons learned".

I know whom I won't hire for my company IT or devops or whatnot.

[+] Dylan16807|4 years ago|reply
> After _that_ experience, he blindly trusts the result of _the same tool_, blindly wipes everything? Wtf.

You imply that there's something wrong with the tool because he aborted early the first time? Why?

I would check the backup too but why emphasize "the same tool" like that?

It has a -v flag.

[+] ajnin|4 years ago|reply
My current backup strategy is to backup the whole filesystem. I run services in actual VMs, not docker containers, with disks mounted from LVM volumes which allows me to take snapshots and back them up live without needing to shut them down. I'm using bup to do the actual backups to a server I'm keeping in my home. I wrote a few custom scripts to backup and restore servers, and keep a history of the last x days, y weeks and z months. That way it gives me more time to figure out if something's wrong, as it's hobby stuff that I'm not checking every day.

My advice for OP would be to 1/ ditch tarsnap. A backups tool that runs for hours without any feedback ? A restore tool that fails if the files are too big ? Everything extremely slow ? Just forget it. 2/ keep more than 3 days of backups. It's too short if you make a mistake, it took 3 days to recover from this one already. 3/ backup everything. Don't try to pick and choose files, you're likely to forget something, and if not now then some time later when you create a new file but forget to add it to the list of things to back up.

[+] geofft|4 years ago|reply
> My terminal sat empty for hours. There were no changes – the process was running, but there was no feedback. I was nervous.

> What if it failed silently? How can I check? What should I do?

On Linux, find the process ID and run e.g.

    ls -l /proc/12345/fd
which will show you all the files currently open by the process. For something like a backup of a whole directory, or something generating a lot of output, run it again a few seconds later. If it's opened different files, then you know it's making progress and it's not stuck.

If it's something that operates on a single file, find the number corresponding to that file in the list (the file descriptor), and run e.g.

    cat /proc/12345/fd/3
which will output a "pos" field showing the position in the file, in bytes. Compare it with the actual size of the file, and also run it again a few seconds later to see how fast it's making progress.

(You can also use strace, but that slows down your program and potentially changes how it behaves in extremely unusual cases, so it isn't the first thing I'd reach for unless I really think the program is misbehaving and I want to see what it's doing in more detail. And there are tools like iostat too, but they're systemwide.)

[+] Dylan16807|4 years ago|reply
Also there's the simple method of using iotop or a network equivalent.
[+] Shank|4 years ago|reply
Far and above, the best strategy is to spin up the new server, scp/rsync the data to the new server directly from the old server, and then boot services, and only decommission after you’ve moved all DNS over and confirmed the new site is working. Using Tarsnap for this is not only time consuming but needless unless you already have it setup and working.
[+] roywashere|4 years ago|reply
Using tarsnap has one big advantage: it proves that you can recover from your backups. Using this method caused the OP to realize the backups were there but the secrets were missing!

I agree with keeping the old server in place until the new one is working obv

[+] neurostimulant|4 years ago|reply
My strategy is to setup the new server, and once everything is ready, stop the service on the old server and replace it with a reverse proxy to the new server, and THEN finally update the dns record. When the old server no longer see any meaningful traffic, I'll decommission the old server. This way I don't have to deal with both server out of sync.
[+] isatty|4 years ago|reply
Why tarsnap when wasabi or backblaze would be significantly cheaper? You can just encrypt by yourselves anyway.

Also I run my own personal infra and here’s what I do:

* treat servers as cattle, not pets. This is really important. Have mandatory reboots, never be afraid of reboots.

* preferably do things with an automation method, I use ansible for n=5 but pick whatever you like

* have SOME monitoring. It’s not too hard to throw up prom+grafana so get on it early.

* VPN instead of securing internal services. Attack surface is way too high if you’ve too many services. Just throw them all behind a vpn and expose selectively. I use WireGuard.

* personally: don’t self host critical infrastructure. I can’t afford downtime on email etc so I rather just pay someone to host that. Personal infra is for fun, not a second job (and I’m an SRE).

[+] MiscIdeaMaker99|4 years ago|reply
> Have mandatory reboots, never be afraid of reboots.

Yes! It is not a badge of honor to have a server that's been online for 365+ days -- it just means you haven't applied any kernel security updates for a long, long time. :-p

[+] simonblack|4 years ago|reply
On my server, I have three 'root' partitions. One for general day-to-day use, one for a backup if something catastrophic happens to my main system, and one for experimentation. The extra disk space taken up for my having two extra 'root' partitions is a miserable 40-50 gig.

But I know that I can swap operating systems over almost instantaneously and then back again just as quickly if I did something wrong.

Great peace of mind for practically no cost.

The second error I see in the article is to use software that we haven't used previously for something important. My first wife had the habit of trying new recipes when we had a dinner-party. I tried to tell her repeatedly to try the recipe on us first, then she would have it down pat when she wanted to impress.

The third error of course was the need to have restorable daily backups and the use of them to restore the system when need be, associated with modularity of the system.

I back up my whole system daily. but the most important part of that is not backing up the distro itself (we have re-installs for that) but backing up all the config files, all the databases, all the local binaries, and a current list of all of the installed distro packages. I can replace the whole operating system from a complete wipe-out in less than two hours.

I store these backups in a pseudo-exponential policy. I have more recent backups, fewer older backups. Currently I have 15 backups covering 9 years, with five of those covering just the last 3 weeks, and four covering the last five days. To augment this, I have a monthly snapshot backup also stashed away.

The other stuff, personal docs etc, is deliberately kept small. Total daily backup of base system and /home is approximately 12 gigs. That is easily transported on a USB stick.

I don't store music, photos, magazine .PDFs, old software, etc in my /home directory. That stuff all goes in an archive directory that's write-once, and store (practically) forever. That gets rsynced to two external USB drives daily. Most days, there's practically nothing that gets transferred out.

Having several times lost much valuable data, I suppose I am really paranoid, but I still think I haven't been paranoid enough.

[+] siraben|4 years ago|reply
Without a sweat, I recently changed from Ubuntu to NixOS on a server I have access to only via SSH (so mounting an ISO wasn't an option) using[0], so the switch happens live(!) and when you reboot you end up in NixOS. I've also done it to CentOS 7.5, but manually via[1].

It was the case here that backups were not necessary, but it's pretty incredible what's possible when the system configuration is declarative and only /nix and /boot are needed by NixOS to boot. I highly recommend people new to running servers to try NixOS.

[0] https://github.com/elitak/nixos-infect

[1] https://nixos.org/manual/nixos/unstable/index.html#sec-insta...

[+] dmuth|4 years ago|reply
Someone tell me if I'm missing something, but isn't the whole point of hosting things in a virtual environment so that when you want to switch/upgrade OSes, you stand up a second server and start migrating apps over one at a time?

I can't understand why that doesn't done here.

[+] plorkyeran|4 years ago|reply
Well the title is "The Wrong Way to Switch Operating Systems on Your Server" so it's unsurprising that it's describing completely the wrong way to do things.

The concerning part is that the author seems to have learned the wrong lessons from doing things the wrong way.

[+] lawrenceduk|4 years ago|reply
This is a bit dumb and I feel like if as much effort had been made doing it as writing the blog post, a more positive outcome would have resulted.
[+] fak3r|4 years ago|reply
For me: rsync > tarsnap

I have a backup rsync script that parses a file I have that lists every path I want backed up. Yes, this considers dotfiles, so the poster's .env file would have been backed up. My script runs locally, backs up to my main (home) server, and then does another rsync to a 'cloud' server. Want to backup a new file or path? Add it to the manifest file. Adding another server or device? Build another manifest script, have rsync write to the same dir on the server, it'll automatically get sync'd to the cloud server too.

[+] ghostly_s|4 years ago|reply
Spent three days fighting with this and didn't think to try the -v flag on his apparently-hanging process?
[+] rhn_mk1|4 years ago|reply
The author may not have the necessary familiarity with it. Experience is gained via mistakes too.
[+] fak3r|4 years ago|reply
My first thought, also always run long running jobs like this in Screen or Tmux! As it is, it's a good learning post that others should be able to build on (and don't hit control-c just bc it's taking too long!)
[+] FpUser|4 years ago|reply
Every server of mine has "history" in a form of script that can rebuild it from the scratch on brand new clean machine. Every time anything is added / modified the script is ran on clean machine to test if all still works as supposed. My databases are all automatically backed to a different backup servers and are restored when needed by the same script.

So for me setting up new server involves following steps:

1) Modify config portion of the script (basically change default values if needed) and run said script on a new server 2) Test 3) Put old server into "read only mode" 4) Run script again with the option to restore the database only. 5) Switch DNS

There are/could be slight variations of course. Practically, I did recent migration of the new server for new product that was running on my location in Toronto to Hetzner. It took a "whopping" 15 minutes of which 13 been taken by restoring database from backup.

[+] teekert|4 years ago|reply
What a nightmare! I swear I had something similar with rsync on a mac once, I was very certain it finished, ran again and it reports it's done. I migrate and I'm missing all these files! Although it was probably my fault, I really don't trust rsync anymore... Maybe it had to do with HFS+ and those strange aperture libraries but man it ruined my day (week).

Sure migrating is 100 times more relaxed when you have the old system running, but sometimes you need to reinstall. I had only one MacBook, now I only have 1 server.

What you could do in that case is just install a new disc, unplug the old one until the new systems is running. It's worth the money and effort.

I'm looking to install NixOS to my Home Server next week. All my personal infra is in Docker compose, on Ubuntu 20.04 atm. I only have one m.2 slot in the server and I don't want to buy a second drive just for this... So I'm sweating already. Maybe I should first migrate to my nuc, then back to the new server... hmmmm...

[+] stavros|4 years ago|reply
I had exactly the same problems, and I decided to change not my backup strategy, but my deployment strategy. I wrote a small tool to deploy everything in a single directory, using Docker Compose:

https://gitlab.com/stavros/harbormaster

It allows you to separate important state (data) from non-important state (caches). This way, all you need to do is back up the data directory, and then you can restore the Harbormaster config file (along with the data) on the new server and you're done.

[+] elondaits|4 years ago|reply
The problem with the conclusion is that he implies his error was to not be careful / smart or knowledgable enough… but his error was to not use a fool/error-proof method.
[+] zrav|4 years ago|reply
Moving containers between hosts is not a use case that is well supported by docker. The tenor being that you should rather delete and recreate. While not a replacement for docker, with LXD moving containers between hosts is a single command. In many aspects I find LXD's UX more polished and its capabilities more complete, so I often do recommend it to people. Even if just to containerize docker itself to gain some of the missing functionality...