top | item 8305912

(no title)

golan | 11 years ago

The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file. In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.

Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)

discuss

order

lamontcg|11 years ago

if you want to copy everything and there's nothing at the target:

rsync --whole-file --ignore-times

that should turn off the metadata checks and the rsync block checksum algorithm entirely and transfer all of the bits at the source to the dest without any rsync CPU penalty.

also for this purpose it looks like -H is also required to preserve hard links which the man page notes:

"Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H."

be mildly interesting to see a speed test between rsync with these options and cp.

there are also utilities out there to "post-process" and re-hardlink everything that is identical, so that a fast copy without preserving hardlinks and then a slow de-duplication step would get you to the same endpoint, but at an obvious expense.

lunixbochs|11 years ago

I use rsync to replicate 10TB+ volumes without any problems. It's also very fast to catch up on repeat runs, so you can ^C it without being too paranoid.

matt-attack|11 years ago

It sounds like OP had large hard-link counts, since he was using rsnapshot. It's likely that he simply wouldn't have had the space to copy everything over to the new storage without hard-links.

aruggirello|11 years ago

You cannot exaggerate what rsync can do... I use it to backup my almost full 120gb Kubuntu system disk daily, and it goes through 700k files in ~ 2-3 minutes. Oh, and it does it live, while I'm working.

lambda|11 years ago

Yep, my answer when cp or scp is not working well is always to break out rsync, even if I'm just copying files over to an empty location.

I've had good luck with the -H option, though it is slower than without the option. I have never copied a filesystem with nearly as many files as the OP with -H; the most I've done is probably a couple million, consisting of 20 or so hardlinked snapshots with probably 100,000 or 200,000 files each. rsync -H works fine for that kind of workload.

beagle3|11 years ago

> In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

I regularly rsync machines with millions of files. On a slow portable 2.5" 25MB/sec USB2 connection it's never taken me more than 1hr on completely cold caches to verify that no file needs to be copied. With caches being hot, it's a few minutes. And on faster drives it's faster still.

Unless you are doing something weird and force it to checksum, checksumming mostly kicks in on files that have actually changed and can be transferred efficiently in parts (i.e. network; NOT on local disks). In other cases, it's just a straight copy.

Have you actually used rsync in the setting you describe, or at least read its manual?

lysium|11 years ago

> The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file.

- In this scenario, the receiving side is empty, so there is no need to check every single file.

- Since 3.0, rsync walks the dirtree while copying (unless you use some special options). So, in some sense, rsync is already "deep syncing inside the FS tree", as you put it.

Twirrim|11 years ago

rsync since version 3 has stopped doing the full tree traversal It's not in RHEL/CentOS 5 but is in 6, and is really trivial to compile by hand anyway. That's made life a lot easier for me in the past.

Some of the other features in rsync would seem to make it a more logical fit for this task anyway, e.g. it's resumable, it supports on-the-fly compression (though the value of that depends on the types of files being transferred), native logging etc etc.

KaiserPro|11 years ago

WE use 10 rsyncs in parallel to copy .5pb in less than 3 days.

CPU is cheap.

spydum|11 years ago

Have done the same, found that rsync seems to not natively parallelize itself, so spread across 20 cores, it really screamed.