Show HN: S3HyperSync – Faster S3 sync tool – iterating with up to 100k files/s
49 points| Starofall | 1 year ago |github.com
Feedback and contributions are welcome!
49 points| Starofall | 1 year ago |github.com
Feedback and contributions are welcome!
iknownothow|1 year ago
> For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively.
[1] https://github.com/peak/s5cmd
Starofall|1 year ago
For a S3->S3 sync using an C6gn.8xlarge instance, I got up to 800MB/sec using 64 workers, but the files were only in average around 50MB. And the bigger the file the higher the MB/sec.
Also from my short look into it, s5cmd does not support syncing between S3 providers (S3->CloudFlare).
kapilvt|1 year ago
Starofall|1 year ago
sam_goody|1 year ago
Also, would this work well when there is not a lot of room on the disk it is syncing from? I have had serious issues with the S3 cli in such a scenario?
Also, how would this compare to something like rclone?
Starofall|1 year ago
The good news is that with S3 over HTTP you should not really run into byte flip issues.
The sync server does not need any file system storage, it processes all uploads in memory and only ever buffers 5MB per worker for multipart uploads.
rclone looks like a good alternative, but without the focus on fast iterations for e.g. daily backups of huge buckets.
toomuchtodo|1 year ago
Starofall|1 year ago
asyncingfeeling|1 year ago
Seemingly not the intended use case, and I might be overlooking something, but nice to have features which the s3 sync tool has and I'd personally miss: - profiles - local sync
Starofall|1 year ago
Local Sync is on the idea list, but not that simple - as local folders do not have the same "paginate all items in lexicographic order as it would look like on S3" feature ^^
jayzalowitz|1 year ago
Starofall|1 year ago
1) The underlying S3 Framework is already super fast https://pekko.apache.org/docs/pekko-connectors/current/s3.ht...
2) Lots of multithreading, stream buffering and pipelineing
3) For the fast iteration speed the "read,parse,ask for next" loop is the main bottleneck - so if you e.g. know that your sync source prefix contains uuids - the tool creates a file iterator for each known subfolder prefix. And with 16 iterators, its mainly the CPU that bottlenecks the XML parsing :)
unknown|1 year ago
[deleted]