top | item 45359372

(no title)

crabique | 5 months ago

Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.

Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).

discuss

epistasis|5 months ago

I would say that Ceph+RadosGW works well with HDDs, as long as 1) you use SDDs for the index pool, and 2) you are realistic about the number of IOPs you can get out of your pool of HDDs.

And remember that there's a multiplication of iops for any individual client iop, whether you're using triplicate storare or erasure coding. S3 also has iop multiplication, which they solve with tons of HDDs.

For big object storage that's mostly streaming 4MB chunks, this is no big deal. If you have tons of small random reads and writes across many keys or a single big key, that's when you need to make sure your backing store can keep up.

bayindirh|5 months ago

Lustre and ZFS can do similar speeds.

However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.

crabique|5 months ago

Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.

Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.

So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.

Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.

elitepleb|5 months ago

Any of them will work just as well, but only with many datacenters worth of drives, which very few deployments can target.

It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.

olavgg|5 months ago

SeaweedFS has evolved a lot the last few years, with RDMA support and EC.

pickle-wizard|5 months ago

At a past job we had an object store that used SwiftStack. We just used SSDs for the metadata storage but all the objects were stored on regular HDDs. It worked well enough.

kerneltime|5 months ago

Apache Ozone has multiple 100+ petabyte clusters in production. The capacity is on HDDs and metadata is on SSDs. Updated docs (staging for new docs): https://kerneltime.github.io/ozone-site/

giancarlostoro|5 months ago

Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."

It's open source / free to boot. I have no direct experience with it myself however.

https://www.gluster.org/

mbreese|5 months ago

Gluster has been slowly declining for a while. It used to be sponsored by RedHat, but tha stopped a few years ago. Since then, development slowed significantly.

I used to keep a large cluster array with Gluster+ZFS (1.5PB), and I can’t say I was ever really that impressed with the performance. That said — I really didn’t have enough horizontal scaling to make it worthwhile from a performance aspect. For us, it was mainly used to make a union file system.

But, I can’t say I’d recommend it for anything new.

epistasis|5 months ago

A decade ago where I worked we used gluster for ~200TB of HDD for a shared file system on a SLURM compute cluster, as a much better clustered version of NFS. And we used ceph for its S3 interface (RadowGW) for tens of petabytes of back storage after the high IO stages of compute were finished. The ceph was all HDD though later we added some SSDs for a caching pool.

For single client performance, ceph beat the performance I get from S3 today for large file copies. Gluster had difficult to characterize performance, but our setup with big fast RAID arrays seems to still outperform what I see of AWS's luster as a service today for our use case of long sequential reads and writes.

We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.

I love both of the systems, and see ceph used everywhere today, but am surprised and happy to see that gluster is still around.

a012|5 months ago

I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though

cullenking|5 months ago

We've been running a production ceph cluster for 11 years now, with only one full scheduled downtime for a major upgrade in all those years, across three different hardware generations. I wouldn't call it easy, but I also wouldn't call it hard. I used to run it with SSDs for radosgw indexes as well as a fast pool for some VMs, and harddrives for bulk object storage. Since i was only running 5 nodes with 10 drives each, I was tired of occasional iop issues under heavy recovery so on the last upgrade I just migrated to 100% nvme drives. To mitigate the price I just bought used enterprise micron drives off ebay whenever I saw a good deal popup. Haven't had any performance issues since then no matter what we've tossed at it. I'd recommend it, though I don't have experience with the other options. On paper I think it's still the best option. Stay away from CephFS though, performance is truly atrocious and you'll footgun yourself for any use in production.

nh2|5 months ago

We're using CephFS for a couple years, with some PBs of data on it (HDDs).

What performance issues and footguns do you have in mind?

I also like that CephFS has a performance benefits that doesn't seem to exist anywhere else: Automatic transparent Linux buffer caching, so that writes are extremely fast and local until you fsync() or other clients want to read, and repeat-reads or read-after-write are served from local RAM.

zenmac|5 months ago

>Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).

What you mean by no EC?

crabique|5 months ago

In their design document at https://garagehq.deuxfleurs.fr/documentation/design/goals/ they state: "erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication"