top | item 45359768

(no title)

crabique | 5 months ago

Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.

Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.

So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.

Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.

discuss

creiht|5 months ago

It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.

crabique|5 months ago

It's a very beefy server with 4 NVMe and 20 HDD bays + a 60-drive external enclosure, 2 enterprise grade HBA cards set to multipath round-robin mode, even with 80 drives it's nowhere near the data path saturation point.

The link is a 10G 9K MTU connection, the server is only accessed via that local link.

Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).

At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.

I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.

On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.

uroni|5 months ago

I'm working on something that might be suited for this use-case at https://github.com/uroni/hs5 (not ready for production yet).

It would still need a resilience/cache layer like ZFS, though.

pjdesno|5 months ago

Ceph's S3 protocol implementation is really good.

Getting Ceph erasure coding set up properly on a big hard disk pool is a pain - you can tell that EC was shoehorned into a system that was totally designed around triple replication.

nh2|5 months ago

Coudl you eleborate what you mean by the last sentence?

toast0|5 months ago

If you can afford it, mirroring in some form is going to give you way better read perf than RAIDz. Using zfs mirrors is probably easiest but least flexible, zfs copies=2 with all devices as top level vdevs in a single zpool is not very unsafe, and something custom would be a lot of work but could get safety and flexibility if done right.

You're basically seek limited, and a read on a mirror is one seek, whereas a read on a RAIDz is one seek per device in the stripe. (Although if most of your objects are under the chunk size, you end up with more of mirroring than striping)

You lose on capacity though.

crabique|5 months ago

Yeah unfortunately mirrors is no go due to efficiency requirements, but luckily read performance is not that important if I manage to completely offload FS/S3 metadata and small files to flash storage (separate zpool for Garage metadata, separate special VDEV for metadata/small files).

I think I'm going to go with 8x RAIDz2 VDEVs 10x HDDs each, so that the 20 drives in the internal drive enclosure could be 2 separate VDEVs and not mix with the 60 in the external enclosure.

epistasis|5 months ago

It's great to see other people's working solutions, thanks. Can I ask if you have backup on something like this? In many systems it's possible to store some data on ingress or after processing, which serves as something that's rebuildable, even if it's not a true backup. I'm not familiar if your software layer has backup to off site as part of their system, for example, which would be a great feature.

bayindirh|5 months ago

It might not be the most ideal solution, but did you consider installing TrueNAS on that thing?

TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.

It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.

foobarian|5 months ago

Do you know if some of these systems have components to periodically checksum the data at rest?

bayindirh|5 months ago

ZFS/OpenZFS can do scrub and do block-level recovery. I'm not sure about Lustre, but since Petabyte sized storage is its natural habitat, there should be at least one way to handle that.