(no title)
crabique | 5 months ago
Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.
So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.
Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.
creiht|5 months ago
crabique|5 months ago
The link is a 10G 9K MTU connection, the server is only accessed via that local link.
Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).
At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.
I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.
On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.
uroni|5 months ago
It would still need a resilience/cache layer like ZFS, though.
pjdesno|5 months ago
Getting Ceph erasure coding set up properly on a big hard disk pool is a pain - you can tell that EC was shoehorned into a system that was totally designed around triple replication.
nh2|5 months ago
toast0|5 months ago
You're basically seek limited, and a read on a mirror is one seek, whereas a read on a RAIDz is one seek per device in the stripe. (Although if most of your objects are under the chunk size, you end up with more of mirroring than striping)
You lose on capacity though.
crabique|5 months ago
I think I'm going to go with 8x RAIDz2 VDEVs 10x HDDs each, so that the 20 drives in the internal drive enclosure could be 2 separate VDEVs and not mix with the 60 in the external enclosure.
epistasis|5 months ago
bayindirh|5 months ago
TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.
It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.
foobarian|5 months ago
bayindirh|5 months ago