top | item 39662573

(no title)

cuno | 2 years ago

IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads, except for truly random I/O in applications such as databases. For example, in Machine Learning, training usually consists of taking large datasets (sometimes many PBs in scale), randomly shuffling them each Epoch, and feeding them into the engine as fast as possible. Because of this, we see storage vendors for ML workloads concentrate on IOPS numbers. The GPUs however only really care about throughput. Indeed, we find a great many applications only really care about the throughput, and IOPS is only relevant if it helps to accomplish that throughput. For ML, we realised that the shuffling isn't actually random - there's no real reason for it to be random versus pseudo-random. And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect - yielding a 60x boost in throughput on S3, beating out a bunch of other solutions. S3 is not going to do great for truly random I/O, however, we find that most scientific, media and finance workloads are actually deterministic or semi-deterministic, and this is where cunoFS, by peering inside each process, can better predict intra-file and inter-file access patterns, so that we can hide the latencies present in S3. At the end of the day, the right benchmark is the one that reflects real world usage of applications, but that's a lot of effort to document one by one.

I agree that things like dedupe and compression can affect things, so in our large file benchmarks each file is actually random. The small file benchmarks aren't affected by "write bigger blocks" because there's nothing bigger than the file itself. Yes, data consistency can be an issue, and we've had to do all sorts of things to ensure POSIX consistency guarantees beyond what S3 (or compatible) can provide. These come with restrictions (such as on concurrent writes to the same file on multiple nodes), but so does NFS. In practice, we introduced a cunoFS Fusion mode that relies on a traditional high-IOPS filesystem for such workloads and consistency (automatically migrating data to that tier), and high throughput object for other workloads that don't need it.

discuss

rfoo|2 years ago

> And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect

This is an interesting hack. However, an IOP is an IOP, no matter how good you predicted it and prefetch it so that you hide the latency it's going to be translated to a GetObject.

I think what you really exploited here is that even though S3 is built on HDDs (and have very low IOPS per TiB) their scale is so large that even if you milk 1M+ IOPS out of it AWS still doesn't care and is happy to serve you. But if my back-of-envelope calculation is correct this isn't going to work well if everyone starts to do it.

How do you get around S3's 5.5k GET per second per prefix limit? If I only have ~200 20GiB files can you still get decent IOPS out of it?

and...

> IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads

No, it's not. I have a workload training a DL model on time series data which demands 600k 8KiB IOPS per compute instance. None of the thing I tested work well. Had to build a custom one with bare metal NVMe-s.

cuno|1 year ago

Sorry for the late response - I didn't see your comment until now.

Our aim is to unleash all the potential that S3/Object has to offer for file system workloads. Yes, the scale of AWS S3 helps, as does erasure coding (which enhances flexibility for better load balancing of reads).

Is it suitable for every possible workload? No, which is why we have a mode called cunoFS Fusion where we let people combine a regular high-performance filesystem for IOPS, and Object for throughput, with data automatically migrated between the two according to workload behaviour. What we find is that most data/workloads need high throughput rather than high IOPS, and this tends to be the bulk of data. So rather than paying for PBs of ultra-high IOPS storage, they only need to pay for TBs of it instead. Your particular workload might well need high IOPS, but a great many workloads do not. We do have organisations doing large scale workloads on time-series (market) data using cunoFS with S3 for performance reasons.