top | item 39659324

(no title)

cuno | 2 years ago

We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.

discuss

crabbone|2 years ago

> find it faster than a local filesystem for many benchmarks.

What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...

Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?

----

The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.

This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.

cuno|2 years ago

For AWS, we're comparing against filesystems in the datacenter - so EBS, EFS and FSx Lustre. Compared to these, you can see in the graphs where S3 is much faster for workloads with big files and small files: https://cuno.io/technology/

and in even more detail of different types of EBS/EFS/FSx Lustre here: https://cuno.io/blog/making-the-right-choice-comparing-the-c...

YZF|2 years ago

It can't be a POSIX filesystem if it doesn't meet POSIX filesystem guarantees. I worked on an S3 compatible object store in a large storage company and we also had distributed filesystem products. Those are completely different animals due to the different semantics and requirements. We've also built compliant filesystems over object store and the other way around. Certain operations like, write-append, are tricky to simulate over object stores (S3 didn't use to support append, I haven't really stayed up to date, does it now?). At least when I worked on this it wasn't possible to simulate POSIX semantics over S3 at all without needing to add additional object store primitives.

supriyo-biswas|2 years ago

> Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")

How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.

cuno|2 years ago

For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.

johnmaguire|2 years ago

You're right that it won't work for all use cases, but starting two threads with prefixes A and M, for example, is one way you might achieve this.

fijiaarone|2 years ago

If you think s3 is fast, you should try FTP. It’s at least a hundred times faster. And combined with rsync, dozens of times more reliable.

orf|2 years ago

Neither of those are true though? Not sure if this is sarcastic or not, if so make it more clear in the future