top | item 39656657

S3 is files, but not a filesystem

571 points| todsacerdoti | 2 years ago |calpaterson.com

430 comments

order
[+] breckognize|2 years ago|reply
> I haven't heard of people having problems [with S3's Durability] but equally: I've never seen these claims tested. I am at least a bit curious about these claims.

Believe the hype. S3's durability is industry leading and traditional file systems don't compare. It's not just the software - it's the physical infrastructure and safety culture.

AWS' availability zone isolation is better than the other cloud providers. When I worked at S3, customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building (or maybe different rooms of the same building) - not with the separation AWS did.

The entire organization was unbelievably paranoid about data integrity (checksum all the things) and bigger events like natural disasters. S3 even operates at a scale where we could detect "bitrot" - random bit flips caused by gamma rays hitting a hard drive platter (roughly one per second across trillions of objects iirc). We even measured failure rates by hard drive vendor/vintage to minimize the chance of data loss if a batch of disks went bad.

I wouldn't store critical data anywhere else.

Source: I wrote the S3 placement system.

[+] treflop|2 years ago|reply
What’s your experience like at other storage outfits?

I only ask because your post is a bit like singing praises for Cinnabon that they make their own dough.

The things that you mentioned are standard storage company activities.

Checksum-all-the-things is a basic feature of a lot of file systems. If you can already set up your home computer to detect bitrot and alert you, you can bet big storage vendors do it.

Keeping track of hard drive failure rates by vendor is normal. Storage companies publicly publish their own reports. The tiny 6-person IT operation I was in had a spreadsheet. Hell, I toured a friend’s friend’s major data center last year and he managed to find time to talk hard drive vendors. Now you. I get it — y’all make spreadsheets.

There are a lot of smart people working on storage outside AWS and long before AWS existed.

[+] rsync|2 years ago|reply
"AWS' availability zone isolation is better than the other cloud providers."

Not better than all of them.

A geo-redundant rsync.net account exists in two different states (or countries) - for instance, primary in Fremont[1] and secondary in Denver.

"S3 even operates at a scale where we could detect "bitrot""

That is not a function of scale. My personal server running ZFS detects bitrot just fine - and the scale involved is tiny.

[1] he.net headquarters

[+] supriyo-biswas|2 years ago|reply
Checksumming the data is not based out of paranoia but simply as a result of having to detect which blocks are unusable in order to run the Reed-Solomon algorithm.

I'd also assume that a sufficient number of these corruption events are used as a signal to "heal" the system by migrating the individual data blocks onto different machines.

Overall, I'd say the things that you mentioned are pretty typical of a storage system, and are not at all specific to S3 :)

[+] medler|2 years ago|reply
> customers would beat us up over pricing compared to GCP blob storage, but the comparison was unfair because Google would store your data in the same building

I don’t think this is true. Per the Google Cloud Storage docs, data is replicated across multiple zones, and each zone maps to a different cluster. https://cloud.google.com/compute/docs/regions-zones/zone-vir...

[+] staunch|2 years ago|reply
> Believe the hype.

I'd rather believe the test results.

Is there a neutral third-party that has validated S3's durability/integrity/consistency? Something as rigorous as Jepsen?

It'd be really neat if someone compared all the S3 compatible cloud storage systems in a really rigorous way. I'm sure we'd discover that there are huge scary problems. Or maybe someone already has?

[+] Veserv|2 years ago|reply
But they asked if the claims were audited by a unbiased third party. Are there such audits?

Alternatively, AWS does publicly provide legally binding availability guarantees, but I have never seen any prominently displayed legally binding durability guarantees. Are these published somewhere less prominently?

[+] tracerbulletx|2 years ago|reply
My first job was at a startup in 2012 where I was expected to build things at a scale way over what I really had the experience to do. Anyways the best choice I ever made was using RDS and S3 (and django).
[+] loeg|2 years ago|reply
Not a public cloud, but storage at Facebook is similar in terms of physical infrastructure, safety culture, and scale.
[+] simonebrunozzi|2 years ago|reply
I also worked at AWS, but not in the S3 team. However, I was Tech Evangelist and met with literally thousands of customers over my 6 years tenure. S3 was one of the hottest topics, but I got a sense of how good and robust it was directly from these customers.

What you say resonates really well with me, and what I've heard during these years.

[+] chupasaurus|2 years ago|reply
> and bigger events like natural disasters

Outdated anecdata: I've worked for a company that lost some parts of buckets after the lightning strike incident in 2011, which bumped the paranoia quite a bit. AFAIK same thing couldn't happen for more than a decade.

[+] zooq_ai|2 years ago|reply
Google discovered random bit flips caused by gamma rays.
[+] spintin|2 years ago|reply
Correct me if I'm wrong but bitrot only affects spinning rust since NAND uses ECC?

If you see this I wonder if S3 is planning on adding hardlinks?

[+] orf|2 years ago|reply
> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem

This misses something critical. Yes, s3 has fast reading and writing, but that’s not really what makes it useful.

What makes it useful is listing. In an unversioned bucket (or one with no delete markers), listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

What’s more, using “/“ as a delimiter is just the default - you can use any character you want and get a set of common prefixes. There are no “directories”, ”directories” are created out of thin air on demand.

This is super powerful, and it’s the thing that lets you partition your data in various ways, using whatever identifiers you need, without worrying about performance.

If listing was just “slow”, couldn’t list on file prefixes and got slower proportional to the number of keys (I.e a traditional unix file system), then it wouldn’t be useful at all.

[+] calpaterson|2 years ago|reply
I have to say that I'm not hugely convinced. I don't really think that being able to pull out the keys before or after a prefix is particularly impressive. That is the basis for database indices going back to the 1970s after all.

Perhaps the use-cases you're talking about are very different from mine. That's possible of course.

But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.

[+] nh2|2 years ago|reply
The key difference between lexicographically keyed flat hierarchies, and directory-nested filesystem hierarchies, becomes clear based on this example:

    dir1/a/000000
    dir1/a/...
    dir1/a/999999
    dir1/b
On a proper hierarchical file file system with directories as tree interior nodes, `ls dir1/` needs to traverse and return only 2 entries ("a" and "b").

A flat string-indexed KV store that only supports lexicographic order, without special handling of delimters, needs to traverse 1 million dirents ("a/00000" throuh "a/999999") before arriving at "b".

Thus, simple flat hierarchies are much slower at listing the contents of a single dir: O(all recursive children), vs. O(immediate children) on a "proper" filesystem.

Lexicographic strings cannot model multi-level tree structures with the same complexities; this may give it the reputation of "listing files is slow".

UNLESS you tell the listing algorithm what the delimter character is (e.g. `/`). Then a lexicographical prefix tree can efficiently skip over all subtrees at the next `/`.

Amazon S3 supports that, with the docs explicitly mentioning "skipping over and summarizing the (possibly millions of) keys nested at deeper levels" in the `CommonPrefixes` field: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

I have not tested whether Amazon's implemented actually saves the traversal (or whether it traverses and just returns less results), but I'd hope so.

[+] adrian_b|2 years ago|reply
Since 30 years ago (starting with XFS in 1993, which was inspired by HPFS) all the good UNIX file systems implement the directories as some kind of B trees.

Therefore they do not get slower proportional to the number of entries and listing based on file prefixes is extremely fast.

[+] foldr|2 years ago|reply
>What makes it useful is listing.

I think 99% of S3 usage just consists of retrieving objects with known keys. It seems odd to me to consider prefix listing as a key feature.

[+] gamache|2 years ago|reply
> ...listing any given prefix is essentially constant time: I can take any given string, in a bucket with 100 billion objects, and say “give me the next 1000 keys alphabetically that come after this random string”.

I'm not sure we agree on the definition of "constant time" here. Just because you get 1000 keys in one network call doesn't imply anything about the complexity of the backend!

[+] aeyes|2 years ago|reply
And if for some reason you need a complete listing along with object sizes and other attributes you can get one every 24 hours with S3 inventory report.

That has always been good enough for me.

[+] tjoff|2 years ago|reply
Is listing really such a key feature that people use it as a database to find objects?

Have not used S3, but that is not how I imagined using it.

[+] jacobsimon|2 years ago|reply
What is it about S3 that enables this speed, and why can’t traditional Unix file systems do the same?
[+] hayd|2 years ago|reply
You can set up cloud watch events to trigger a lambda function to store meta data about the s3 file in a regular database. That way you can index it how you expect to list.

Very effective for our use case.

[+] donatj|2 years ago|reply
> And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem.

I was taken aback by this recently. At my coworkers request, I was putting some work into a script we have to manage assets in S3. It has a cache for the file listing, and my coworker who wrote it sent me his pre-populated cache. My initial thought was “this can’t really be necessary” and started poking.

We have ~100,000 root level directories for our individual assets. Each of those have five or six directories with a handful of files. Probably less than a million files total, maybe 3 levels deep at its deepest.

Recursively listing these files takes literally fifteen minutes. I poked and prodded suggestions from stack overflow and ChatGPT at potential ways to speed up the process and got nothing notable. That’s absurdly slow. Why on earth is it so slow?

Why is this something Amazon has not fixed? From the outside really seems like they could slap some B-trees on the individual buckets and call it a day.

If it is a difficult problem, I’m sure it would be for fascinating reasons I’d love to hear about.

[+] catlifeonmars|2 years ago|reply
S3 is fundamentally a key value store. The fact that you can view objects in “directories” is nothing more than a prefix filter. It is not a file system and has no concept of directories.
[+] electroly|2 years ago|reply
The way that you said "recursively" and spent a lot of time describing "directories" and "levels" worries me. The fastest way to list objects in S3 wouldn't involve recursion at all; you just list all objects under a prefix. If you're using the path delimiter to pretend that S3 keys are a folder structure (they're not) and go "folder by folder", it's going to be way slower. When calling ListObjectsV2, make sure you are NOT passing "delimiter". The "directories" and "levels" have no impact on performance when you're not using the delimiter functionality. Split the one list operation into multiple parallel lists on separate prefixes to attain any total time goal you'd like.
[+] jameshart|2 years ago|reply
A fun corollary of this issue:

Deleting an S3 bucket is nontrivial!

You can't delete a bucket with objects in it. And you can't just tell S3 to delete all the objects. You need to send individual API requests to S3 to delete each object. Which means sending requests to S3 to list out the objects, 1000 at a time. Which takes time. And those list calls cost money to execute.

This is a good summary of the situation: https://cloudcasts.io/article/deleting-an-s3-bucket-costs-mo...

The fastest way to quickly dispose of an S3 bucket turns out to be to delete the AWS account it belongs to.

[+] somedudetbh|2 years ago|reply
> Amazon S3 is the original cloud technology: it came out in 2006. "Objects" were popular at the time and S3 was labelled an "object store", but everyone really knows that S3 is for files. S3

Alternative theory: everyone who worked on this knew that it was not a filesystem and "object store" is a description intended to describe everything else pointed out in this post.

"Objects were really popular" is about objects as software component that combines executable code with local state. None of the original S3 examples were about "hey you can serialize live objects to this store and then deserialize them into another live process!" It was all like "hey you know how you have all those static assets for your website..." "Objects" was used in this sense in databases at the time in the phrase "binary large object" or "blob". S3 was like "hey, stuff that doesn't fit in your database, you know...objects...this is a store for them."

This is meant to describe precisely things like "listing is slow" because when S3 was designed, the launch usecases assumed an index of contents existed _somewhere else_, because, yeah, it's not a filesystem. it's an object store.

[+] paulddraper|2 years ago|reply
Yeah, I'm really worried the author is confusing OOP with an object store.

To quote GCP:

> Object storage is a data storage architecture for storing unstructured data, which sections data into units—objects—and stores them in a structurally flat data environment

> https://cloud.google.com/learn/what-is-object-storage

That is (1) unstructured (2) flat organization (3) whole-item operations (read, write)

[+] alphazard|2 years ago|reply
S3 is not even files, and definitely not a filesystem.

The thing I would expect from a file abstraction is mutability. I should be able to edit pieces of a file, grow it, shrink it, read and write at random offsets. I shouldn't have to go back up to the root, or a higher level concept once I have the file in hand. S3 provides a mutable listing of immutable objects, if I want to do any of the mutability business, I need to make a copy and re-upload. As originally conceived, the file abstraction finds some sectors on disk, and presents them to the client as a contiguous buffer. S3 solves a different problem.

Many people misinterpret the Good Idea from UNIX "everything is a file" to mean that everything should look like a contiguous virtual buffer. That's not what the real Good Idea is. Really: everything can be listed in a directory, including directories. There will be base leaves, which could be files, or any object the system wants to present to a process, and there will be recursive trees (which are directories). The directories are what make the filesystem, not the type of a particular leaf. Adding a new type of leaf, like a socket or a frame buffer, or whatever, is almost boring, and doesn't erode the integrity of the real good idea. Adding a different kind of container like a list, would make the structure of the filesystem more complex, and that would erode the conceptual integrity.

S3 doesn't do any of these things, and that's fine. I just want a place to put things that won't fit in the database, and know they won't bitrot when I'm not looking. The desire to make S3 look more like a filesystem comes from client misunderstanding of what it's good at/for, and poor product management indulging that misunderstanding instead of guarding the system from it.

[+] globular-toast|2 years ago|reply
A filesystem is an abstraction built on a block device. A block device just gives you a massive array of bytes and lets you read/write from them in blocks (e.g. write these 300 bytes at position 273041).

A block device itself is an abstraction built on real hardware. "Write these 300 bytes" really means something like "move needle on platter 2 to position 6... etc"

S3 is just a different abstraction that is also built on raw storage somehow. It's a strictly flat key-object store. That's it. I don't know why people have a problem with this. If you need "filesystem stuff" then implement it in your app, or use a filesystem. You only need to append? Use a database to keep track of the chain of appends and store the chunks in S3. Doesn't work for you? Use something else. Need to "copy"? Make a new reference to the same object in your db. Doesn't work for you? Use something else.

S3 works for a lot of people. Stop trying to make it something else.

And stop trying to change the meaning of super well-established names in your field. A filesystem is described in text books everywhere. S3 is not a filesystem and never claimed to be one.

Oh and please study a bit of operating system design. Just a little bit. It really helps and is great fun too.

[+] tison|2 years ago|reply
It's ever discussed in https://github.com/apache/arrow-rs/issues/3888 for comparing object_store in Apache Arrow to the APIs provided by Apache OpenDAL.

Briefly, Apache OpenDAL is a library providing FS-like APIs over multiple storage backends, including S3 and many other cloud storage.

A few database systems, such as GreptimeDB and Databend, use OpenDAL as a better S3 SDK to access data on cloud storage.

Other solutions exist to manage filesystem-like interfaces over S3, including Alluxio and JuiceFS. Unlike Apache OpenDAL, Alluxio and JuiceFS need to be deployed standalone and have a dedicated internal metadata service.

[+] cynicalsecurity|2 years ago|reply
Backblaze B2 is worth mentioning while we are speaking of S3. I'm absolutely in love with their prices (3 times lower than of S3). (I'm not their representative).
[+] nickcw|2 years ago|reply
Great article - would have been useful to read before starting out on the journey of making rclone mount (mount your cloud storage via fuse)!

After a lot of iterating we eventually came up with the VFS layer in rclone which adapts S3 (or any other similar storage system like Google Cloud Storage, Azure Blob, Openstack Swift, Oracle Object Storage, etc) into a POSIX-ish file system layer in rclone. The actual rclone mount code is quite a thin layer on top of this.

The VFS layer has various levels of compatibility "off" where it just does directory caching. In this mode, like the article states you can't read and write to a file simultaneously and you can't write to the middle of a file and you can only write files sequentially. Surprisingly quite a lot of things work OK with these limitations. The next level up is "writes" - this supports nearly all the POSIX features that applications want like being able to read and write to the same file at the same time, write to the middle of the file, etc. The cost for that though is a local copy of the file which is uploaded asynchronously when it is closed.

Here are some docs for the VFS caching modes - these mirror the limitations in the article nicely!

https://rclone.org/commands/rclone_mount/#vfs-file-caching

By default S3 doesn't have real directories either. This means you can't have a directory with no files in, and directories don't have valid metadata (like modification time). You can create zero length files ending in / which are known as directory markers and a lot of tools (including rclone) support these. Not being able to have empty directories isn't too much of a problem normally as the VFS layer fakes them and most apps then write something into their empty directories pretty quickly.

So it is really quite a lot of work trying to convert something which looks like S3 into something which looks like a POSIX file system. There is a whole lot of smoke and mirrors behind the scene when things like renaming an open file happens and other nasty corner cases like that.

Rclone's lower level move/sync/copy commands don't bother though and use the S3 API pretty much as-is.

If I could change one thing about S3's API I would like an option to read the metadata with the listings. Rclone stores modification times of files as metadata on the object and there isn't a bulk way of reading these, you have to HEAD the object. Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

[+] throwaway892238|2 years ago|reply
> The "simple" in S3 is a misnomer. S3 is not actually simple. It's deep.

Simple doesn't mean "not deep". It means having the fewest parts needed in order to accomplish your requirements.

If you require a distributed, centralized, replicated, high-availability, high-durability, high-bandwidth, low-latency, strongly-consistent, synchronous, scalable object store with HTTP REST API, you can't get much simpler than S3. Lots of features have been added to AWS S3 over the years, but the basic operation has remained the same.

[+] inkyoto|2 years ago|reply
S3 is a tagged versioned object storage with file like semantics implemented in the AWS SDK (via AWS S3 API's). The S3 object key is the tag.

Files and folders are used to make S3 buckets more approachable to those who either don't know or don't want to know what it actually is, and one day they get a surprise.

[+] d-z-m|2 years ago|reply
> S3 is a cloud filesystem, not an object-whatever. [...]I think the idea that S3 is really "Amazon Cloud Filesystem" is a bit of a load bearing fiction.

Does anyone actually think this? I have never encountered anyone who has described S3 in these terms.

[+] type_Ben_struct|2 years ago|reply
Tools like LucidLink and Weka go a way to making S3 even more of a “file system”. They break files into smaller chunks (S3 objects) which helps with partial writes, reads and performance. Alongside tiering of data from S3 to disk when needed for performance.
[+] svat|2 years ago|reply
It's nice to see Ousterhout's idea of module depth (the main idea from his A Philosophy of Software Design) getting more mainstream — mentioned in this article with attribution only in "Other notes", which suggests the author found it natural enough not to require elaboration. Being obvious-in-hindsight like this is a sign of a good idea. :-)

> The concept of deep vs shallow modules comes from John Ousterhout's excellent book. The book is [effectively] a list of ideas on software design. Some are real hits with me, others not, but well worth reading overall. Praise for making it succinct.

[+] hiAndrewQuinn|2 years ago|reply
I feel like I understand the lasting popularity of the humble FTP fileserver a bit better now. Thank you.
[+] arvindamirtaa|2 years ago|reply
Like Gmail is emails but not IMAP. It's fine. We have seen that these kinds of wrappers work pretty well most of the time considering the performance and simplicity they bring in building and managing these systems.
[+] ein0p|2 years ago|reply
A bit off topic but also related: I use Minio as a local "S3" to store datasets and model checkpoints for my garage compute. Minio, however, has a bunch of features that I simply don't need. I just want to be able copy to/from, list prefixes, and delete every now and then. I could use nfs I suppose, but that'd be a bit inconvenient since I also use Minio to store build deps (which Bazel then downloads), and I'd like to be able to comfortably build stuff on my laptop. In particular, one feature I do not need is the constant disk access than Minio does to "protect against bit rot" and whatever. That protection is already provided by periodic scrubs on my raidz6.

So what's the current best (preferably statically linked) self-hosted, single-node option for minimal S3 like "thing" that just lets me CRUD the files and list them?

[+] hn72774|2 years ago|reply
> Filesystem software, especially databases, can't be ported to Amazon S3

Hudi, Delta, iceberg bridge that gap now. Databricks built a company around it.

Don't try to do relational on object storage on your own. Use one of those libraries. It seems simple but it's not. Late arriving data, deletes, updates, primary key column values changing, etc.

[+] MatthiasPortzel|2 years ago|reply
This article was an epiphany for me because I realized I've been thinking of the Unix filesystem as if it has two functions: read_file and write_file. (And then getting frustrated with the filesystem APIs in programming languages.)
[+] YouWhy|2 years ago|reply
The article is well written, but I am annoyed at the attempt to gatekeep the definition of a filesystem.

Like literally any abstraction out there, filesystems are associated with a multitude of possible approaches with conceptually different semantics. It's a bit sophistic to say that Postgres cannot be run on S3 because S3 is not a filesystem; a better choice would have been to explore the underlying assumptions; (I suspect latency would kill the hypothetical use case of Postgres over S3 even if S3 had incorporated the necessary API semantics - could somebody more knowledgeable chime in?).

A more interesting venue to pursue would be - what other additions could be made to the S3 API to make it more usable on its own right - for example, why doesn't S3 offer more than one filename per blob? (e.g., a similar to what links do in POSIX)