top | item 37404549

(no title)

nyc_pizzadev | 2 years ago

Does anyone have any experience on how this works at scale?

Let’s say I have a directory tree with 100MM files in a nested structure, where the average file is 4+ directories deep. When I `ls` the top few directories, is it fast? How long until I discover updates?

Reading the docs, it looks like it’s using this API for traversal [0]?

What about metadata like creation times, permission, owner, group?

Any consistency concerns?

[0] https://cloud.google.com/storage/docs/json_api/v1/objects/li...

discuss

order

BrandonY|2 years ago

Hi, Brandon from GCS here. If you're looking for all of the guarantees of a real, POSIX filesystem, you want to do fast top level directory listing for 100MM+ nested files, and POSIX permissions/owner/group and other file metadata are important to you, Gcsfuse is probably not what you're after. You might want something more like Filestore: https://cloud.google.com/filestore

We've got some additional documentation on the differences and limitations between Gcsfuse and a proper POSIX filesystem: https://cloud.google.com/storage/docs/gcs-fuse#expandable-1

Gcsfuse is a great way to mount Cloud Storage buckets and view them like they're in a filesystem. It scales quite well for all sorts of uses. However, Cloud Storage itself is a flat namespace with no built-in directory support. Listing the few top level directories of a bucket with 100MM files more or less requires scanning over your entire list of objects, which means it's not going to be very fast. Listing objects in a leaf directory will be much faster, though.

nyc_pizzadev|2 years ago

Thanks for the reply.

Our theoretical usecase is 10+ PB and we need multiple TB/s of read throughout (maybe of fraction of that for writing). So I don’t think Filestore fits this scale, right?

As for the directory traversals, I guess caching might help here? Top level changes aren’t as frequent as leaf additions.

That being said, I don’t see any (caching) proxy support anywhere other than the Google CDN.

milesward|2 years ago

Brandon, I know why this was built, and I agree with your list of viable uses; that said, it strikes me as extremely likely to lead to gnarly support load, grumpy customers, and system instability when it is inevitably misused. What steps across all of the user interfaces is GCP taking to warn users who may not understand their workload characteristics at all as to the narrow utility of this feature?

daviesliu|2 years ago

If you really expect a file system experience over GCS, please try JuiceFS [1], which scales to 10 billions of files pretty well with TiKV or FoundationDB as meta engine.

PS, I'm founder of JuiceFS.

[1] https://github.com/juicedata/juicefs

victor106|2 years ago

The description says S3. Does it also support GCS?

skrowl|2 years ago

[deleted]

kuchenbecker|2 years ago

Blobstores are O(n) to perform a directory operation. You are forced to serialize / lock when these expensive operations happen to maintain consistency which limits the maximum size.