(no title)
sudhirj | 1 year ago
So if your keys are 2024/12/03/22-45:24 etc, I would expect the prefix to be first 7 characters. If your keys are UUIDs I’d assume first two or three. For ULIDs I’d assume first 10. I this there’s a function that does stat analysis on key samples to figure out reasonable sharding.
tecleandor|1 year ago
The problem with a date based key like the one you used (that's very common) is that if you read a lot of files that tend to be from the same date (for example: for data analysis you read all the files from one day or week, not files randomly distributed) all those files are going to share the same prefix and are going to be located in the same shard, reducing performance until the load is so high that Google splits that index in parts and begins to distribute your data in other shards.
For this reason they recommend to think your key name beforehand and split that prefix using some sort of random hash in a reasonable location of your key:
https://cloud.google.com/storage/docs/request-rate#naming-co...
jrochkind1|1 year ago
> Adding a random string after a common prefix still allows auto-scaling to work, but…
No way to know if that's true of S3's algorithm too without them revealing it.
jrochkind1|1 year ago
If you don't have "a lot" of keys, then you probably have only one prefix, maybe? Without them documenting the target order of magnitude of their shards?
sudhirj|1 year ago