Ah so its not only me that uses AWS primitives for hackily implementing all sorts of synchronization primitives.
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them.
Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
While it can't be done server-side, this can be done straightforwardly in a signer service, and the signer doesn't need to interact with the payloads being uploaded. In other words, a tiny signer can act as a control plane for massive quantities of uploaded data.
The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.
Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.
I have a blog post [1] that goes into more depth on signers in general.
S3 has supported SHA-256 as a checksum algo since 2022. You can calculate the hash locally and then specify that hash in the PutObject call. S3 will calculate the hash and compare it with the hash in the PutObject call and reject the Put if they differ. The hash and algo are then stored in the object's metadata. You simply also use the SHA-256 hash as the key for the object.
That's interesting. Would you want it to be something like a bucket setting, like "any time an object is uploaded, don't let an object write complete unless S3 verifies that a pre-defined hash function (like SHA256) is called to verify that the object's name matches the object's contents?"
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
It is often very important to know, when you write an object, what the previous state was. Say you sold plushies and you had 100 plushies in a warehouse. You create a file "remainingPlushies.txt" that stores "100". If somebody buys a plushie, you read the file, and if it's bigger than 0, you subtract 1, write the new version of the file, and okay the sale.
Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.
The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Noting that Azure Blob storage supports e-tag / optimistic controls as well (via If-Match conditions)[1], how does this differ? Or is it the same feature?
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
With Google Cloud Storage, you can solve this by conditionally writing based on the "generation number" of the object, which always increases with each new write, so you can know whether the object has been overwritten regardless of its contents. I think Azure also has an equivalent.
Ironically with this and lambda you could make a serverless sqlite by mapping pages to objects, using http range reads to read the db and lambda to translate queries to the writes in the appropriate pages via cas. Prior to this it would require a server to handle concurrent writers, making the whole thing a nonstarter for “serverless”.
Too bad performance would be terrible without a caching layer (ebs).
s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to the IfNoneMatch feature from the summer. How would people imagine this new feature being surfaced in a filesystem abstraction?
I had no idea people rely on S3 beyond dumb storage. It almost feels like people are trying to build out a distributed OLAP database in the reverse direction.
That's not "strange" to me. Object storage has been a long time coming, and it's still being figured out: the entirely typical process of discovering useful and feasible primitives that expand applicability to more sophisticated problems. This is obviously going occur first in smaller and/or younger, more agile implementations, whereas AWS has the problem of implementing this at pretty much the largest conceivable scale with zero risk. The lag is, therefore, entirely unsurprising.
It's not surprising at all. The scale of AWS, in particular S3, is nearly unfathomable, and the kind of solutions they need for "simple" things are totally different at that size. S3 was doing 1.1million requests a second back in 2013.[1]
I wouldn't be surprised if they saw over 100mil/req/sec globally by now. That's 100 million requests a second that need strong read-your-write consistency and atomicity at global scale. The number of pieces they had to move into place for this to happen is probably quite the engineering tale.
Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?
Here's how I would think about this. Regatta isn't the best way to add synchronization primitives to S3, if you're already using the S3 API and able to change your code. Regatta is most useful when you need a local disk, or a higher performance version of S3. In this case, the addition of these new primitives actually just makes Regatta work better for our customers -- because we get to achieve even stronger consistency.
It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
So...are we closer to getting to use S3 as a...you guessed it...a database? With CAS, we are probably able to get a basic level of atomicity, and S3 itself is pretty durable, now we have to deal with consistency and isolation...although S3 branded itself as "eventually consistent"...
There was a great deal of interest in gossip protocols, eventual consistency, and such at Amazon in the mid oughts. So much so that they hired a certain Cornell professor along with the better part of his grad students to build out those technologies.
I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.
Now if only you had more control over the ETag, so you could use a sha256 of the total file (even for multi-part uploads), or a version counter, or a global counter from an external system, or a logical hash of the content as opposed to a hash of the bytes.
What is stopping you not doing it now? I know Q is not good (hallucinates, slow, requires sign in) But it's wise to explain what your gripe is about than saying which you can always do.
torginus|1 year ago
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
belter|1 year ago
_zoltan_|1 year ago
williamdclt|1 year ago
JoshTriplett|1 year ago
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
josnyder|1 year ago
The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.
Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.
I have a blog post [1] that goes into more depth on signers in general.
[1] https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...
UltraSane|1 year ago
https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
texthompson|1 year ago
jiggawatts|1 year ago
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
cmeacham98|1 year ago
anotheraccount9|1 year ago
Sirupsen|1 year ago
https://turbopuffer.com/blog/turbopuffer
amazingamazing|1 year ago
CobrastanJorji|1 year ago
1a527dd5|1 year ago
Genuinely, we've wanted this for ages and we got half way there with strong consistency.
ncruces|1 year ago
paulddraper|1 year ago
CubsFan1060|1 year ago
lxgr|1 year ago
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
CobrastanJorji|1 year ago
Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.
The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.
Sirupsen|1 year ago
jayd16|1 year ago
unknown|1 year ago
[deleted]
maglite77|1 year ago
[1]: https://learn.microsoft.com/en-us/azure/storage/blobs/concur...
simonw|1 year ago
koolba|1 year ago
IgorPartola|1 year ago
ncruces|1 year ago
So coordinating writes to multiple objects still requires… creativity.
offmycloud|1 year ago
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
revnode|1 year ago
unknown|1 year ago
[deleted]
UltraSane|1 year ago
CobrastanJorji|1 year ago
unknown|1 year ago
[deleted]
ipython|1 year ago
amazingamazing|1 year ago
Too bad performance would be terrible without a caching layer (ebs).
captn3m0|1 year ago
sillysaurusx|1 year ago
ncruces|1 year ago
Can we have this Google?
…
Please?
seansmccullough|1 year ago
mannyv|1 year ago
m_d_|1 year ago
spprashant|1 year ago
amne|1 year ago
2. glue jobs to partition by some columns reporting uses
3. query with athena
4. ???
5. profit (celebrate reduced cost)
This thing costs couple $ a month for ~500gb of data. Snowflake wanted crazy amounts of money for the same thing.
vytautask|1 year ago
topspin|1 year ago
aseipp|1 year ago
I wouldn't be surprised if they saw over 100mil/req/sec globally by now. That's 100 million requests a second that need strong read-your-write consistency and atomicity at global scale. The number of pieces they had to move into place for this to happen is probably quite the engineering tale.
[1] https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-obje...
tonymet|1 year ago
akira2501|1 year ago
wanderingmind|1 year ago
https://news.ycombinator.com/item?id=42174204
huntaub|1 year ago
gravitronic|1 year ago
lttlrck|1 year ago
rrr_oh_man|1 year ago
msoad|1 year ago
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
stevefan1999|1 year ago
User23|1 year ago
mr_toad|1 year ago
gynther|1 year ago
vlovich123|1 year ago
anonymousDan|1 year ago
dvektor|1 year ago
Finally we can have this with s3 :)
mdaniel|1 year ago
paulsutter|1 year ago
thayne|1 year ago
londons_explore|1 year ago
juggli|1 year ago
throwaway314155|1 year ago
[deleted]
earth2mars|1 year ago
ramon156|1 year ago
grahamj|1 year ago
serbrech|1 year ago