If someone is wondering why this is huge. It's because of the problems you face, when you don't have strong consistency. See this picture to get an understanding of what kind of problems you might run into with eventual consistency:
I believe this makes Amazon S3 behave more similar to Azure blob storage[1] and Google Cloud Storage[2], which is pretty convenient for folks who are abstracting blob stores across different offerings.
For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)
This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.
That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.
I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.
The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.
Yup me too! Checkout this state_doc Interface I wrote in Python for building a state machine where the state is stored in S3. I never bothered to solve for the edge case of inconsistent reads because my little state machine only ever measured in hours when it slept waiting for work and then minutes in-between waking up and doing work to verify restores.
With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.
This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.
Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.
Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?
The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):
Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction
Leader -> other replicas: Start commit, timestamp 123
Other replicas -> Leader: Ready for commit
Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval
Leader -> Other replicas: Commit transaction, and in parallel replies to client.
Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.
Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).
For any AWS EMR/hadoop users out there, this means the end of emrfs.
For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.
It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.
Side question -- why is s3fs so slow? I get only about 1 put per 10 seconds with s3fs. One would hope that you could use something like s3fs as a makeshift S3 API, but its performance is really horrible.
One fun fact that I learned recently: a prefix is not strictly path-delimited. I would think of /foo/bar and /foo/baz and /bar/baz as having two prefixes, but it could be anywhere from one to three, depending on how S3 has partitioned your data.
Neat! Would this make it possible in the future for S3 to support CAS like operations for PutObject? Something like supporting `if-match` and `if-none-match` headers on the etag of an object would be so useful.
Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.
HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.
Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.
S3 is almost certainly not fully partition tolerant at the node level and requires some sort of quorum. Other “magical” data stores like Spanner also retain this limitation, they just have very reliable replication strategies.
This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet
[+] [-] tutfbhuf|5 years ago|reply
https://i.ibb.co/DtxrRH3/eventual-consistency.png
[+] [-] pachico|5 years ago|reply
[+] [-] rafaelturk|5 years ago|reply
[+] [-] pantulis|5 years ago|reply
[+] [-] jchw|5 years ago|reply
For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)
[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...
[2]: https://cloud.google.com/storage/docs/consistency
[+] [-] joana035|5 years ago|reply
[+] [-] eloff|5 years ago|reply
[+] [-] boulos|5 years ago|reply
This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).
[+] [-] gopalv|5 years ago|reply
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] basicneo|5 years ago|reply
Is this public?
[+] [-] ofek|5 years ago|reply
[+] [-] scrollaway|5 years ago|reply
I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.
The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.
[+] [-] foxhop|5 years ago|reply
With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.
https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...
I should likely pull out the StateDoc class into it's own module.
[+] [-] kartayyar|5 years ago|reply
https://cloud.google.com/blog/products/gcp/how-google-cloud-...
[+] [-] aparsons|5 years ago|reply
[+] [-] tutfbhuf|5 years ago|reply
Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?
[+] [-] foota|5 years ago|reply
The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):
Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction
Leader -> other replicas: Start commit, timestamp 123
Other replicas -> Leader: Ready for commit
Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval
Leader -> Other replicas: Commit transaction, and in parallel replies to client.
Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.
[+] [-] rockinghigh|5 years ago|reply
Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).
[+] [-] throwaway_dcnt|5 years ago|reply
[+] [-] jz-amz|5 years ago|reply
* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...
[+] [-] iblaine|5 years ago|reply
For any AWS EMR/hadoop users out there, this means the end of emrfs.
For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.
It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.
[+] [-] damon_c|5 years ago|reply
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.
[+] [-] dheera|5 years ago|reply
[+] [-] web007|5 years ago|reply
[+] [-] swyx|5 years ago|reply
[+] [-] psanford|5 years ago|reply
[+] [-] jamesblonde|5 years ago|reply
HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.
Disclosure: I work on HopsFS
[+] [-] stubish|5 years ago|reply
[+] [-] sleepydog|5 years ago|reply
[+] [-] kleebeesh|5 years ago|reply
[+] [-] hoytschermerhrn|5 years ago|reply
[0] https://cloud.google.com/spanner/docs/true-time-external-con...
[+] [-] ultimoo|5 years ago|reply
I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.
[+] [-] k__|5 years ago|reply
[+] [-] mikaeluman|5 years ago|reply
Where is the mistake.
[+] [-] johncolanduoni|5 years ago|reply
[+] [-] chillydawg|5 years ago|reply
[+] [-] gaul|5 years ago|reply
[+] [-] hendry|5 years ago|reply
[+] [-] juliansimioni|5 years ago|reply
[+] [-] mholt|5 years ago|reply
[+] [-] leakybucket|5 years ago|reply
With the consistency change, those might be useful as the basis for atomic operations.
[+] [-] OJFord|5 years ago|reply
[+] [-] ryanworl|5 years ago|reply