S3 Strong Consistency

[+] tutfbhuf|5 years ago|reply

If someone is wondering why this is huge. It's because of the problems you face, when you don't have strong consistency. See this picture to get an understanding of what kind of problems you might run into with eventual consistency:

https://i.ibb.co/DtxrRH3/eventual-consistency.png

[+] pachico|5 years ago|reply

You just made my day :)

[+] rafaelturk|5 years ago|reply

I shall print this..

[+] pantulis|5 years ago|reply

Unbelievably good!

[+] jchw|5 years ago|reply

I believe this makes Amazon S3 behave more similar to Azure blob storage[1] and Google Cloud Storage[2], which is pretty convenient for folks who are abstracting blob stores across different offerings.

For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)

[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...

[2]: https://cloud.google.com/storage/docs/consistency

[+] joana035|5 years ago|reply

And Minio too: https://docs.min.io/docs/distributed-minio-quickstart-guide....

[+] eloff|5 years ago|reply

And Wasabi as well: https://wasabi-support.zendesk.com/hc/en-us/articles/1150016...

[+] boulos|5 years ago|reply

Disclosure: I work on Google Cloud.

This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).

[+] gopalv|5 years ago|reply

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.

Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.

Consistency issues are hell.

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

Yes!

S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.

[+] unknown|5 years ago|reply

[deleted]

[+] basicneo|5 years ago|reply

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

Is this public?

[+] ofek|5 years ago|reply

Is your rewrite available?

[+] scrollaway|5 years ago|reply

That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.

I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.

The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.

[+] foxhop|5 years ago|reply

Yup me too! Checkout this state_doc Interface I wrote in Python for building a state machine where the state is stored in S3. I never bothered to solve for the edge case of inconsistent reads because my little state machine only ever measured in hours when it slept waiting for work and then minutes in-between waking up and doing work to verify restores.

With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.

https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...

I should likely pull out the StateDoc class into it's own module.

[+] kartayyar|5 years ago|reply

This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

[+] aparsons|5 years ago|reply

Azure as well

[+] tutfbhuf|5 years ago|reply

Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.

Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?

[+] foota|5 years ago|reply

This documentation talks about it in brief: https://cloud.google.com/spanner/docs/whitepapers/life-of-re.... You can read the spanner paper for more detail: https://static.googleusercontent.com/media/research.google.c...

The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):

Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction

Leader -> other replicas: Start commit, timestamp 123

Other replicas -> Leader: Ready for commit

Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval

Leader -> Other replicas: Commit transaction, and in parallel replies to client.

Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.

[+] rockinghigh|5 years ago|reply

The Spanner paper goes through this: http://static.googleusercontent.com/media/research.google.co...

Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).

[+] throwaway_dcnt|5 years ago|reply

Partitioning.

[+] jz-amz|5 years ago|reply

Some related reading -

* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...

[+] iblaine|5 years ago|reply

AWS Re:invent talked about improvements to S3 earlier today. You can see that talk here https://virtual.awsevents.com/media/1_usujusng

For any AWS EMR/hadoop users out there, this means the end of emrfs.

For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.

It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.

[+] damon_c|5 years ago|reply

The more I learn about S3 the more I feel the same awe I feel when looking at the Pyramids or Ankor Wat.

> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.

[+] dheera|5 years ago|reply

Side question -- why is s3fs so slow? I get only about 1 put per 10 seconds with s3fs. One would hope that you could use something like s3fs as a makeshift S3 API, but its performance is really horrible.

[+] web007|5 years ago|reply

One fun fact that I learned recently: a prefix is not strictly path-delimited. I would think of /foo/bar and /foo/baz and /bar/baz as having two prefixes, but it could be anywhere from one to three, depending on how S3 has partitioned your data.

[+] swyx|5 years ago|reply

11 nines of reliability. I don't know anything else promising that (unclear if they actually reach it)

[+] psanford|5 years ago|reply

Neat! Would this make it possible in the future for S3 to support CAS like operations for PutObject? Something like supporting `if-match` and `if-none-match` headers on the etag of an object would be so useful.

[+] jamesblonde|5 years ago|reply

Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.

HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.

Disclosure: I work on HopsFS

[+] stubish|5 years ago|reply

Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.

[+] sleepydog|5 years ago|reply

Google Cloud storage has behaved this way for years. According to other comments, so does Azure.

[+] kleebeesh|5 years ago|reply

Fantastic. Any write-ups or descriptions for how they made it happen?

[+] hoytschermerhrn|5 years ago|reply

Seconded, I'd love to read a whitepaper if anyone is able to provide one. I'm assuming this operates on some variant of Google's TrueTime[0].

[0] https://cloud.google.com/spanner/docs/true-time-external-con...

[+] ultimoo|5 years ago|reply

> S3 consistency is available at no additional cost and removes the need for additional third-party, services, and complex architecture.

I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.

[+] k__|5 years ago|reply

It's interesting to read all these comments here that talk about the eventual consistency like it was some kind of bug.

[+] mikaeluman|5 years ago|reply

I thought we had worked out in 2010-2012 that the CAP theorem, consistency, availability and partition tolerance (https://en.wikipedia.org/wiki/CAP_theorem#History) made this impossible?

Where is the mistake.

[+] johncolanduoni|5 years ago|reply

S3 is almost certainly not fully partition tolerant at the node level and requires some sort of quorum. Other “magical” data stores like Spanner also retain this limitation, they just have very reliable replication strategies.

[+] chillydawg|5 years ago|reply

I would guess write times are slightly up to cover the cost of the new replication completion guarantees.

[+] gaul|5 years ago|reply

This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet

[+] hendry|5 years ago|reply

I wonder if this would make S3 a host for sqlite!

[+] juliansimioni|5 years ago|reply

Another, perhaps better link for the announcement: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

[+] mholt|5 years ago|reply

Would this enable atomic operations with S3 (e.g. "put object only if it does not exist")?

[+] leakybucket|5 years ago|reply

The copy object operation has options to control if the copy occurs based on examining source metadata, see "x-amz-copy-source-if-match" at https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje... .

With the consistency change, those might be useful as the basis for atomic operations.

[+] OJFord|5 years ago|reply

Could you not already do that by specifying ETag?

[+] ryanworl|5 years ago|reply

I also want the answer to this, but to the best of my knowledge it does not.

227 comments