top | item 35501051

(no title)

quanticle | 2 years ago

Is it just me, or does Ed Huang skip over the most important part of database design: actually making sure the database has stored the data?

I read to the end of the article, and while having a database as a serverless collection of microservices deployed to a cloud provider might be useful, it ultimately will be useless if this swarm approach doesn't give me any guarantees about how or if my data actually makes it onto persistent storage at some point. I was expecting a discussion of the challenges and pitfalls involved in ensuring that a cloud of microservices can concurrently access a common data store (whether that's a physical disk on a server or a S3 bucket), without stomping on each other, but that seemed to be entirely missing from the post.

Performance and scalability are fine, but when it comes to databases, they're of secondary importance to ensuring that the developer has a good understanding of when and if their data has been safely stored.

discuss

audioheavy|2 years ago

Excellent point. Many discussions here do not emphasize transactional guarantees enough, and most developers writing front-ends should not have to worry about programming to address high write contention and concurrency to avoid data anomalies.

As an industry, we've progressed quite a bit from accepting data isolation level compromises like "eventual consistency" in NoSQL, cloud, and serverless databases. The database I work with (Fauna) implements a distributed transaction engine inspired by the Calvin consensus protocol that guarantees strictly serializable writes over disparate globally deployed replicas. Both Calvin and Spanner implement such guarantees (in significantly different ways) but Fauna is more of a turn-key, low Ops service.

Again, to disclaim, I work for Fauna, but we've proven that you can accomplish this without having to worry about managing clusters, replication, partitioning strategies, etc. In today's serverless world, spending time managing database clusters manually involves a lot of undifferentiated, costly heavy lifting. YMMV.

rockwotj|2 years ago

I agree that actually persisting data reliability is tablestakes for a database, which I would assume Ed takes for granted this needs to work. Obviously lots of non trivial stuff there but this post seems to be more about database product direction than the nitty gritty technical details talking about fsync, filesystems, etc

mamcx|2 years ago

Also, most of the "action" on this sphere is for the "super-rich" customer: Assume it has more than 1 machine, lots of RAM, fast i/o & fast networks, etc. And this means: It run on AWS or other "super-rich" environment.

There, you can $$$ your way out of data corruption. You can even loss all the data if you have enough replicas and backups.

Not many are in the game of Sqlite.

This is the space I wish to work more: I think not only mean you can do better the high-end but is more practical all around: If you commit to a DB that depends of running in the cloud (to mask is not that fast, to mask is not that reliable, for extract more $$$ from customers, mostly) then when you NEED to have a portion of that data locally, you are screwed and then, you use sqlite!