Building a Database in the 2020s (2022)

[+] carom|2 years ago|reply

Memory safety is something that needs to be mentioned. I was integrating DuckDB into a project and ended up ripping it out after running into memory corruption issue in practice. Upon investigation they had a massive issue of fuzzer found bugs on their GitHub. While I am glad they are fuzzing and finding issues, I cannot ship that onto customer systems.

We have a few very good memory safe programming languages at this point. Please do not start a project in C/C++ unless you are truly exceptional and understand memory management and exploitation inside and out. I switched to SQLite on the project since it is one of the more fuzzed applications out there that fit the need. The next embeddable database I use (bonus if it works on cloud) will need to be in a memory safe language.

[+] vbezhenar|2 years ago|reply

There're more than one embedded database in Java. They were old and well tested 10 years ago. Why people keep missing this ecosystem is beyond my comprehension.

[+] tyingq|2 years ago|reply

If you're writing an embeddable database, exposing it as a C library maximizes the potential user base. Since most languages have some way to do ffi with C, most platforms have C compilers, etc.

[+] KrugerDunnings|2 years ago|reply

I've been building a Postgresql extension in the last months for some functionality that was needed and have learned a ton about the internal workings of this database. All very scary and complicated sounding stuff but I feel privileged to be able to do this because the things you learn are just pure gold. My attitude before this was that of the ideal customer of a cloud database, someone who was scared of sql and preferred to hide behind the complexity of a ORM. Not anymore, now I write thousands of lines of sql and laugh to myself like a maniac.

[+] nonethewiser|2 years ago|reply

That’s a big jump - hiding behind ORMs to learnings some inner workings of Postgres.

Do you have an example of something that you previously saw as a black box but now understand? Perhaps something simpler than you anticipated, or just something you never even knew existed before digging into the details?

[+] juancn|2 years ago|reply

Databases are worth the time to understand deeply. Postgres in particular is a magnificent beast.

[+] quanticle|2 years ago|reply

Is it just me, or does Ed Huang skip over the most important part of database design: actually making sure the database has stored the data?

I read to the end of the article, and while having a database as a serverless collection of microservices deployed to a cloud provider might be useful, it ultimately will be useless if this swarm approach doesn't give me any guarantees about how or if my data actually makes it onto persistent storage at some point. I was expecting a discussion of the challenges and pitfalls involved in ensuring that a cloud of microservices can concurrently access a common data store (whether that's a physical disk on a server or a S3 bucket), without stomping on each other, but that seemed to be entirely missing from the post.

Performance and scalability are fine, but when it comes to databases, they're of secondary importance to ensuring that the developer has a good understanding of when and if their data has been safely stored.

[+] audioheavy|2 years ago|reply

Excellent point. Many discussions here do not emphasize transactional guarantees enough, and most developers writing front-ends should not have to worry about programming to address high write contention and concurrency to avoid data anomalies.

As an industry, we've progressed quite a bit from accepting data isolation level compromises like "eventual consistency" in NoSQL, cloud, and serverless databases. The database I work with (Fauna) implements a distributed transaction engine inspired by the Calvin consensus protocol that guarantees strictly serializable writes over disparate globally deployed replicas. Both Calvin and Spanner implement such guarantees (in significantly different ways) but Fauna is more of a turn-key, low Ops service.

Again, to disclaim, I work for Fauna, but we've proven that you can accomplish this without having to worry about managing clusters, replication, partitioning strategies, etc. In today's serverless world, spending time managing database clusters manually involves a lot of undifferentiated, costly heavy lifting. YMMV.

[+] rockwotj|2 years ago|reply

I agree that actually persisting data reliability is tablestakes for a database, which I would assume Ed takes for granted this needs to work. Obviously lots of non trivial stuff there but this post seems to be more about database product direction than the nitty gritty technical details talking about fsync, filesystems, etc

[+] vkakade|2 years ago|reply

I would also add that the databases in 2020s will be written in Rust, rather than C/C++. The safety guarantees Rust provides makes the development process faster, as well as results is clean code that is easier to understand and extend.

[+] jandrewrogers|2 years ago|reply

Modern database architectures have memory safety models that the Rust compiler currently can’t reason about. Hence why new database kernels are still written in C++. It isn’t because C++ is great (it is not) but because there are proven high-performance safety models in databases that Rust has difficulty expressing without a lot of “unsafe”.

Generally speaking, modern database architectures have minimal dynamic memory allocation, no multi-threaded code to speak of, or even buffer overflow issues (everything internally is built for paged memory). Where are these advantages from Rust supposed to come from? It has a lot of disadvantages for databases today, and I say that as someone who has used Rust in this space. People who build this kind of software are pragmatic and Rust doesn’t deliver for database developers currently.

[+] breck|2 years ago|reply

I have seen some data that makes me think you may be right. I haven't looked at database projects yet but looking at the code bases of other programming languages and big systems (such as Linux), I see the Rust file count going up. That being said, I did recently look at SQLite and that is still all C. It's on my todo list to look this up for all the major open source DBs.

[+] gxt|2 years ago|reply

Working on it. Just wish LLVM was natively written in rust too, it's easy to segfault when doing something unexpected.

[+] studmuffin650|2 years ago|reply

My 2 cents is that modern C++ (c++11 and greater) removes most of the memory safety problems that people historically have used as example for the language being dangerous or bad. I would actually predict that modern databases will probably continue to be built in C++ but using a modern version of the language, providing memory safety and high performance.

[+] vp8989|2 years ago|reply

I dont agree with the premise that running transactional and analytical workloads on the same database is architecturally “simpler”. In my experience this is only true at very low scale and those contexts are already sufficiently well served by existing database tech.

[+] munchor|2 years ago|reply

Can you elaborate? I am curious :)

[+] breck|2 years ago|reply

I'm working on a new Git backed file based Database for knowledge bases. Not designed for domains where you can confidently predict your schemas ahead of time but instead designed for use cases where you have large complex schemas that are frequently changing.

It simply wasn't possible a few years ago (SSDs weren't fast enough and Git was too slow for projects with huge number of files).

I'm having fun with it.

Current version is just written in Javascript but if there demand hits a higher level would likely write a version in Rust or Go.

If anyone has any pointers to similar projects I'm all ears.

[+] FridgeSeal|2 years ago|reply

Have you looked at Dolt? Or LakeFS? Both of these are taking the “hot for data” approach too.

[+] refset|2 years ago|reply

Interesting read! I would like to add:

* databases need to get better yet at schema management and workload isolation to enable multiple applications to properly integrate through the database (as traditionally envisioned)

* HTAP seems inevitable but needs built-in support for row-level history/versioning to get maximum benefit

* databases should abstract over raw compute infrastructure efficiently enough that you don't need k8s to run your application logic and APIs elsewhere. The database should be a decent all-in-one place to build & ship stuff

[+] unknown|2 years ago|reply

[deleted]

[+] paulddraper|2 years ago|reply

I'm surprised there isn't a "serverless" PostgreSQL. That seems like it would get more bank for buck then writing a cloud native DB from scratch.

(Or maybe there is one but I don't know about it.)

AWS made a serverless MySQL.

[+] esperent|2 years ago|reply

A quick Google search suggests there are many serverless PostGres implementations, including an AWS version.

[+] dragonwriter|2 years ago|reply

> I'm surprised there isn't a "serverless" PostgreSQL

There are many.

> AWS made a serverless MySQL.

AWS Aurora Serverless has both MySQL and Postgres flavors.

[+] samieljabali|2 years ago|reply

planetscale.com

[+] adam_gyroscope|2 years ago|reply

bit.io

[+] jansommer|2 years ago|reply

neon.tech

57 comments