Never Write Your Own Database (2017)

[+] tlb|8 years ago|reply

Akin's rules for spacecraft design [0] include: "Any exploration program which "just happens" to include a new launch vehicle is, de facto, a launch vehicle program."

By analogy, any software project that includes writing a database is, de facto, a database project.

[0] http://spacecraft.ssl.umd.edu/akins_laws.html

[+] dang|8 years ago|reply

Since this obviously inspired https://news.ycombinator.com/item?id=16996306, I'm going to move the spacecraft replies to that thread. The database replies can stay here. Beautiful analogy tlb.

[+] Strom|8 years ago|reply

Another common version of this is: video game project turning into a video game engine project, usually not even reaching the actual content phase.

[+] lukaslalinsky|8 years ago|reply

Any software project that needs a custom database most likely is a database project anyway. In such cases, it's often more realistic to write a new database than is it to hack an existing one. And these days you have many good options for low level storage.

[+] yash1th|8 years ago|reply

The analogy is amazing

[+] meepmorp|8 years ago|reply

Thank you for this.

[+] jrcii|8 years ago|reply

[deleted]

[+] MichaelGG|8 years ago|reply

Sometimes it's a huge advantage. I wrote a network search engine. On a single 1TB spinning disk, I could handle 5TB of traffic, stored and indexed, per day. That's around 2 billion packets indexed. The key was having an log/merge system with only a couple bits of overhead per entry, and compressed storage of chunks of packets for the actual data. (This was before LevelDB and Elasticsearch.)

In practice the index overhead per packet was only 2-3 bits. This was accomplished by lossy indexes, using hashes of just the right size to minimise false hits. The trade-off being that an occasional extra lookup is worth the vastly reduced size of compressed indexes.

To this day, I'm not sure of general purpose, lossy, write-once hashtables that get close to such little overhead.

Competitors would use MySQL and insert per packet. The row overhead was more than my entire index. But it worked out: just toss 50k of hardware at it.

But... It does take over a lot of engineering time writing such bespoke software. Just compressing the hashes (a common info retrieval problem) is a huge area, now with SIMD optimised algorithms and everything.

[+] lmilcin|8 years ago|reply

I had to roll my own transactional database for an embedded product working in a very constrained environment (2MHz ARM, 2MB total memory). I did very extensive search and I found no product with the combination of characteristics I was looking for:

- working in very small but more importantly, constant memory (predictability is a must for reliable embedded app),

- provide provable transactional characteristics with guarantee to roll back to previous state in case transaction is interrupted and still be consistent and able to accept new transactions,

- minimum storage overhead -- the amount of overhead was directly translating to competitiveness of the device at the time as multiple vendors tried to provide their own solutions for the same preexisting platform

- storage write pattern that would provide wear leveling for the underlying flash

I ended up writing my own database (less than 1k LOC total with encoding/decoding data using BER-TLV) that would meet all characteristics, take few tens of bytes of stack, take few bytes of overhead per write. The database would vacuum itself and coalesce record state automatically. It had some O(n^3) algorithms but THAT'S OK since the amount of data could never be so large that it could pose any problems.

The project took 2 years to complete. I spent maybe a month designing, implementing and perfecting the database. I wouldn't say that the Akin's law of spacecraft design applies here. I would probably spend more than that if I had to integrate existing product and end up with inferior product anyway.

[+] faitswulff|8 years ago|reply

The title is misleading - it's actually about how and why they did end up writing their own db. From the article:

> There’s an old adage in software engineering — “never write your own database”. So why did the OneNote team go ahead and write one as part of upgrading OneNote’s local cache implementation?

[+] unbendable|8 years ago|reply

It's "funny" how you (and many others too) didn't read the article and just commented based on the title.

[+] ghoul2|8 years ago|reply

Is this something publicly available? I am currently hunting around for an embedded database which has amazingly similar requirements, and have found nothing. I started my own db, but i'd love to be able to take a look at what you ended up with, if possible.

[+] nerfhammer|8 years ago|reply

wonder if sqlite would work for something like that, if you can get away from sql, i.e., strings.

[+] mpweiher|8 years ago|reply

Possibly relevant are these 2 great papers co-authored by Michael Stonebraker[1]:

“One Size Fits All”: An Idea Whose Time Has Come and Gone[2]

The End of an Architectural Era (It’s Time for a Complete Rewrite)[3]

They basically show that classical RDBMs are inefficient by around a factor of at least 10 in every conceivable application. I tend to trade in a little of that for the kinds of dramatically simpler mechanisms discussed in TFA.

[1] https://en.wikipedia.org/wiki/Michael_Stonebraker

[2] https://cs.brown.edu/~ugur/fits_all.pdf

[3] http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf

[+] everybodyknows|8 years ago|reply

Do they identify any cases for which classical RDBMS are inefficient by a super-constant factor?

[+] commandlinefan|8 years ago|reply

"Never write/roll your own" is the counterpoint to the NIH "Not Invented Here" syndrome... however, the older I get, the more reasonable NIH cases I see, such as the one in the article. People will say, for example, "why would you write your own ORM when there are so many available for free? Why reinvent the wheel?" Yet when I start looking at the wheels that are available, they take up a lot of space, a lot of time, take quite a while to learn (while the boss is asking me "is it done yet? is it done yet?"), have their own bugs, and need to be regularly upgraded and maintained against. This isn't necessarily a flaw - these "wheels" (ORMs, file storage solutions, web application frameworks and, dare I say it? security solutions!) are designed to solve generic problems rather than the specific targeted problem that I'm focused on.

[+] jmalicki|8 years ago|reply

If your problem is that your boss is asking "is it done yet?" writing your own new ORM does not seem like the most productive antidote to learning an existing hard-to-understand ORM....

[+] abenedic|8 years ago|reply

Overtime I have gone back and forth on this. When I was much younger I liked to write my own versions of libraries and such to have more control.

I got older and realized I now had to maintain 8 different projects, including my in house versions of things. Giving up control of a project to another person made me realize I had to eventually trust someone else to implement things correctly.

Now a days I am back on implementing my own solutions since I am more worried about efficiency. It would be nice if the compiler could inline the library implementation WELL(it usually sucks at this with static linking even with -ffunction-sections, etc.). I think there is a need for a language level facility for controlling what aspects of a third party library wind up in your final binary. Or more modular type libraries.

I just wish design decisions made by others weren't so baked in, like datatypes used or tag bits and what not.

[+] hinkley|8 years ago|reply

The other dimension rarely mentioned in these ruminations is that in the NIH scenario, you're robbing your less senior team members of an opportunity to build transferable skills.

I'm not sure whether that doesn't rank very high for people, doesn't even occur to them to be a problem, or they just don't give a shit about anybody else.

[+] smt88|8 years ago|reply

Taking a long time to learn is, in my experience, becoming less common. Library authors are more aware of ergonomics than they were, say, 10 years ago. For every huge ORM, there's an intentionally small alternative built out of frustration.

Having a variety of high-quality, widely-used libraries to choose from is vital, though. That's why I don't use brand new stacks at work, even if they are "better" in some way.

[+] phendrenad2|8 years ago|reply

I would love to write my own version of many things. I am confident that my version would be less bloated because it would be focused to my company’s particular use case (I.E. the MVC framework doesn’t need a table of non-standard HTTP methods unless we’re actually using them, and the ORM doesn’t need to be compatible with Oracle 11g DBs unless we’re using those). It would also have less surface area for vulnerabilities. It also wouldn’t get random updates that break backward compatibility in subtle ways. But I’ll never do it, because it’s more code we have to maintain in-house. Companies naturally accrete huge overburdened tech stacks, and anything I can do to cut back or slow the progress of that effect is a win.

[+] lallysingh|8 years ago|reply

Yeah but the stuff you roll yourself grows over time, has its own bugs, and gets retargeted for more projects over time.

[+] DrScump|8 years ago|reply

If there are use cases even now where a roll-your-own solution is best, imagine the 1980s.

My first programming roles were in a mainframe shop with a major defense contractor, which had some brilliant mainframe system programmers over the years. Their major unclassified systems (payroll, shop order control, etc.) were internetworked with a homegrown real-time system, and they all had a home-grown disk access system (random and sequential) that was surreal in its speed and reliability, all coded in 370 assembler. On the business applications side, they had a thorough API that was callable from even COBOL programs.

By the mid-1980s, upper management decided they had to "standardize" and began developing replacement systems using IBM's IMS. Performance was unusably bad. I left around that time to join the UNIX RDBMS world, so I don't know if they ever found a solution that could actually be rolled out.

Working with the older stuff was actually fun because things just worked, and the customers of our legacy systems were really happy.

[+] randomsearch|8 years ago|reply

The major cost factors in software development are developer wages.

It’s all about how you spend the time you have. Don’t build stuff that you can get off shelf. You’re not going to write a better database, even if you think you can. You may get great performance for the current system on day one (after spending a huge amount of time and effort on developing it, perhaps with zero value created) but over the lifetime of the database you’ll incur huge costs that you probably can’t even fully foresee. The details here are not clear enough, but working from first principles (“I want to build an Evernote-like app”) I can’t imagine an experienced developer suggesting you should write a database (cache/file system).

The best software development is mundane: glue together what’s there, buy the resources you need to get sufficient performance. Switch a component if you need something to be faster, re-architect key parts, and buy the new components off the shelf too. It’s fun to build new stuff with custom algorithms you work out, but instead of that you can go home at 5pm, try building a database in your spare time and learn why you shouldn’t, and still have space to relax.

As Sam Altman wrote recently, the productivity tip that most people are astoundingly ignorant of is: choose what you do carefully.

It’s telling that this article does not begin: “we had a bottleneck that could not be solved.”

[+] kalleboo|8 years ago|reply

The story people should take away from the article isn't "Microsoft wrote their own database for their note-taking app, so should I for mine!", it's that "Microsoft launched and made wildly popular a note-taking app using standard tools, then learned their problem space including subtle requirements about the deployment environment, and THEN wrote their own database to solve real pain-points they had in practice."

[+] dman|8 years ago|reply

On the flip side - building just enough software to solve the problem at hand can sometimes run circles around assembling things using off the shelf components. This is especially true if something like speed / memory / correctness / optimality is important.

[+] yashap|8 years ago|reply

Thoroughly agree. There are very rare times where you need to write your own low level, high complexity system for a high level application, but ... they’re really, really rare nowadays. Almost always it’s developers practicing resume driven development.

I’ve never worked with anyone who tried to implement their own database, but I have worked with people who implemented their own network protocols, JS SPA frameworks and service discovery layers, and they were all really bad ideas.

It gets even worse when said person leaves the company, their system is still critically important and full of bugs, and has the inevitable little-to-no documentation. Other devs have to maintain this mess, and nobody truly understands it.

[+] rdtsc|8 years ago|reply

I have written my own database. In 3 months. Shipped it and installed in customers' sites, and it's been working great for many years now. So do write your own database, but know exactly how kernel, libc, language libraries, sockets work :-)

[+] projektfu|8 years ago|reply

It seems to me that Microsoft frequently writes new embedded databases, leaving them poorly documented and accessible from weird tools. For example, the registry, Windows installer, structured storage.

[+] coldacid|8 years ago|reply

Structured storage isn't embedded database tech. It's actually FAT (or close enough) inside a file, with a bunch of helper COM objects.

[+] Animats|8 years ago|reply

There are two old commonly used programs that have their own database - BIND and Sendmail. Both have problems because their database is crappy.

[+] notacoward|8 years ago|reply

The looser your requirements are, the easier it is to write your own so-called database. Nobody should be writing a full ACID/SQL/whatever database as part of another project. On the other hand, if you want to implement a "database" in the same sense that many implement a "filesystem" which is just an HTTP object layer on top of a real filesystem, knock yourself out. Just don't think you're doing the same thing as the people writing more featureful general-purpose databases.

[+] a1369209993|8 years ago|reply

If you're willing to accept that durability is a lie, and also to compromise (possibly a lot) on performance and trust that the CPU and RAM aren't actively helping the disk to sabatoge you, you can make a fairly full-featured ACI database in a weekend using Merkle trees and a single-threaded rendevous serializer. The hard part is getting those guarantees while also having latency and throughput vaguely comparable to naive file writes. (I'd add something about a better API than assembly language, but database people decided to imitate COBOL, so I can't really criticise assembly with a straight face.)

[+] amluto|8 years ago|reply

I wrote my own database once. It’s an “event sourcing” database, but I wrote it way before event sourcing was cool. It can handle over 1M events/sec on one core. I don’t know any off-the-shelf product that could replace it.

I also helped write a log-structured OLTP-ish database that sits in front of MySQL.

I have only two regrets about these systems. First, MySQL was, in retrospect, the wrong backing store. Second, I used Thrift serialization. Thrift is not so great.

[+] dwyerm|8 years ago|reply

Did using MySQL allow you to ship a product on-time and under budget? Or, to put this the other way, if you had used The Right Backing Store would you still have delivered the product in a timely manner at the right price-point?

[+] perl4ever|8 years ago|reply

There's also the (anti)pattern of making your own database in a lesser sense - implementing a set of tables on a SQL database for generic data. I think this is called the entity-attribute-value pattern. My impression is this is often considered deplorable, yet practically every application built on a database that I've encountered at work uses it to some extent.

[+] perlgeek|8 years ago|reply

As soon as you allow some kind of user-defined meta data, you basically end up with the entity-attribute-value pattern.

Every sufficiently advanced system does that, be it a CRM that allows custom fields, or a project management or ticket system, a workflow system etc. And basically all enterprise software contains elements of these systems.

[+] Walkman|8 years ago|reply

"As alluded to above, this does introduce a separate challenge with multi-file consistency. The team took an interesting approach here. The index file points to the location of the root of the page graph and free space map in each page cache file. When persisting a multi-file transaction, all page cache files are flushed to disk before the index file."

That sounds familiar: https://pthree.org/2012/12/14/zfs-administration-part-ix-cop...

[+] krylon|8 years ago|reply

I used to work at a company where the "database" was literally a raw dump of the in-memory array of structs defined in a C program. Had some fun with that because struct padding depended on compiler flags, so different varieties of the program that were built with different compiler flags could not read each other's "database".

I never got over it, really. On the plus side, it was super fast, that I cannot deny. But it always seemed super gross to me.

[+] kalleboo|8 years ago|reply

Lots of software used to do this, back when computers were really slow and low on RAM. The old MS Office, Adobe Photoshop formats are notorious for it.

[+] mastax|8 years ago|reply

I was not expecting this article to be about OneNote, but I'm glad that it is. OneNote is a surprisingly good product, but it takes so excruciatingly long to sync what must be 10KB of text. I look forward to what these changes enable.

[+] jchanimal|8 years ago|reply

If you’re writing your own database and you are serious about making it a business, reach out to me via my profile here. Ive been there done that, happy to help.

[+] unilynx|8 years ago|reply

We wrote our own database 15 years ago - needed something embeddable into the application on both windows and linux, i386 and alpha, and the ability to do transactional schema modifications without global locking. If I recall correctly, postgresql was at 7.3 at the time and would happily dump core if you tried something like 'drop table x;rollback'.

And we didn't need full sql semantics, because we partially implemented it into our scripting language anyway and used only cursor APIs. postgresql's mvcc was still a great inspiration for how to design the actual row storage.

Served us well, but 15 years later, a bit too much of custom features make it hard to switch to a standard database.

[+] progx|8 years ago|reply

If everybody never writes a database, who does it?

[+] dicroce|8 years ago|reply

I wrote a database for my app and it was the best decision we made.

[+] naspinski|8 years ago|reply

Without a comparison of how good the other decisions were, this means nothing :)

154 comments