Let’s Build a Simple Database (2017)

[+] ccleve|7 years ago|reply

I have spent quite a lot of time building a database in Java. It's not a small undertaking.

I've gone through as many books and papers as I can, and there is only one, one, that gives an overview and actual examples of building a system that processes SQL. Everything else is theoretical or so general that you don't get enough direction to get started.

It's "Database Design and Implementation" by Sciore. It's a little dated, but not by much. It is an underappreciated resource.

[+] charlysl|7 years ago|reply

You may find MIT's Database Systems course helpful, the labs consist of building several db components in Java (and the prof is Stonebraker, of Postgres fame):

https://ocw.mit.edu/courses/electrical-engineering-and-compu...

https://youtu.be/F3XGUPll6Qs

http://blog.marcua.net/post/117671929

[+] fgonzag|7 years ago|reply

You could have read the source to the processor of one of the many open source SQL databases though (off the top of my head: MySQL, PostgreSQL, SQLite, Derby, H2, HSQLDB). Many of them have very clean code, SQLite would be my choice if you know C otherwise one of the java databases (though I've never actually read their source)

[+] zerr|7 years ago|reply

Agree! I've accidentally found this gem. It is really written from the actual hands-on engineering/programming perspective which covers all aspects, from the front-end to the actual storage implementation.

[+] billxreynolds|7 years ago|reply

Sciore is a great teacher! I had him at BC 20 years ago for both database systems and object oriented programming. I didn't know he put out a few books, I'll have to check them out.

[+] HappyJoy|7 years ago|reply

I've been looking for this type of book. Thanks!

[+] ignoramous|7 years ago|reply

Folks interested in this sort of thing might like Anatomy of a Database System by Stonebreaker et al, which I think is the most concise treatment on the subject: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf

Also, see Redbook for the latest ensemble of research ideas in databases. http://www.redbook.io

[+] alexchamberlain|7 years ago|reply

The Redbook looks like an awesome resource. I think it's a real shame that find resources at the intermediate level of database system design is so hard; feels very much like the "how to draw an owl" meme

[+] barbecue_sauce|7 years ago|reply

Also, Andy Pavlo's CMU Database Group courses on Youtube.

[+] tyingq|7 years ago|reply

Another exercise that's fun is playing with Sqlite virtual tables and/or the Sqlite VFS layer.

Virtual tables: https://www.sqlite.org/draft/vtab.html

A simple example of using VFS to add a useful feature: https://github.com/Metalnem/sqlite-vfsdemo

[+] charlysl|7 years ago|reply

Pavlo's CMU intro db course's labs are actually based on sqlite:

https://youtu.be/vyVGm_2iFwU

https://15445.courses.cs.cmu.edu/fall2018/

https://github.com/Nov11/15-445

[+] titanix2|7 years ago|reply

Thanks a lot. I planned to implement virtual SQL tables for a project and wasn’t aware of existing product. This probably will save me a lot of time.

[+] lmilcin|7 years ago|reply

Writing a database is a good exercise I could recommend to everybody.

Writing a general purpose database is very difficult. On the other hand, if you know what kind of load you can expect and when you have an idea of performance and consistency requirements it may become much simpler.

I have rolled out my own transactional database for a commercial project. This was a decade ago, the device had at most 1MB of memory budget for compiled code, stack, heap, and database storage. The database was to be stored on a non-wear-leveled flash requiring precise control of which bytes and how frequently were erased. We required ability to store small objects (tens to few hundreds of bytes at most) of varying size and be able to record changes to multiple objects at the same time, transactionally (think modifying two accounts). The requirement was that the database stayed consistent no matter what happened, regardless of any application error or power cycles. These devices were operated by hundreds of thousands of customers and power cycled liberally whenever anything didn't work.

The database run in constant memory. It allowed the application to register hooks to calculate deltas between two versions of records and to apply delta to base version to produce result version. The database didn't care how the data was encoded, this was application detail.

The database was stored as transaction log. During operation, all writes were directed to the end of the log. When the log reached half of the available memory, all live records would be coalesced and written to the other half of the space and the original space would be erased.

The read algorithmic complexity was horrible with some algorithms being cubic(!) complexity. This was absolutely ok, since there could never be enough records that would make it a problem. Having extremely simple algorithms allowed for the entire code to be very simple and compact. Everything fit about 200 lines or so lines of code.

I really enjoyed the project.

[+] fizwhiz|7 years ago|reply

How long did it take to implement, and how large was the team?

[+] chirau|7 years ago|reply

I believe the simplest database is this:

    #!/bin/bash
    db_set () {
         echo "$1,$2" >> database
    }
    db_get () {
         grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
    }

[+] pmccarren|7 years ago|reply

Let's optimize!

    #!/bin/bash
    db_set () {
         echo "$1,$2" >> database
    }
    db_get () {
         # read database line by line in reverse, stopping at first match
         tac database | grep -m 1 "^$1," | sed -e "s/^$1,//"
    }

[+] stevoski|7 years ago|reply

Awesome stuff.

Java programmers: if you are curious about how an SQL database works, take a look at H2's source (https://github.com/h2database/h2database). You can even try to fix an outstanding issue. It is illuminating - and much easier than you would have thought.

[+] thomascgalvin|7 years ago|reply

For a while now, I've used H2 as the in-memory database for a bunch of Selenium and Spring Boot integration tests. The primary advantages are 1. speed and 2. simplicity.

Simplicity is the big thing here; I can configure Spring to launch an in-memory H2 database and not have to worry about state, clearing and reseeding, or any of that stuff. The integration tests can just run.

I've been so happy with it that I'm pushing a customer to use H2 in production. I would normally use Postgres, but my customer doesn't really have an IT staff, and an embedded database will be a lot easier for them to keep up and running.

[+] bozoUser|7 years ago|reply

Hey @stevoski can you recommend one of the very basic issues to look at for someone to ramp up on the codebase ? Thanks

[+] mschaef|7 years ago|reply

Really glad to see this... this has often seemed to me to be a nice sort of capstone project for an undergraduate degree program. (Covers real implementation of everything from networking, datastructures, language parsing, etc.)

[+] MuffinFlavored|7 years ago|reply

I can't tell if this is "finished" or not.

It'd be cool if somebody benchmarked this to show just how "fast" SQLite is comparatively, even if they are by no means equal functionality wise.

[+] bshipp|7 years ago|reply

I tend to use postgresql in my daily work now but I will always carry a flame for sqlite, my first DB experience. I still use it for rapidly parsing and normalizing huge text files, as well as the backend for numerous one-off webscrapers.

I contend there are few faster ACID-compliant database options for high-speed input, without moving toward a complex multi-machine distributed solution. Combined with a high-speed SSD or an in-memory database, that single write-worker can often put any high-concurrency single-machine database to shame, as long as some thought is put into the code.

There is something deeply satisfying about writing a short python script and then watching a 5gb text file or a huge directory of log files get chewed up and spit out into a tidy little perfectly normalized and portable database. Bliss.

Just between you and me, sometimes I download a hundred thousand files to process just for fun. It's not right.

[+] blattimwind|7 years ago|reply

The end of the last published chapter:

> The next step should be splitting internal nodes. Until then!

[+] alecco|7 years ago|reply

CMU Andy Pavlo’s lectures on YouTube are great and also shows state of the art. Start there.

[+] zerr|7 years ago|reply

Not sure, I've watched several lectures and I've got an impression that they are too theoretic. As I remember they even don't implement the database system, but some very specific part in some existing system.

[+] eunos|7 years ago|reply

Very interesting! However I wonder whether it's possible to do it on OOP languages such as Java.

[+] bunderbunder|7 years ago|reply

Why wouldn't it be?

[+] p4bl0|7 years ago|reply

It's already almost the case. Juste rename structure X in class X and make every X_xxx functions a xxx method of class X and you're almost set.

[+] jimbokun|7 years ago|reply

https://news.ycombinator.com/item?id=19582150

[+] macintux|7 years ago|reply

The good news is that an OOP language is a superset of imperative languages like C.

The more general answer is that all Turing-complete languages are functionally equivalent. The misery level varies, as does the applicability to any given project, but anything you can do in C you can also do in Java, Python, Lisp, Erlang, Forth, ad nauseum.

[+] mathgladiator|7 years ago|reply

I love this kind of stuff since I started my own database from a different angle. Namely, I code generate everything do all indexing in memory. It's been fun, and rock solid for single host monolith web app.

[+] gigatexal|7 years ago|reply

This is so cool! I’ve been wanting to do just this!

[+] kakaoscott|7 years ago|reply

Hey thanks a lot for making an education program for a simpleton like myself.

[+] parthdesai|7 years ago|reply

does anyone know if there is a similar tutorial in Rust?

[+] nkozyra|7 years ago|reply

Of course[1], but this tutorial isn't about the language, per se, you should be able to translate it pretty trivially.

These sorts of tutorials are more about the inner workings of database operation; I recall doing something like this in grad school but you're reminded quickly how more complex databases are than you think about when all you're doing is using them. And hell, I'm not even that good at using them.

[1] http://nikhilism.com/post/2016/writing-simple-database-in-ru...

[+] MehdiHK|7 years ago|reply

And anything in Go?

[+] cryptokernel|7 years ago|reply

[deleted]

[+] stickmanpagemov|7 years ago|reply

[deleted]

88 comments