I have spent quite a lot of time building a database in Java. It's not a small undertaking.
I've gone through as many books and papers as I can, and there is only one, one, that gives an overview and actual examples of building a system that processes SQL. Everything else is theoretical or so general that you don't get enough direction to get started.
It's "Database Design and Implementation" by Sciore. It's a little dated, but not by much. It is an underappreciated resource.
You may find MIT's Database Systems course helpful, the labs consist of building several db components in Java (and the prof is Stonebraker, of Postgres fame):
You could have read the source to the processor of one of the many open source SQL databases though (off the top of my head: MySQL, PostgreSQL, SQLite, Derby, H2, HSQLDB). Many of them have very clean code, SQLite would be my choice if you know C otherwise one of the java databases (though I've never actually read their source)
Agree! I've accidentally found this gem. It is really written from the actual hands-on engineering/programming perspective which covers all aspects, from the front-end to the actual storage implementation.
Sciore is a great teacher! I had him at BC 20 years ago for both database systems and object oriented programming. I didn't know he put out a few books, I'll have to check them out.
The Redbook looks like an awesome resource. I think it's a real shame that find resources at the intermediate level of database system design is so hard; feels very much like the "how to draw an owl" meme
Writing a database is a good exercise I could recommend to everybody.
Writing a general purpose database is very difficult. On the other hand, if you know what kind of load you can expect and when you have an idea of performance and consistency requirements it may become much simpler.
I have rolled out my own transactional database for a commercial project. This was a decade ago, the device had at most 1MB of memory budget for compiled code, stack, heap, and database storage. The database was to be stored on a non-wear-leveled flash requiring precise control of which bytes and how frequently were erased. We required ability to store small objects (tens to few hundreds of bytes at most) of varying size and be able to record changes to multiple objects at the same time, transactionally (think modifying two accounts). The requirement was that the database stayed consistent no matter what happened, regardless of any application error or power cycles. These devices were operated by hundreds of thousands of customers and power cycled liberally whenever anything didn't work.
The database run in constant memory. It allowed the application to register hooks to calculate deltas between two versions of records and to apply delta to base version to produce result version. The database didn't care how the data was encoded, this was application detail.
The database was stored as transaction log. During operation, all writes were directed to the end of the log. When the log reached half of the available memory, all live records would be coalesced and written to the other half of the space and the original space would be erased.
The read algorithmic complexity was horrible with some algorithms being cubic(!) complexity. This was absolutely ok, since there could never be enough records that would make it a problem. Having extremely simple algorithms allowed for the entire code to be very simple and compact. Everything fit about 200 lines or so lines of code.
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
# read database line by line in reverse, stopping at first match
tac database | grep -m 1 "^$1," | sed -e "s/^$1,//"
}
Java programmers: if you are curious about how an SQL database works, take a look at H2's source (https://github.com/h2database/h2database). You can even try to fix an outstanding issue. It is illuminating - and much easier than you would have thought.
For a while now, I've used H2 as the in-memory database for a bunch of Selenium and Spring Boot integration tests. The primary advantages are 1. speed and 2. simplicity.
Simplicity is the big thing here; I can configure Spring to launch an in-memory H2 database and not have to worry about state, clearing and reseeding, or any of that stuff. The integration tests can just run.
I've been so happy with it that I'm pushing a customer to use H2 in production. I would normally use Postgres, but my customer doesn't really have an IT staff, and an embedded database will be a lot easier for them to keep up and running.
Really glad to see this... this has often seemed to me to be a nice sort of capstone project for an undergraduate degree program. (Covers real implementation of everything from networking, datastructures, language parsing, etc.)
I tend to use postgresql in my daily work now but I will always carry a flame for sqlite, my first DB experience. I still use it for rapidly parsing and normalizing huge text files, as well as the backend for numerous one-off webscrapers.
I contend there are few faster ACID-compliant database options for high-speed input, without moving toward a complex multi-machine distributed solution. Combined with a high-speed SSD or an in-memory database, that single write-worker can often put any high-concurrency single-machine database to shame, as long as some thought is put into the code.
There is something deeply satisfying about writing a short python script and then watching a 5gb text file or a huge directory of log files get chewed up and spit out into a tidy little perfectly normalized and portable database. Bliss.
Just between you and me, sometimes I download a hundred thousand files to process just for fun. It's not right.
Not sure, I've watched several lectures and I've got an impression that they are too theoretic. As I remember they even don't implement the database system, but some very specific part in some existing system.
The good news is that an OOP language is a superset of imperative languages like C.
The more general answer is that all Turing-complete languages are functionally equivalent. The misery level varies, as does the applicability to any given project, but anything you can do in C you can also do in Java, Python, Lisp, Erlang, Forth, ad nauseum.
I love this kind of stuff since I started my own database from a different angle. Namely, I code generate everything do all indexing in memory. It's been fun, and rock solid for single host monolith web app.
Of course[1], but this tutorial isn't about the language, per se, you should be able to translate it pretty trivially.
These sorts of tutorials are more about the inner workings of database operation; I recall doing something like this in grad school but you're reminded quickly how more complex databases are than you think about when all you're doing is using them. And hell, I'm not even that good at using them.
[+] [-] ccleve|7 years ago|reply
I've gone through as many books and papers as I can, and there is only one, one, that gives an overview and actual examples of building a system that processes SQL. Everything else is theoretical or so general that you don't get enough direction to get started.
It's "Database Design and Implementation" by Sciore. It's a little dated, but not by much. It is an underappreciated resource.
[+] [-] charlysl|7 years ago|reply
https://ocw.mit.edu/courses/electrical-engineering-and-compu...
https://youtu.be/F3XGUPll6Qs
http://blog.marcua.net/post/117671929
[+] [-] fgonzag|7 years ago|reply
[+] [-] zerr|7 years ago|reply
[+] [-] billxreynolds|7 years ago|reply
[+] [-] HappyJoy|7 years ago|reply
[+] [-] ignoramous|7 years ago|reply
Also, see Redbook for the latest ensemble of research ideas in databases. http://www.redbook.io
[+] [-] alexchamberlain|7 years ago|reply
[+] [-] barbecue_sauce|7 years ago|reply
[+] [-] tyingq|7 years ago|reply
Virtual tables: https://www.sqlite.org/draft/vtab.html
A simple example of using VFS to add a useful feature: https://github.com/Metalnem/sqlite-vfsdemo
[+] [-] charlysl|7 years ago|reply
https://youtu.be/vyVGm_2iFwU
https://15445.courses.cs.cmu.edu/fall2018/
https://github.com/Nov11/15-445
[+] [-] titanix2|7 years ago|reply
[+] [-] lmilcin|7 years ago|reply
Writing a general purpose database is very difficult. On the other hand, if you know what kind of load you can expect and when you have an idea of performance and consistency requirements it may become much simpler.
I have rolled out my own transactional database for a commercial project. This was a decade ago, the device had at most 1MB of memory budget for compiled code, stack, heap, and database storage. The database was to be stored on a non-wear-leveled flash requiring precise control of which bytes and how frequently were erased. We required ability to store small objects (tens to few hundreds of bytes at most) of varying size and be able to record changes to multiple objects at the same time, transactionally (think modifying two accounts). The requirement was that the database stayed consistent no matter what happened, regardless of any application error or power cycles. These devices were operated by hundreds of thousands of customers and power cycled liberally whenever anything didn't work.
The database run in constant memory. It allowed the application to register hooks to calculate deltas between two versions of records and to apply delta to base version to produce result version. The database didn't care how the data was encoded, this was application detail.
The database was stored as transaction log. During operation, all writes were directed to the end of the log. When the log reached half of the available memory, all live records would be coalesced and written to the other half of the space and the original space would be erased.
The read algorithmic complexity was horrible with some algorithms being cubic(!) complexity. This was absolutely ok, since there could never be enough records that would make it a problem. Having extremely simple algorithms allowed for the entire code to be very simple and compact. Everything fit about 200 lines or so lines of code.
I really enjoyed the project.
[+] [-] fizwhiz|7 years ago|reply
[+] [-] chirau|7 years ago|reply
[+] [-] pmccarren|7 years ago|reply
[+] [-] stevoski|7 years ago|reply
Java programmers: if you are curious about how an SQL database works, take a look at H2's source (https://github.com/h2database/h2database). You can even try to fix an outstanding issue. It is illuminating - and much easier than you would have thought.
[+] [-] thomascgalvin|7 years ago|reply
Simplicity is the big thing here; I can configure Spring to launch an in-memory H2 database and not have to worry about state, clearing and reseeding, or any of that stuff. The integration tests can just run.
I've been so happy with it that I'm pushing a customer to use H2 in production. I would normally use Postgres, but my customer doesn't really have an IT staff, and an embedded database will be a lot easier for them to keep up and running.
[+] [-] bozoUser|7 years ago|reply
[+] [-] mschaef|7 years ago|reply
[+] [-] MuffinFlavored|7 years ago|reply
It'd be cool if somebody benchmarked this to show just how "fast" SQLite is comparatively, even if they are by no means equal functionality wise.
[+] [-] bshipp|7 years ago|reply
I contend there are few faster ACID-compliant database options for high-speed input, without moving toward a complex multi-machine distributed solution. Combined with a high-speed SSD or an in-memory database, that single write-worker can often put any high-concurrency single-machine database to shame, as long as some thought is put into the code.
There is something deeply satisfying about writing a short python script and then watching a 5gb text file or a huge directory of log files get chewed up and spit out into a tidy little perfectly normalized and portable database. Bliss.
Just between you and me, sometimes I download a hundred thousand files to process just for fun. It's not right.
[+] [-] blattimwind|7 years ago|reply
> The next step should be splitting internal nodes. Until then!
[+] [-] alecco|7 years ago|reply
[+] [-] zerr|7 years ago|reply
[+] [-] eunos|7 years ago|reply
[+] [-] bunderbunder|7 years ago|reply
[+] [-] p4bl0|7 years ago|reply
[+] [-] jimbokun|7 years ago|reply
[+] [-] macintux|7 years ago|reply
The more general answer is that all Turing-complete languages are functionally equivalent. The misery level varies, as does the applicability to any given project, but anything you can do in C you can also do in Java, Python, Lisp, Erlang, Forth, ad nauseum.
[+] [-] mathgladiator|7 years ago|reply
[+] [-] gigatexal|7 years ago|reply
[+] [-] kakaoscott|7 years ago|reply
[+] [-] parthdesai|7 years ago|reply
[+] [-] nkozyra|7 years ago|reply
These sorts of tutorials are more about the inner workings of database operation; I recall doing something like this in grad school but you're reminded quickly how more complex databases are than you think about when all you're doing is using them. And hell, I'm not even that good at using them.
[1] http://nikhilism.com/post/2016/writing-simple-database-in-ru...
[+] [-] MehdiHK|7 years ago|reply
[+] [-] cryptokernel|7 years ago|reply
[deleted]
[+] [-] stickmanpagemov|7 years ago|reply
[deleted]