Ask HN: CTM gave CS the “kernel language”. Is there a “kernel database”?
27 points| bordercases | 10 years ago
Currently I'm diving into databases, and I'm nothing but confused. It seems easy to get lost in the myriad of technologies and their features. I'm having a difficult time negotiating the tradeoffs, particularly without a real, central model of what a database is or what it's design space is. Is there an equivalent to a kernel language in the database world?
The cover graphic of the MIT OCW for databases (labeled 6.830 / 6.814) seems to give away that there is a basic set of problems that databases need to solve. So I think that a "kernel description" of databases can exist. I'm wondering if anyone here has come across any.
[+] [-] nostrademons|10 years ago|reply
Schema. This defines how the raw bytes of a data record are mapped into semantically-relevant pieces of data for your program. The relational model (behind SQL) is the most commonly used schema for commercial database products, but also check out things like XML databases, JSON data-stores, serialization formats like Protobuf or Thrift, etc.
https://en.wikipedia.org/wiki/Data_model
Indexing. This defines what data structures are used to quickly find records on disk. Almost all the major products use B-trees, but these are not the only options: you can also have hash indexes, sorted string tables, bitmaps, bloom filters, etc.
https://en.wikipedia.org/wiki/Database_index
Concurrency control. This defines how the database deals with multiple clients trying to modify the same piece of data at once. Options include row-level locks, table-level locks, MVCC, STM, CRDTs, etc.
https://en.wikipedia.org/wiki/Concurrency_control
The memory hierarchy. This defines where the data is stored, physically. Is it on disk, like most data-warehousing stores or distributed filesystems? Is it in memory, like memcached or Redis? Is it on Flash memory? Is there some sort of caching scheme where parts of the data are in memory and parts on disk?
https://en.wikipedia.org/wiki/Memory_hierarchy
Distribution. How is data split across multiple machines, and then how are failures in the network handled?
https://en.wikipedia.org/wiki/CAP_theorem
Hope this helps. Most data-management solutions (even ones that often aren't considered "databases", like memcached or Redis or flat CSV files or MapReduce or the Google Search indexes) can be mapped onto this space.
[+] [-] bordercases|10 years ago|reply
[+] [-] paperwork|10 years ago|reply
If you are looking for a theoretical model of relational databases, you may look up relational algebra. I'm sure others will chime in about how sql is not a very good representation of relational algebra; however, if you are not aware of it, RA is certainly something to study.
I suspect that you may be referring to a kernel database, the way there are simple implementations of operating systems--written expressly to study how OSes work. I wish something like that existed for databases, but I'm not aware of it.
Finally, there are several text books on relational database management systems. I haven't used one in a while so can't recommend a specific one. However, these books often show you how to implement various parts of a database, such as b-treee for storage, indexes, how to design a pipeline to implement selection/projection/group by, etc. Make sure that the book is showing you how to implement a DBMS, not how to design a datamodel.
[+] [-] bonobo3000|10 years ago|reply
Figure out the problem something solves, try a bunch of simple solutions and see why it fails, ask questions about why things are done a certain way.
Example is much more helpful:
Q.Whats a database?
A. thing to store data
Q.why not use a simple text file?
A.well there is no way to enforce a schema that way. whats to stop me from adding a line "a,b,c,d" when the schema of the database has only 3 fields?
its probably not good for binary data either.
ok so well use some specialized format
Q.whats an index? why do the use indexes?
A.<do research - find they are B-trees that allow you to maintain an order over data>
Q. so what happens if multiple people try to write at the same time
A. we could global lock the whole table. so only one write is allowed at a time.
we could lock on the specific row.
with something like CRDTs, order of writes doesn't matter.
what if we didnt lock at all?
data will be corrupt. this is where consistency models like ACID etc. come in.
So my advice is think deeply about the problem being solved, what the difficult sub-problems are and then see what kinds of trade-offs different databases use to deal with them.
The more questions you answer, the more questions you will have, but about increasingly detailed slices of the DB/framework/etc. This is learning.
[+] [-] shotgun|10 years ago|reply
[1] https://www.youtube.com/watch?v=Cym4TZwTCNU [2] http://www.infoq.com/presentations/datomic-functional-databa...
[+] [-] wcarss|10 years ago|reply
These pretty much describe the underlying set of the things you can do with database systems, and different databases provide different implementations of them with different tradeoffs and abstractions.
[+] [-] cpach|10 years ago|reply
”If you’re not familiar with RDBMS, general introductions to SQL and relational algebra should be useful. Back in uni our textbook was Fundamentals of Database Systems[1]. (It’s okay, but there are probably better books out there.)
http://www.pearsonhighered.com/educator/product/Fundamentals...
[+] [-] frik|10 years ago|reply
[+] [-] bhntr3|10 years ago|reply
I think Foundation of Databases, "The Alice Book", lays out the mathematical basis well. The relational algebra might be the kernel language. I think Ullman's books on databases provide more implementation details. Caveat: I haven't finished either. The Alice book is pretty dense.
I'd describe it this way myself:
A database is a persistent repository of facts about some domain.
The simplest set of facts might be a set of values. You can't do much interesting with that. This might be main memory or a hard disk.
Beyond that, the facts might be tuples (K, V) where K is some name or key and V is some unstructured value. Key value stores and filesystems are like this. You can index K and provide fast lookups
If the tuple has more structure, it might be a relation. Relations might be described as an ordered tuple of values where each value is selected from some domain. (Name, Age, HourlyWage) might be a relation where Name comes from Strings, Age comes from Ints and HourlyWage comes from Floats.
So, we might have a couple facts like: (Bob, 45, 14.50) (Jill, 19, 112.95) (Jim, 7, 1337.0)
Now, the repository of facts can provide a solver that answers questions. You define some constraints like HourlyWage > 100.0 AND Age < 10 and it finds the facts that match the constraints. It's more complicated than that and the Alice book explains the mathematical foundations well.
And the database remains a repository. You can add data to it. The repository may enforce the relations and reject facts that don't fit the domain of their values. Maybe it has uniqueness constraints on certain fields, etc. etc.
Beyond that it's mostly additional features. More complicated relational operators. Better query abstraction. Optimizations to solve queries more efficiently (query planning, etc.) Atomicity, Consistency, Isolation and Durability guarantees. Clustering and networked operations. So on and so forth in the spirit of CTM.
My intuition is that the mathematical foundation is fairly pure. So even "non-relational databases" could be viewed as implementing some non-standard form of the relational model.
[+] [-] brudgers|10 years ago|reply
That said, the bigger abstraction is information storage and retrieval. For some purposes a printed book on a shelf is most appropriate, and it's helpful to think of the continuum of storage technologies as including hard copy simply because it filters out the noise of implementations: key-value stores, versus document stores, versus relational stores, versus column stores, versus CPU caches. The what and why and when come before the how and where.
For traditional relational databases, I found Database Systems: The Complete Book very informative. But my intuition is that RDMS's are increasingly less likely to be the best choice for a lot of common applications.
Good luck.
[+] [-] Animats|10 years ago|reply
[+] [-] bordercases|10 years ago|reply
[+] [-] boyaka|10 years ago|reply
After reading a bit about it I started to think of it as pseudocode for programming languages, and I thought I'd share some results from my searching based on that idea [1].
"In terms of the kernel language itself, it is in fact its own programming language, a sort of "runnable pseudocode" that is executable in the Mozart/Oz platform."
Here's an interesting link I found by searching for database pseudocode [2]. This delves into the topic of "NoSQL", which might be helpful in your database basics research, but it also covers other database basics.
I am also not very skilled nor experienced with databases, but based on your request, one database that comes to mind for me is Berkeley DB. It is the building block for lots of other technologies, and is also considered one of the origins of the NoSQL concept. Here's a recent relevant discussion that probably made me think of it [3] (that and it is the basis for a distributed filesystem I'm studying). Also, check out the Wikipedia for it, and (maybe obvious to you already) searching site:news.ycombinator.com on Google can get you a lot more interesting discussions on any topic.
Here's one more random link I found [4], just because it has a lot of summaries on different types of NoSQL databases.
Again, this is all far from expert advice, which is probably really what you need, but I thought I'd just share with you what I found in my attempt to understand this "kernel language" concept and apply it to databases.
[1] http://michaelrbernste.in/2013/02/23/notes-on-teaching-with-...
[2] http://www.jeffknupp.com/blog/2014/09/01/what-is-a-nosql-dat...
[3] https://news.ycombinator.com/item?id=10148242
[another old one] https://news.ycombinator.com/item?id=3613638
[4] http://bigdata-madesimple.com/a-deep-dive-into-nosql-a-compl...