top | item 10189320

Ask HN: CTM gave CS the “kernel language”. Is there a “kernel database”?

27 points| bordercases | 10 years ago

In "Concepts, Techniques, Methods of Computer Science" (CTM), Peter Van Roy takes the approach of teaching programming not as one paradigm or another, but rather a hierarchy of features for which each new feature set or stratification, maps the whole set to a new paradigm. (He calls this abstract set of features the "kernel language".) Thus you build up something like Object Oriented Programming from simple parts, the lowest of which is Declarative/Functional Programming. Then you can negotiate the tradeoffs effectively.

Currently I'm diving into databases, and I'm nothing but confused. It seems easy to get lost in the myriad of technologies and their features. I'm having a difficult time negotiating the tradeoffs, particularly without a real, central model of what a database is or what it's design space is. Is there an equivalent to a kernel language in the database world?

The cover graphic of the MIT OCW for databases (labeled 6.830 / 6.814) seems to give away that there is a basic set of problems that databases need to solve. So I think that a "kernel description" of databases can exist. I'm wondering if anyone here has come across any.

18 comments

[+] nostrademons|10 years ago|reply

I think you're looking for a set of roughly-orthogonal concepts that together make up a "database", right? Basically, you're trying to understand the different dimensions under which database products might make different trade-offs, defining the design space? If so, take a look at these concepts:

Schema. This defines how the raw bytes of a data record are mapped into semantically-relevant pieces of data for your program. The relational model (behind SQL) is the most commonly used schema for commercial database products, but also check out things like XML databases, JSON data-stores, serialization formats like Protobuf or Thrift, etc.

https://en.wikipedia.org/wiki/Data_model

Indexing. This defines what data structures are used to quickly find records on disk. Almost all the major products use B-trees, but these are not the only options: you can also have hash indexes, sorted string tables, bitmaps, bloom filters, etc.

https://en.wikipedia.org/wiki/Database_index

Concurrency control. This defines how the database deals with multiple clients trying to modify the same piece of data at once. Options include row-level locks, table-level locks, MVCC, STM, CRDTs, etc.

https://en.wikipedia.org/wiki/Concurrency_control

The memory hierarchy. This defines where the data is stored, physically. Is it on disk, like most data-warehousing stores or distributed filesystems? Is it in memory, like memcached or Redis? Is it on Flash memory? Is there some sort of caching scheme where parts of the data are in memory and parts on disk?

https://en.wikipedia.org/wiki/Memory_hierarchy

Distribution. How is data split across multiple machines, and then how are failures in the network handled?

https://en.wikipedia.org/wiki/CAP_theorem

Hope this helps. Most data-management solutions (even ones that often aren't considered "databases", like memcached or Redis or flat CSV files or MapReduce or the Google Search indexes) can be mapped onto this space.

[+] bordercases|10 years ago|reply

Yes, this is close to exactly what I was looking for. Between this and a database management systems book (as per paperwork's suggestion) I think I would get both "a lay of the land" and "a cathedral to build on" when it comes to understanding database systems. Thank you.

[+] paperwork|10 years ago|reply

I've wondered this as well. However, I couple of notes. CTM is indeed an awesome book, but the idea of language kernel is older. I believe academic papers often start with a minimal language, like the lambda calculus, then show how it can be extended with one feature or another. If I recall correctly, I don't think CTM spends any time showing you how to implement these features using a parser/interpreter/compiler/etc.

If you are looking for a theoretical model of relational databases, you may look up relational algebra. I'm sure others will chime in about how sql is not a very good representation of relational algebra; however, if you are not aware of it, RA is certainly something to study.

I suspect that you may be referring to a kernel database, the way there are simple implementations of operating systems--written expressly to study how OSes work. I wish something like that existed for databases, but I'm not aware of it.

Finally, there are several text books on relational database management systems. I haven't used one in a while so can't recommend a specific one. However, these books often show you how to implement various parts of a database, such as b-treee for storage, indexes, how to design a pipeline to implement selection/projection/group by, etc. Make sure that the book is showing you how to implement a DBMS, not how to design a datamodel.

[+] bonobo3000|10 years ago|reply

This may be useless, but here is how I would do it. Its kind of related to the "kernel" idea in that I try to discover the heart of the problem a technology solves, what the traditional problems are and how a given technology overcomes them.

Figure out the problem something solves, try a bunch of simple solutions and see why it fails, ask questions about why things are done a certain way.

Example is much more helpful:

Q.Whats a database?

A. thing to store data

Q.why not use a simple text file?

A.well there is no way to enforce a schema that way. whats to stop me from adding a line "a,b,c,d" when the schema of the database has only 3 fields?

its probably not good for binary data either.

ok so well use some specialized format

Q.whats an index? why do the use indexes?

A.<do research - find they are B-trees that allow you to maintain an order over data>

Q. so what happens if multiple people try to write at the same time

A. we could global lock the whole table. so only one write is allowed at a time.

we could lock on the specific row.

with something like CRDTs, order of writes doesn't matter.

what if we didnt lock at all?

data will be corrupt. this is where consistency models like ACID etc. come in.

So my advice is think deeply about the problem being solved, what the difficult sub-problems are and then see what kinds of trade-offs different databases use to deal with them.

The more questions you answer, the more questions you will have, but about increasingly detailed slices of the DB/framework/etc. This is learning.

[+] shotgun|10 years ago|reply

Although tangential to your question, I feel you might enjoy presentations by Rich Hickey wherein he describes traditional database architectures and presents the whys and wherefores behind Datomic. For example:

[1] https://www.youtube.com/watch?v=Cym4TZwTCNU [2] http://www.infoq.com/presentations/datomic-functional-databa...

[+] wcarss|10 years ago|reply

This is a very naive (and perhaps unsatisfying) answer, and I haven't read CTM so I may misunderstand the real concept of the kernel language, but you may be looking for either the Relational Calculus or Algebra[1][2] or, for less math and more application, straight up database-independent SQL[3].

These pretty much describe the underlying set of the things you can do with database systems, and different databases provide different implementations of them with different tradeoffs and abstractions.

    [1] https://en.wikipedia.org/wiki/Relational_calculus
    [2] https://en.wikipedia.org/wiki/Relational_algebra
    [3] https://en.wikipedia.org/wiki/SQL

[+] cpach|10 years ago|reply

Copy of a comment I made in an earlier thread:

”If you’re not familiar with RDBMS, general introductions to SQL and relational algebra should be useful. Back in uni our textbook was Fundamentals of Database Systems[1]. (It’s okay, but there are probably better books out there.)

http://www.pearsonhighered.com/educator/product/Fundamentals...

[+] [-] frik|10 years ago|reply

BeFS (BeOS) and "Cairo"/WinNT4 NTFS (with Object extension) offered a database API on top of the filesystem. And IBM stored its system files in a database like filestorage too, in its mainframes.

[+] [-] bhntr3|10 years ago|reply

Van Roy's book is an awesome way to explain computer languages and computing itself. I'm not sure there's anything so approachable yet mind blowing for DBs.
I think Foundation of Databases, "The Alice Book", lays out the mathematical basis well. The relational algebra might be the kernel language. I think Ullman's books on databases provide more implementation details. Caveat: I haven't finished either. The Alice book is pretty dense.
I'd describe it this way myself:
A database is a persistent repository of facts about some domain.
The simplest set of facts might be a set of values. You can't do much interesting with that. This might be main memory or a hard disk.
Beyond that, the facts might be tuples (K, V) where K is some name or key and V is some unstructured value. Key value stores and filesystems are like this. You can index K and provide fast lookups
If the tuple has more structure, it might be a relation. Relations might be described as an ordered tuple of values where each value is selected from some domain. (Name, Age, HourlyWage) might be a relation where Name comes from Strings, Age comes from Ints and HourlyWage comes from Floats.
So, we might have a couple facts like: (Bob, 45, 14.50) (Jill, 19, 112.95) (Jim, 7, 1337.0)
Now, the repository of facts can provide a solver that answers questions. You define some constraints like HourlyWage > 100.0 AND Age < 10 and it finds the facts that match the constraints. It's more complicated than that and the Alice book explains the mathematical foundations well.
And the database remains a repository. You can add data to it. The repository may enforce the relations and reject facts that don't fit the domain of their values. Maybe it has uniqueness constraints on certain fields, etc. etc.
Beyond that it's mostly additional features. More complicated relational operators. Better query abstraction. Optimizations to solve queries more efficiently (query planning, etc.) Atomicity, Consistency, Isolation and Durability guarantees. Clustering and networked operations. So on and so forth in the spirit of CTM.
My intuition is that the mathematical foundation is fairly pure. So even "non-relational databases" could be viewed as implementing some non-standard form of the relational model.

[+] [-] brudgers|10 years ago|reply

Off the top of my head, I'd say the kernels of database systems are the CAP on the operational side and ACID on the design side. More abstractly, what really matters is the particular business logic [what problem are we trying to solve?] a database models not the implementation details [what code are we using?].
That said, the bigger abstraction is information storage and retrieval. For some purposes a printed book on a shelf is most appropriate, and it's helpful to think of the continuum of storage technologies as including hard copy simply because it filters out the noise of implementations: key-value stores, versus document stores, versus relational stores, versus column stores, versus CPU caches. The what and why and when come before the how and where.
For traditional relational databases, I found Database Systems: The Complete Book very informative. But my intuition is that RDMS's are increasingly less likely to be the best choice for a lot of common applications.
Good luck.

[+] [-] Animats|10 years ago|reply

I thought this was going to be about why an OS needs a database for its own data, instead of a collection of flat files without locking or atomic updates.

[+] [-] bordercases|10 years ago|reply

This is an interesting idea, I wonder if it has any merit.

[+] [-] boyaka|10 years ago|reply

Warning: I am neither a skilled nor experienced programmer. I would say the kernel of programming languages is more practically something like C, assembly, or machine language. Kernel implies containing the core functionality, the building blocks. Van Roy's kernel language does seem to contain core ideas for ideal programming languages, even though it doesn't seem to translate well to existing hardware / machine language.
After reading a bit about it I started to think of it as pseudocode for programming languages, and I thought I'd share some results from my searching based on that idea [1].
"In terms of the kernel language itself, it is in fact its own programming language, a sort of "runnable pseudocode" that is executable in the Mozart/Oz platform."
Here's an interesting link I found by searching for database pseudocode [2]. This delves into the topic of "NoSQL", which might be helpful in your database basics research, but it also covers other database basics.
I am also not very skilled nor experienced with databases, but based on your request, one database that comes to mind for me is Berkeley DB. It is the building block for lots of other technologies, and is also considered one of the origins of the NoSQL concept. Here's a recent relevant discussion that probably made me think of it [3] (that and it is the basis for a distributed filesystem I'm studying). Also, check out the Wikipedia for it, and (maybe obvious to you already) searching site:news.ycombinator.com on Google can get you a lot more interesting discussions on any topic.
Here's one more random link I found [4], just because it has a lot of summaries on different types of NoSQL databases.
Again, this is all far from expert advice, which is probably really what you need, but I thought I'd just share with you what I found in my attempt to understand this "kernel language" concept and apply it to databases.
[1] http://michaelrbernste.in/2013/02/23/notes-on-teaching-with-...
[2] http://www.jeffknupp.com/blog/2014/09/01/what-is-a-nosql-dat...
[3] https://news.ycombinator.com/item?id=10148242
[another old one] https://news.ycombinator.com/item?id=3613638
[4] http://bigdata-madesimple.com/a-deep-dive-into-nosql-a-compl...