> Relational databases get their name from the fact that relationships can be defined between tables.
This is a widespread misconception. Relational databases get their name from relations in the mathematical sense[1], i.e. sets of tuples containing facts. The basic idea of the relational model is that logical predicates can be used to query data flexibly without having to change the underlying data structures.
The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.
> Relational databases get their name from the fact that relationships can be defined between tables.
Relational databases get their name from the mathematical concept of a relation, used by the Relational Model, "an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations." [1][2] (emphasis added)
Recommended Reading: Database In Depth, by Chris Date.
I did a MOOC on relational algebra that made me much more productive in SQL and better appreciate the gravity of what RDBMS really offer. Understanding relational algebra helps demystify the magic or query planners and grok why they both add and reduce latency based on use cases.
Codd's 1979 paper "Extending the Relational Model" [1] is really interesting, especially the second half. The first half is about nulls and outer joins, and I think that steals everyone's attention. But the second half basically gives a way to turn your RDBMS into a graph database by (among other things) letting you query the system catalog and dynamically construct your query based on the results. This would never work with today's parse-plan-optimize-execute pipelines, but it's a really cool idea, and I've certainly often wished for something like it. I'd love to know if anyone has followed up on these ideas, either in scholarship or in built tools.
The document completely overlooks Columnar databases, which are focused on analytics and are much faster than (most, not all) general-purpose DBMSes. See:
A database being columnar is more of an implementation detail regarding the underlying storage mechanism than it is a type of database. It has to do with how the data is physically stored, i.e. by putting values across rows of a single column adjacent to one another on disc/in memory.
The discussion of graph dbs completely misses the semantic rdf graph approach and how that differs greatly from the property graph (which is discussed). So important is not having to have a custom schema for each application that does not communicate with any other app as opposed to using standard ontologies with relationships and classes that are known and allow interoperability between systems (Linked Data Platform - Solid)
Do you know of any successfully semantic RDF graph databases, I guess with OWL support? Because I personally don't. If not, it probably is rightfully too much an academic niche to be discussed in the article.
This article comes from the team at Prisma, who are doing some really cool work building "database schema management and design with code" tools. They're working on a new version of their library right now (Prisma 2) and are regularly giving updates to the community and providing test versions.
Most everything they make is open source and really well designed. Would recommend checking it out!
Nikolas from the Prisma team here! Thanks a lot for the endorsement, we're indeed super excited about the current database space and the tooling we see emerging.
For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2
I am curious about Prisma2 because I tried to build a server side API with v1 as a novice to graph systems and it became an unwieldy nightmare. Partially my fault for wanting to do it without the SaaS they provide but trying to build with something complicated and Apollo on the frontend with a skilled FE dev got me so confused I put it off.
Just a quick remark on graph dbs. Titan which is mentioned in the article as an example of a graph db is dead. Its successor is the Janus graph (https://github.com/JanusGraph/janusgraph).
I am surprised Dgraph isn't mentioned as an example. It is the most starred graph db on Github, and I think it is the best one in terms of performance and scalability.
I'd like to see "dynamic relational" implemented. It's conceptually very similar to existing RDBMS and can use SQL (with some minor variations for comparing more explicitly). You don't have to throw away your RDBMS experience and start over.
And you can incrementally "lock it down" so that you get RDBMS-like protections when projects mature. For example, you may add required-field constraints (non-blank) and type constraints (must be parsable as a number, for instance). Thus, it's good for prototyping and gradually migrating to production. It may not be as fast as an RDBMS for large datasets, though. But that's often the price for dynamicness. (A fancy version could allow migrating or integrating tables to/with a static system, but let's walk before we run.)
Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.
I've done this (commercially!) with PostgreSQL - just start with a single table, with one JSON field, and as you want performance, integrity, etc, add expression indexes, break out frequently used expressions into columns etc. On large tables, obviously there's a cost for this reorganization but you can partition the data first, and only reorg the most recent data (e.g. range partitioning by time).
The description of flat-file database seems too restrictive. In my experience, flat files with fixed record lengths and no delimiters were far more common than variable-length delimited formats like CSV. File sizes were often much larger than computer memory size, so random read & write was necessary.
Yes. The origin of the flat-file database is fixed-format unit record equipment¹, predating computers. COBOL is essentially a language designed for processing fixed-format files.
Time-series is more about a specific use-case about data that has a primary time component (like sensor metrics). You can store it in any database, although the common ones are usually some sort of key/value or relational with specific features for time-based queries.
Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.
Good point, I think it should be mentioned because those DBs are special in that they are optimized for analyzing data changes over time. They store timestamps effectively and often allow filtering and aggregation on tags and in time intervals. It is not common to query RDBM for a chart of 10-minute averages of latency, histograms, quantiles, etc and do things like downscaling from 1-second intervals to 1minute intervals Great examples: Prometheus, Graphite, InfluxDB
All of them are in fact graph databases, they just didn't realize about it and got lost giving the implementation the category of design for many reasons specific to the context in which they were created. I think we should think more often as mathematicians and a little bit less as "hackers"
I think this is a mischaracterization. The relational model which motivated relational DMBSs is based on predicate logic. Mappings to graphs are obvious, but are not the organizing principle. This was one of the strengths of the relational model, encouraging a more flexible view of the data than graph databases had previously offered. In a complex relational schema, you can discover and work with all kinds of implicit graphs that were not originally intended by the schema design.
If they are all described as graph databases, we lose the usefulness of understanding the differences between them. I think understanding these differences are at least interesting, and possibly useful.
> Legacy database types represent milestones on the path to modern databases. These may still find a foothold in certain specialized environments, but have mostly been replaced by more robust alternatives for production environments.
I didn't notice anything that went into any detail about legacy database types.
Any idea what the author means by a "Legacy Database"?
Can anyone here point to a resource that gives a comprehensive treatment (use cases, pros and cons etc) of all the types (Nosql, NewSQL, relational, timeseries) of databases being used today?
How would you categorize something like ClickHouse or Interana or Druid? Columnar I guess, but then the description of Column-family in the article doesn't match up with my experience of how those work.
Hello, author here. That's a good question and something I had a hard time sorting out as I worked on this.
I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).
I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.
My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.
Skipped through it looking for an answer but didn't see it: where are unions in Prisma?! Was looking for some big reveal about an underlying choice that enables what everyone is begging for
The column-family databases mentioned (Cassandra, HBase) are both just fancy key-value stores that add semantics for separate tables and cell-level data so you're not rolling it yourself.
GraphQL is a really important use case for Prisma. That is using Prisma as the "data layer" on top of your database when implementing a GraphQL server (e.g. using Apollo Server). However, it's not the only use case since you can effectively use it with any application that needs to access a database (e.g. REST or gRPC APIs). We actually wrote a blog post exactly on this topic: https://www.prisma.io/blog/prisma-and-graphql-mfl5y2r7t49c/
Document databases kind of killed them I would specualte. Since JSON serializes with objects so much better than (ugh) XML, the Relational impedence is gone (well, a lot of it).
It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.) as you can see demonstrated in this article (https://gun.eco/docs/Graph-Guide).
This makes graphs a superior data structure.
If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.
This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!
I'll read this generously and assume you meant to say that graphs are an essential data structure, i.e. we can use a graph to represent the more specific data structures used by various types of databases (e.g. A b-tree is a type of graph)
Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.
> To store data, you provide a key and the blob of data you wish to save, for example a JSON object, an image, or plain text. To retrieve data, you provide the key and will then be given the blob of data back. The database does not evaluate the data it is storing and allows limited ways of interacting with it.
Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.
Tabular vs Document. Having relations is orthogonal to the shape of your data. There are document databases with relations - RethinkDB was pretty popular. Mongo sadly doesn't have them but will probably eventually get them too.
I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa. Computers, in the end, are just a Turing machine. Just pay attention that the query that you are executing is actually doing the optimal solution.
I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.
For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.
> For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join".
In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)
Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.
In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.
And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.
> Computers, in the end, are just a Turing machine.
This isn't a helpful abstraction. Electricity is just electrons after all, why think about how you're going to wire your house when it's all just atoms? Read a bunch of books on quantum physics, then become an electrician /s
> Just pay attention that the query that you are executing is actually doing the optimal solution.
Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.
> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.
Absolutely, if just for knowledge's sake, but these are largely not needed if you understand the high level use cases for any given DB. It's cool if you understand B-Trees but not strictly needed to use sql. In fact, many would this this not helpful.
Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL. Or good luck traversing a graph structure in MongoDB when you should've used Neo4J. So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.
Also Web Scale is about any approach used at the bigger web companies. And the type of people who use the term are not the same people who would be sitting there optimising SQL queries. So I wouldn't conflate the two.
Working in a relational-ish database built on top of a poorly-abstracted B-tree object store is among the top ten worst development hells I've personally endured. It was awful.
Yes, you can sort of make any database behave like any other database. But you don't want to. You want a power tool that has query planners built in that can efficiently solve the sort of problems you're dealing with.
> I am really tired of articles that talk about the different types of databases.
I read this article and learned things I didn't know.
What would you like to have done differently? The author not write the post? Or someone not post it to HN? I don't understand your complaint. If you'd like a different article, write a different article.
> I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa.
> I wish more time would be spent talking about the underlying algorithms
That's precisely the difference. How they store data is fundamental to what kinds of operations are fast. A row-based and columnar database can look rather similar (PostgreSQL/Redshift). But the performance characteristics are far different.
Check out mm-ADT: http://rredux.com/mm-adt/. The goal of the project is to develop a distributed virtual machine and bytecode specification to make it easier to integrate processors, storage systems, and query languages.
taffer|6 years ago
This is a widespread misconception. Relational databases get their name from relations in the mathematical sense[1], i.e. sets of tuples containing facts. The basic idea of the relational model is that logical predicates can be used to query data flexibly without having to change the underlying data structures.
The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.
[1] https://en.wikipedia.org/wiki/Finitary_relation
[2] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
tlarkworthy|6 years ago
nikolasburk|6 years ago
woolcap|6 years ago
Relational databases get their name from the mathematical concept of a relation, used by the Relational Model, "an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations." [1][2] (emphasis added)
Recommended Reading: Database In Depth, by Chris Date.
[1] https://en.wikipedia.org/wiki/Relational_model
[2] https://en.wikipedia.org/wiki/Relation_(database)
reilly3000|6 years ago
pjungwir|6 years ago
[1] https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...
einpoklum|6 years ago
https://en.wikipedia.org/wiki/Column-oriented_DBMS
and
https://www.slideshare.net/arangodb/introduction-to-column-o...
or get:
http://www.nowpublishers.com/article/Details/DBS-024
Examples:
* MonetDB
* SAP Hana
* Actian Vector (formerly Vectorwise)
* Oracle In-Memory
ralusek|6 years ago
kthejoker2|6 years ago
Columnar compression is a really interesting engineering problem
takeda|6 years ago
cptnapalm|6 years ago
muydeemer|6 years ago
gibsonf1|6 years ago
planck01|6 years ago
AlphaWeaver|6 years ago
Most everything they make is open source and really well designed. Would recommend checking it out!
nikolasburk|6 years ago
For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2
616c|6 years ago
muydeemer|6 years ago
planck01|6 years ago
fitzoh|6 years ago
https://www.datastax.com/products/datastax-graph
https://venturebeat.com/2015/02/03/datastax-acquires-aureliu...
imchairmanm|6 years ago
tabtab|6 years ago
And you can incrementally "lock it down" so that you get RDBMS-like protections when projects mature. For example, you may add required-field constraints (non-blank) and type constraints (must be parsable as a number, for instance). Thus, it's good for prototyping and gradually migrating to production. It may not be as fast as an RDBMS for large datasets, though. But that's often the price for dynamicness. (A fancy version could allow migrating or integrating tables to/with a static system, but let's walk before we run.)
https://stackoverflow.com/questions/66385/dynamic-database-s...
Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.
asah|6 years ago
https://www.google.com/search?q=expression+index+postgres
https://www.postgresql.org/docs/10/ddl-partitioning.html
thunderbong|6 years ago
pas|6 years ago
And since 4.x they have transactions over shards too!
unknown|6 years ago
[deleted]
bryanlarsen|6 years ago
kps|6 years ago
¹ https://en.wikipedia.org/wiki/Unit_record_equipment
imchairmanm|6 years ago
rainyMammoth|6 years ago
manigandham|6 years ago
Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.
neop1x|6 years ago
nudpiedo|6 years ago
TheMiller|6 years ago
edmundsauto|6 years ago
_Understated_|6 years ago
> Legacy database types represent milestones on the path to modern databases. These may still find a foothold in certain specialized environments, but have mostly been replaced by more robust alternatives for production environments.
I didn't notice anything that went into any detail about legacy database types.
Any idea what the author means by a "Legacy Database"?
AlphaWeaver|6 years ago
tempguy9999|6 years ago
AtlasBarfed|6 years ago
honkycat|6 years ago
bryanrasmussen|6 years ago
victor106|6 years ago
thekhatribharat|6 years ago
minitoar|6 years ago
imchairmanm|6 years ago
I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).
I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.
My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.
[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...
intellix|6 years ago
matthewmueller|6 years ago
Are you in our Slack? I'm @mattmueller at https://prisma.slack.com.
I'd love to chat with you to better understand your use cases, so we can make sure we're designing it for you.
dehrmann|6 years ago
neop1x|6 years ago
CMCDragonkai|6 years ago
zbentley|6 years ago
sourcepath|6 years ago
nikolasburk|6 years ago
You can also find examples for the various use cases here: https://github.com/prisma/prisma-examples/tree/prisma2
Please let me know if that clarifies it or if you have more questions! :)
galaxyLogic|6 years ago
AtlasBarfed|6 years ago
chrisweekly|6 years ago
marcosdumay|6 years ago
pressurefree|6 years ago
[deleted]
marknadal|6 years ago
It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.) as you can see demonstrated in this article (https://gun.eco/docs/Graph-Guide).
This makes graphs a superior data structure.
If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.
This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!
jt2190|6 years ago
I'll read this generously and assume you meant to say that graphs are an essential data structure, i.e. we can use a graph to represent the more specific data structures used by various types of databases (e.g. A b-tree is a type of graph)
Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.
danenania|6 years ago
Not to knock graph dbs, but isn’t the reverse also true?
cbtx|6 years ago
unknown|6 years ago
[deleted]
kristoff_it|6 years ago
Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.
nailer|6 years ago
Tabular vs Document. Having relations is orthogonal to the shape of your data. There are document databases with relations - RethinkDB was pretty popular. Mongo sadly doesn't have them but will probably eventually get them too.
takeda|6 years ago
It's common misconception that it is from foreign keys.
Nican|6 years ago
I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.
For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.
scottlamb|6 years ago
In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)
Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.
In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.
And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.
jorblumesea|6 years ago
This isn't a helpful abstraction. Electricity is just electrons after all, why think about how you're going to wire your house when it's all just atoms? Read a bunch of books on quantum physics, then become an electrician /s
> Just pay attention that the query that you are executing is actually doing the optimal solution.
Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.
> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.
Absolutely, if just for knowledge's sake, but these are largely not needed if you understand the high level use cases for any given DB. It's cool if you understand B-Trees but not strictly needed to use sql. In fact, many would this this not helpful.
threeseed|6 years ago
Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL. Or good luck traversing a graph structure in MongoDB when you should've used Neo4J. So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.
Also Web Scale is about any approach used at the bigger web companies. And the type of people who use the term are not the same people who would be sitting there optimising SQL queries. So I wouldn't conflate the two.
roywiggins|6 years ago
Yes, you can sort of make any database behave like any other database. But you don't want to. You want a power tool that has query planners built in that can efficiently solve the sort of problems you're dealing with.
manigandham|6 years ago
Everything you said about algorithms and query complexity is part of that foundation, not a replacement for it.
bradly|6 years ago
I read this article and learned things I didn't know.
What would you like to have done differently? The author not write the post? Or someone not post it to HN? I don't understand your complaint. If you'd like a different article, write a different article.
paulddraper|6 years ago
> I wish more time would be spent talking about the underlying algorithms
That's precisely the difference. How they store data is fundamental to what kinds of operations are fast. A row-based and columnar database can look rather similar (PostgreSQL/Redshift). But the performance characteristics are far different.
Did the article come up short?
ska|6 years ago
Famous last words :)
Detail, in the end, really matter.
okram|6 years ago
I recently gave a talk on mm-ADT at ApacheCon: https://www.slideshare.net/slidarko/mmadt-a-multimodel-abstr...
shostack|6 years ago