Old, Good Database Design

[+] commandlinefan|5 years ago|reply

Some people choose nosql alternatives because they've spent time analyzing the performance of a proper relational model and have determined that an RDBMS will generate too much overhead for their data load and consciously accept the tradeoffs involved in giving up automated referential integrity.

Most people, though, choose nosql alternatives because they're too lazy to learn how to model data.

[+] SamuelAdams|5 years ago|reply

I am forever grateful that I took a full semester of database design in my undergrad. This single skill has stood with me for my entire career so far and has enabled me to figure out the root cause of many production issues. Plus people really like it when you can answer ad-hoc questions like "what inspections are still open and when were they first opened".

If y'all can understand Angular / React / Vue there's no reason to not learn databases.

[+] teh_klev|5 years ago|reply

I speak as someone who's worked for 30+ years on data modelling. Every time I encounter some mongo or other non-relational DB where the company jewels (the data) are stored with no documentation, no data model etc and stuff is just shoved into these stores willy nilly it makes me weep.

Start off with relational, if perf is a problem then look at denormalising, after that then consider other alternatives for special cases. But to see run-of-the-mill apps with no near future scalability issues jumping right into mongo et al from day one makes me want to run away.

[+] Cieplak|5 years ago|reply

> have determined that an RDBMS will generate too much overhead

I think the read/write overhead is mostly a function of schema design, rather than an intrinsic property of an RDBMS. Denormalized schemas have similar performance profiles to document-oriented storage.

Moreover, mainstream SQL databases like Postgres are getting better and better at indexing jsonb fields, indexing time series data with BRIN, rudimentary full-text search, offering a one-size-fits-all storage system that may not be the best at everything, but will be good enough to support a growing business to its next funding round.

[+] hinkley|5 years ago|reply

Those two aren't mutually exclusive.

If you don't know how to put in the hard work to get something like you want 'the right way' and someone keeps advertising that they have 'another right way' that can do all the same things, you can take the low road and by the time you see the trap you've walked into it's difficult to go back, or for some people even admit that mistakes were made.

There are all of these time horizons in Software Engineering where you externalize the costs of things to your successors. Several classes of bad decisions start to fall apart at 18 months, and unless those started at the same time you did, you might be able to grin and bear it until it's time for a new job anyway. I think there are classes of process or architecture missteps that might go 3 years (less confident of that time interval) before it comes off the rails, and for a lot of startups that could be the proximal reason you flame out, even though it's the symptoms that are more visible.

[+] hazz99|5 years ago|reply

Or, like myself, they chose NoSQL because it was much better integrated into their development and production cloud infrastructure than any alternatives.

[+] jrockway|5 years ago|reply

I think relational databases have largely failed developers because they don't provide the features they actually need.

A common question that comes up is how to do zero-downtime schema changes. The answer is that there isn't one. A correct implementation would store each schema version in the database and when an application connects, it would specify which version it's speaking. The developer would supply a mapping on how to make vX data available to a vY program. But no relational database supports such a feature, so people are forced to tread carefully -- look at all the deployment software that exists to attempt to find changes with database migrations and treat them differently. Look at all the software people have written to even apply those migrations. It's staggering, all because in the 70s when these systems were designed, the thought of deploying your code multiple times a day was unheard of.

Another problem that comes up is transactional isolation. Most engineers, and even casual practitioners, "know" that transactions exist for cases where you want to perform multiple operations atomically. But very few of these people are running the transaction with an isolation level that provides those guarantees. They will write their program assuming that transactions are strictly serializable, but in fact they are using "read committed" or some other weak form of isolation that totally breaks their assumptions. Then the database gets into a weird state, and people are baffled as to how that could happen. The actual implementation is so different from the CS assumptions that it's not even something that crosses people's minds, and "read committed" behaving as "read committed" looks like a heisenbug in the rare case they actually notice what's going on. That's super bad.

These underlying problems have nothing to do with SQL or NoSQL, though. NoSQL can smooth over schema incompatibilities a bit (perhaps the schema specifies everything as a "field tag" instead of a name, so you can safely rename columns, or perhaps everything is "optional", so the application can detect that it's reading an old record when it's missing; you can also easily build your own versioning system on top because the data doesn't mean anything to the database engine itself), but you can still get yourself into a lot of trouble. The NoSQL databases also have a horrifying transactional cleanliness record. Postgres may be "read committed" by default but at least you can get real transactions if you ask for them; good luck ever getting them with some NoSQL databases.

I guess where I'm going with this is that database engines are focused on the wrong problems. The relational model is very good. But it's something you can bring yourself once you have a way to transactionally read and write keys with opaque values. You can also add indexing at the application layer, or triggers, or encryption, or auditing, or RBAC... whatever, it's just code. At the end of the day, picking a relational database just gets you a VERY opinionated set of defaults that is unlikely to be what your application needs and nearly impossible to change later... but the defaults are juuuuust good enough that nobody makes a real effort to change them. Meanwhile, we ignore the problems that actually plague developers; schema versioning, unusual data types (time series, large blobs), availability, replication, etc. The operational concerns have been ignored for decades, and it's slowing everyone down.

People are right to be looking for alternatives, even though we know that most of the alternatives have even worse problems. Someday, somewhere, someone will get it right.

[+] rdiddly|5 years ago|reply

All depends what you're doing, right? Is it a small stupid app with a short lifespan? Is it "just" a proof-of-concept? Is the company likely to survive long enough that technical debt matters? (oblig. HN/SV/startup comment) Is there a ridiculous imbalance between data ingestion volume vs. reads/retrieval? All good arguments for alternative/noSQL approaches.

[+] burnstek|5 years ago|reply

I am imagining an industry where half of developers don't know how to design a relational data model. Scary if true.

[+] throwaway894345|5 years ago|reply

If one of the purposes of relational databases is data modeling, I've always wondered why there aren't good semantics for sum types. The real world is full of them, but databases can't express them. When I bring this up, some people respond that this is the purpose of ORMs; however, this implies that we have an arbitrary bifurcation in which some of the processing happens efficiently in SQL and anything that depends on sum types has to get hoisted up and over a network to application code. Further, it allows for different clients to behave differently, possibly having different notions of what any given sum type's variants are (which leads invariably to data corruption). I really wish databases did better here, but maybe I'm missing something.

[+] PhineasRex|5 years ago|reply

There's no real reason you can't have sum types in a relational database, this just goes back to our current crop of RDBMSs being quite old and predating the current emphasis on strong typing. I suspect a modern relational database with an easier way define complex types (not writing a plugin) would be very popular, but the amount of work required to make something like that is immense

[+] yummybrownies|5 years ago|reply

Three techniques for sum types in SQL (absorption, separation, and partition):

https://www.parsonsmatt.org/2019/03/19/sum_types_in_sql.html

Also, here is a relational database system with native support for sum types (and also no "NULL" nonsense, which is also not part of pure relational algebra):

https://github.com/agentm/project-m36

[+] namelosw|5 years ago|reply

I tend to agree. Algebraic data types (product and sum types) is essential for modeling. Missing sum type would be incomplete in terms of modeling.

A lot of obscure OO code I have seen is related to modeling sum types of products (or parameterized enumerated in some language), or the lack of it in mainstream languages.

The problem applies to SQL as well. If I make everything into the same table and leave many columns empty I’m not modeling at all. But if I split them into tables it defeat most basic SQL functionalities like SELECT * FROM my_sum_type.

[+] _y5hn|5 years ago|reply

The traditional relational model is very focused on mutable data and normalization. Different types would be categorized in separate columns. So this idea would run counter to "best practice" and need something foundational behind it, which would be just enough out of scope for a traditionally typed relational datastore. Maybe this is just another way of saying the underlying theories are different, or covering different areas of computation and storage. So the prevailing wisdom is to not expose type complexity explicitly to clients, but just export the inherent problem implicitly.

[+] wisnesky|5 years ago|reply

Sum types are harder to express as data integrity constraints that product types - for example, Horn clause constraints can be used to express that a table is a product (or limit, more generally) of other tables, but to express that a table is a co-product (or colimit, more generally) requires a stronger logic (handling disjunction, for example). This has implications for data processing algorithms such as the chase that depend on the strength underling logic.

[+] alextheparrot|5 years ago|reply

Can you give an example of real world data modeling where you want more expressive sum types over just using enums? Enums are technically a subclass of sum types, but even those are non-trivial to use at a data format level (Try evolving them in an on-the-wire message format like Avro or Protobuf).

[+] jayd16|5 years ago|reply

Hmm, is it that hard?

    Vehicle table -- ID, TypeId, Make, etc.  (123, 456, ...)
    TypeId table -- ID, Type (456, motorcycle)
    Motorcycle table -- ID, HandleBarStyle, etc. (456, Low Rider, ...)
    Automobile table -- ID, TrunkSpace, etc. (789, ...)

You can pretty easily add extra information to any id as long as you know where to look, and that can be a simple enum column to define the concrete type (and thus what data to grab). Its an easy enough join, isn't it?

The data modelling isn't the hard part really. Pulling it into an application in a nice way is probably harder. I don't often use languages with sum types professionally so maybe this is way off base but I don't see an issue.

[+] kccqzy|5 years ago|reply

What's difficult about them? I typically use nullable columns and then a check constraint to specify a custom condition for nullability. Columns belonging to the same alternative in the sum type must be all null or all not null. And then there's check only one active alternative.

[+] geophile|5 years ago|reply

> If one of the purposes of relational databases is data modeling

Huh?

Creating a data model before creating a database, is like writing an outline before writing an essay. It organizes your thoughts and gives structure to what you are about to do. Once you have a data model, you can then implement it using whatever database technology you choose.

If you don't start with a data model, you literally don't know what you are doing.

[+] gopalv|5 years ago|reply

> A well-thought design can save us many hours of coding, testing, and troubleshooting.

That is the very definition of a waterfall design model.

I've turned into a fluid-design advocate over the years, where every design principle follows a next question - "okay, this is good but how would I change it?".

So you start with a unique constraint and four months later, you find out that it is not actually unique (like "two patients with the same email, because the 2nd one is a newborn on day of birth").

Or you normalize a data-set only to find out that your 1:N relationship turns into a 2:N relationship from before/after dates (like "UK" goes from "EU" to "UK").

The lost-time work of a design is usually the "okay, we did it in a world where UK was in EU - but we undo it & here's how" notes.

Having a plan to decommission the nuclear plant you're building is super useful and often more relevant than designing it for efficiency alone.

[+] nightski|5 years ago|reply

Good design as advocated by the author is generally conducive to change and is saying pretty much the same thing as you are.

It's a lot easier to take away unique constraints later on instead of adding them in. It's easier to de-normalize some data for performance than to normalize it later on. The list goes on.

The reason the waterfall method received so much bad press is because of requirements gathering, not the software or data design phases. Requirements are hard to get right the first time and they also change over time. But I'd be surprised to find someone argue that good architecture and design is a bad thing (being defined as the ability to adapt to changes in requirements).

[+] pmontra|5 years ago|reply

Well-thought design saves time even in agile projects. Maybe it's a one day activity instead of three months of it and yet it makes a difference. Basically each activity (or whatever we call it) is a micro waterfall.

Example from today: a developer came back from a week of vacation, listened to the stand up meeting this morning and pointed out that we misunderstood the purpose of a table he worked on time ago. Result: a few hours of last week's coding were useless and we spent a couple of hours together at redesigning the activity. The total impact should be of about one day.

[+] teh_klev|5 years ago|reply

> That is the very definition of a waterfall design model.

I think you're conflating two different things. I've spent the last two days thinking about the impact of adding five new tables to our database for a feature, two of them just lookups. I've thought about the short term benefits and the long term possible problems (for something that isn't even spec'd yet).

This isn't "waterfall" this is just getting a design about right for the current circumstances.

If during the design phase you're already asking "okay, this is good but how would I change it?" then you're already waterfalling your design.

Being thoughtful about your design up front (waterfall as your claim) hopefully solves many a problem down the line when you've suddenly got 5900M+ rows of data in the wrong shape in your production database; because a production database of that size doesn't take kindly to being "agiled" around.

[+] silvestrov|5 years ago|reply

When you need to add info to your database, do you refactor existing tables or do you add a key-value table.

I've seen so many key-value tables that really needed to be refactored into proper normalized tables.

It's just so tempting to stick extra info into KV when you don't know where the project will end up.

[+] hinkley|5 years ago|reply

> "okay, this is good but how would I change it?"

Otherwise known as Reversible Decisions. Any decision that can be undone easily does not require the level of scrutiny, the level of investment. Save that investment for the things you won't be able to change.

When faced with an irreversible decision, it's helpful to develop stalling tactics. Everything from distracting people with other issues to finding a way to get a 'taste' of the change without committing to it. It also helps if you stay on top of release notes for tools you use, and competitors of those tools. New opportunities might arise to use someone else's work to solve your problem better.

[+] pjmorris|5 years ago|reply

Are you presuming that a 'well-thought [out] design' means that all the thoughts, and the design, came only at first? I can see a well-thought out design either as the waterfall you presume, or as an accretion of design decisions made over time as in your 'fluid design.' Either way can save many hours of coding, testing, troubleshooting.

[+] jelnur|5 years ago|reply

> That is the very definition of a waterfall design model.

No it doesn't mean that. When you can rely on your data it's easies to code and test than when you have to account for every bad data combination.

And remove the constraint is easier when you need it, than to add it later.

Thanks for the feedback!

[+] jacques_chester|5 years ago|reply

It's easier to safely loosen constraints than to add them post facto.

[+] dkersten|5 years ago|reply

Agile needs design too. Agile does not mean you don't do any up front work or any design work, it just means you have agility to adapt to change as you progress. A well thought out design often helps you be more agile, not less.

[+] simonw|5 years ago|reply

My least favorite part of database design is the bit where you have to pick lengths for your char columns.

Twenty years in and I'm still picking these pretty much by guessing. And when I guess wrong it causes really annoying problems further down the line.

I love how SQLite doesn't make me do this - it just has a TEXT type which is always unlimited in length.

[+] pizza234|5 years ago|reply

I suppose that in every RDBMs that makes the distinction, it's an optimization matter - VARCHAR being stored on-page, while TEXT off-page (although there can be optimizations for short TEXT values); the latter will cause an extra page seek on access.

Some database [versions] may also be unable to apply certain optimizations in certain cases, when TEXT is used (eg. temp tables on MySQL <= 5.7).

That doesn't prevent one from always using TEXT, and possibly, for most of the use cases (surely, if one uses SQLite, that's the case), the performance impact is not meaningful.

[+] simnim|5 years ago|reply

What's wrong with TEXT type for postgres, mysql, etc? In Postgres you don't need to declare a length for varchar either.

[+] brobinson|5 years ago|reply

Postgres also has a TEXT type like this.

Edit: I see mpolun left a similar comment here, but it looks like he has been mostly shadowbanned for about 8 years.

[+] 60secz|5 years ago|reply

Haven't you seen ghostbusters? "Don't cross the streams. It would be bad." If your data model has lines crossing over, it's the first smell of a bad design. Seriously tho, 5 minutes of untangling would make that data model diagram 100x better.

[+] janvdberg|5 years ago|reply

Reminds me of this great Derek Sivers post: https://sive.rs/pg

If your design is good, you need less code.

[+] markus_zhang|5 years ago|reply

I'm wondering what you guys think about columnar databases and wide tables.

We use Vertica and from senior colleagues and even Vertica developers I got the impression that big wide tables are good because it eliminates the needs of join. Thus we don't use star schema and just wide tables.

However I think data modelling is also about embedding proper business logic and it would be a lot more confusing if two unrelated columns are stuffed in the same table.

For example let's say my event based game has a base for each player and he can do a lot of things in his base. If we use a wide table then we wilk see columns related to factory building side by side with pig feeding, and because each small feature has some unique columns, a lot of those columns are NULL simply because this event json doesn't even have th fields.

I'm wondering if we should use Vertica for a transactional type table and then use say sql server for dwh and build more traditional data modelling. But this could be awfully wrong maybe...

[+] teej|5 years ago|reply

A few comments based on lots and lots of experience:

- Wide tables in columnar DBs can make some analytics queries easier to write and sometimes more performant.

- Wide tables can come at high storage cost and make other queries less performant (like SELECT *)

- How much of these things happen is extremely DB dependent. How does the DB's underlying storage mechanism work? How is the data partitioned and distributed? How sophisticated and storage-aware is the query planner? How large is your data? How fast is your data growing? How fast do you need a new data point to be reflected in your dashboard?

There's no free lunch when it comes to this stuff. A perfect solution doesn't exist, but the technology is getting better all the time. I've personally never used sql server as a data warehouse but plenty of folks do.

The stuff I use that I recommend very highly - Snowflake, TimescaleDB, vanilla Postgres. Also, use dbt.

[+] sixdimensional|5 years ago|reply

No. Vertica excels at being a column store database - a data model fit for analytics and OLAP. You should pretty much never use Vertica nor column stores for OLTP/transactional use cases.

It's not about the columns being wide, per se that is important in Vertica - although that is a benefit. It is about columnar storage and optimized querying, over row storage and optimized writing.

It would be the other way around - SQL Server or other transactional databases for that part. Actually, these days, SQL Server supports column store indexes on top of transactional tables- which gives you HTAP - hybrid transactional/analytical processing. A few other systems do this (Oracle too).

If you do HTAP, in a way you could even avoid using Vertica for data warehouse/OLAP use cases. Or you can build a separate dimensional data model in SQL Server and keep everything in there.

But if you already have Vertica, which is quite fast and good at OLAP queries on dimensional models, use it for that (enterprise data warehouse), and feed data from transactional systems in.

[+] tarkin2|5 years ago|reply

JSONB objects with SQL relations in Postgresql is my happy-medium between the joy of schema-less JSON and the reassurance of SQL relations.

[+] c17r|5 years ago|reply

Would you explain a little more? I'm intrigued. Are you saying SQL relations between fields inside JSONB columns between tables? A field in the JSONB column has foreign key to another table? And you can do a join?

[+] slifin|5 years ago|reply

Most relational databases aren't relational enough

For example in Drupal you can have a node table which is a foreign key relationship to many other custom tables

In SQL I can't say get me all the nodes that have a start date without explicitly left joining to a potentially dynamic number of other tables using a higher level language to modify the query

In Datomic or Datascript or Datahike or Datalevin or Crux this query would be trivial these are not nosql databases they're EAVT datalog engines, that in some cases also support SQL for compatibility

[+] whirlwin|5 years ago|reply

This is useful for most small apps or truly monolithic apps. ...But normalization in a distributed environment with 50+ apps? Really? You need to have some sort of duplication unless you want big bottlenecks, performance penalties, and hot headaches

From the linked MS article: "Redundant data wastes disk space and creates maintenance problems"

Made me laugh well

[+] TheRealSteve|5 years ago|reply

This seems contrary to what I have learned in my career as an application developer on data heavy platforms. Namely the first section that concludes: > Having stressed the importance of good database design...

I'm not in agreement with the author's concept of good design. I don't want other "doors" to edit the data that bypass the application logic. That's the mess SQL enables for DBAs and scripts that think it's okay to change data without executing business logic.

I think it's better to accept your data and application layers are coupled and plan to evolve and refactor them together rather than teach your database to defend itself from the evil business logic in applications.

I also don't think this dictates relational DBs vs nosql. Your data model is probably relational but how you choose to store and access the data depends more on what use cases you are trying to enable.

[+] unnouinceput|5 years ago|reply

When creating a visualization for your database design DO NOT under any circumstance cross relations as they go from one table to another. The client will get confused and frustrated. Looking at this article's visualizations hurt my eyes. Very bad choice for what's suppose to be a "good" design.

[+] roywiggins|5 years ago|reply

> we should keep numeric data in “integer”, “numeric” fields

I end up keeping numeric data as text when I'm ingesting an external data source that I don't trust not to change ID format on me. They're all numeric now, but the format could change, and the actual numeric value of the ID is not important at all.

[+] geebee|5 years ago|reply

Nice link. Nothing controversial, but sometimes simplicity is controversial in our field.

I've slowly come around to seeing proper database design as the most essential foundation of an IT system. I remember reading "your data will outlast your application", and I've been around as a developer long enough to have lived it.

One big anti-pattern I've seen with ORMs is that developers who don't really think in terms of data and relationships use the ORM as a kind of object serialization usable only from the application. Rather than thinking of the database as something useful that could be queried and accessed outside the context of the application, they write objects out to various tables and then re-import and re-construct them once they're needed again in the app, often with dependencies that are in code or even yaml configuration files. The upshot is that you simply can't use the database as anything other than a persistence tier for an app. It really would be no different if they had simply given the objects and id and pickled them to disk. The resulting data store really is that inaccessible and meaningless outside the context of the app.

As a result, if an analyst wants a report, they can't write SQL to get those reports, even though the persistence tier is, if perhaps in name only, a "relational database". And when the app goes away (as it inevitably does), they'll still want to know, say, how many beakers and test tubes were ordered by a lab tech who participated in 100 or more experiments per year with at least one faculty member from radiology between June and December in 2007.

But because they don't really know SQL, they see no value to it, and they're honestly just kinda irritated that they don't have an object database, which is what they understand a database to be - a way to pickle and reimport objects. The application outlasts the developer, and the data outlasts the application.

Yeah, if you're writing an app that will hold potentially useful data, definitely think about how this will be accessed outside the context of the app that perhaps inspired its creation, and think about how you'd get at the data if the application went away. That'll probably lead you to old, good database design.

Now, I actually do think that you can create a decent relational database through an ORM. I saw this go sideways with Rails a bunch of times, but I think that's probably because Rails made it so easy to start developing that a lot of people new to application development skipped the design stage of the data backend. You can construct a pretty robust database with migrations rather than CREATE TABLE statements, as long as you stick to the basic principles - but the technology makes it relatively easy to do the opposite and never really get into the mindset of data.

[+] jrms|5 years ago|reply

Hi, my apologies if it's a bit off topic, but I wonder if you have any advise or can point me in some direction on the way to becoming a DBA, or something similar enough? I've been working as and old school sysadmin for the last 17 years and counting. Even if I can do nowadays devops stuff, I always liked the DBA work, based on my experience as sysadmin for DB servers (mysql, oracle, and friends...), and I'm considering moving my career in that direction. So is there any "DBA certification" for which I could take a course and all that? Thanks!!

[+] rapfaria|5 years ago|reply

Which software is used to make those diagrams? The ones I use don't have good arrows like that

[+] mooneater|5 years ago|reply

Is the diagram a joke? That's total ERD spaghetti.

[+] jelnur|5 years ago|reply

It's included only as a picture :)

165 comments