top | item 22806142

A humble guide to database schema design

414 points| helloiloveyou | 6 years ago |mikealche.com

245 comments

order
[+] earthboundkid|6 years ago|reply
Most of this advice is stuff they tell you in school that doesn’t matter at all in reality.

Real advice:

- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

- When in doubt, just have a column full of JSON and figure it out later.

- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.

[+] throwaway_pdp09|6 years ago|reply
Fuck me, are you trolling?

Guy with actual DB experience here, so:

> Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

No it won't, it depends entirely on the situation. Denormalised data (besides allowing update discrepancies) can expand the amount of data, thus push otherwise memory resident working sets onto disk. Big slowdown. You are wrong. The correct answer is "it depends" - and denormalising really is a last resort that has to be done judiciously and with care.

> You need nullable dates and nullable numbers. You never need a nullable string.

Utter unjustified garbage. Correct answer (from a practical, not super-normalised form) is: "it depends"

> When in doubt, just have a column full of JSON and figure it out later.

OMG please never let me come near any fool who does this. I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.

Please nobody listen to such 'advice'.

[+] teddyh|6 years ago|reply
> - Names should be stored as display_name, first_name, and last_name.

No.

“[…] refer people to this post the next time they suggest a genius idea like a database table with a first_name and last_name column.

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

You could, conceivably, have a "display_name" column, an "informal_name" column, a "formal_name" column, etc. depending on your needs. But for absolutely most people and purposes, a single "name" column will suffice, and will steer programmers away from making unfounded assumptions about people’s names.

[+] goto11|6 years ago|reply
> Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

It all comes down to the purpose of the database. There is no one-size-fits-all solution. For a hospital which need to register newborns, you can't even count on everyone having a name. For a library database, you need both first name and last name of authors, since books are sorted by authors last name - but having a "display_name" would be redundant and violate normalization.

If you can get away with having a single "name" field, this will say you lots of trouble. Only add additional fields if you need it for some use case.

[+] retreatguru|6 years ago|reply
> “ Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.”

I love this! It’s so true only bots dive that deep and it’s usually a total waste. But what do you recommend instead? Should you just deny any calls past page 10? Or do away with pagination altogether?

[+] goto11|6 years ago|reply
> denormal will have better performance

No it wont! In general it will have worse performance. Deliberate denormalization can be used to optimize for particular queries but it will make other operations more expensive and/or error prone. So it has to be done very carefully and with a full understanding of cost vs benefit.

And many database systems supports materialized queries which give the same optimization without having to compromise the schema.

[+] nerdponx|6 years ago|reply
You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

What's an example of a data entry error that results from this scenario? Wouldn't it entirely depend on the data entry interface?

[+] Aeolun|6 years ago|reply
Your main database should always be 100% normalized. You can add a separate database for lookups later that can be as denormalized as you want.
[+] Cthulhu_|6 years ago|reply
That assumes people who read it actually have been in school; there are a LOT that never had anything about database design.

Your other points make a lot of assumptions and there's a lot of footguns in them but other commenters have pointed them out already.

[+] tasuki|6 years ago|reply
> Never store firstname and lastname in the same field

I disagree. Why would you want to force the "firstname lastname" pattern on people? What if someone has three names? Or only one? Just use a "name" field!

[+] henrikschroder|6 years ago|reply
Scandinavian here, our names don't really fit this pattern very well either.

My last name has an "ö" in it. That doesn't work in 99% of so-called global websites, so I have to transcribe it as "oe" instead, which is just wrong.

My husband has three first names and one last name. However, his actual name is his second first name. If his full name is "A B C D", if you ask him what his name is, he'd say "B D". If you know him you can call him "B", if you don't, you can call him "Mr D" or "Mr B D". Most American companies call him "Mr A" or "Mr A D", which is just wrong.

I have a friend with two first names and two last names, but his second first name is his name, and he prefers his first last name. If his full name is "A B C D", he'd say his name is "B C" if you ask him. Calling him "Mr A" is just wrong, and calling him "Mr A D" is super duper wrong.

(In government systems in our home country, this is of course handled. You can "mark" which first name(s) are preferred, so mail to my husband from either government or private companies will always be correctly addressed to "B D" and nothing else.)

[+] clon|6 years ago|reply
Absolutely! Having 'name_given', 'name_middle', 'name_family' or some nonsense like this is one of the most common mistakes in database (and application) design, usually made by culturally narrow-minded Western developers.

Take names in Myanmar [1] as an example. From Wikipedia:

> Burmese names lack the serial structure of most modern names. The Bamars have no customary patronymic or matronymic system and thus there is no surname at all.

That said, it is hard to push back on a business requirement to sort users A-Z based on family name, just because.

[1] https://en.wikipedia.org/wiki/Burmese_names

[+] Macha|6 years ago|reply
The same also applies for the structured address format. I've seen some stupid decisions in the name of this.

Post codes: 5 years ago, my country brought in post codes. prior to this there were none. It was not for the benefit of our postal system, who stated they didn't really care, but for badly designed web forms (and the occasional rural house with an unhelpful address)

They also may not fit your convention of what addresses in your country are. The obvious examples are stuff like Japan and one or two central European cities that are addressed "Building number, Block Number, City", but the aforementioned unhelpful rural addresses include a former friend of mine who had to give his address as "House name, near village, County". Actually the village was not near but it got the mail to the right post office which knew what to do with it.

As a counterpoint, I've had places that have very strict fields like "Street Name" (20 char limit), City, County, Country. For a few years I lived on a street who's name appeared three times in the same city. The actual form to get it delivered was "Street name, Locality, City", but because of the length limits I had to sneak the locality into the City field, which luckily worked as the City and County shared a name.

[+] thelittleone|6 years ago|reply
Very cool to see mention of "one name only" case. My wife had only one name before we married. She's Indonesian. When we got married in Australia our forms pretty much crashed the Government's Birth's, Death's and Marriages system. We got bounced around between all different people trying to compute the reality that in some parts of the world, people have only one name.
[+] ozim|6 years ago|reply
Yeah, "humble guide..." using never/always in 3 points, seems like author thinks he saw everything but he is up for unpleasant surprise in the future :)
[+] CameronNemo|6 years ago|reply
You seem to be assuming that each field:

1. Could not contain whitespace.

2. Would have a minimum length.

TFA specifies neither of these requirements. You could have someone's first name (which I would recommend calling given name) be "Pablo Emilio" and their surname be "Escobar Gaviria". Or alternatively you could have someone with a given name of "Sukarno" and an empty string for the family name. Patronyms and matronyms are sort of ambiguous, but would probably be categorized as surnames.

I use a single name field, but I don't think a given name and surname split is crazy.

[+] nobrains|6 years ago|reply
The usual pattern in this case is to add first name and then all rest of the names in the first_name field. And last name in the last name field. Users can be made aware of this on the UI.

Another option is to create a middle_names or other_names field, its first_name then other_names then last_name in that order.

[+] orange8|6 years ago|reply
One way to look at it, if you have more than one name, you'll usually have a preferred order of writing them, regardless of culture. The first-name used will have different protocols and meanings in different cultures, but it is still the first-name, literally.
[+] gremlinsinc|6 years ago|reply
What if you're ordering by 'last name'?

Say it's an HR app. Do you just assume the last word in the name is the last name?

[+] ebg13|6 years ago|reply
Design guidance aside, this is not a good article because it doesn't show how to fix the presented problem scenarios.

"This is not first normal form" - OK, how would you fix it?

"This is not second normal form" - OK, how would you fix it?

"This is neither second nor third normal form" - OK, how would you fix it?

If you're going to show problems, show solutions as well.

The rest is a big red flag list of falsehoods that programmers believe about names and addresses.

[+] helloiloveyou|6 years ago|reply
Although others have replied to you, I'll take your comment as constructive criticism. Though the tone could be better.
[+] kbenson|6 years ago|reply
That's exactly what it does. When talking about each normal form, it gives an example and what to change.
[+] awaythrower|6 years ago|reply
A good ER database textbook illustrates normal forms much better than a blog post.
[+] tomlagier|6 years ago|reply
If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.

Here's nearly 100 pages on the subject (from the perspective of addressing mail for USPS): http://www.columbia.edu/~fdc/postal/

[+] me_bx|6 years ago|reply
Off-topic comment / feedback about the blog post:

The author self-describes himself as "_Trusted_ Consultant" in the blog's subtitle.

The comic strips illustrating the article do not mention about the permission from identified license provider(s), and are cropped, does not help building trust - at least in my case.

Are they used with authorization from their authors?

[+] boffinism|6 years ago|reply
In what way is this guide 'humble'? Seems fairly self-confidently opinionated to me.
[+] LoSboccacc|6 years ago|reply
> Never store firstname and lastname in the same field

funny, where would you store the middle name then? and the second middle name? married name in cultures where the spouse changes or double it? of course in normal form you need a name table, with an order column, and so on and so long.

anyway, this article goes straight to the notions everyone can fetch on any book and skip the interesting and diffiylcult part about database design, which is understanding your domain so that no future expansion is prevented by a wrong cardinality on a relation that was simplified from reality.

because if it's a user database you might get away with just the name, so you can address people during support calls. if it's an accounting database, you might want to know someone name at the time of invoicing, having multiple records with start and end validity times, and if it's a FBI database for protected identities you might have to have multiple name aliases referring to a single person.

[+] atwebb|6 years ago|reply
Since everyone has to have an opinion and provide it, here's mine, maybe a retitle?

>A humble guide to transactional database schema design

There are a number of things that don't apply to analytical database design and I'm seeing the two confused fairly often, then again, this comes entirely from my point of view so take it for what it's worth.

[+] davedx|6 years ago|reply
I work on an e-commerce platform that's live in about 20 countries. One of the hardest records to normalize correctly is the address, because not only do users have very different rules depending on their country (e.g. not all addresses in Turkey have post codes, GB has some very weird building names/numbers...), we also connect with vendor API's that have their own ideas of how to model an address.

My personal opinion on this is something like address_line_1, address_line_2, city, region, post_code, with each field being varchar(max). Anyone else have an opinion on how to do this?

[+] winrid|6 years ago|reply
Recently I ran into a project where everything is stored in around 3k unique fields, with many levels of nesting, in a JSONB column with NO schema or documentation. Generated JSDOC from the data and I'm hoping to create some kind of schema.

You can use JS on the server. Just don't do it this way ...

[+] ghufran_syed|6 years ago|reply
“Beginning database design” By Clare Churcher is a really accessible book on these issues, very practical (starts off with simple spreadsheet tables!) and takes you through step by step to understanding the problems with and the need for the different levels of normalization.
[+] gigatexal|6 years ago|reply
I’m just here with popcorn and for the comments.

Lucky for me SQL just clicked and when I wrote my first schema third normal form is how it came out more or less so I knew I was on to something.

I would say a study of how the internals of your databases of choice works would help in schema design.

[+] DeathArrow|6 years ago|reply
That's one issue I see with NoSql, data is not normalized and consistency goes out of the window if database is big enough and the volume of data is big enough. That might not matter if you work on a search engine, social network, music suggestion app, but for most business apps and for most websites, having inconsistency is really bad.

On the other hand, having some data denormalized can improve performance as you can avoid expensive joins. It's for designer to see if they need to trade consistency for performance and in what measure. I'd avoid it, I'd prefer to add more servers, use load balancing, use caches.

[+] jonathanstrange|6 years ago|reply
In my programming life I've barely used SQL databases so far, and there is something I've never understood about these recommendations. Maybe someone here can explain that to me.

I thought databases are designed to organize the data optimally internally and execute all operations on them as efficiently as possible. Why do you then need to organize databases so stringently? Doesn't that mean that there is something wrong with how the db works?

To me the advice always seemed similar to bit twiddling instead of relying on the compiler to optimize. Shouldn't the database lay out the data optimally internally?

[+] collyw|6 years ago|reply
My humble opinion is that poor choices in database design are the source of a great deal of technical debt.

I used to avoid making schema changes as they were "scary" (until I got a decent understanding of Django Migrations). Just a quick hack to make up for it in the application layer was far easier. Then another, and another until you have real mess on your hands.

And avoid nulls when you can. Unfortunately most of the time I have had a new database design there is existing data in excel sheets that is messy and you can't avoid it.

[+] FpUser|6 years ago|reply
I've got 2 things to say:

1) There is no "good design". It is always a compromise of trying to satisfy various practical constraints. OLAP vs OLTP probably rings a bell here. And this Last / First name thing is laughable. The complexity of handling all possible human names can be insane. Depending on the goal the proper design can be anything starting from single field and up to some relational structure.

2) The article itself looks like let's just write something up to score some points.

[+] yellowapple|6 years ago|reply
> 4) Never store firstname and lastname in the same field

I have the opposite opinion: unless you have a specific reason why you need to know one's "first name" and "last name" (e.g. to interface with some other system that requires names to be split up like that), just use a single field and stop caring about whether or not the name conforms to your preconceived notions of format and structure.

[+] emmelaich|6 years ago|reply
(fwiw, I like the article, and I find it suitably humble)

Just want to hang a thought here ... depending on the precise definition of First Normal Form, it is either impossible to avoid[0] or impossible to enforce[1] without further constraints.

[0] if definition is "must be rectangular", i.e. all columns have the same number of rows and all rows have the same number of columns.

[1] if the definition includes "do not allow duplicate rows"