top | item 1566027

A Simple Guide to Five Normal Forms in Relational Database Theory

135 points| b-man | 15 years ago |bkent.net | reply

50 comments

order
[+] novum|15 years ago|reply
I do a lot of interviews for software developers, and I ask a very simple database normalization question that close to half of the candidates miss completely.

Something like: "You're tracking doctors, hospitals, and patients. Doctors can see multiple patients, and doctors can staff multiple hospitals". All I'm looking for you to say is that you'd put junction tables between the main entities.

It's so, so crucial that developers understand the basics of database design, and normalization is a huge part of that.

[+] Groxx|15 years ago|reply
In this case, the many to many relationships should be obvious, but I feel I must point out:

Based on that text specifically, all that's needed is that the patient and hospital each have a doctor ID column. How many would get it if you specifically pointed out that they need to consider real-world circumstances in this, and / or if you enumerated all the requirements?

Many many many times, when stating a hypothetical question, real-world handling responses are not desirable to the questioners. If the interviewee realizes this, they also realize that they risk losing points no matter how they answer: answering based on the specified requirements fails with you, but answering based on real-world uses fails with people testing against over-engineering / people religious about specific answers to specific questions. And being verbose risks both groups, based pretty much solely on the interviewer, of whom they have effectively no knowledge.

Or have you done this, and it just wasn't included in the example above?

[+] tbrownaw|15 years ago|reply
Nice sanity test buried in there too, how many candidates try to put a DoctorID field on the patient and hospital tables (ie, assume a patient can only see one doctor, or a hospital only has one doctor working there)?
[+] timwiseman|15 years ago|reply
As a senior DBA that has done many interviews for more junior DBA/database developer, I too have been surprised by how many candidates cannot answer a question like that at all.

With that said, many developers (other than database developers) really do not need to understand that. They don't need to understand it because they will normally have a good DBA/database developer that handles that part for them. In fact, large projects may have many specialized DBA roles including Architects focused specifically on schema design.

[+] djb_hackernews|15 years ago|reply
devils advocate, but why are you asking database engineer questions during a software developer interview?
[+] Shorel|15 years ago|reply
It totally depends on the use you plan for the DB.

For a DataWarehouse, you probably want to de-normalize the DB if you expect good performance when calculating those cubes.

[+] kingkilr|15 years ago|reply
Gah, I've been in this business too long (I'm 19, WTF?), my first question is, can doctors be patients as well?
[+] btilly|15 years ago|reply
You understand everything most developers should ever need to know about normalization if you get the following two sentences.

The key, the whole key, and nothing but the key. So help me Codd.

(For those who don't get it, the first sentence describes first, second and third normal forms, and the second sentence names the guy who is responsible for most of the theory behind them.)

[+] sovande|15 years ago|reply
Thanks, nice to get a refresher on the fifth normal form. Though in practice I have yet to see anything behind the third normal form used; "The Key, the whole Key, and nothing but the Key, so help me Codd"
[+] ergo98|15 years ago|reply
Almost everyone honours the 4th normal form without thinking about it. I think the "the 3rd normal form is the practical level" just became a bit of a meme.
[+] slim|15 years ago|reply
Don't take it for granted.

With modern software design and especially OOP the risk of data inconsistency is less of a problem (I do web apps for entreprise and it simply never happened to me).

On the other hand duplication of information is a great way to scale an application. And incidentally offers data security since you can cross check your data in case of corruption.

[+] ergo98|15 years ago|reply
>With modern software design and especially OOP the risk of data inconsistency is less of a problem

This sounds mighty hand-wavy. How does "modern software design" reduce the risk of data inconsistency?

>On the other hand duplication of information is a great way to scale an application

Most relational systems have facilities for duplicating information to scale. Materialized views, for instance, generate vile offenses to all normal forms but perhaps the first, hyper optimized for consumers, but it is guaranteed coherent and consistent, and happens with barely any work.

Seriously, we've been solving these performance issues for years. Every time some, failing a better word, noob writes up their big internet paper on why the relational model fails (with "modern" software, which is chuckleworthy), the world gets just a little bit dumber.

The bit about checking data for corruption is just disturbing.

[+] Another1|15 years ago|reply
If you want to learn the relational model using a fairly slim book, i cant recommend "Practical Issues in Database Management" by Fabian Pascal enough for you!

And always remember, relational theory is one thing, and popular RDBMS is another.

[+] maxdemarzi|15 years ago|reply
Software developers and SQL developers can both usually figure out how to get what they want out of the database...

The problem is the software developer may rely on loops ( and variables, and cursors and temporary tables and dynamically generated sql, the horror) to do his thing while the SQL developer will use set theory to avoid as much of that as possible.

They both rely on what they know, it's just that one persons knowledge is better suited for programming and the other is specific to relational databases.

[+] timwiseman|15 years ago|reply
You are entirely right, but permit me to respectfully add a few points I think are critical.

First, the two domains of knowledge are not mutually exclusive. Almost all SQL experts/DBAs are also able to do many forms of more conventional programming, and do them properly. Similarly, many software developers know SQL reasonably well and can write good set based code when they choose too.

Second, if working in the other domain, you pay a hefty price for not approaching that domain on its own terms. Using loops and cursors in MS SQL Server is enormously less efficient than the same solution done in sets (Oracle has a slightly narrower gap, but still a gap. I suspect the same is true of all RDBMS but I can only speak to those from experience). On the flip side, if a normal database developer is shoe horning data into an RDBMS that is not relational, that will cause complications as well.

[+] grk|15 years ago|reply
I've always thought that 5NF is a bit overkill. Didn't find one decent example that would make me say "oh, so that's why they do it".

I must say though, after the Database Systems course at DTU, when I come up with a DB design it's almost always in 4NF to start with :). Man, I hated this course.

[+] alan-crowe|15 years ago|reply
I know so little about relational databases that I'm worried about being misled by an unreliable guide found on the internet. Reading suspiciously, this guide seems garbled.

Section 3, second and third normal forms, says "Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form."

There seems to be a stray word "us". Ignoring that, this is cute word play that doesn't quite make sense. If your table has non-key fields you are inevitably providing facts about the non-key fields.

Continuing,

"We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee."

This is the wrong way round. Usually there are lots of employees and a few deparments. Each employee works for just one department, but each department has many employees. Thus the "department of an employee" is a many-to-one relation, or function, which takes an employee and yields a department. The one-to-many relationship here is the employee list of a department.

The example for 3.2 seems to be opening the wrong can of worms. Suppose that Mr Strauss, who works for the department of waltz in Vienna, is seconded to the department of piety in Rome, in order to teach them some dance steps. Then we want his row in the database to read

(Straus, Waltz, Rome)

So one can of worms is sticking generic labels on your fields. If you label your fields (Employee, Department, Employee-location) there is no problem. If you label your fields (Employee, Department, Department-location) you have a problem, but it is obvious. If you label your fields (Employee, Department, Location) you are heading for trouble as some users of the database fill in the location of department and other users of the database fill in the location of the employee.

Hmm, second and third normal form are suspiciously similar, differing only because we regard some fields as belonging to the key. Is the article trustworthy?

[+] cmurphycode|15 years ago|reply
I studied this summary before my most recent interview. It's a little abstract (in terms of lacking motivation/examples) for those who don't have experience in the field, but if you read it enough times, you get a decent handle on the idea.
[+] eru|15 years ago|reply
I don't have any experience in the field, but I did not find it too abstract. Rather not abstract enough. (But on the other hand, I am a mathematician, and know about relations in theory.)
[+] dennisgorelik|15 years ago|reply
Did anyone notice how many grammatical mistakes that article has?