I do a lot of interviews for software developers, and I ask a very simple database normalization question that close to half of the candidates miss completely.
Something like: "You're tracking doctors, hospitals, and patients. Doctors can see multiple patients, and doctors can staff multiple hospitals". All I'm looking for you to say is that you'd put junction tables between the main entities.
It's so, so crucial that developers understand the basics of database design, and normalization is a huge part of that.
In this case, the many to many relationships should be obvious, but I feel I must point out:
Based on that text specifically, all that's needed is that the patient and hospital each have a doctor ID column. How many would get it if you specifically pointed out that they need to consider real-world circumstances in this, and / or if you enumerated all the requirements?
Many many many times, when stating a hypothetical question, real-world handling responses are not desirable to the questioners. If the interviewee realizes this, they also realize that they risk losing points no matter how they answer: answering based on the specified requirements fails with you, but answering based on real-world uses fails with people testing against over-engineering / people religious about specific answers to specific questions. And being verbose risks both groups, based pretty much solely on the interviewer, of whom they have effectively no knowledge.
Or have you done this, and it just wasn't included in the example above?
Nice sanity test buried in there too, how many candidates try to put a DoctorID field on the patient and hospital tables (ie, assume a patient can only see one doctor, or a hospital only has one doctor working there)?
As a senior DBA that has done many interviews for more junior DBA/database developer, I too have been surprised by how many candidates cannot answer a question like that at all.
With that said, many developers (other than database developers) really do not need to understand that. They don't need to understand it because they will normally have a good DBA/database developer that handles that part for them. In fact, large projects may have many specialized DBA roles including Architects focused specifically on schema design.
You understand everything most developers should ever need to know about normalization if you get the following two sentences.
The key, the whole key, and nothing but the key. So help me Codd.
(For those who don't get it, the first sentence describes first, second and third normal forms, and the second sentence names the guy who is responsible for most of the theory behind them.)
Thanks, nice to get a refresher on the fifth normal form. Though in practice I have yet to see anything behind the third normal form used;
"The Key, the whole Key, and nothing but the Key, so help me Codd"
Almost everyone honours the 4th normal form without thinking about it. I think the "the 3rd normal form is the practical level" just became a bit of a meme.
With modern software design and especially OOP the risk of data inconsistency is less of a problem (I do web apps for entreprise and it simply never happened to me).
On the other hand duplication of information is a great way to scale an application. And incidentally offers data security since you can cross check your data in case of corruption.
>With modern software design and especially OOP the risk of data inconsistency is less of a problem
This sounds mighty hand-wavy. How does "modern software design" reduce the risk of data inconsistency?
>On the other hand duplication of information is a great way to scale an application
Most relational systems have facilities for duplicating information to scale. Materialized views, for instance, generate vile offenses to all normal forms but perhaps the first, hyper optimized for consumers, but it is guaranteed coherent and consistent, and happens with barely any work.
Seriously, we've been solving these performance issues for years. Every time some, failing a better word, noob writes up their big internet paper on why the relational model fails (with "modern" software, which is chuckleworthy), the world gets just a little bit dumber.
The bit about checking data for corruption is just disturbing.
If you want to learn the relational model using a fairly slim book, i cant recommend "Practical Issues in Database Management" by Fabian Pascal enough for you!
And always remember, relational theory is one thing, and popular RDBMS is another.
Software developers and SQL developers can both usually figure out how to get what they want out of the database...
The problem is the software developer may rely on loops ( and variables, and cursors and temporary tables and dynamically generated sql, the horror) to do his thing while the SQL developer will use set theory to avoid as much of that as possible.
They both rely on what they know, it's just that one persons knowledge is better suited for programming and the other is specific to relational databases.
You are entirely right, but permit me to respectfully add a few points I think are critical.
First, the two domains of knowledge are not mutually exclusive. Almost all SQL experts/DBAs are also able to do many forms of more conventional programming, and do them properly. Similarly, many software developers know SQL reasonably well and can write good set based code when they choose too.
Second, if working in the other domain, you pay a hefty price for not approaching that domain on its own terms. Using loops and cursors in MS SQL Server is enormously less efficient than the same solution done in sets (Oracle has a slightly narrower gap, but still a gap. I suspect the same is true of all RDBMS but I can only speak to those from experience). On the flip side, if a normal database developer is shoe horning data into an RDBMS that is not relational, that will cause complications as well.
I've always thought that 5NF is a bit overkill. Didn't find one decent example that would make me say "oh, so that's why they do it".
I must say though, after the Database Systems course at DTU, when I come up with a DB design it's almost always in 4NF to start with :). Man, I hated this course.
I know so little about relational databases that I'm worried about being misled by an unreliable guide found on the internet. Reading suspiciously, this guide seems garbled.
Section 3, second and third normal forms, says "Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form."
There seems to be a stray word "us". Ignoring that, this is cute word play that doesn't quite make sense. If your table has non-key fields you are inevitably providing facts about the non-key fields.
Continuing,
"We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee."
This is the wrong way round. Usually there are lots of employees and a few deparments. Each employee works for just one department, but each department has many employees. Thus the "department of an employee" is a many-to-one relation, or function, which takes an employee and yields a department. The one-to-many relationship here is the employee list of a department.
The example for 3.2 seems to be opening the wrong can of worms. Suppose that Mr Strauss, who works for the department of waltz in Vienna, is seconded to the department of piety in Rome, in order to teach them some dance steps. Then we want his row in the database to read
(Straus, Waltz, Rome)
So one can of worms is sticking generic labels on your fields. If you label your fields (Employee, Department, Employee-location) there is no problem. If you label your fields (Employee, Department, Department-location) you have a problem, but it is obvious. If you label your fields (Employee, Department, Location) you are heading for trouble as some users of the database fill in the location of department and other users of the database fill in the location of the employee.
Hmm, second and third normal form are suspiciously similar, differing only because we regard some fields as belonging to the key. Is the article trustworthy?
I studied this summary before my most recent interview. It's a little abstract (in terms of lacking motivation/examples) for those who don't have experience in the field, but if you read it enough times, you get a decent handle on the idea.
I don't have any experience in the field, but I did not find it too abstract. Rather not abstract enough. (But on the other hand, I am a mathematician, and know about relations in theory.)
[+] [-] novum|15 years ago|reply
Something like: "You're tracking doctors, hospitals, and patients. Doctors can see multiple patients, and doctors can staff multiple hospitals". All I'm looking for you to say is that you'd put junction tables between the main entities.
It's so, so crucial that developers understand the basics of database design, and normalization is a huge part of that.
[+] [-] Groxx|15 years ago|reply
Based on that text specifically, all that's needed is that the patient and hospital each have a doctor ID column. How many would get it if you specifically pointed out that they need to consider real-world circumstances in this, and / or if you enumerated all the requirements?
Many many many times, when stating a hypothetical question, real-world handling responses are not desirable to the questioners. If the interviewee realizes this, they also realize that they risk losing points no matter how they answer: answering based on the specified requirements fails with you, but answering based on real-world uses fails with people testing against over-engineering / people religious about specific answers to specific questions. And being verbose risks both groups, based pretty much solely on the interviewer, of whom they have effectively no knowledge.
Or have you done this, and it just wasn't included in the example above?
[+] [-] tbrownaw|15 years ago|reply
[+] [-] timwiseman|15 years ago|reply
With that said, many developers (other than database developers) really do not need to understand that. They don't need to understand it because they will normally have a good DBA/database developer that handles that part for them. In fact, large projects may have many specialized DBA roles including Architects focused specifically on schema design.
[+] [-] djb_hackernews|15 years ago|reply
[+] [-] Shorel|15 years ago|reply
For a DataWarehouse, you probably want to de-normalize the DB if you expect good performance when calculating those cubes.
[+] [-] kingkilr|15 years ago|reply
[+] [-] Another1|15 years ago|reply
the relational term would be relation, please stop interviewing ppl for jobs you are not qualified
[+] [-] btilly|15 years ago|reply
The key, the whole key, and nothing but the key. So help me Codd.
(For those who don't get it, the first sentence describes first, second and third normal forms, and the second sentence names the guy who is responsible for most of the theory behind them.)
[+] [-] sovande|15 years ago|reply
[+] [-] ergo98|15 years ago|reply
[+] [-] slim|15 years ago|reply
With modern software design and especially OOP the risk of data inconsistency is less of a problem (I do web apps for entreprise and it simply never happened to me).
On the other hand duplication of information is a great way to scale an application. And incidentally offers data security since you can cross check your data in case of corruption.
[+] [-] ergo98|15 years ago|reply
This sounds mighty hand-wavy. How does "modern software design" reduce the risk of data inconsistency?
>On the other hand duplication of information is a great way to scale an application
Most relational systems have facilities for duplicating information to scale. Materialized views, for instance, generate vile offenses to all normal forms but perhaps the first, hyper optimized for consumers, but it is guaranteed coherent and consistent, and happens with barely any work.
Seriously, we've been solving these performance issues for years. Every time some, failing a better word, noob writes up their big internet paper on why the relational model fails (with "modern" software, which is chuckleworthy), the world gets just a little bit dumber.
The bit about checking data for corruption is just disturbing.
[+] [-] Another1|15 years ago|reply
And always remember, relational theory is one thing, and popular RDBMS is another.
[+] [-] maxdemarzi|15 years ago|reply
The problem is the software developer may rely on loops ( and variables, and cursors and temporary tables and dynamically generated sql, the horror) to do his thing while the SQL developer will use set theory to avoid as much of that as possible.
They both rely on what they know, it's just that one persons knowledge is better suited for programming and the other is specific to relational databases.
[+] [-] timwiseman|15 years ago|reply
First, the two domains of knowledge are not mutually exclusive. Almost all SQL experts/DBAs are also able to do many forms of more conventional programming, and do them properly. Similarly, many software developers know SQL reasonably well and can write good set based code when they choose too.
Second, if working in the other domain, you pay a hefty price for not approaching that domain on its own terms. Using loops and cursors in MS SQL Server is enormously less efficient than the same solution done in sets (Oracle has a slightly narrower gap, but still a gap. I suspect the same is true of all RDBMS but I can only speak to those from experience). On the flip side, if a normal database developer is shoe horning data into an RDBMS that is not relational, that will cause complications as well.
[+] [-] grk|15 years ago|reply
I must say though, after the Database Systems course at DTU, when I come up with a DB design it's almost always in 4NF to start with :). Man, I hated this course.
[+] [-] alan-crowe|15 years ago|reply
Section 3, second and third normal forms, says "Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form."
There seems to be a stray word "us". Ignoring that, this is cute word play that doesn't quite make sense. If your table has non-key fields you are inevitably providing facts about the non-key fields.
Continuing,
"We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee."
This is the wrong way round. Usually there are lots of employees and a few deparments. Each employee works for just one department, but each department has many employees. Thus the "department of an employee" is a many-to-one relation, or function, which takes an employee and yields a department. The one-to-many relationship here is the employee list of a department.
The example for 3.2 seems to be opening the wrong can of worms. Suppose that Mr Strauss, who works for the department of waltz in Vienna, is seconded to the department of piety in Rome, in order to teach them some dance steps. Then we want his row in the database to read
(Straus, Waltz, Rome)
So one can of worms is sticking generic labels on your fields. If you label your fields (Employee, Department, Employee-location) there is no problem. If you label your fields (Employee, Department, Department-location) you have a problem, but it is obvious. If you label your fields (Employee, Department, Location) you are heading for trouble as some users of the database fill in the location of department and other users of the database fill in the location of the employee.
Hmm, second and third normal form are suspiciously similar, differing only because we regard some fields as belonging to the key. Is the article trustworthy?
[+] [-] cmurphycode|15 years ago|reply
[+] [-] eru|15 years ago|reply
[+] [-] dennisgorelik|15 years ago|reply