"Now, for reasons I can’t explain, people just love GUIDs in database designs."
Yeah - they seem kind of silly and meaningless, so instead you use sequential integers. After all, they work nicely with object-relational frameworks like ActiveRecord, they are sequential, and they are more "human readable". And then one day you have a subset of data that you need to migrate from one environment. And when you try, you end up with some clashing records created in the target environment. No big deal, you will just increment all of your sequential integers so that there are no clashes. Oh wait, there are foreign keys - need to update those as well. Which reminds me, what is going to happen when ids in the target system catch up the the ones imported. OK we can change the sequence to cause it to start adding records after the one imported. Hey, this is really kind of complicated. There must be an easier way...
Hmmm... about those GUIDs? They completely eliminate an entire class of problems that occur when migrating data across environments (a very common scenario in some quarters).
If this type of scenario never will arise for you, then GUIDs may not make sense for your application. Hopefully this does explain why some "people just love" them.
I've never worked with big databases so bare with me if this is a really silly question.
Let's say I have database A with sequential IDs, and database B with sequential IDs. Couldn't I just translate the sequential IDs to GUIDs in both databases, merge them to database C and then translate the records from A and B to sequential IDs?
To be fair, he seemed to be referring to non-sequential GUIDs. And I can understand his point. If the only thing a GUID provides is a unique ID, it's not doing much with all that room it takes up.
My favorite part of this writeup is the revelation that the "TIMESTAMP" type in MS SQL Server is not what 99.9% of reasonable, non-insane people would consider it to be – namely, a timestamp. From the MS SQL Server documentation the author links to:
"The timestamp data type is just an incrementing number and does not preserve a date or a time. To record a date or time, use a datetime data type."
Seriously? In what way is that even remotely acceptable? How many PMs and QA engineers saw that and were like, "Cool makes sense SHIP IT."?
If MS SQL Server were a person, he'd be a colossal asshole.
I don't use foreign keys because they've bitten me in the ass too many times. I can maintain data integrity in the app level so I do it there. Adding FKs to the mix means my DB is also doing the work I already have to deal with in the app and that's use slowing things down.
FKs are a kind of business logic and I want none of that in my DB. I want my DB to give me storage, preferable ACID storage, and nothing more. I'll keep my business logic in my app.
I don't see it this way. Foreign keys in your database allow you do define relationships that must be maintained across your data set. Without them it's possible for a bug in your application code to create a logical state that makes no sense (ex: child records with no parent).
Just like data type checks (ex: regex whitelisting for field values) they need to be done at both the front end and backed and yes some times it's a pain but in the long run that's a positive not a negative. If you expose your database to a whole new app, say v2 of your app or an orthogonal admin interface, you can be assured of relational guarantees between your tables.
>I want my DB to give me storage, preferable ACID storage, and nothing more.
Yikes! You are missing the 'C' in ACID at the very least.
If what you say is true, a relational database might not be what you need to be using.
I think it is more likely that a relational database would be helpful to you but you worry about data after you worry about logic. A dangerous approach in many cases IMHO.
This is a pretty uninformed post. You want a consistent database, then add foreign keys. You'll need to ensure that you get the application layer to enforce it also, but you also get the advantage of a constraint at the database so it won't ever become inconsistent.
Keeping your "business logic" in the app as reasoning... You could just say that the business logic is built into the schema and shouldn't be there and just use a key store (many people do this, actually...); or you do all your transactions in the app. To my way of thinking, that's short-sighted.
As for slowing you down, FKs shouldn't, in terms of development (a good ORM handles it with ease), or in terms of physical performance (most databases handle performance constraints at this level via query optimisation and join strategies, at its worst you do bulk updates via a database facility).
Wow, I can't fathom how this works. Where do you store the relationship, then? Say you have posts and comments. I would want a posts table and a comments table, with a post_id foreign key in comments. How would you do it without a foreign key?
All they've done for me (in my experience) is slowed down development and saved me from nothing. The only downside I see to not using them is not being able to use FK's in indexing.
1. INT32 is fairly natural and generally a default goto in most higher level languages that, you know, actually USE the data... anything smaller, or larger than INT32 should be discouraged unless absolutely necessary... Yes, storage should be considered, but you also need to consider the friction such design can cause with different client languages. Beyond this, you can go with Unsigned values, but again, this can cause a lot of friction.
2. This is precisely why I don't like my RDBMS talking directly to each other. In fact, I like my RDBMS to be as stupid as possible. As mentioned above, the most common structure to your programming environment is likely what you should use... for time insensitive date/time I always store/read in UTC offset, same goes across boundaries. For events, I use the local time (no offset specified) and/or with the location or offset. It depends on the use.
3. Most databases that support UUID/GUID fields support a generator that will create an index friendly sequential UUID, where the main portion is sequential, with most of the value still random.
UUIDs are essential for many offline access/creation, synchronization and sharding systems. The biggest mistake a DBA tends to make is to assume that a single server will run everything well enough for everyone that uses a given application.
4. I tend to always have primary and foreign key constraints.. additional indexes should only be considered once you have typical use cases in place. After you have done the necessary profiling to determine which indexes will be the most benefit.
I would state that not understanding how indexes work in your given RDBMS can be a big issue. I think MSSQL tends to be one of the most forgiving in this area, and Oracle least so.
5. (see my response to number 4)
The arguments for 1, 2, 3 are argumentative at best.. there are perfectly good use cases for using a datatype larger than necessary (reducing friction in code is huge), not to mention UUIDs being absolutely essential in many use cases. MSSQL with create a UUID field, and a unique constraint when you enable replication if a table doesn't already have it for example, so it may as well be a primary key, if it makes sense.
As to 4-5, this comes down to understanding index strategies, which is a separate learning issue imho.
> If you know that the only possible values for a certain column are between 0 and 100,000 then you don’t need to slap a BIGINT data type for that column when a INT would do just fine.
What conceivable column could have 100K possible values but not more than 4B? Factoring in gaps in sequence numbers, wasted ids for testing, parallel generation of ids, etc, it's not that hard to crack the 32-bit boundary.
If you're dealing with something you know is of a fixed small size, for example surrogate keys for a constants table that is manually created by you (the app designer) then you can use an int (32-bit), short (16-bit), or even an single byte (8-bit). For anything else though just use 64-bit ids.
> what about those NCHAR(2000) columns that are storing mostly first and last names? How much extra overhead for those?
This makes no sense. Using VARCHAR fields the size is just a max. You should still pick a reasonable max but just because its defined as VARCHAR(256) doesn't mean it's stored as 256 bytes.
CHAR/NCHAR are fixed length, VARCHAR/NVARCHAR are dynamic length... Usually the biggest restrictions to varchar/nvarchar are indexing limits.. generally my indexed fields for n/varchar will be 100 (email, name, etc) non-indexed, xml hashes, json will be NVARCHAR(MAX)/NTEXT when they aren't indexed.
You should have started with the NCHAR(2000) example. Choosing INT over BIGINT is really something that, as you calculated yourself, wouldn't change the fate of a project. Especially not in our 64 bit world ;)
The point you make about timestamps is very clear to everybody. Do you have more examples which people don't think about?
Point 5 makes no sense. The whole point in using a surrogate key is that a natural key is almost always flawed. There are no good reasons for the advise he gives for point 5, in fact it is a mistake to do what he says!
I think his point could have been made clearer, but he's really just saying that there will often be a mutable natural key that is captured in the modelled data, and that it makes sense to enforced constraints over that key if you know that the constraints exist.
> Multiply those rows by 4 bytes and you have 8 million bytes, or roughly 7.8MB of wasted space.
I have a hard time imagining any installation where this isn't a ludicrously small amount of space, particular in relation to the total data size.
As for using too big a type, another factor is to consider is "how easy is it to expand this later?" If it's easy, sure use a small type. If it's hard, err on the side of too large.
> Now, for reasons I can’t explain, people just love GUIDs in database designs.
There are some pretty good (or at least understandable) reasons for liking GUIDs for keys. Firstly you need to understand how the database generates IDs.
The first camp is typified by MySQL with it's auto-incrementing keys. This is super-convenient in many cases and generates ordered and mostly sequential keys but it comes at a cost in that you typically have to round-trip to the DB and insert something to get a key. This generates spurious round trips if you need this key as a foreign key in other entities.
The other school is sequences typified by Oracle. A sequence isn't otherwise tied to a particular table. You still need a round trip to the DB to get an ID but it doesn't require you to insert anything. Sequences are in-memory and generally fast (round trip notwithstanding). Sequences are mostly ordered and partially sequential depending on your setup (this is less true in a clustered environment).
So, GUIDs. They're 128 bit. Collisions are unlikely. OSs tend to come with decent GUID generators (so you're not just using a 32 bit PRNG to generate your supposedly 128 bit keyspace items). They require no DB round trip and they have one other property that can be important depending on your setup and your database: scattering.
If you have a partitioned database then you want to evenly spread your data as much as possible. There are many techniques for this that might or might not be data dependent (eg first character of email address has poor scattering properties but a good hash of an email address will probably be spread much more evenly). GUIDs neatly spread elements in your keyspace for very little effort.
> If you are using a surrogate key (and, ideally, one that is sequential)
I don't see sequential or ordered being strictly better properties for keys.
> You still need a round trip to the DB to get an ID
No you don't. Not sure about MySQL but Oracle supports returning generated keys as part of an insertion[1]. An additional round trip is not required as it comes back in the "success" response of the insert itself.
Okay so what can I say ... from my experience I disagree with his conclusions for 3, 4 and 5.
3. If you are building any kind of table that you expect to horizontally partition, do NOT use sequential things like autoincrementing ids! The mutex lock will be your bottleneck. Instead, use this algorithm to generate ids:
10: id = GENERATE_UNIQUE_STRING(20)
20: IF ( SELECT_ROW_BY_ID(id) ) GOTO 10
30: INSERT_ROW_WITH_ID(id)
(yeah, it's written in BASIC just for fun)
20 lowercase alphanumeric characters gives you 26^20 combinations. And the chance of a clash becomes around 50% only after you have filled 26^10 ids (birthday "paradox").
If you really have to have a sequentially incrementing "ordinal" field, e.g. for messages posted to a stream in a specific order, then use transactions and lock only one row, in the related table, which will contain the max value of the field. For example in MySQL use InnoDB which implements row-level locking, and have stream.message_count field, which you use when you INSERT into the message table.
4. Again, when you are sharding you probably don't want to have foreign key constraints defined in your database layer. If this guy is shocked he must have not really done sharding before.
5. While you can use surrogate keys, that limits your horizontal partition to the way you generate these keys. Oftentimes you will want to look things up in your shards by a certain RANGE, and for that you want to preserve the ordering in your primary key (up to normalization).
[+] [-] EzGraphs|12 years ago|reply
Yeah - they seem kind of silly and meaningless, so instead you use sequential integers. After all, they work nicely with object-relational frameworks like ActiveRecord, they are sequential, and they are more "human readable". And then one day you have a subset of data that you need to migrate from one environment. And when you try, you end up with some clashing records created in the target environment. No big deal, you will just increment all of your sequential integers so that there are no clashes. Oh wait, there are foreign keys - need to update those as well. Which reminds me, what is going to happen when ids in the target system catch up the the ones imported. OK we can change the sequence to cause it to start adding records after the one imported. Hey, this is really kind of complicated. There must be an easier way...
Hmmm... about those GUIDs? They completely eliminate an entire class of problems that occur when migrating data across environments (a very common scenario in some quarters).
If this type of scenario never will arise for you, then GUIDs may not make sense for your application. Hopefully this does explain why some "people just love" them.
[+] [-] SomeRandomUser|12 years ago|reply
Let's say I have database A with sequential IDs, and database B with sequential IDs. Couldn't I just translate the sequential IDs to GUIDs in both databases, merge them to database C and then translate the records from A and B to sequential IDs?
Or modyfying IDs is a big no-no in this scenario?
[+] [-] jasonlotito|12 years ago|reply
[+] [-] venutip|12 years ago|reply
"The timestamp data type is just an incrementing number and does not preserve a date or a time. To record a date or time, use a datetime data type."
Seriously? In what way is that even remotely acceptable? How many PMs and QA engineers saw that and were like, "Cool makes sense SHIP IT."?
If MS SQL Server were a person, he'd be a colossal asshole.
[+] [-] ecopoesis|12 years ago|reply
FKs are a kind of business logic and I want none of that in my DB. I want my DB to give me storage, preferable ACID storage, and nothing more. I'll keep my business logic in my app.
[+] [-] sehrope|12 years ago|reply
Just like data type checks (ex: regex whitelisting for field values) they need to be done at both the front end and backed and yes some times it's a pain but in the long run that's a positive not a negative. If you expose your database to a whole new app, say v2 of your app or an orthogonal admin interface, you can be assured of relational guarantees between your tables.
[+] [-] clavalle|12 years ago|reply
Yikes! You are missing the 'C' in ACID at the very least.
If what you say is true, a relational database might not be what you need to be using.
I think it is more likely that a relational database would be helpful to you but you worry about data after you worry about logic. A dangerous approach in many cases IMHO.
[+] [-] chris_wot|12 years ago|reply
Keeping your "business logic" in the app as reasoning... You could just say that the business logic is built into the schema and shouldn't be there and just use a key store (many people do this, actually...); or you do all your transactions in the app. To my way of thinking, that's short-sighted.
As for slowing you down, FKs shouldn't, in terms of development (a good ORM handles it with ease), or in terms of physical performance (most databases handle performance constraints at this level via query optimisation and join strategies, at its worst you do bulk updates via a database facility).
[+] [-] losvedir|12 years ago|reply
[+] [-] skylan_q|12 years ago|reply
[+] [-] tracker1|12 years ago|reply
2. This is precisely why I don't like my RDBMS talking directly to each other. In fact, I like my RDBMS to be as stupid as possible. As mentioned above, the most common structure to your programming environment is likely what you should use... for time insensitive date/time I always store/read in UTC offset, same goes across boundaries. For events, I use the local time (no offset specified) and/or with the location or offset. It depends on the use.
3. Most databases that support UUID/GUID fields support a generator that will create an index friendly sequential UUID, where the main portion is sequential, with most of the value still random.
UUIDs are essential for many offline access/creation, synchronization and sharding systems. The biggest mistake a DBA tends to make is to assume that a single server will run everything well enough for everyone that uses a given application.
4. I tend to always have primary and foreign key constraints.. additional indexes should only be considered once you have typical use cases in place. After you have done the necessary profiling to determine which indexes will be the most benefit.
I would state that not understanding how indexes work in your given RDBMS can be a big issue. I think MSSQL tends to be one of the most forgiving in this area, and Oracle least so.
5. (see my response to number 4)
The arguments for 1, 2, 3 are argumentative at best.. there are perfectly good use cases for using a datatype larger than necessary (reducing friction in code is huge), not to mention UUIDs being absolutely essential in many use cases. MSSQL with create a UUID field, and a unique constraint when you enable replication if a table doesn't already have it for example, so it may as well be a primary key, if it makes sense.
As to 4-5, this comes down to understanding index strategies, which is a separate learning issue imho.
[+] [-] jrarredondo|12 years ago|reply
[+] [-] jasonlotito|12 years ago|reply
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
[+] [-] sehrope|12 years ago|reply
What conceivable column could have 100K possible values but not more than 4B? Factoring in gaps in sequence numbers, wasted ids for testing, parallel generation of ids, etc, it's not that hard to crack the 32-bit boundary.
If you're dealing with something you know is of a fixed small size, for example surrogate keys for a constants table that is manually created by you (the app designer) then you can use an int (32-bit), short (16-bit), or even an single byte (8-bit). For anything else though just use 64-bit ids.
> what about those NCHAR(2000) columns that are storing mostly first and last names? How much extra overhead for those?
This makes no sense. Using VARCHAR fields the size is just a max. You should still pick a reasonable max but just because its defined as VARCHAR(256) doesn't mean it's stored as 256 bytes.
[+] [-] tracker1|12 years ago|reply
[+] [-] dialtone|12 years ago|reply
http://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varcha...
And 7.8 MB really change practically nothing in the overall equation. 1 is a good point but badly argumented.
[+] [-] akaiser|12 years ago|reply
The point you make about timestamps is very clear to everybody. Do you have more examples which people don't think about?
[+] [-] emilyst|12 years ago|reply
Wow, what the fuck? Was there any reason for that?
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] chris_wot|12 years ago|reply
[+] [-] siddboots|12 years ago|reply
[+] [-] cletus|12 years ago|reply
http://stackoverflow.com/questions/621884/database-developme...
To add some points:
> Multiply those rows by 4 bytes and you have 8 million bytes, or roughly 7.8MB of wasted space.
I have a hard time imagining any installation where this isn't a ludicrously small amount of space, particular in relation to the total data size.
As for using too big a type, another factor is to consider is "how easy is it to expand this later?" If it's easy, sure use a small type. If it's hard, err on the side of too large.
> Now, for reasons I can’t explain, people just love GUIDs in database designs.
There are some pretty good (or at least understandable) reasons for liking GUIDs for keys. Firstly you need to understand how the database generates IDs.
The first camp is typified by MySQL with it's auto-incrementing keys. This is super-convenient in many cases and generates ordered and mostly sequential keys but it comes at a cost in that you typically have to round-trip to the DB and insert something to get a key. This generates spurious round trips if you need this key as a foreign key in other entities.
The other school is sequences typified by Oracle. A sequence isn't otherwise tied to a particular table. You still need a round trip to the DB to get an ID but it doesn't require you to insert anything. Sequences are in-memory and generally fast (round trip notwithstanding). Sequences are mostly ordered and partially sequential depending on your setup (this is less true in a clustered environment).
So, GUIDs. They're 128 bit. Collisions are unlikely. OSs tend to come with decent GUID generators (so you're not just using a 32 bit PRNG to generate your supposedly 128 bit keyspace items). They require no DB round trip and they have one other property that can be important depending on your setup and your database: scattering.
If you have a partitioned database then you want to evenly spread your data as much as possible. There are many techniques for this that might or might not be data dependent (eg first character of email address has poor scattering properties but a good hash of an email address will probably be spread much more evenly). GUIDs neatly spread elements in your keyspace for very little effort.
> If you are using a surrogate key (and, ideally, one that is sequential)
I don't see sequential or ordered being strictly better properties for keys.
[+] [-] sehrope|12 years ago|reply
No you don't. Not sure about MySQL but Oracle supports returning generated keys as part of an insertion[1]. An additional round trip is not required as it comes back in the "success" response of the insert itself.
[1]: http://info.michael-simons.eu/2007/10/09/jdbc-get-autogenera...
[+] [-] EGreg|12 years ago|reply
3. If you are building any kind of table that you expect to horizontally partition, do NOT use sequential things like autoincrementing ids! The mutex lock will be your bottleneck. Instead, use this algorithm to generate ids:
20 lowercase alphanumeric characters gives you 26^20 combinations. And the chance of a clash becomes around 50% only after you have filled 26^10 ids (birthday "paradox").If you really have to have a sequentially incrementing "ordinal" field, e.g. for messages posted to a stream in a specific order, then use transactions and lock only one row, in the related table, which will contain the max value of the field. For example in MySQL use InnoDB which implements row-level locking, and have stream.message_count field, which you use when you INSERT into the message table.
4. Again, when you are sharding you probably don't want to have foreign key constraints defined in your database layer. If this guy is shocked he must have not really done sharding before.
5. While you can use surrogate keys, that limits your horizontal partition to the way you generate these keys. Oftentimes you will want to look things up in your shards by a certain RANGE, and for that you want to preserve the ordering in your primary key (up to normalization).
So tehre you have it, it's not that simple :)