SQL Keys in Depth

[+] manigandham|8 years ago|reply

After a decade of large systems relying on RDBMs, we now use 64-bit integers for all primary keys with a global Hi/Lo id generation system (app reserves a range of numbers on startup to assign to records automatically).

This means plenty of ID space, maintains rough numeric ordering, allows ID creation without a roundtrip for every insert, is easily portable across different databases, and produces unique IDs for every row in the database which greatly simplifies everything from caching to replication.

[+] walshemj|8 years ago|reply

does seem to me that the author is a hobbyist or a junior developer with little real word experience.

Literally the first Colum of almost every SQL table I have written has had a column called id with auto increment.

[+] needusername|8 years ago|reply

> global Hi/Lo id generation system

Can you elaborate a bit? Are you using an ORM, eg. Hibernate? If so have to considered other strategies like pooled-lo, IDENTITY or recursive CTEs that offer the same benefits but make the sequence values match the database values and reduce "id loss"?

[+] dboreham|8 years ago|reply

We adopted a similar (well, I think probably identical) approach some years ago. No regrets -- it is simple and works well.

[+] simonw|8 years ago|reply

"For instance, a database of hobbyist club members could include uniqueness on the two columns first_name, last_name of members. Duplicates are likely unintentional at this scale, and if necessary the constraint can be dropped. Until an actual conflict happens the key is a reasonable precaution."

Absolutely do not do this.

People have names that are duplicates. A situation where someone is unable to join a club because their name clashes with an existing member is not OK.

Expecting a club administrator to be able to drop a uniqueness key from their database in order to resolve this situation is not a reasonable solution!

[+] laumars|8 years ago|reply

That's not the only point of the article I really disagree with:

> Here are some values that often work as natural keys:

> login names

This makes user names static. Which would annoy people if they change name (eg marriage, gender change, nationalization change (Chinese name vs English name)) or they just want to update their online handle.

> email addresses

Same problem as above. What happens if someone wants to change email addresses (eg they used a work one and changed jobs, or they got a new email to combat spam on the old one, or a new email address to reflect a name change (see above))

> mac address on a network

This makes more sense than above, but MAC addresses can be spoofed or even just used as a proxy, so you would need to be careful that the information you're uniquely storing against the MAC address is intended to be unique. Most use cases that immediately springs to my mind wouldn't follow his unique rule but I'm sure there will some examples that do.

> (lat,lon) for points on the earth

That might work if you also included an elevation point as well but office and apartment blocks (for example) will often have different tenants at the same latitude and longitude coordinates.

> Some people seem to struggle with the choice of “natural” key attributes because they hypothesize situations where a particular key might not be unique in some given population. This misses the point.

Actually no it doesn't. That's entirely the point as it demonstrates good planning and future-proofing rather than setting arbitrary business rules that are hard to rectify if and when you do run into those particular edge case scenarios.

[+] lathiat|8 years ago|reply

For extra fun there was talk today on twitter about several people that have to file Japanese tax returns but have originally english names. There are many ways to convert the name to english, and different systems actaully enforce you to convert it in a different way with various limitations.

This results in them having the system believe they are two people, one of who hasn't filed returns and no easy way to fix it (apparently the easiest fix is to possibly create a company?).

This despite there being some kind of unique ID submitted in both systems, but that ID is not currently reconciled.

Not sure 100% on the details myself but I guess the moral of that story is, duplicate checks on the name itself may not even be a guarantee of uniqueness - even assuming there was only 1 person with that name!

More information in the Twitter threads: https://twitter.com/patio11/status/947821990810869762 https://twitter.com/marcan42/status/947862901725024257

[+] planetjones|8 years ago|reply

Completely agree. Adding a proper primary key e.g. from a sequence involves very little extra overhead and protects against this. Asking them to drop the constraint in the future sounds a bit silly as that may mean two members get deleted when you intended only one. Plus your foreign keys now have to be maintained. So if a member changes their name you have to update all foreign keys. Also if the hobbyist club then builds a website to view members the URLs can use the primary key e.g. a long / bigint, rather than names. The names may introduce further complexity with having to escape the URL etc. I am all for keeping it simple, but all relations should have a guaranteed unique primary key.

I know the author is talking about a small scale project here, but shortcuts like this (in my experience) just lead to a mess later on; irrespective of the complexity of the problem being solved.

[+] dragonwriter|8 years ago|reply

> Expecting a club administrator to be able to drop a uniqueness key from their database in order to resolve this situation is not a reasonable solution!

Especially since if you have any FK relationships, and that was the only candidate key for the members relation (and why have such a bad key if it's not the only one), you then need to add new attributes to create a new key, populate it to all existing roles, change all existing FK relationships and associated queries, on top of the easy part, which is dropping the old unique constraint.

[+] dullgiulio|8 years ago|reply

I think the author fell for falsehood 21: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...

[+] MarHoff|8 years ago|reply

I agree with you, but I think you did overthink his metaphorical example a little.

I think the point was that for a 40 people club odd are very very poor that two people would have the same name. And even if so a club manager could still differentiate by adding a middle name or a nickname (in a 40 people scenario).

Of course IF your neighborhood club expand and you need to manage a lot of people you'll have to switch to a better technique. His point was that you must fit to you business case. (But hey some people change name when they marry so... problems can happen fast, but that's mutability issue not uniqueness).

Overall very good article but keeping critical mind as you did is needed.

[+] tzahola|8 years ago|reply

Ah, this brings me back to the time when I was trying to convince a product manager that it was a bad idea to "validate" email addresses with regular expressions. I failed, and the product was rolled out with the following regex: ^[a-z0-9.-]+@[a-z0-9.-]+\.[a-z]{2,4}$

I quit shortly thereafter.

[+] kdeldycke|8 years ago|reply

Might be a feature though, as in "the Ancient Mystic Society of No Homers", where it is not allowed to have more than one Homer member: https://en.wikipedia.org/wiki/Homer_the_Great#Plot

[+] olavk|8 years ago|reply

Yeah I was surprised by this in an otherwise great article.

[+] onion2k|8 years ago|reply

It's also worth noting that the same is true for email addresses. People share them, so two people can have the same address.

[+] surrogatekey|8 years ago|reply

The bit about duplicates likely being "unintentional at this scale" is an important part of the use context too though. There are many cases where allowing duplicate names increases errors and confusion for users -- not only because of unintentional duplication, but because users themselves often think of names as unique.

To me, surrogate keys are just about always preferable over natural keys, but it's also important to think about the "naturalness" of natural keys from a user's pov. In some contexts, using a surrogate key while also enforcing unique names (with the rare possibility of a case where an admin has to go in and do something weird like add a genuinely identical name with a "2" after it, say) can be a better trade-off.

[+] foreigner|8 years ago|reply

At what scale does all this stuff start to actually matter? I have an database with ~100 tables and ~500M rows driving a medium-traffic web app and various back-end systems. We use auto-incrementing integers as primary keys and try not to expose them externally. Indexes are added as necessary to enable specific queries. We don't enforce any other constraints (e.g. not-null or foreign keys) at the database level.

... and it all works and performs just fine? The considerations the author mentions all make some sense to me in theory, but when do they actually matter in practice in a modern system?

[+] mannykannot|8 years ago|reply

You might not have any observable effects until the day you do, at which point you may be faced with an arbitrarily large problem.

What you are doing is passing up on the opportunity of catching various errors (you are also passing up the opportunity for some optimizations, but that is probably a secondary issue.) In particular, you are passing up on some opportunities to catch inconsistencies in how different applications (or different parts of the same application) create and use data.

One argument made against putting these sort of rules in effect is that they constrain what application developers can do, but that is the wrong way to look at it: any such conflict is an indication of a misunderstanding (not necessarily on the part of the application developers) that has been caught before it can lead to bigger problems, such as a database full of irretrievably inconsistent or incomplete data.

When the problems finally do arise, it is often the case that some sort of workaround is the only practical solution. This, in my experience, is one of the common ways by which systems accrue gratuitous complexity, which in turn has at least two real-world consequences: an increased time to make changes, upgrades and extensions, and an increased frequency of errors, especially WTF-type errors (and probably also efficiency/performance hits.)

[+] adambatkin|8 years ago|reply

I disagree that UUIDs are generally preferable over integers. For one, they take up more space (on disk and in memory). And for something like a key, it is likely that there will be multiple copies of that value stored, since it will exist in the table itself, at least one index (possibly more) and foreign keys. More space means fewer records per page on disk, more I/O and more memory usage (potentially leading to more I/O). I would wager that for most users (including this case), the cost of this additional I/O is far greater than some theoretical scalability limitation on generating IDs.

Most database vendors make sequence generation (whether through explicit SEQUENCE objects or auto-incrementing columns) performant by making a few compromises:

* Numbers may not always be sequential (you might get 1, 3, 2 - in that order)

* There may be gaps (you might get 1, 2, 5, 6)

But since these are supposed to be opaque identifiers, neither of these compromises should be a concern for most users. But this means that these sequences can live outside of a transaction (you might grab an ID, rollback, and that ID is gone) and that systems with multiple nodes can be allocated a "block" of numbers from which they can quickly pull new values, without needing to coordinate with a master node.

[+] amarkov|8 years ago|reply

As you can see leaking through in a few of the examples, the scenarios where it matters are the ones where non-experts are reading to and writing from the database. If a data entry clerk is just typing values into a form thinly wrapping an INSERT, you'd better make sure any input that would break implicit assumptions is rejected. If you have a legion of Bobs from marketing who "know some SQL", you can't hand-optimize every single query they try to run.

When the database is just serving a webapp you control end to end, the only part that'll really have a huge impact is making sure you can partition by the primary key effectively. Which is good, because complex indexing schemes and foreign key constraints actually scale very badly.

[+] tragomaskhalos|8 years ago|reply

I would certainly look to add not-null and uniqueness constraints, although you might find that by now your data is actually violating some of those constraints!

A lot of people take the view that the application code is sufficient proof against bad data getting into your database, but this ignores (a) bugs and (b) back-door data loading, which in most applications will almost certainly be happening at some point.

The data in your database is the crown jewels (the application that fronts it is by comparison a piece of crap that you can replace at any time), so put guards on that gate ...

[+] mohaine|8 years ago|reply

No constraints will often be faster, but your data will not be 100% consistent.

Over time you _will_ have child child records pointing to non existing parents and the same with foreign keys. These often don't really hurt anything, until they do. Since most of the testing is with clean data, most issues due to bad data are in production and reported by end users.

I've worked successfully on systems without any real constraints and it is usually much better to start with and remove them later as needed. Adding them later is a pain as you first have to clean up the data and fix the code issues that created the bad data in the first place.

[+] barrkel|8 years ago|reply

I've seen auto_increment fail to scale on MySQL. Large, long-running insert statements (e.g. using insert into ... select from, or load data infile) can lock the the table resource and block concurrent inserts. Pretty easy to work around: have a table containing the next available key for every table, and use a separate small transaction to bump it before bulk inserts.

Another reason not to use auto-inc is if you need to insert lots of data into several separate tables that have foreign key relationships. If you know what you're inserting up front, and you need to insert millions of rows quickly, you're better off allocating the primary keys ahead of time and supplying them directly with the right keys for the relationship at insert time.

Separately, another argument against compound keys: if you're in a situation where you're trying to optimize joins, sorts and filters over moderately large tables, you want to minimize the number of columns you touch. Every column measurably increases query time; more data to shuffle, sort, compare, bigger indexes, etc. You won't see this if you're doing simple key / row lookups, but you will see it if you're presenting user-interactive slices over million+ row data sets.

[+] rotten|8 years ago|reply

You can get some real headaches with auto-incrementing integers if you have to go to a multi-master or some other distributed system for data inserts.

How do you know which integer to insert next for a table that is replicated across systems? One system could do even numbers and one could do odd. One could do every other 100 integers. Or you could use UUIDs and not worry about it.

My argument is that auto-incrementing integers are inherently not scalable. They can scale to a certain extent, and then you have to switch key systems.

[+] vinceguidry|8 years ago|reply

They don't necessarily matter until you hit Twitter-size for performance purposes, but take care to pay close attention to the idea of natural vs. artificial keys, and heed his warning about turning artificial keys into natural keys.

One day you'll be integrating database systems together and you'll be glad you followed his advice. I've seen some really ugly bizdata schemas.

[+] lulmerchant|8 years ago|reply

My background is in financial applications, where an inconsistent state can literally be a multimillion dollar problem, so maybe my perspective is skewed a little. But FKs and other constraints are simply used to outright enforce consistency, in my experience at least. If you rely on your custom business logic to maintain consistency, then you leave yourself open to the possibility of breaking that in some way, and if you do, who says you'll notice right away? What if you don't notice for 6 months, and then realise that you have a billion transactions that have some form of integrity compromise?

Complexity obviously makes this problem more serious, but something like this can easily occur at any scale of complexity or transaction volume. If you try to commit some code that breaks the integrity of an important relationship in some way, or even if you just want to run an arbitrary statement directly against the database, then you want the DB to throw an error.

[+] nerdponx|8 years ago|reply

A lot of these are not performance considerations. I read them as mostly for architectural ergonomics, security, and to some extent "idiot-proofing".

[+] lifepillar|8 years ago|reply

The most important property of surrogates, which I never see mentioned, is that two tuples, anywhere in the database, have the same surrogate key if and only if they are (believed) to refer to the same real-world entity. This is a crucial difference wrt auto-incrementing or randomly generated values that are independent in different relations (which is a common practice). Among the rest, with surrogates defined as above you may freely join relations on surrogate keys being sure that you are joining related data.

In his landmark 1979’s paper, Codd defined surrogates and pretty clearly, and that part of his work was based on an earlier paper entitled On Entities and Relations (I don’t recall the authors right now), so this is not exactly news.

Unfortunately, current DBMSs provide little to no support for properly implemented surrogates.

[+] jeltz|8 years ago|reply

I would argue that the support is there. You can either use random UUIDs or a global shared sequence for this. Nothing requires you to have a sequence per table other than some synthactic sugar in the SQL standard. PostgreSQL also has its own OIDs, but I think that is more of a legacy from before they implemented sequences.

[+] dragonwriter|8 years ago|reply

> Unfortunately, current DBMSs provide little to no support for properly implemented surrogates.

UUIDs instead of per-table autoincrement keys seem to be a solution which covers the part of the problem that involves DB support as opposed to data model design, and several DBs have adequate support for UUIDs.

[+] chmaynard|8 years ago|reply

A review article that combines original research, excellent writing, and useful advice. Impressive!

[+] Ericson2314|8 years ago|reply

Yeah, not the usual poorly argued heresay. Good job author.

[+] stickfigure|8 years ago|reply

On Postgres I prefer to use a single sequence to generate ids across all tables. This reduces the chance of accidents (eg accidentally deleting the wrong thing with id #123) and reduces information leakage ("oh, I see I'm customer #5, you must only have 4 other customers").

[+] dizzystar|8 years ago|reply

It's an interesting article with interesting ideas.

I'm squarely in the camp of using natural PKs until there is good reason to use surrogate PKs.

I generally disagree with the notion of using a combination of surrogate and natural keys. In SQL, a PK isn't just another unique key: a PK an important block of communication.

As a rule of thumb, when a PK is attached to a semantic value, it is saying that this is the identifier of the table. If a PK is on a surrogate key, it is saying that there is no good unique identifying value in the table. When you are dealing with larger data sets, this distinction isn't minor, as it helps to understand the intent of the data when working with it. PKs serve as guideposts in your design along with guideposts to the person who has to maintain (or fix) your database later on.

I know some disagree with me on that, but there are many undeniably good reasons to use a natural key. A good place is a check-constraint table, where you say, have a list of US states and you want to ensure that "New York" and not "New Yoerk" inserted into the state column of an address table. Put a PK on valid_state_names and FK to the PK from addresses.

[+] erpellan|8 years ago|reply

The very first example implies its own counterexample. In most decks of cards there are 2 jokers.

[+] B-Con|8 years ago|reply

> One thing to avoid in publicly exposed numbers is sequential order. It allows people to probe for resources (/videos/1.mpeg, /videos/2.mpeg etc) and also leaks cardinality information. Add a Feistel cipher on top of a sequence. This hides the ordering while maintaining uniqueness.

> The PostgreSQL pseudo encrypt wiki gives an example cipher function: > [...]

I get that they want to leave the key value in the schema, but practically this kind of thing feels worth pulling out into the storage engine. Just generate a random string as the ID upon row insertion and force uniqueness on that column. If insertion fails, generate another random string. (And it should pretty much never fail because otherwise you're still suspect to the enumeration attack you're trying to prevent.)

Although given that Postgres can read from a file, could the above be done in the schema by reading from /dev/urandom? If so that seems like the better approach.

[+] slg|8 years ago|reply

>Modern SQL is supposed to abstract from the physical representation. Tables model relations, and should not expose an implicit order in their rows. However even today SQL Server creates a clustered index by default for primary keys, physically ordering rows in the old tradition.

Which is why the UUID (or GUID in SQL Server speak) can have other drawbacks there in comparison to auto incrementing bigints, namely delays due to the data being ordered randomly on disk. There are obviously ways to counter that like adding extra non-clustered indexes or changing the clustered index to something besides the primary key, but at that point the extra time and overhead might defeat the benefits you gained from going with UUIDs in the first place.

[+] Ericson2314|8 years ago|reply

This is well done. And yet many people get this stuff wrong. As a language person, I think about how we could make these choices more natural by rewriting SQL; make them the path of least resistance rather than requiring much pondering and wisdom.

[+] hug|8 years ago|reply

Let's be honest: If you could make these choices more natural then you would have done so by now. If you could make SQL better than it is, you already would have done that as well. In fact, if most people who are interested in SQL could make SQL better than SQL is, we wouldn't be using SQL anymore.

We're not using it because we have some weird tradition we enforce. We're not using it because we like the way it looks, the way it writes, or the way it handles. We're not using it because we've tried to use ORMs and those ORMs have proved conclusively better across use cases than SQL.

The reason we use it is because we don't yet have a better solution. If you can write a better solution than SQL I would love to hear about it: I absolutely loathe SQL, but I use it because out of everything I've seen, it's the simplest and most elegant solution to the problem that I've ever come across.

So yeah, while we could possibly and should definitely make SQL better if we can, I don't think that suggesting that we just re-do it from scratch is the way to go. That way lies madness, and possibly javascript frameworks.

[+] Ericson2314|8 years ago|reply

Some ideas:

1. Rename the constraints to mimick the classification of keys described here. E.g. require UUIDs for artificial keys.

2. Make joins with foreign artificial keys much lighter syntactically than free form joins on natural keys. Make joins involving non-ubique columns in the WITH (today's syntax) harder still. People should be able think of foreign keys as references to be dereferenced separate from full joins.

2. Going further, break apart "tables" to represent the are used for different things people use them for, e.g. key-value maps vs "join tables" (a term I had never heard before but immediately could define from experience) vs cannonical data. In math terms, distinguish sets of aggregate data from relations. Having full algebraic data types for the sets of agreggate data but not relations could help. (maps are a special case of relations.) Another final related distinction is tables-as-types vs tables-as- collections. I suppose that relates (no pun :)) back to what joins are allowed.

4. Keep the references pointing in the right direction. For example, à la Rust and capability theory, the owner should point to it's uniquely owned children. But, in SQL today, we instead invert the back reference, having the children refer to their unique parent. Supporting lists (of children) and unaliased foreign keys directly would make the cannonical back-reference an implementation detail.

[+] rusbus|8 years ago|reply

One thing to keep in mind with Postgres is that unique key indices can lead to deadlock and stall other transactions that insert the same key.[1]

I wouldn't say "unique constraints considered harmful" but I would definitely say they can be surprising and must be used with care.

[1]: https://rcoh.me/posts/postgres-unique-constraints-deadlock/

[+] navium|8 years ago|reply

"There’s no need to manually create indexes on columns already declared unique; doing so would just duplicate the automatically-created index." Is the last part real? Makes me feel like forking postgesql just to save the world from accidental duplicate expensive indexes.

[+] danielovichdk|8 years ago|reply

Funny example in regards of a card-deck. How would you have unique keys for the 3 jokers in the deck ?

[+] LeonM|8 years ago|reply

A joker does not have a suit, nor a number, so it would not fit in the example table anyway.

[+] WorldMaker|8 years ago|reply

Decks with four jokers are often suited (just as the most common decks with two jokers are red/black), so a deck with three jokers you could treat as suited jokers that are missing a suit.

[+] Dowwie|8 years ago|reply

programmers who are intrigued by the recommended pg_hashids extension have a wide variety of alternatives to using a pg extension for the same key obfuscation logic: http://hashids.org/

[+] txutxu|8 years ago|reply

Great article.

But MAC address cannot be considered unique, as soon as software like keepalived is running.

[+] mikst|8 years ago|reply

hey, is it me being far from expert, or:

why use UUIDv1 if you can just use timestamp?

wouldn't time-based solution be prone to time misconfig across servers? Or that little shuffle doesn't really matter?

[+] epx|8 years ago|reply

The reading was worth because of the curse generator anedocte.

[+] bullen|8 years ago|reply

There are two things this article could point out:

The difference between centralized, distributed and decentralized primary keys.

And how public/private key encryption should be a natural part of every primary key.

The problem is all SQL databases uses sync clients which make them completely useless in any distributed or decentralized setting.

The real tension here is that there is no guarantee that a large UUIDs are unique, and we have not embraced the random occurrence of a collision.

170 comments