top | item 19876901

Database IDs Have No Place in URIs (2008)

71 points| tosh | 6 years ago |johntopley.com

167 comments

order
[+] benkuhn|6 years ago|reply
This argument seems backwards.

As far as I know, no database has ever had any of the problems he mentions with IDs not round-tripping properly during backup/restore. (Please correct me if I'm wrong! But why do I think this is true? Because any database that doesn't roundtrip primary keys would break all foreign keys on backup/restore. Databases with broken backup/restore tend to either fix it real fast, or go away.)

On the contrary, it's much easier to keep IDs stable over time, than any other field. The problem with using a different field as an identifier (e.g. a URL component) is that almost any property with an objective meaning, might change over time or become non-unique. This is why it's a best practice not to use, e.g., user email as a primary key.

For instance, Stack Overflow allows users to edit the title of a post. Right now when that happens, the canonical URL changes (to include the new slug), but the old URL is easily 301'd to the new one because it includes the ID. If the numeric ID was not in the URL, you'd be stuck with either a misleading slug, or having to maintain a list of former slugs/redirects for each post. Plus, you'd have to hack your slug-generation algorithm to ensure the generated slugs to be unique, and the easiest way to do that is... by appending an incrementing ID to the end of the slug.

The point about avoiding cruft like .aspx in URLs is well taken, but unfortunately it's in direct tension with the point about keeping them stable given that "meaningful" non-cruft tends to change over time!

(And if you don't want someone to guess your metrics from IDs, you can use a random non-numeric ID instead!)

[+] dexen|6 years ago|reply
>As far as I know, no database has ever had any of the problems he mentions with IDs not round-tripping properly during backup/restore.

Don't know of any such either. Typical SQLite, MySQL, or PostgerSQL dump/backup is just bunch of SQL statements that re-create the content, including PKs.

You can imagine a non-native (perhaps DB-agnostic?) DB dump format that would use generated IDs for PK, rather than actual PKs. Why would one do that, I don't know.

Another possible case is using file (or memory) offset as the PK. This gives certain nice properties - one less indirection level, and object identity being pointer identity. Perhaps there's a memory image-based LISP version out there that uses such. Perhaps there's a NoSQL engine like that somewhere.

>Plus, you'd have to hack your slug-generation algorithm to ensure the generated slugs to be unique

...or you put the moderators to the task of ruthlessly stamping out duplicate questions. There was a popular website like that, oh wait, that's the Stack Overflow ;-)

[edit]

I've just remembered that PostgreSQL's ctid and SQLite's ROWID are physical identifiers (offset-like identifiers) of records. They are (to certain extent) exposed to userspace and could be misused as IDs in URLs. As physical identifiers (offset-like), they are naturally subject to change upon dump/restore, vacuum, other maintenance tasks, etc.. Clearly those should not be used as end-user visible IDs in URLs.

[+] saltcured|6 years ago|reply
PostgreSQL today is finally deprecating the OID feature they inherited from the original Postgres code. The OID used to be an implicit row ID that would have these problems.

The OID is generated from a sequence shared by all databases in the "cluster" i.e. the PostgreSQL server instance. So it is not only non-deterministic based on insert orders to your DB, but also is affected by inserts in other databases on the same server.

[+] coldtea|6 years ago|reply
>As far as I know, no database has ever had any of the problems he mentions with IDs not round-tripping properly during backup/restore. (Please correct me if I'm wrong! But why do I think this is true? Because any database that doesn't roundtrip primary keys would break all foreign keys on backup/restore. Databases with broken backup/restore tend to either fix it real fast, or go away.)

You might not use foreign keys, but still have autoincremented primary key.

Database ids on the URL are also a security concern.

[+] cryptonector|6 years ago|reply
Pick a primary key (i.e., set of columns) for a table. If you expose this outside the database in stable ways (e.g., in URIs), then do not permit changes to that table's rows' primary keys.

There's NO NEED for a primary key to be an INTEGER or any sort of ROWID.

I agree with TFA -if this is what it says- that one should not put INTEGER PRIMARY KEY values in URIs -- it's useless to users. Just make the "slug" a column, make a NOT NULL and UNIQUE, disallow changes, and use that. You don't have to make the slug a PRIMARY KEY -- NOT NULL and UNIQUE has the same semantics as PRIMARY KEY.

Surely TFA wouldn't argue to have no DB key at all in URIs, so I won't bother with that possibility. And yes, I should read TFA...

[+] michaelt|6 years ago|reply

  IDs not round-tripping properly during backup/restore
Oracle has a "ROWID pseudocolumn" which uniquely identifies each row.

(I know this because I once had to fix a database table with no primary key, and some rows where every field was identical. The rowid lets you delete one out of two identical rows. So it's not an entirely useless feature)

If you had an inexperienced developer, and no-one reviewing their work, and they heard the rowid is the fastest way of looking up a row - faster even than the primary key - so they decided to use that in URLs, that would be tremendously inconvenient.

[+] speedplane|6 years ago|reply
> it's much easier to keep IDs stable over time ... [otherwise] you'd have to hack your slug-generation algorithm to ensure the generated slugs to be unique[.] ... [A]voiding cruft ... [is] in direct tension with the point about keeping them stable given that "meaningful" non-cruft tends to change over time!

Many applications have meaningful unique identifiers, creating a unique slug is not always a "hack". Any app that displays events that occurred in the past often have unique identifiers. Examples include software logs, email, government records, financial transactions, i.e., anything that displays past events.

These records are inherently "write-once", and the primary key can be something more meaningful than an ID. These may be just a subset of apps out there, but in these cases there is a strong argument for removing any ID in a URL.

[+] yetihehe|6 years ago|reply
> As far as I know, no database has ever had any of the problems he mentions with IDs not round-tripping properly during backup/restore.

Simple example - neo4j. Node id's are reused when deleted and you can't create node with id you want. Documentation mentions this. We've just used unique random strings as keys.

[+] jolmg|6 years ago|reply
> If you have to move the site to a different box, can you guarantee that those database IDs will remain the same?

If you don't, then your foreign keys will be broken.

> If you have to restore the site’s data from a backup, can you guarantee that those database IDs will remain the same?

Ditto. If they're not, then that backup seems broken.

> If you have to switch to a different database server (say from Microsoft SQL Server to Oracle), can you guarantee that those database IDs will remain the same?

Ditto.

I've only had one situation where I had to change the database ids, including making sure to change all foreign keys accordingly in all tables, and that was when I was migrating multiple company-hosted instances of a webapp they had in their intranets to a unified internet-hosted server. I was merging multiple databases into one so modifying ids was necessary so they wouldn't clash.

At least in my case, I don't think any of those companies' employees would have had the expectation for their intranet URLs to work. I mean, even the host/domain part of the URI changed. Why would they have the expectation for the path part to still work?

[+] ehutch79|6 years ago|reply
I don't understand why people are saying this is a security issue.

If user a has private document 1234 at /doc/1234 and user b has private document 1235 at /doc/1235

when user a goes to /dec/1235 IT SHOULD NOT RETURN THAT DOCUMENT.

Who in their right minds considers obscurity to be a replacement for authorization?

If a random anonymous individual can access private data by hitting a url, the problem is not the id used in the url.

[+] ehutch79|6 years ago|reply
Also, what about when your users are all generating multiple entries with the same name/slug? thing-name-1532 isn't any easier than things/1532/thing-name

I have no reason to enforce uniqueness of the name of things, but I MUST do this because some jerk said so in 2008?

[+] JMTQp8lwXL|6 years ago|reply
UUID's are a good solution to this. Less and less do I see sites using an increment primary key as an id. I'd like to think that using a UUID is more scalable too -- no need to keep track of what the last ID was.
[+] sprt|6 years ago|reply
It's at least a user enumeration issue
[+] ehutch79|6 years ago|reply
the /dec/1235 bit was a typo.

I mean really?

[+] airstrike|6 years ago|reply
> when user a goes to /dec/1235 IT SHOULD NOT RETURN THAT DOCUMENT.

As a matter of fact, it should return 404 NOT FOUND for both users A and B

[+] Jwarder|6 years ago|reply
One idea I like is to have a normal internal database ID, but also have a public UUID column to use for URLs, APIs and that sort of thing.

This way you can change the record's UUID at any time to display different data without having to worry about updating a bunch of internal foreign keys. You don't have to worry about a user making easy edits to the UUID in the URL to find nearby records (although you still need real protections if you need to restrict data by user).

[+] einrealist|6 years ago|reply
"URIs should be meaningful, not cluttered with meaningless information."

Actually, URIs are opaque. The parts of an URI are implementation details - even the protocol, domain and port. To the user, changing database IDs, that happen to be part of the URI, is no different to changing the domain, port number or the protocol.

Whenever I get into a discussion about "URI design", I state two rules:

1) Care about the URI design as much as you want. But never expect from someone else - especially the client - to make sense of that design.

2) When necessary, handle URI changes in a way that does not break the client (redirects are your friend).

Btw. For this reason, browsers tend to remove the address bar from the UI (as much as possible).

[+] slig|6 years ago|reply
For a content website such as SO, the biggest advantage of using /<id>/<slug>/ is being able to update the `slug` whenever the `title` of your content changes, and then it's very easy do a 301 redirect and you're good to go.
[+] dqybh|6 years ago|reply
If you're afraid of Google bombing, the redirect is not even necessary, just make sure the <link rel=canonical> tag contains the correct URL.
[+] fabian2k|6 years ago|reply
That article seems to immediately contradict itself by saying that the most important aspect of URIs is that they never change, and then suggests using the slug derived from the title as unique identifier. Titles change all the time on Stack Overflow, this would immediately break URLs.
[+] bazzargh|6 years ago|reply
The title and slug are related but not tied together. It's quite common to visit slug-based sites and find the slug is a relic of the original title. In addition, slugs have to be unique when used this way, but titles don't. eg a title of Database IDs Have No Place in URIs (2008) may appear again next year, but the url would need to point to a different conversation; your CMS would suggest or generate the slug based on the original title.
[+] GauntletWizard|6 years ago|reply
This guy sounds like he has bad backups and bad database management. Nothing more. That "database ID" is probably an auto-incrementing primary key, and yes, moving it to another box, restoring from backups, and even migration to another server type should keep the same ID. If not, you've done it wrong.

In none of those situations will new entries necessarily follow the same pattern, i.e. there may be huge gaps or it might start filling in deleted questions so entries are "out of order", but those are separate problems and also remedied by using your schema correctly.

In short: this post is very misinformed and I hope that it's presented for that purpose and not taken as serious advice.

[+] aiCeivi9|6 years ago|reply
> this post is very misinformed.

This. The answer to 3 question in article is always "YES".

> The slug itself can easily be automatically generated when a new question is saved. Then you can simply retrieve a question by its slug.

And that was just pure ignorance. What happens to slug when question(or product) is edited?

[+] madrox|6 years ago|reply
I get the vibe a lot of these critiques are rooted in academic understanding of the issues rather than real world practice. In practice, these are all problems that can be addressed...some rather trivially.

The biggest real world concern I've ever had with embedding incremental IDs into URLs is that incremental IDs can leak information, like how many records you have and how quickly that's growing. Depending on your business, you may want to obfuscate that.

[+] adrianmsmith|6 years ago|reply
Also, using a numerical auto-incrementing ID in the URL will expose how many users you have so far. Not a problem for Stack Overflow, but I’ve had clients (startups) who wanted to put IDs in the URL, when you sign up and your URL is /profile/432 it’s quite clear the product doesn’t have many users.
[+] petepete|6 years ago|reply

    ALTER SEQUENCE users_id_seq RESTART WITH 54321;
"Wow, this service is probably mature, well-tested and reliable!"
[+] adjkant|6 years ago|reply
An easy solution to that is simply not to start at zero/one and pick a random number somewhere around 5-6 digits. Can do that at the database level or programatically.
[+] branko_d|6 years ago|reply
Just use GUIDs instead.

And if you want to keep using integer IDs internally, put the GUIDs into an alternate key.

[+] kozak|6 years ago|reply
That can be either good or bad depending on situation. There are plenty of cases where concise sequential identifiers are an advantage.
[+] kevsim|6 years ago|reply
Fully agree. For example, it's very handy that GitHub issues are auto incrementing integers starting at 1, so that I can type "Fixes #123" in a commit message instead of "Fixes #<some-uuid-or-something>".

Of course, that short numeric ID doesn't necessarily need to be part of the DB's primary key.

[+] dexen|6 years ago|reply
>where concise sequential identifiers are an advantage

For example, certain kinds of attacks are helped with sequential IDs. Wasn't there an AT&T "hack" where the attackers iterated URLs over sequential account numbers to grab nearly all data? Would've been nigh impossible to perform with UUIDs as identifiers.

About the only use case I can think of where sequential IDs are helpful is using MAX(PRIMARY KEY) to roughly estimate total number of records. And that's only relevant where there's special cased support for MAX(PRIMARY KEY). Think MySQL, but not PostgreSQL. And that's also subject to the over-estimation due to holes left in the sequence, whether due to deleted records, or backed off transactions.

[+] LoSboccacc|6 years ago|reply
this seems backward, as far as the pk remain meaningless they're easier to port around, not harder. it's once you give them a meaning then trouble starts (for example you can't change the slug down the road if the title changes)

and with integer keys the worst that can happen is that the sequence is not in the backup so after a restore you need to have the p sequence restart from the previous max id + 1; meanwhile with slug built off user entered data a change in collation can really screw up your day, unless you want to mangle all texts into basic ascii, and SO doesn't https://stackoverflow.com/questions/41102371/sql-doesnt-diff... (see the ü in the link more than the question itself)

[+] 3xblah|6 years ago|reply
"URIs in Stack Overflow look like this:

http://stackoverflow.com/questions/13204/why-doesnt-my-cron-...

Article gives only cons. What are the pros?

https://stackoverflow.com/q/13204

Shorter, easier for users to work with. Could be even shorter with a domain hack. Does Stack Overflow have one?

https://en.wikipedia.org/wiki/Domain_hack

http://xona.com/domainhacks/s.cgi?q=stackoverflow.com

[+] geophile|6 years ago|reply
I don't get this argument at all.

If the ID comes from something like a SERIAL column, then yes, you can port those easily. I think every database system has something like SERIAL, as well as the ability to control an initial value, taking care of portability concerns.

SERIAL has other problems, namely predictability, so another possibility is UUIDs. And those should be pretty portable too. I know DBMS UUID implementations are different from OS implementations, but the risk of collision should still be vanishingly small.

And if you don't use IDs, then what's the choice? If the URI contains an article name then what about article names being changed? What about duplicate article names?

[+] steego|6 years ago|reply
I can't believe people have been giving this post this much credit in the first place. All of his contrived examples are solved with trivial methods and it just sounds like he doesn't know how to migrate data from one database system to another.
[+] thecatspaw|6 years ago|reply
The issue I see with searching by slugs is thats its slower than searching by integer, and its more likely to be entered incorrectly by users when typing a url.

You also can not change the slugging algorithm if you want to keep bookmarks valid

[+] adrianmsmith|6 years ago|reply
I don’t agree it’s slower in a practical sense. If your service is backed by a relational database like PostgreSQL then you’d have an index on the column you’re looking up. So a request involves network round-trip latency, parsing the request and preparing the response, and searching for the object in the database using the index, most likely a tree with a depth of 3 or 4 levels. I don’t think the number of characters in the key, or the data type, will play a big role here. Happy to see evidence to the contrary though if I’m wrong!

The algorithm flexibility might not be a big problem either. The old slugs are stored in the database. Update the algorithm, new slugs adhere to the new algorithm, old ones unchanged, everything still works.

[+] dusted|6 years ago|reply
When talking about public pieces of information, being able to refer to them by their identity is useful and any hashing would be counterproductive (just wasting power), also, I'm sure any database can operate on unique, incrementing numbers more quickly than uuids. If we're talking about private information, it should be transferred in another way than the url in the first place, yes, the chance of "guessing" a hash is lower than an incrementing number, but there are other ways it could be intercepted, a screenshot, accidential copy/paste for example.
[+] michrassena|6 years ago|reply
Say I'm convinced that IDs could possibly leak information, but UUIDs are problematic (e.g. cost of enforcing uniqueness grows with each added row). Would encrypting the ID with some symmetric key plus a salt be sufficient? The client sees a number where the ID can be recovered on the server side, but which the benefits of having an easily generatable unique identifier while making it challenging to recover the actual ID).