User IDs probably shouldn't be passed around as ints (2018)

[+] bakugo|3 years ago|reply

More and more I'm noticing this narrative of "if I write awful code with obvious mistakes and somehow nobody else notices it in code review, it's never our fault, the language and its typing system or safety features should've stopped me" gaining popularity in the programming world and I really have to ask, what happened to programmers that actually knew what they were doing instead of expecting the computer to tell them what to do?

[+] bigDinosaur|3 years ago|reply

Because people are realising in general how good type systems can help at preventing such errors? The well was poisoned for a while with terrible type systems that provided either marginal safety or were just horrible to work with (e.g. complete lack of inference) but that is now seriously changing.

[+] chrsig|3 years ago|reply

Enough of the "hard problems" are solved where it's expected that developers use off-the-shelf type solutions. As a result, they're not getting hands on experience solving hard problems, or challenging each other to think more deeply about solutions.

[+] yakshaving_jgt|3 years ago|reply

They never existed. It was all hubris. Mistakes inevitably happened and people lost money and/or died.

[+] kvdveer|3 years ago|reply

Software systems have been able to grow mostly because developers have been able to delegate a lot of diligence to tools.

Of course we could require a developer to "know what they are doing", many work environments do. However, you won't see many posts about that. First of all because it doesn't scale, and secondly because it doesn't make for interesting reading.

[+] ascotan|3 years ago|reply

I learned never to use autoinc IDs in anything especially not URLs where they leak DB info, etc. However, I've seen it many times where younger devs do exactly that because they're learning from bad online books or tutorials. Every generation dev needs to relearn the same lessons.

I also suspect that many devs today are learning platforms first and software skills last. This was the reverse for many devs that came from building it the hard way and then using platform tooling to simplify. Newers devs are looking for the tooling to provide the core skills guardrails.

[+] preommr|3 years ago|reply

What happened is we went from making pong to cyberpunk 2077.

[+] nivertech|3 years ago|reply

1. use UUIDs for user IDs

2. use typed schemas for APIs, i.e. GraphQL with a custom scalar type for UserID. Other typed schemas for APIs: OpenAPI, AsyncAPI, protobufs/gRPC, etc.

  // add GraphQL parsing and serialization code
  scalar UserID

3. Make the implicit - explicit. Don't use naked primitive types in your code (similarly as you will not use naked literal constants, i.e. use PI instead of 3.1415...). Use a PL with the string static typing (preferably Algebraic Type System). Define type for UserID which cannot be mixed with the integers, i.e.

  type UserID = UserID int

4. in dynamic PLs you can use tagged tuples or tagged maps/structs, i.e.

  {:user_id, 1234}
  
  or

  {"_type": "UserID", "id": 1234}

5. validate all external inputs (even those comming from the DB or Message Broker):

  validateUserID: int -> UserID

6. The examples above assume integer representation of the user id, here's if we switch to UUIDv4:

  type UUIDv4 = UUIDv4 string
  type UserID = UserID UUIDv4

  validateUserID: string -> UserID

[+] Hermitian909|3 years ago|reply

My one critique of this prescription is that UUIDs are not identifiable as user IDs. IMO you're better off with a format like "user{20 random characters}"

[+] unscaled|3 years ago|reply

UUID also suffers from a serious issue: there is no standard way to encode them as string and in binary format.

In binary, UUID can be encoded as Big Endian or Little Endian or even Mixed Endian, and while the string encoding is usually lowercase with a certain hyphenization rules, I've seen variations of that.

The (hexadecimal) text encoding is also quite inefficient compared to more modern standards like ULID or kSUID.

[+] thangngoc89|3 years ago|reply

Are you familiar with Typescript? If it's possible, could you give me an example of 3 or 4 in Typescript?

[+] LorenPechtel|3 years ago|reply

And if you use UUIDs for user IDs you're probably using them for other things. The same sort of mistake can happen.

The only real defense here is language-level enforcement. Allow declaration of a subtype that is not assignment-compatible with the parent even though it's identical.

I don't do any web-facing stuff but the only bits of code that know about things like IDs are the database stuff. All the logic works with classes that contain the ID and relevant data--you always pass the class, not the ID.

[+] camgunz|3 years ago|reply

I'll get back on my "tests are better than types" hobby horse and say "or write a single test for this function". The problem here is that an engineer was able to push untested code to prod that irrevocably modifies the database.

You need a test here because types won't tell you that a user was banned. And that test would also have caught this error.

[+] nivertech|3 years ago|reply

s/comming from/coming from/

s/string static typing/strict static typing/

as opposed to the weak static typing in languages like C/C++/etc.

[+] alxmng|3 years ago|reply

Disagree. UUIDs are a waste of space, and they aren’t sorted.

Use 64-bit ints. Prefer not exposing them to users, but it’s not a problem for most software.

If you ever get so big you need to shard, use Snowflake or a 64bit scheme that encodes the shard.

Don’t overcomplicate things. You’re already using either SQLite or PostgreSQL and it already gives you auto-incrementing integer keys by default, without the need to encode/decode UUIDs in whatever software you’re writing to interface with it.

[+] tijsvd|3 years ago|reply

Take incrementing int32. Extend to 64 bits. Multiply by large prime number (e.g. fnv32 prime). Mod 1 million. Add 1 million. End up with random looking, 7 digits, nicely sequenced, 32 bit integers. Write an exhaustive test to verify.

When near 1 million users (yagni), reset sequence and do the same with 10 million (or one billion).

Doesn't solve the upside of 128-bit random numbers (ala uuid): the ability to generate remotely and expect no collision.

[+] onion2k|3 years ago|reply

... and they aren’t sorted.

I don't imagine there are many valid use cases where you want a list of users sorted by their database record ID though, and if you're suggesting an auto-incremented int then the creation timestamp will give you the same order anyway.

[+] sonicgear1|3 years ago|reply

UUIDv1 can be sorted by time of creation. That said UUIDs are pretty good, they are hard to guess and can be generated anywhere without a coordinator.

[+] fastball|3 years ago|reply

Why does it matter if they're sorted?

[+] _3u10|3 years ago|reply

In practice it works just fine. Lots of really big and small companies do it everyday and somehow manage not to collapse.

In theory there are all sorts of negative consequences to using integers that you’ll run into once in a while and will require a google for a solution, however, everyday all of your queries and all of your inserts will be faster than using anything else.

Also, it will be widely supported by any 3rd party code you use to build your project which is probably more important than anything else.

[+] Mikhail_Edoshin|3 years ago|reply

The mistake has nothing to do with 'int' lacking something. 'int' is totally fine as an user ID or anything else. Everything in computer is essentially an 'int'. Every character in this text, for example, is an integer.

The problem is that 'ban_account' changes data in a user record. Hence it should be a method of a 'User' class. And the 'Message' class should have a way to fetch an instance of 'User' for the sender. Here's the right way:

    ban_senders_of_messages(messages) {
      for (i = 0; i < messages.size; ++i)
          messages[i].sender.ban();
    }

[+] actuallyalys|3 years ago|reply

One solution, available in most languages, is to not use manual indexing at all. Use a for-each loop, iterator or similar abstraction. Sometimes using an index or counter is unavoidable, of course, but hopefully seeing an indexed loop, especially combined with a comparatively risky operation like banning, would trigger more scrutiny during a code review, more testing, or both.

Avoiding manual indexing doesn't prevent you from sticking other stray integers in place of user IDs (like specifying message IDs instead of the user ID that sent the message), but most of them are not biased toward the start like indexes are.

Still, I think it's prudent to avoid these kinds of errors at all if you can. Perhaps a good reason to switch to UUIDs for all primary keys, even if the normal concerns about enumerability don't apply.

[+] rahimnathwani|3 years ago|reply

This seems to be the same point about type safety that's made in this recent front page post:

https://news.ycombinator.com/item?id=33844117

[+] makach|3 years ago|reply

Is it just user-ids we should discuss or is it all kind of ids represented as an int? on one side you have the convenience and simplicity and on the other side you have a set of experience, best practice, security considerations to take into account.

It can be overwhelming because the correct way seems obvious, listen to the experts and do as somebody think you should do - but it never is as easy as that. You need to balance your structure. Everything you do has a performance hit in some way or another and you need to consider the impact said practice will make in your environment.

[+] Semaphor|3 years ago|reply

> Is it just user-ids we should discuss or is it all kind of ids represented as an int?

Strongly typed ID’s are a thing advocated for in general [0], yes.

And they aren’t even inconvenient in some languages. Modern C# for example makes it very easy to use them [1].

[0]: https://andrewlock.net/using-strongly-typed-entity-ids-to-av...

[1]: https://github.com/andrewlock/StronglyTypedId

[+] mardix|3 years ago|reply

While UUID4 can be good for UserID. Sometimes, to avoid having a second index in the DB, ULID is another good option; As not only it supports some of the features of UUID4 of uniqueness, it is also sortable.

[+] fire|3 years ago|reply

TIL about ULID[0]; It looks interesting, but the spec was last touched in 2019 and I haven't really heard of it before... is it actively used?

Also curious because I don't actually know: If a format spec is GPL, does that encumber implementations of said spec?

0: https://github.com/ulid/spec

[+] nivertech|3 years ago|reply

ULID leaks information about when the user was created (while serial IDs leaking the order in which users were created and their cardinality).

I'm using ULIDs for the cases when the entity is publicly orderd by time: any timestamped event, e.g. a chat message, a log entry, or a sensor measurement/metric.

But need to be careful that the timestamp resolution is detailed enough.

Also, while ULID might be good for optimizing RDBMS indexes, it might create hotspots in NoSQL K/V stores (i.e. all entities will be created on the same node in the cluster).

[+] suralind|3 years ago|reply

Have a look at UUIDv7 I guess, though databases probably don't have support for generating it.

[+] janaagaard|3 years ago|reply

I think is goes a long way to always use lang names for ID properties. So always call it userId, companyId etc. instead of simply id everywhere.

[+] iancarroll|3 years ago|reply

Unpredictable user IDs are also an important security feature. At every tech company, developers inevitably write an endpoint that accidentally operates on user data without authorization checks. It is much easier for an attacker to iterate over integers than to figure out a separate way to leak a list of user UUIDs.

[+] andyp-kw|3 years ago|reply

A hacker knowing the user ID should not be a problem in itself if both authorization and authentication are used by the endpoint.

[+] dang|3 years ago|reply

Discussed at the time:

User IDs probably shouldn't be passed around as ints - https://news.ycombinator.com/item?id=16946557 - April 2018 (84 comments)

[+] shagie|3 years ago|reply

If you're not going to do math on it, don't pass it around as a number.

[+] mjevans|3 years ago|reply

On one hand I agree.

On the other; Legal Names change: for lots of reasons. It makes a LOT of sense to store a UID the way *NIX systems often do. An explicit indirect lookup table entry.

[+] de6u99er|3 years ago|reply

I consider using (G)UIDs anyways best practice. They are super practical when merging databases.

56 comments