More and more I'm noticing this narrative of "if I write awful code with obvious mistakes and somehow nobody else notices it in code review, it's never our fault, the language and its typing system or safety features should've stopped me" gaining popularity in the programming world and I really have to ask, what happened to programmers that actually knew what they were doing instead of expecting the computer to tell them what to do?
Because people are realising in general how good type systems can help at preventing such errors? The well was poisoned for a while with terrible type systems that provided either marginal safety or were just horrible to work with (e.g. complete lack of inference) but that is now seriously changing.
Enough of the "hard problems" are solved where it's expected that developers use off-the-shelf type solutions. As a result, they're not getting hands on experience solving hard problems, or challenging each other to think more deeply about solutions.
Software systems have been able to grow mostly because developers have been able to delegate a lot of diligence to tools.
Of course we could require a developer to "know what they are doing", many work environments do. However, you won't see many posts about that. First of all because it doesn't scale, and secondly because it doesn't make for interesting reading.
I learned never to use autoinc IDs in anything especially not URLs where they leak DB info, etc. However, I've seen it many times where younger devs do exactly that because they're learning from bad online books or tutorials. Every generation dev needs to relearn the same lessons.
I also suspect that many devs today are learning platforms first and software skills last. This was the reverse for many devs that came from building it the hard way and then using platform tooling to simplify. Newers devs are looking for the tooling to provide the core skills guardrails.
2. use typed schemas for APIs, i.e. GraphQL with a custom scalar type for UserID. Other typed schemas for APIs: OpenAPI, AsyncAPI, protobufs/gRPC, etc.
// add GraphQL parsing and serialization code
scalar UserID
3. Make the implicit - explicit. Don't use naked primitive types in your code (similarly as you will not use naked literal constants, i.e. use PI instead of 3.1415...). Use a PL with the string static typing (preferably Algebraic Type System). Define type for UserID which cannot be mixed with the integers, i.e.
type UserID = UserID int
4. in dynamic PLs you can use tagged tuples or tagged maps/structs, i.e.
{:user_id, 1234}
or
{"_type": "UserID", "id": 1234}
5. validate all external inputs (even those comming from the DB or Message Broker):
validateUserID: int -> UserID
6. The examples above assume integer representation of the user id, here's if we switch to UUIDv4:
type UUIDv4 = UUIDv4 string
type UserID = UserID UUIDv4
validateUserID: string -> UserID
My one critique of this prescription is that UUIDs are not identifiable as user IDs. IMO you're better off with a format like "user{20 random characters}"
UUID also suffers from a serious issue: there is no standard way to encode them as string and in binary format.
In binary, UUID can be encoded as Big Endian or Little Endian or even Mixed Endian, and while the string encoding is usually lowercase with a certain hyphenization rules, I've seen variations of that.
The (hexadecimal) text encoding is also quite inefficient compared to more modern standards like ULID or kSUID.
And if you use UUIDs for user IDs you're probably using them for other things. The same sort of mistake can happen.
The only real defense here is language-level enforcement. Allow declaration of a subtype that is not assignment-compatible with the parent even though it's identical.
I don't do any web-facing stuff but the only bits of code that know about things like IDs are the database stuff. All the logic works with classes that contain the ID and relevant data--you always pass the class, not the ID.
I'll get back on my "tests are better than types" hobby horse and say "or write a single test for this function". The problem here is that an engineer was able to push untested code to prod that irrevocably modifies the database.
You need a test here because types won't tell you that a user was banned. And that test would also have caught this error.
Disagree. UUIDs are a waste of space, and they aren’t sorted.
Use 64-bit ints. Prefer not exposing them to users, but it’s not a problem for most software.
If you ever get so big you need to shard, use Snowflake or a 64bit scheme that encodes the shard.
Don’t overcomplicate things. You’re already using either SQLite or PostgreSQL and it already gives you auto-incrementing integer keys by default, without the need to encode/decode UUIDs in whatever software you’re writing to interface with it.
Take incrementing int32. Extend to 64 bits. Multiply by large prime number (e.g. fnv32 prime). Mod 1 million. Add 1 million. End up with random looking, 7 digits, nicely sequenced, 32 bit integers. Write an exhaustive test to verify.
When near 1 million users (yagni), reset sequence and do the same with 10 million (or one billion).
Doesn't solve the upside of 128-bit random numbers (ala uuid): the ability to generate remotely and expect no collision.
I don't imagine there are many valid use cases where you want a list of users sorted by their database record ID though, and if you're suggesting an auto-incremented int then the creation timestamp will give you the same order anyway.
In practice it works just fine. Lots of really big and small companies do it everyday and somehow manage not to collapse.
In theory there are all sorts of negative consequences to using integers that you’ll run into once in a while and will require a google for a solution, however, everyday all of your queries and all of your inserts will be faster than using anything else.
Also, it will be widely supported by any 3rd party code you use to build your project which is probably more important than anything else.
The mistake has nothing to do with 'int' lacking something. 'int' is totally fine as an user ID or anything else. Everything in computer is essentially an 'int'. Every character in this text, for example, is an integer.
The problem is that 'ban_account' changes data in a user record. Hence it should be a method of a 'User' class. And the 'Message' class should have a way to fetch an instance of 'User' for the sender. Here's the right way:
ban_senders_of_messages(messages) {
for (i = 0; i < messages.size; ++i)
messages[i].sender.ban();
}
One solution, available in most languages, is to not use manual indexing at all. Use a for-each loop, iterator or similar abstraction. Sometimes using an index or counter is unavoidable, of course, but hopefully seeing an indexed loop, especially combined with a comparatively risky operation like banning, would trigger more scrutiny during a code review, more testing, or both.
Avoiding manual indexing doesn't prevent you from sticking other stray integers in place of user IDs (like specifying message IDs instead of the user ID that sent the message), but most of them are not biased toward the start like indexes are.
Still, I think it's prudent to avoid these kinds of errors at all if you can. Perhaps a good reason to switch to UUIDs for all primary keys, even if the normal concerns about enumerability don't apply.
Is it just user-ids we should discuss or is it all kind of ids represented as an int? on one side you have the convenience and simplicity and on the other side you have a set of experience, best practice, security considerations to take into account.
It can be overwhelming because the correct way seems obvious, listen to the experts and do as somebody think you should do - but it never is as easy as that. You need to balance your structure. Everything you do has a performance hit in some way or another and you need to consider the impact said practice will make in your environment.
While UUID4 can be good for UserID. Sometimes, to avoid having a second index in the DB, ULID is another good option; As not only it supports some of the features of UUID4 of uniqueness, it is also sortable.
ULID leaks information about when the user was created (while serial IDs leaking the order in which users were created and their cardinality).
I'm using ULIDs for the cases when the entity is publicly orderd by time: any timestamped event, e.g. a chat message, a log entry, or a sensor measurement/metric.
But need to be careful that the timestamp resolution is detailed enough.
Also, while ULID might be good for optimizing RDBMS indexes, it might create hotspots in NoSQL K/V stores (i.e. all entities will be created on the same node in the cluster).
Unpredictable user IDs are also an important security feature. At every tech company, developers inevitably write an endpoint that accidentally operates on user data without authorization checks. It is much easier for an attacker to iterate over integers than to figure out a separate way to leak a list of user UUIDs.
On the other; Legal Names change: for lots of reasons. It makes a LOT of sense to store a UID the way *NIX systems often do. An explicit indirect lookup table entry.
[+] [-] bakugo|3 years ago|reply
[+] [-] bigDinosaur|3 years ago|reply
[+] [-] chrsig|3 years ago|reply
[+] [-] yakshaving_jgt|3 years ago|reply
[+] [-] kvdveer|3 years ago|reply
Of course we could require a developer to "know what they are doing", many work environments do. However, you won't see many posts about that. First of all because it doesn't scale, and secondly because it doesn't make for interesting reading.
[+] [-] ascotan|3 years ago|reply
I also suspect that many devs today are learning platforms first and software skills last. This was the reverse for many devs that came from building it the hard way and then using platform tooling to simplify. Newers devs are looking for the tooling to provide the core skills guardrails.
[+] [-] preommr|3 years ago|reply
[+] [-] nivertech|3 years ago|reply
2. use typed schemas for APIs, i.e. GraphQL with a custom scalar type for UserID. Other typed schemas for APIs: OpenAPI, AsyncAPI, protobufs/gRPC, etc.
3. Make the implicit - explicit. Don't use naked primitive types in your code (similarly as you will not use naked literal constants, i.e. use PI instead of 3.1415...). Use a PL with the string static typing (preferably Algebraic Type System). Define type for UserID which cannot be mixed with the integers, i.e. 4. in dynamic PLs you can use tagged tuples or tagged maps/structs, i.e. 5. validate all external inputs (even those comming from the DB or Message Broker): 6. The examples above assume integer representation of the user id, here's if we switch to UUIDv4:[+] [-] Hermitian909|3 years ago|reply
[+] [-] unscaled|3 years ago|reply
In binary, UUID can be encoded as Big Endian or Little Endian or even Mixed Endian, and while the string encoding is usually lowercase with a certain hyphenization rules, I've seen variations of that.
The (hexadecimal) text encoding is also quite inefficient compared to more modern standards like ULID or kSUID.
[+] [-] thangngoc89|3 years ago|reply
[+] [-] LorenPechtel|3 years ago|reply
The only real defense here is language-level enforcement. Allow declaration of a subtype that is not assignment-compatible with the parent even though it's identical.
I don't do any web-facing stuff but the only bits of code that know about things like IDs are the database stuff. All the logic works with classes that contain the ID and relevant data--you always pass the class, not the ID.
[+] [-] camgunz|3 years ago|reply
You need a test here because types won't tell you that a user was banned. And that test would also have caught this error.
[+] [-] nivertech|3 years ago|reply
s/string static typing/strict static typing/
as opposed to the weak static typing in languages like C/C++/etc.
[+] [-] alxmng|3 years ago|reply
Use 64-bit ints. Prefer not exposing them to users, but it’s not a problem for most software.
If you ever get so big you need to shard, use Snowflake or a 64bit scheme that encodes the shard.
Don’t overcomplicate things. You’re already using either SQLite or PostgreSQL and it already gives you auto-incrementing integer keys by default, without the need to encode/decode UUIDs in whatever software you’re writing to interface with it.
[+] [-] tijsvd|3 years ago|reply
When near 1 million users (yagni), reset sequence and do the same with 10 million (or one billion).
Doesn't solve the upside of 128-bit random numbers (ala uuid): the ability to generate remotely and expect no collision.
[+] [-] onion2k|3 years ago|reply
I don't imagine there are many valid use cases where you want a list of users sorted by their database record ID though, and if you're suggesting an auto-incremented int then the creation timestamp will give you the same order anyway.
[+] [-] sonicgear1|3 years ago|reply
[+] [-] fastball|3 years ago|reply
[+] [-] _3u10|3 years ago|reply
In theory there are all sorts of negative consequences to using integers that you’ll run into once in a while and will require a google for a solution, however, everyday all of your queries and all of your inserts will be faster than using anything else.
Also, it will be widely supported by any 3rd party code you use to build your project which is probably more important than anything else.
[+] [-] Mikhail_Edoshin|3 years ago|reply
The problem is that 'ban_account' changes data in a user record. Hence it should be a method of a 'User' class. And the 'Message' class should have a way to fetch an instance of 'User' for the sender. Here's the right way:
[+] [-] actuallyalys|3 years ago|reply
Avoiding manual indexing doesn't prevent you from sticking other stray integers in place of user IDs (like specifying message IDs instead of the user ID that sent the message), but most of them are not biased toward the start like indexes are.
Still, I think it's prudent to avoid these kinds of errors at all if you can. Perhaps a good reason to switch to UUIDs for all primary keys, even if the normal concerns about enumerability don't apply.
[+] [-] rahimnathwani|3 years ago|reply
https://news.ycombinator.com/item?id=33844117
[+] [-] makach|3 years ago|reply
It can be overwhelming because the correct way seems obvious, listen to the experts and do as somebody think you should do - but it never is as easy as that. You need to balance your structure. Everything you do has a performance hit in some way or another and you need to consider the impact said practice will make in your environment.
[+] [-] Semaphor|3 years ago|reply
Strongly typed ID’s are a thing advocated for in general [0], yes.
And they aren’t even inconvenient in some languages. Modern C# for example makes it very easy to use them [1].
[0]: https://andrewlock.net/using-strongly-typed-entity-ids-to-av...
[1]: https://github.com/andrewlock/StronglyTypedId
[+] [-] mardix|3 years ago|reply
[+] [-] fire|3 years ago|reply
Also curious because I don't actually know: If a format spec is GPL, does that encumber implementations of said spec?
0: https://github.com/ulid/spec
[+] [-] nivertech|3 years ago|reply
I'm using ULIDs for the cases when the entity is publicly orderd by time: any timestamped event, e.g. a chat message, a log entry, or a sensor measurement/metric.
But need to be careful that the timestamp resolution is detailed enough.
Also, while ULID might be good for optimizing RDBMS indexes, it might create hotspots in NoSQL K/V stores (i.e. all entities will be created on the same node in the cluster).
[+] [-] suralind|3 years ago|reply
[+] [-] janaagaard|3 years ago|reply
[+] [-] iancarroll|3 years ago|reply
[+] [-] andyp-kw|3 years ago|reply
[+] [-] dang|3 years ago|reply
User IDs probably shouldn't be passed around as ints - https://news.ycombinator.com/item?id=16946557 - April 2018 (84 comments)
[+] [-] shagie|3 years ago|reply
[+] [-] mjevans|3 years ago|reply
On the other; Legal Names change: for lots of reasons. It makes a LOT of sense to store a UID the way *NIX systems often do. An explicit indirect lookup table entry.
[+] [-] de6u99er|3 years ago|reply