Sqids – Generate short unique IDs from numbers

[+] notfed|2 years ago|reply

I appreciate that the author clearly states that security, i.e., output can't be reversed back to the input, is a non-requirement. We can't criticize the author too much for that either, because, as a rule, "random-looking id generator" algorithms will always be either not secure, or not short, or not collision-free. Or they'll be a key-value database.

A secure "random-looking id generator" is called a block cipher. Block ciphers with less than 128 bits of output are widely considered insecure: this corresponds to about 22 base64 characters.

Going further, you probably do want to use a secure algorithm. History is full of people thinking "we don't need this to be secure, it's not used for security-sensitive things" and later regretting it. Some case studies at [1].

[1] https://www.schneier.com/blog/archives/2016/04/security_risk... .

[+] pixel8account|2 years ago|reply

It's good to consider this but... Plenty of sites expose user ID as a regular integer. In some cases you might want to avoid this (leaking user count to competitors etc), but I have never heard about anyone calling this a vulnerability.

[+] geek_at|2 years ago|reply

> i.e., output can't be reversed back to the input

But in the "Not good for" section it states the opposite: "[not good for] User IDs can be decoded, revealing user count" so it can be reversed?

[+] notfed|2 years ago|reply

I realize now the site lists "One-Time Passwords" as a use-case. I'm a positive-vibes-only HN poster though, so will leave it as an exercise to the reader to decide whether that's going to turn out well or not.

[+] 8organicbits|2 years ago|reply

The mention of one-time passcodes seems odd. Those need to be unguessable, but don't need to be unique. If you supply a suitable random source, then I suppose it works, but the "padded with junk" feature makes these look more complex than they really are.

The standard choice of 4 to 8 random digits works well and it's clear what level of security they provide. Digits are easier to understand than case sensitive latin characters, especially when your native language uses a different character set.

[+] pytness|2 years ago|reply

I think you completely misunderstood the article, or you have not read it.

The uniqueness from the system comes from the fact that two different numbers will never have the same id.

The pad only works for things like user ids.

You can also change the alphabet it uses so its not case sensitive; using this alphabet: "ABCDEFGHJKLMNPQRSTUVWXYZ0123456789" and a minimum of 8 digits, it will produce 8 digit ids all the way 4294967295 (0xffffffff).

[+] no_wizard|2 years ago|reply

I like the idea, though I use nanoid with the safe letter dictionary (it excludes letters used for profanity[0])

They should use a similar dictionary approach IMO because I looked at the implementation and it’s hardcoded to look for “bad” words

Otherwise looks real straightforward! I’d love to see some performance test suites for it

[0]: https://github.com/sqids/sqids-javascript/blob/ebca95e114932...

[1]: though with UUID v4 so common to generate and well optimized in most languages I wonder if these userland solutions are really better. You can always generate a UUID and re-encode with base32 or base64 with also is well optimized in most languages

[+] lxgr|2 years ago|reply

> it excludes letters used for profanity

That doesn't seem possible. How would that work?

> I looked at the implementation and it’s hardcoded to look for “bad” words.

If you mean https://github.com/y-gagar1n/nanoid-good, that seems to be doing the same thing.

In general, I'm a bit weary of solutions that "guarantee no bad words" – this is usually highly language-specific: One language's perfectly acceptable name is another language's swear word.

[+] seanhunter|2 years ago|reply

Grepping out naughty words in randomly generated text definitely strictly weakens the information content if you're using it for a secure application but is often necessary.

In the early dotcom era the company I worked for were about to go live and the final step was demoing the end to end flow to the ceo. I had done the back end stuff and hadn't paid much attention to the front-end. The person who did the account creation process wanted to nudge people to generate memorable yet strongish passwords, so when it created your account it would generate with a random password which he did by choosing 2 four letter words at random from the unix dictionary and putting a two digit number between them. He ran that past me as an idea and I thought "yeah, good idea" and didn't think more of it.

However he forgot to first grep out all the naughty words so when we demoed it to the CEO/non-technical founder both of the words in his randomly generated password were swearwords.

[+] tttp|2 years ago|reply

I tried something similar with a fixed alphabet that guarantees no profanity and a checksum (luhn)

https://github.com/tttp/dxid

[+] unknown|2 years ago|reply

[deleted]

[+] parhamn|2 years ago|reply

Side note: there are some business insights you can get from a company using serial ids.

i.e if you sign up and get user id 32588 and make another account a few days later, you can tell the growth rate of the company.

And this is possible with every resource type in the application.

I do wonder how much the url bar junk thing matters these days. I tend to use uulids (waiting on uuid v7 wide adoption), and they're a bit ugly, but most browsers hide most of the urls now anyway. The fact that there is a builtin time component comes in clutch sometimes (e.g. object merging rules).

[+] pacificmint|2 years ago|reply

> you can tell the growth rate of the company.

You can even do this when you don’t know the exact interval by using probabilities. The Allies used this method to estimate German tank production in World War II by analyzing the serial numbers of captured or destroyed tanks.

This is know as the German Tank Problem [1]

[1] https://en.wikipedia.org/wiki/German_tank_problem

[+] raggi|2 years ago|reply

I see so many organizations add weird slowdowns from debts associated with this. I reflect on some of the most successful tech businesses of the last decade and remember that all their APIs exposed this kind of data early on and many still do.

Does anyone have an example they can reference of a business being harmed by this information being out there?

[+] aequitas|2 years ago|reply

At an internship long ago, my boss instructed me to always add a few extra to the auto incremented order ID so customers couldn’t guess how business was going if they happen to order stuff quickly in a row.

[+] diznq|2 years ago|reply

The solution I prefer is to simply just encrypt the data such as IDs.

Instead of giving user an ID in response, user gets hmac(cipher(Data, secret_key), secret_key) + cipher(Data, secret_key) and then some simple pre-request handler just iterates over query params / form data and decrypts them if signature matches.

It also works as a really nice CSRF protection as user ID of currently signed user can be embedded into Data and checked if current user.id == decrypted data.id.

Another nice advantage is that you can deny the request right in the beginning as you know ahead of time that the provided data is not valid (signature doesn't match), saving some DB queries.

The down side is that URL gets pretty long though, but if that's hidden by browser or user doesn't care, it's a non-issue

[+] throwaway47747|2 years ago|reply

Can confirm, when I worked in VC we used this to verify order volume for a number of startups we were evaluating. For one startup, I wrote a bot to place a small order a few times a day, and log the order number.

[+] paulddraper|2 years ago|reply

> most browsers

Not chrome...

Also, links are a thing in chat, etc

[+] taneq|2 years ago|reply

I always take note of invoice numbers for this exact reason, they give you a feel for how busy they are.

[+] hinkley|2 years ago|reply

And this is the stuff you get if you manage to get your access control right.

Get it wrong and we jump from actionable business metadata to actionable business data (like perhaps which of your customer's customers are poachable)

[+] hot_gril|2 years ago|reply

Yes but just using Sqids doesn't fix this. Sqids are decodable. You need to use a random or otherwise unpredictable input.

What's the advantage of uuid7 (sequential + random) vs uuid4 (full random) for this?

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] maxcan|2 years ago|reply

Yes, that's how I know I was roughly the 600,000th person to sign up for thefacebook.com.

[+] cortesoft|2 years ago|reply

Unless they are doing master-master replication so are incrementing by something other than 1

[+] dfc|2 years ago|reply

It's weird under "Get Started" they have links to 40 different languages. You can only get started with 15 of the 40 languages listed, the other 25 are skeleton repos asking for people to start the repo to indicate interest.

[+] LeFever|2 years ago|reply

It’s kinda clever. The people most likely to look at this project are also likely ideal candidates for implementing the library in a new language (Developer, FOSS enthusiast, interested in the project, need the library in a language they’re familiar with that isn’t implemented yet).

Also, the language pills differentiate between those that have been implemented (color logo, dark text, bold) and those that aren’t (grayscale).

[+] vyrotek|2 years ago|reply

The approach definitely works. Some time ago I saw .NET listed but discovered it wasn't complete. I was eager to replace an existing Hashids implementation so I made some comments, shared a starter-snippet, and then someone was excited enough to complete in just a few days. It was great to see how quick the community stepped in. Maybe there was a bit of Cunningham's Law in effect with my contribution, ha.

https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...

[+] hooverd|2 years ago|reply

Maybe a slam dunk first FOSS contribution?

[+] SebRollen|2 years ago|reply

The second example for each language sample where the generated squid ends up being “B4aajs” essentially reads as “P0ooop” to a Swedish speaker.

Which is fine, they don’t propose to filter “bad” words in other languages, but kind of funny when that’s one of the highlighted examples, right next to the goal of filtering words. Goes to show how hard it is to filter profanity generally for international audiences

[+] sandstrom|2 years ago|reply

Neat library!

We're using randomly generated strings for many things. IDs, password recovery tokens, etc. We've generated millions of them in our system, for various use-cases. Hundreds of thousands of people see them every day.

I've never heard any complaints about a random content-id being "lR8vDick4r" (dick) or whatever.

But nowadays our society is so afraid of offending anyone, that profanity filters has extended all the way to database IDs and password recovery tokens.

(there are some legit cases, like randomly generated IDs for user profiles shared in public URLs, that users have to live with, but even there just make the min length 8 and you're unlikely to have any full-word profanity as the complete ID; put differently, I don't understand why they made the block list an opt-out thing)

[+] progne|2 years ago|reply

In a Ruby app we just convert to a high base, like

  > 1234567890.to_s(36)
  => "kf12oi"

That gets us most of the way there, but Sqid has a Ruby library and lets you set a much higher base, including upper case characters, and I suppose, emoji. We're going to need much bigger numbers before that space savings makes much difference. I like it, but it's hard to know when something like that is worth adding a dependency.

[+] whalesalad|2 years ago|reply

This used to have a totally different name iirc, they used to be called hashids

[+] Schnitz|2 years ago|reply

It would be great to have a quick primer on why this is better than what people typically homebrew, like base62 encoding a random number.

[+] ComputerGuru|2 years ago|reply

I haven’t been able to find a case for this because ids either need to be unique or they’re not going to be large. If they’re unique, I’m using uuid or ulid (uuidv7 of tomorrow) as the sortable primary key type to avoid conflicts without using the db to generate and maintain sequences.

Where do you have unique ids that aren’t the primary key? I would be more interested in a retrospectively unique truncated encoding for extant ulid/uuid; ie given that we’ve passed timestamp foo, we know that (where no external data is merged) we only need a bucketed time granularity of x for the random component of the id to remain unique (for when sortability is no longer needed).

Or just more generally a way to convert a ulid/uuidv7 to a shorter sequence if we are using it for external hash table lookups only and can do without the timestamp component.

[+] its-summertime|2 years ago|reply

For a similar thing, (X bytes to X bytes, no collisions) https://en.wikipedia.org/wiki/Format-preserving_encryption is a good page

[+] waffle_ss|2 years ago|reply

I wrote a Ruby gem to address this problem of hiding sequential primary keys that uses a Feistel network to effectively shuffle int64 IDs: https://github.com/abevoelker/gfc64

So instead of

    /customers/1
    
    /customers/2

You'll get something like

    /customers/4552956331295818987
    
    /customers/3833777695217202560

Kinda similar idea to this library but you're encoding from an integer to another integer (i.e. it's format-preserving encryption). I like keeping the IDs as integers without having to reach for e.g. UUIDs

[+] Alifatisk|2 years ago|reply

On a side note, "Sqids ... is an open-source library that lets you generate YouTube-looking IDs from numbers.", "The main use of Sqids is purely visual."

If the purpose of it is to give a friendlier url / id, who not use something like friendly_id instead? (http://norman.github.io/friendly_id).

The url is readable and searchable through the history.

I would much rather prefer people using "www.website.com/channel/video/a-dog-walking" instead of "www.website.com/channel/video/3cXv8c".

[+] chrismorgan|2 years ago|reply

These are called slugs <https://en.wikipedia.org/wiki/Clean_URL#Slug>. If you’re willing to commit to a persistent slug, please do: they are nicer. But for many applications, e.g. most user-generated content, you can’t be confident they won’t change. There’s a reason for the general advice to not use data as a primary key in databases, but to use a separate ID.

[+] andix|2 years ago|reply

Actually I'm a bit disappointed that it can't format 128 bit integers or byte arrays. That would allow formatting UUIDs. I'm not a huge fan of public facing integer IDs. There is always the risk of leaking some kind of critical information with ascending IDs.

So I will probably keep Base64URL formatting my UUIDs to make them shorter for URLs, QR Codes and so on.

Quick example:

  20b30b32-d421-4cfb-bdbc-9a4e0475abea
  =>
  MguzICHU-0y9vJpOBHWr6g

It's important to keep in mind that there are different ways to convert an UUID to a byte array (little/big endian, order of segments, ..)

[+] wslh|2 years ago|reply

I offered something similar here [1] and it is used by many companies including Philip Morris, and the Argentinian tax agency for the same purposes.

The technique I used (I should publish it as open source) is using a Feistel cipher [2] with a key. The Feistel network could be adjusted to almost any size and the key used in every round is an expansion of a general key using a key derivation function [3] (KDF3 if I remember well).

Basically it is a symmetric cipher of arbitrary size.

[1] https://www.nektra.com/products/secure-coupon-code-generator...

[2] https://en.wikipedia.org/wiki/Feistel_cipher

[3] https://en.wikipedia.org/wiki/Key_derivation_function

[+] rdpintqogeogsaa|2 years ago|reply

Interestingly, it seems there might have been independent invention of the same idea[0]. Have you checked for any patents in the space?

[0] https://bytes.grubhub.com/why-we-use-crypto-when-generating-...

[+] smashed|2 years ago|reply

I'm not a cryptographer and did not understand half the things you said except symmetric crypto.

I think there are 2 problems with this approach:

How do you prevent the double spend problem, for example duplicate entry tickets. You would have to mark the ticket as used in a central database anyway to prevent it

What happens if the secret key material is compromised? Anyone can issue new valid numbers, etc..

Please correct me if I'm wrong.

[+] ZephyrP|2 years ago|reply

Feistel ciphers are a good technique for doing just this but it's also worth noting that if all you are looking for is "produce a pseudorandom permutation of 1..N without actually shuffling a list of numbers" you can also use an LFSR as well.

[+] orf|2 years ago|reply

How do you adjust or evolve the blocklist with this, without making previously generated IDs incorrect?

The ID is simply incremented if it is blacklisted [1]. So the ID is fixed to the blacklist content, and adjusting it in any way invalidates certain segments of previously generated IDs?

1. https://github.com/sqids/sqids-rust/blob/9f987886bc06875d782...

[+] hot_gril|2 years ago|reply

I think the "unique IDs" part of the title throws people off and brings security to mind. "The main use of Sqids is purely visual" is what you need to know. It's not necessarily for IDs, it's just a user-friendly way to encode/decode numbers.

[+] BraverHeart|2 years ago|reply

I see many people in this thread saying that this is a good way to hide insights from ids/numbers, I don't understand, aren't the generated values easily decoded? couldn't I just decode a couple of numbers to get that insight? What am I missing.

[+] dumbo-octopus|2 years ago|reply

Odd design decision in that if you provide your own blocklist, it overwrites their (extensive) default list instead of adding to it.

And in general the algorithm is surprisingly complicated for something that could be replaced with simply base64 encoding, the given example (1,2,3) base64 encodes to a string with just one more letter than this algorithm.

That said I do appreciate the semicolon-free-style. I don't typically see that in libs besides my own.

https://github.com/sqids/sqids-javascript/blob/main/src/sqid...

232 comments