Short, friendly base32 slugs from timestamps

[+] wahern|3 years ago|reply

For my current project I decided to try out 120-bit/15-byte identifiers. 120 is evenly divisible by 4 (hex), 5 (base32), 6 (base64), and 8. Neither base32 nor base64 encodings need padding. You can usually leave padding out regardless, but with 120 bits it's not even a choice you have to worry about, and there are no wasted bits.

[+] cakoose|3 years ago|reply

Having it be evenly divisible by all those bases is neat!

But I wouldn't say "no wasted bits" unless 120 bits is also exactly the right length for a particular use case. For example, if my use case would be served better by 64 bits, I'd prefer losing some bits to padding than extending it to 120 bits.

[+] kijin|3 years ago|reply

If you're going to change the order defined in RFC 4648, you might as well change the character set, too. Since one of the goals is to minimize the use of characters that can be confused for another, I would add back 8 and 9 and remove two of i, l and o.

[+] junon|3 years ago|reply

This is what crockford32 does by the way. It's my preferred base32 alternative when there's a chance humans have to transmit something.

[+] jamesfisher|3 years ago|reply

This removes '1' but not 'I'. But if you want to avoid ambiguity, you need to remove _both_ characters, not just one. Users don't know your IDs never contain a '1', so when they see 'I', they'll still think, "hmm, is that a 1?"

[+] KMag|3 years ago|reply

Crockford's version of base32 [0] (sometimes called crockford32) treats 0, o, and O the same when decoding, likewise for 1, i, I, l and L. It also doesn't use the letter U, to reduce the chances of including English obscenities.

So, you don't need to remove ambiguities if you can collapse the ambiguities into equivalence classes.

If you're going to use some flavor of base32, crockford32 is hard to argue against. I could bikeshed some alternatives (I would use the letter U and include Q in the 0 equivalence class, and not worry too much about obscenities), but of the somewhat common base32 standards, crockford32 is the best, IMNSHO.

[0] https://www.crockford.com/base32.html

[+] dahfizz|3 years ago|reply

If you are taking base32 as user input, you can translate '1' to 'I' so that the input is correct regardless of which the user types.

But, needing users to read / type a base32 ID is always bad UX anyway.

[+] iainmerrick|3 years ago|reply

It’s much clearer with lower-case “i” (which this article uses). Same situation with “o”.

[+] kortex|3 years ago|reply

Another option is to generate a ULID and just take the timestamp part (6 bytes, I forget how many characters). Uses Crockford Base32 already so no need for the custom alphabet.

[+] dahfizz|3 years ago|reply

That is definitely taking the long way around.... Why not just grab a timestamp and base32 it yourself? ULIDs also use millisecond timestamps, and use implementation magic to sort IDs within a millisecond. If you are just using the timestamp portion, you should use micro or nanoseconds.

[+] zlib|3 years ago|reply

I always go for Hashids (https://hashids.org/)

[+] vore|3 years ago|reply

I've never been sure of the use case of hashids. They're not really a security measure since they can be easily defeated; you might as well just expose the sequential ID at that point. If you do need something with better properties, you should probably be using UUIDv6.

[+] hoherd|3 years ago|reply

Fwiw, hashids do not solve OP's use case:

> Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.

vs the summary in the article:

> So now when I list atoms from my filesystem or in S3, they come back in the same order that I wrote them.

[+] tokumei|3 years ago|reply

Hashids are nice

[+] stavros|3 years ago|reply

I use shortuuid[0] for that stuff, which also omits the capital letter I, and has some other niceties (I wrote the library). It works really well, and I like how small the IDs are.

[0] https://github.com/skorokithakis/shortuuid

[+] bartwe|3 years ago|reply

Careful to not show these to end users as some may generate swearwords and such.

[+] jminter|3 years ago|reply

Nice post!

To me, "math/big" feels like a big hammer for working on an int64; a nice alternative is "encoding/binary":

  func atomSlug(publishedAt time.Time) string {
      b := make([]byte, 0, 8)
      b = binary.BigEndian.AppendUint64(b, uint64(publishedAt.Unix()))
      b = bytes.TrimLeft(b, "\x00")
      return lexicographicBase32Encoding.EncodeToString(b)
  }

[+] pavon|3 years ago|reply

Now that you made me think about it, I think both this approach and the original math.big one will cause more base32 digits to be added early than needed, breaking sorting earlier than needed, since slugs with different number of chars won't sort properly (think 1, 10, 11, ..., 2, 20...).

Right now the time is converted to 4 bytes which then uses 7 base32 characters to encode. When the time rolls over to 33 bits in 2106, it will increase to 5 bytes encoded as 8 base32 chars, even though 7 base32 chars have enough bits (35) to represent the number.

So perhaps a better way would be something like:

    func atomSlug(publishedAt time.Time) string {
        b := make([]byte, 0, 8)
        b = binary.BigEndian.AppendUint64(b, uint64(publishedAt.Unix()))
        s = lexicographicBase32Encoding.EncodeToString(b)
        return strings.TrimLeft(s, '2'); // '2' is all zero bits
    }

This will stick with 7 base32 chars until you exceed 35 bits in 3058.

[+] throwaway81523|3 years ago|reply

This is clever and thoughtful, but it encodes a 32-bit timestamp in seven base32 characters. It would have been simpler and just one more character to use eight hex digits. Alternatively, maybe go to 4 second resolution and use 30 bit timestamps, so they fit in six base32 chars instead of seven.

[+] pavon|3 years ago|reply

Those approaches would both rollover at 2106, breaking sorting, and you'd potentially start getting collisions around 2566. With his approach he is starting with a 63-bit timestamp (Go uses signed int64s), then math.big.Bytes is dropping leading 0 bytes, meaning it will add more characters as needed.

[+] foota|3 years ago|reply

Now I wish the standard for unix timestamps was in hex, thanks!

[+] oittaa|3 years ago|reply

One simple workaround could be to use something like UUID version 7[0], using just the first part that encodes the time, and dropping the rest.

[0]: https://github.com/ietf-wg-uuidrev/rfc4122bis

42 comments