top | item 22429809

Show HN: Base24 binary-to-text encoding for humans

116 points| kuon | 6 years ago |kuon.ch | reply

68 comments

order
[+] wp381640|6 years ago|reply
Microsoft product keys were base-24 with the following alphabet:

> B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9

they were 115 bits encoded in 24 characters

see also human-oriented base32 encoding:

https://philzimmermann.com/docs/human-oriented-base-32-encod...

which includes this nice trick:

> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.

edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

[+] VMG|6 years ago|reply
See also Bech32 which includes error correction and detection:

https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi...

> Why not use an existing character set like RFC3548 or z-base-32? The character set is chosen to minimize ambiguity according to this visual similarity data, and the ordering is chosen to minimize the number of pairs of similar characters (according to the same data) that differ in more than 1 bit. As the checksum is chosen to maximize detection capabilities for low numbers of bit errors, this choice improves its performance under some error models.

[+] hajimemash|6 years ago|reply
Brings me back to the trusty old FCKGW-RHQQ2-YXRKT-8TG6W-2B7Q8
[+] Fnoord|6 years ago|reply
> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.

Basically they removed vowels (except for y, if it counts as one) as non-vowels often include a vowel in their sound. A fact reinforced while teaching my toddler daughter letters, words, and numbers. On top of that, they removed l/1(/i, and also o/0), m/n, s/5, z. Not sure why they removed z. Perhaps because of 2?

I'm not sure this is universal either, sound-wise. I suppose it does count for English. Because 7 ("zeven") and 9 ("negen") in Dutch get confused when spoken, some people say "zeuven" instead of "zeven".

[+] WorldMaker|6 years ago|reply
> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

The most famous example is Schneier's Solitaire [1]. It was a common encoding I liked to play with in HS classes even before reading Cryptonomicon. I still think about it sometimes when I read through a Duplicate Bridge story in that long syndicated newspaper column. (One of these days I will actually learn Bridge, maybe.)

[1] https://www.schneier.com/academic/solitaire/

[+] vsnf|6 years ago|reply
> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

I encountered something similar to this a while ago when watching a multiplayer mod of Ocarina of Time[0]. They use a string of inventory item symbols to denote the identity of the server to connect to. “Hook shot, hook shot, master sword, deku nut” is a whole lot easier to remember than a long string of ascii.

I guess it’d be something like a Base62 encoding.

[0]https://youtu.be/FLjIiVGPo_0

[+] garganzol|6 years ago|reply
Microsoft's Base24 has a much saner alphabet as it avoids ambiguities between 5 and S, 7 and Z symbols while the alphabet suggested by the OP falls into pitfall.
[+] PeterisP|6 years ago|reply
It would make all sense to ensure that you don't have B and 8 in the same alphabet. Just as you don't want 1 and I, and 0 and O - pick any one of them, but not both.
[+] directionless|6 years ago|reply
I'm fond us using a base100, made of up 2 letter syllables. It results in a vaguely pronounceable string.

For syllables, I use: syllables: %w[ ba be bi bo bu ca ce ci co cu da de di do du fa fe fi fo fu ga ge gi go gu ha he hi ho hu ja je ji jo ju ka ke ki ko ku la le li lo lu ma me mi mo mu na ne ni no nu pa pe pi po pu ra re ri ro ru sa se si so su ta te ti to tu va ve vi vo vu wa we wi wo wu xa xe xi xo xu ya ye yi yo yu za ze zi zo zu ],

I can dump an implementation somewhere if people are really curious

[+] codyb|6 years ago|reply
This is very similar to Dominic O’Brien’s techniques for remembering numbers where he assigns each number in base 10 a letter, then you go through and make characters and actions for each number 0 - 99.

So 0 - O, 1 - A, 2 -B... (with S for 6, and N for 9).

Then 00 -> Double O holding a pistol

OA - Your friend Oliver Anderson doing whatever Oliver Anderson does

Etc.

Then 0100 becomes your friend Oliver Anderson holding a pistol in your imagination and that’s easier to remember and you can make stories with your characters to remember phone numbers etc.

Once I started this comment I realized it may be just a tad bit more involved, but I wonder if you could combine the two and have characters for every number from 0000 - 9999.

[+] shakna|6 years ago|reply
> I can dump an implementation somewhere if people are really curious

Interested.

I've put together a few encoding libraries for fun when I get bored. (base16, morse, etc.)

This one looks fun, particularly because it _might_ be possible to serialise it to sound and back, if I put in a little bit of effort, which is something I've done [0] once or twice.

[0] https://git.sr.ht/~shakna/soundofsilence

[+] numpad0|6 years ago|reply
That’s just Japanese “Roman” alphabet:

Consonants [(None), k, s, t, n, h, m, y, r, w], followed by,

Vowels [a, i, u, e, o] forming 5x10 matrix,

+ semi-voiced ゜(p replaces h) and voiced ゛(g, z, d, b replaces k, a, t, h)signs,

+ silent “nn”,

- wi wu we.

(aka NES Dragon Quest spell of resurrection)

[+] thelazydogsback|6 years ago|reply
And pipe them into TTS for all kinds of fun...
[+] anonsivalley652|6 years ago|reply
That's not "base-100" in terms of symbols or storage, it's functionally-identical to base-10! Be honest about how terrible it is.
[+] appwiz|6 years ago|reply
I’d love to see your implementation.
[+] excitedleigh|6 years ago|reply
Another interesting solution to this problem is that used by plus codes [1]:

> The characters that are used in Open Location Codes were chosen by computing all possible 20 character combinations from 0-9A-Z and scoring them on how well they spell 10,000 words from over 30 languages. This was to avoid, as far as possible, Open Location Codes being generated that included recognisable words. The selected 20 character set is made up of "23456789CFGHJMPQRVWX". [2]

[1]: https://plus.codes [2]: https://github.com/google/open-location-code/blob/master/doc...

[+] cortesoft|6 years ago|reply
Does not help with the ambiguous character problem at all, though
[+] tlhunter|6 years ago|reply
> The final alphabet I came up with is ZAC2B3EF4NH5TKL7P8RS9WXY. As I required 24 characters, I kept G and 6 which are the least ambiguous in the list.

I've read this a dozen times. Isn't OP saying that their character list includes G and 6, which are _not_ present in that list?

Update: It appears to be a typo in the article. Here's the real alphabet (N replaced by G and L replaced by 6): ZAC2B3EF4GH5TK67P8RS9WXY

https://github.com/kuon/java-base24/blob/0c25905414f1598a0ed...

[+] anonsivalley652|6 years ago|reply
S 5 6 G

P R

2 Z

8 B

look similar, depending on the font

It would be better to include some lower case characters which have more visual variability than trying to obsess over an arbitrary, inflexible stylistic "design."

[+] kuon|6 years ago|reply
Oh, yes my bad, I will correct the article. I did use G and 6 in the end, I copy pasted one of the other candidate. NL was a try.

Sorry about that.

[+] Groxx|6 years ago|reply
Though clearly there are some advantages with removing ambiguous chars... I feel like it's more of a UI / UX thing-to-polish than a problem. Lack of polish creates the problem, the ambiguous chars themselves are not inherently an issue.

If it's ambiguous, you could accept either and transform it to the correct value (implicitly, or as entered, or whenever makes sense. your users don't ever have to know). Or if you can't do that / the differences matter, do something like 1password does with chars and letters: show them differently https://www.dropbox.com/s/a29g2uiggqujzjl/screen%20shot%2020...

[+] oefrha|6 years ago|reply
> do something like 1password does with chars and letters: show them differently

That’s missing the point. You can show them differently, but the point of keys / recovery codes is that they’ll be stored somewhere and later re-entered. Users could store them in any program (including writing them down or printing them out), you can’t control how they are displayed over there. Then when they need to use them, there’s a chance the ambiguous characters can’t be easily discerned.

[+] jansan|6 years ago|reply
With my old shareware product that really did not sell a lot, I got one phone call from a customer who was not able to enter the correct license code. Of course he had mixed up 0 and O. So yes, for some people it solves a problem.
[+] GeertB|6 years ago|reply
I really like the super high efficiency at the important multiple-of-4-byte increments. Using 7 base-24 characters to encode 32 bits is 99.7% efficient. However, I'd recommend using 7 base-24 digits followed by a blank as standard output format. This would allow for efficient 8 character <=> 32 bit conversions. Also, I think padding output to a multiple of 7 characters would be good, for similar reasons that it's good for base-64. Now you can concatenate encoded streams like you could byte streams, and recover on decode. As multiples of 32 bits are so common, padding would be used little in practice. On input, it would be fine to accept unpadded base-24 sequences, but valid base-24 output should always pad to a multiple of 7 chars (excluding the blanks that should be just for readability and not significant otherwise).

However, I strongly dislike the arbitrary mapping between character values and base-24 digits. There is a strong reason for using the order 2345679ABCEFGHKRSTWXYZ, which is that now encoded values compare the same as the original binary values. I did appreciate the 0x00000000 == ZZZZZZZ equivalence, but consistent ordering is just way more important IMO. Also 2222222 looks a lot like ZZZZZZZ. Just saying.

[+] kuon|6 years ago|reply
I thought about the comparison bit, and I wanted to go against it.

Ordered, your snippet look like the alphabet with a few missing letters, and isn't searchable on google or anything. I really wanted the alphabet to stand out.

I don't think that it is important that it can be sorted, it is intended for randomly generated keys which by my experience, you won't be sorting.

[+] tjchear|6 years ago|reply
I think proquints [0] are pretty good at encoding for humans as well. For example, when used to encode IP addresses, they result in pronouncible identifiers like this:

  127.0.0.1       lusab-babad
  63.84.220.193   gutih-tugad
  63.118.7.35     gutuk-bisog
[0] https://arxiv.org/html/0901.4016
[+] kortex|6 years ago|reply
Nice! I'd like to implement this in a key-recovery tool I have been working on, Passcrux [1]. I actually started fleshing out a base24 encoding of my own, but the padding/bit shuffling proved to be somewhat cumbersome, and I shifted focus to abc16, which is like hex, but purely alphabetic.

[1] https://github.com/xkortex/passcrux

[+] davidcollantes|6 years ago|reply
Related; how to get Passcrux to compile? I couldn't find instructions on the repository, and would love to try it out. Thanks!
[+] kozak|6 years ago|reply
If you consider the scenario of dictating over the phone, letters can be confusing not just because of their written shape. For many non-English speakers, for example, E can be confused with I, and V can be confused with W, unless both sides use the same way of pronouncing them. Look how Microsoft's base24 alphabet (from the other comment) has neither E nor I.
[+] beojan|6 years ago|reply
That's what the NATO phonetic alphabet is for.
[+] thelazydogsback|6 years ago|reply
> decimal: 49894920630459842177293598641814316632

This 128-bits can also be represented in, let's say base-50K, by using five words chosen from a 50,000 word dictionary. If you also make "this", "This" and "THIS" separate, then you can get away with a 17K word dictionary. Depending on the language, if you use roots and then vary morphology based number, tense, etc., then the number of root words (and the choice you have in making them simple) can be reduced. Such "pass phrases" can be easier to remember, transcribe, etc. (Also you will get random, humorous, offensive, etc., phrases...)

[+] strags|6 years ago|reply
I recently needed to encode a 32-bit value into something easy for QA folks to remember and report. I opted for 3 words out of an 11-bit (2048 entry) dictionary of commonly used words.

How to build the dictionary? Well, in order to determine the most commonly used English words, I downloaded a bunch of free texts from Project Gutenberg, and did some simple filtering - nothing less than 5 letters, no duplication of singular + plural, etc...

A valuable lesson that I learned during this process is that when your corpus includes older english texts, you should always give your final list a visual once-over and apply some judicious manual filtering. I'm looking at you, "The Adventures of Tom Sawyer". (And, to a lesser extent, Moby Dick).

[+] SlowRobotAhead|6 years ago|reply
Not bad, I mean, I’m not lining up to implement it in C tomorrow, but if it gets an RFC I could definitely see using it.

I have an application where I’m using a 32bit serial for the event someone has to read it to sales staff over the phone. I would have liked to use 64bit and encode some more details into the serial. This would satisfy that.

I like the idea of removing ambiguous chars. I have a Base64 system that prints where I and l use the same font (infuriating).

[+] rini17|6 years ago|reply
What about Base10 emoji encoding? There is already more than 1024 of them ;)

128bit can be then represented by just 13 characters, or even much less with modifiers.

[+] WorldMaker|6 years ago|reply
It's certainly a fun idea. If the goal is human readable, humans are surprisingly bad at differentiating emoji just by looking at them, especially all of the subtly different variation face ones. Describing them over something like a phone call could lead to all sorts of transcription mistakes. Not to mention that there's a variety of different emoji input systems/keyboards and the amount of user skill in finding/picking emoji for text entry are hugely variable.
[+] Aardwolf|6 years ago|reply
> The data length must be multiple of 32 bits. There is no padding mechanism in the encoder.

Such padding mechanism should not be necessary, and the padding from standard base64 is also not necessary. If you remove the ==='s you can still unambiguously decode it (despite the error some tools will give). URL-safe base64 (RFC 4648 §5) does not require padding and can represent any data length.

[+] bograt|6 years ago|reply
In a similar vein, this is an encoding I designed specifically for 256 bit keys; my design includes checksumming and some consideration to consistent verbalization:

https://github.com/tomgibara/keycode

[+] garganzol|6 years ago|reply
An important note to the author: please put test vectors right to the spec!

The classics like "", "f", "fo", "foo", ..., "foobar" would suffice. If the encoding specifically works on numbers, put test vectors for those too.

[+] aabbcc1241|6 years ago|reply
The author mentioned it's confusing even when one of similar character is used (like base 10), the program can indeed automatically resolve typo (e.g. treat O as zero).

This doesn't require the user to be technical.