Base-122 – A space efficient alternative to base-64

http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_c...

Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.

Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.

Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...

dmbarbour|9 years ago

It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.

kevinAlbs|9 years ago

Good point, I will change that.

userbinator|9 years ago

The use of codepoints below 32 (space, start of what's usually considered "printable") makes me a bit hesitant. A lot of systems won't preserve those characters. Base85 is a more efficient alternative to base64, and doesn't use that lower range:

https://en.wikipedia.org/wiki/Ascii85

jws|9 years ago

base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes. Depending on your processor's memory model and the cost of multiplies compared to shifts this can make it the best performer.

This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.

ChuckMcM|9 years ago

I was going to bring up base 85 as well, its a better choice for a variety of reasons. A long time ago I wrote a base encoder class in Java[1] mostly so that we could write a netnews reader in Java but also because I felt UUEncoding was not robust. The challenges of using unprintable characters is a lot more of a headache than anyone pays attention to initially. Lots (and I mean quite a few here) of systems consider unprintable characters "safe" to re-purpose into random uses. One display vendor had them changing the color of future characters in the display as an example.

Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."

[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...

billforsternz|9 years ago

I've never come across this one, thanks for the tip. I like the way it doesn't use every character in the range 0x20 to 0x7f. That last one in particular (0x7f or 'DEL') has always seemed problematic to me because it's a weird exceptional case. I've always thought someone must have screwed up with that one way back in the day. Using only '!' to 'u' and hence avoiding ' ' feels right for some reason, and I also like the cute trick of getting just a little bit of compression by using 'z' to represent 32 zero bits.

throwbsidbdk|9 years ago

Holla for ASCII85. Used it many times when some of the base64 chars are unusable. A nice thing to know about when working with old systems

poizan42|9 years ago

A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].

The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.

[0]: https://en.wikipedia.org/wiki/Binary-to-text_encoding

[1]: https://en.wikipedia.org/wiki/YEnc

NelsonMinar|9 years ago

It's crucial to evaluate encoding space usage in the context of compression. For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for practical data. Even though base64 is more efficient than base16, it breaks up data across byte boundaries which then makes gzip significantly less efficient.

bascule|9 years ago

Came here to post this... yEnc is about as compact as you can get, and has been around for over a decade.

Besides that, the Z85 encoding is the next runner up as a compact "string safe" encoding: https://rfc.zeromq.org/spec:32/Z85/

aidenn0|9 years ago

YEnc is unsuitable when the text will be transfered as UTF-8.

throwbsidbdk|9 years ago

I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.

Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.

More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.

Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times

RobIII|9 years ago

> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.

It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.

daenney|9 years ago

Does anyone else find it odd that both

Base-122 was created with the web in mind.

And

As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.

Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.

gayprogrammer|9 years ago

I agree. Including:

> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.

A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.

CiPHPerCoder|9 years ago

My problem with base-122 is simply that it's not an even power of 2.

It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding

Base-122? Not sure if it's even possible.

kobeya|9 years ago

I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.

deckar01|9 years ago

I think you are focusing on the wrong number.

Each byte of base64 produces 6 bits of data, so the boundary aligns at 32 bits. LCM(6,8) = (6•8)/2

Each byte of base122 produces 7 bits of data, so the byte boundary aligns at 56 bits. LCM(7,8) = (7•8)/1

Edit: Due to the variable length encoding, there is no guarantee of byte alignment.

theoh|9 years ago

A number system can have any natural number as radix, so why not? https://en.m.wikipedia.org/wiki/Radix

From the posted article: "This leaves us with 122 legal one-byte UTF-8 characters to use"

Seems legit to me.

xg15|9 years ago

It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.

If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?

Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?

paulddraper|9 years ago

Compression takes care of exactly this.

zihotki|9 years ago

tldr, no revolution, base64+gzip still far better than base122+gzip. But the article worth reading.

sikosmurf|9 years ago

Aren't there security concerns using gzip and HTTPS? Could this be a space efficient alternative when gzip is disabled?

wildfire|9 years ago

Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.

Base122 does not have that property - as far as I could tell.

Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.

[edit: spelling ]

faragon|9 years ago

Base85 is between base64 and base122, while using printable characters. Any statistics of compressed base85?

Dylan16807|9 years ago

I don't have an easy way to try a test corpus but I fed this comment page into it.

  Raw size:     37409
  gzip(raw):    6170
  gzip(base16): 7482
  gzip(base64): 10675
  gzip(base85): 10549

So it works better than base64, and it has the advantage of working 4 bytes at a time rather than 3.

Also note that when you're feeding in binary data the gzip sizes for raw and base-xx data get a lot closer together.

michaelsbradley|9 years ago

There's also basE91:

http://base91.sourceforge.net/

lalaithion|9 years ago

Note that if you're not using UTF-8 to encode your characters, you can beat Base64, Base85, and Base122 with https://github.com/ferno/base65536

jwatte|9 years ago

Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85, packing 4 bytes into 5 uucp/mail/Usenet safe characters. For some reason, base 64 became the standard, and I don't think there's space for anything else these days. (Most things do just fine with binary now, anyway!)

Rafert|9 years ago

Interesting, but please add a license :)

unknown|9 years ago

[deleted]

diafygi|9 years ago

Why not embrace the Unicode world we live in now and just offset bytes to a Unicode region that has 256 non-space characters?

https://github.com/diafygi/Offset248

Pros: simple, no ratio or ending calculation, easy copy/paste, same character length as original bytes

Cons: has to be percent encoded in urls, some fonts might not render all characters

timdierks|9 years ago

In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.

randomslob|9 years ago

Base64 is widespread and works fine, while nobody groks base122.

Base122 requires UTF-8, and while that's pretty common, it's not universal, so base64 can't ever go away in favor of base122.

Compressed base64 is more efficient than base122 (with or without compression).

Conclusion: Big nope.

web007|9 years ago

Are single-quotes allowed in the spec?

There is an explicit call out for "double quote == bad", but single quotes are also valid property delimeters in HTML.

majewsky|9 years ago

If your property string is enclosed in a double quote, then a single quote in the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes would break. JS allows both types of quotes on string literals for exactly this reason.)

Still, single quotes are somewhat asking for trouble.

kevinAlbs|9 years ago

Base-122 encoded strings may contain single quotes but not double quotes. The choice between the two was arbitrary.

unknown|9 years ago

[deleted]

Hydraulix989|9 years ago

I always wondered why people didn't do this.

amelius|9 years ago

I'm looking for an efficient encoding that is guaranteed to not contain certain substrings, such as improper language.

jzwinck|9 years ago

You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....

voltagex_|9 years ago

Define improper language. Does your improper-language dictionary cover every possible translation of every possible "improper" word?

logicallee|9 years ago

meanwhile the Unicode consortium has been hard at work since 1991 to make it possible to encode up to 2.8 MB -- more than enough for most images, short videos, or many PDF files -- in a single character.

qwename|9 years ago

Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.

66 comments