Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.
Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.
It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.
The use of codepoints below 32 (space, start of what's usually considered "printable") makes me a bit hesitant. A lot of systems won't preserve those characters. Base85 is a more efficient alternative to base64, and doesn't use that lower range:
base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes. Depending on your processor's memory model and the cost of multiplies compared to shifts this can make it the best performer.
This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.
I was going to bring up base 85 as well, its a better choice for a variety of reasons. A long time ago I wrote a base encoder class in Java[1] mostly so that we could write a netnews reader in Java but also because I felt UUEncoding was not robust. The challenges of using unprintable characters is a lot more of a headache than anyone pays attention to initially. Lots (and I mean quite a few here) of systems consider unprintable characters "safe" to re-purpose into random uses. One display vendor had them changing the color of future characters in the display as an example.
Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."
I've never come across this one, thanks for the tip. I like the way it doesn't use every character in the range 0x20 to 0x7f. That last one in particular (0x7f or 'DEL') has always seemed problematic to me because it's a weird exceptional case. I've always thought someone must have screwed up with that one way back in the day. Using only '!' to 'u' and hence avoiding ' ' feels right for some reason, and I also like the cute trick of getting just a little bit of compression by using 'z' to represent 32 zero bits.
A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].
The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.
It's crucial to evaluate encoding space usage in the context of compression. For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for practical data. Even though base64 is more efficient than base16, it breaks up data across byte boundaries which then makes gzip significantly less efficient.
I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.
Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.
More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.
Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times
> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.
It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.
As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.
Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.
> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.
A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.
My problem with base-122 is simply that it's not an even power of 2.
It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding
I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.
It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.
If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?
Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?
Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.
Base122 does not have that property - as far as I could tell.
Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.
Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85, packing 4 bytes into 5 uucp/mail/Usenet safe characters.
For some reason, base 64 became the standard, and I don't think there's space for anything else these days. (Most things do just fine with binary now, anyway!)
In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.
If your property string is enclosed in a double quote, then a single quote in the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes would break. JS allows both types of quotes on string literals for exactly this reason.)
Still, single quotes are somewhat asking for trouble.
You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....
meanwhile the Unicode consortium has been hard at work since 1991 to make it possible to encode up to 2.8 MB -- more than enough for most images, short videos, or many PDF files -- in a single character.
Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.
zackmorris|9 years ago
Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.
Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.
Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...
dmbarbour|9 years ago
kevinAlbs|9 years ago
userbinator|9 years ago
https://en.wikipedia.org/wiki/Ascii85
jws|9 years ago
This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.
ChuckMcM|9 years ago
Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."
[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...
billforsternz|9 years ago
throwbsidbdk|9 years ago
poizan42|9 years ago
The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.
[0]: https://en.wikipedia.org/wiki/Binary-to-text_encoding
[1]: https://en.wikipedia.org/wiki/YEnc
NelsonMinar|9 years ago
bascule|9 years ago
Besides that, the Z85 encoding is the next runner up as a compact "string safe" encoding: https://rfc.zeromq.org/spec:32/Z85/
aidenn0|9 years ago
throwbsidbdk|9 years ago
Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.
More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.
Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times
RobIII|9 years ago
It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.
daenney|9 years ago
Base-122 was created with the web in mind.
And
As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.
Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.
gayprogrammer|9 years ago
> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.
A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.
CiPHPerCoder|9 years ago
It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding
Base-122? Not sure if it's even possible.
kobeya|9 years ago
deckar01|9 years ago
Each byte of base64 produces 6 bits of data, so the boundary aligns at 32 bits. LCM(6,8) = (6•8)/2
Each byte of base122 produces 7 bits of data, so the byte boundary aligns at 56 bits. LCM(7,8) = (7•8)/1
Edit: Due to the variable length encoding, there is no guarantee of byte alignment.
theoh|9 years ago
From the posted article: "This leaves us with 122 legal one-byte UTF-8 characters to use"
Seems legit to me.
xg15|9 years ago
If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?
Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?
paulddraper|9 years ago
zihotki|9 years ago
sikosmurf|9 years ago
wildfire|9 years ago
Base122 does not have that property - as far as I could tell.
Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.
[edit: spelling ]
faragon|9 years ago
Dylan16807|9 years ago
Also note that when you're feeding in binary data the gzip sizes for raw and base-xx data get a lot closer together.
michaelsbradley|9 years ago
http://base91.sourceforge.net/
lalaithion|9 years ago
jwatte|9 years ago
Rafert|9 years ago
unknown|9 years ago
[deleted]
diafygi|9 years ago
https://github.com/diafygi/Offset248
Pros: simple, no ratio or ending calculation, easy copy/paste, same character length as original bytes
Cons: has to be percent encoded in urls, some fonts might not render all characters
timdierks|9 years ago
randomslob|9 years ago
Base122 requires UTF-8, and while that's pretty common, it's not universal, so base64 can't ever go away in favor of base122.
Compressed base64 is more efficient than base122 (with or without compression).
Conclusion: Big nope.
web007|9 years ago
There is an explicit call out for "double quote == bad", but single quotes are also valid property delimeters in HTML.
majewsky|9 years ago
Still, single quotes are somewhat asking for trouble.
kevinAlbs|9 years ago
unknown|9 years ago
[deleted]
Hydraulix989|9 years ago
amelius|9 years ago
jzwinck|9 years ago
voltagex_|9 years ago
logicallee|9 years ago
qwename|9 years ago