The Base45 Data Encoding

Confiks|4 years ago

Note that this encoding isn't of the same efficiency as QR binary mode, as it converts 3 bytes into 2 base45 characters. So it's more like 'base41 using the base45' charset.

I'm still a bit sad that with this standard and the packages available now, the namespace of 'base45' is clobbered with this suboptimal implementation. It can best just be renamed to 'base41'. It's a good tradeoff for the DCC, but not for the rest of possible implementations.

For the Dutch variant of the green pass using unlinkable signatures [1], we need all the space we can get, so we use a base45 encoding that uses the exact same method as base58 [2][3], and which has the exact same efficiency as QR binary mode.

[1] https://github.com/minvws/nl-covid19-coronacheck-app-coordin...

[2] https://gist.github.com/confiks/8fcb480d87a50cf1bb5e40e2f093...

[3] https://github.com/confiks/base45-go/tree/main/base45

mjevans|4 years ago

32 bits raw <==> 33 bits (3 pairs of 11 bits) QR alphanum.

In blocks of 4 bytes this encodes as 6 'base45' (QR alphanum) characters, and uses the same lookup table.

https://en.wikipedia.org/wiki/QR_code#Encoding

The "Alphanumeric character codes" table, at least at a visual glance, is identical to the RFC's lookup table.

pulse7|4 years ago

One could take 10 digits, 26 letters and 5 operands (+ - * / %) for Base41 encoding. 41 * 41 * 41 = 68921 is enough for 2 bytes (65536 combinatons).

radicalbyte|4 years ago

A nice tidbit: this RFC has its roots in the EU Covid Certificate project. The encoding was designed to cut the size of the QR payload (which for DCC is a CBOR - binary encoded - object) :)

The smaller the payload, the better and faster the scanning. Which is important for something that is designed to be used during border crossings and the like.

We have a number of implementations here:

https://github.com/ehn-dcc-development/

lawl|4 years ago

Interesting. The swiss system, which is supposedly compatible with the EU uses JSON + base64 [0].

Seems we also use RSA, and from a quick glance the EU seems to take any x509 certificate authority [1].

Does anyone know if there is a reason elliptic curves weren't been mandated, which should cause smaller signatures than RSA and thus a smaller payload?

[0] https://github.com/admin-ch/CovidCertificate-Apidoc/#respons...

[1] https://github.com/ehn-dcc-development/dgc-java/blob/main/cr...

thehappypm|4 years ago

Honestly feels like premature/unnecessary optimization.

EdSchouten|4 years ago

The idea of this encoding is to store two bytes of data in three characters. To me it's not obvious why you need a base as high as 45 for that.

Assuming you either want to store two bytes, or a trailing one, you have 256*256 + 256 combinations: 65792. Using three base45 characters, you can get up to 45^3=91125 combinations. It looks like base41 would have been sufficient. That way you can get rid of some of those special characters, making it easier to use through different transports.

dolmen|4 years ago

+1

This would allow to avoid space, %, / and + to be more URL friendly, and so allow more usages.

unknown|4 years ago

[deleted]

kstenerud|4 years ago

Unfortunately, the QR code "binary" mode specification defaults to ISO 8859-1 for the encoding (because it was not originally intended to store actual binary data), and there's also no way to indicate what format is actually encoded. So all decoders of course just assume ISO 8859-1 because they have no way of knowing otherwise.

However, we could in theory get around this by using binary data formats that always begin with an invalid text character (such as 0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

nly|4 years ago

For context, it seems this encoding was covered by this discussion a few days ago:

What's Inside the EU Green Pass QR Code?

https://news.ycombinator.com/item?id=27589913

codeflo|4 years ago

So instead of extending QR codes, which are inherently binary, to efficiently handle binary payloads, we invent yet another ASCII-based tunneling scheme. Why ever fix any problem when we can just pile workaround upon workaround upon workaround?

masklinn|4 years ago

Qrcode already has a binary mode. The problem is that qrcode software treats qrcode data as text and fucks up. You’re not going to get every software out there fixed.

This does not extent qrcode at all, instead it defines a binary-to-text encoding designed to fit in qrcode’s existing alphanumeric mode, exactly like base32, base64 or base85 (or base36, base62, binhex, quopri) but fitting the specific constraints of the qrcode medium.

kstenerud|4 years ago

Unfortunately, the QR code "binary" mode specification defaults to ISO 8859-1 for the encoding (because it was not originally intended to store actual binary data), and there's also no way to indicate what format is actually encoded. So all decoders of course just assume ISO 8859-1 because they have no way of knowing otherwise.

However, we could in theory get around this by using binary data formats that always begin with an invalid text character (0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

atoav|4 years ago

Becaus sometimes you need your solution to be used be the public, voluntarily. If they have to install an extra app to read your "looks-like-QR-code-but-not-quite" thing, then you have at least an increased risk of not getting the adoption you want to get.

This is not a realistic choice for many projects.

tlamponi|4 years ago

Because QR code scanners are very widespread already, can be understood by even basic cameras nowadays and some devices are hard to update or would need yet another self-invented thingy, which would make people rather more suspicious if done for something like the use case this initially had – codes for verifying that a person is tested, vaccinated or recovered OK on borders and the like.

I at least find it good that existing solutions still get optimizations, even and are enhanced, vs. just throwing everything away at the slightest issue and redo everything, the that churn just costs a lot of $€£ while never bringing out something mature, i.e., useful for the masses.

mjevans|4 years ago

This could be useful for storing non-ascii armored crypto keys within a printed QR code format that can be stored in a fireproof safe and reasonably OCRed and decoded for use in disaster recovery or other applications.

chrismorgan|4 years ago

So, it’s using a 45-character alphabet which matches the QR code alphanumeric values table, which lets the QR code encoder switch to a more efficient mode that takes less space.

I just tried rendering a QR code of the 692 characters of the introductory paragraph (with lines joined appropriately), and compared it with a QR code of the same text, uppercased and with out-of-range characters `,`, `[` and `]` changed to %. This reduced an 89×89 code down to 77×77, a 24% reduction in area. If this is roughly the ratio, then Base45-encoding binary data by QR code will yield roughly 17% area savings compared with Base64. (Base45 gets 50% bloat, then 24% shrink = multiple of 1.14; Base64 gets 33% bloat = multiple of 1.33; Base45 / Base64 = 1.14 / 1.33 = 0.83.)

[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]

I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.

In case it’s not clear, I am utterly inexpert in this domain.

Further thoughts:

Base45 encoding two octets in three characters is pretty wasteful: 45³ ÷ 256² ≈ 1.39, which is 39% waste. (By contrast, Base64 is 100% efficient with its alphabet: 64⁴ = 256³.) This means that if you were willing to do more complex encoding and decoding, you could shrink your QR code by roughly 39% more—to about 52% of the size of the Base64, rather than 83%. Leaving such a huge gap on the table puzzles me—I’d have thought that either you’d want something simple (where Base64 is well-understood) or want to minimise your QR codes, and Base45 sits in an awkward place in the middle.

For UTF-8, base-128 will be the most efficient you get. That’ll be ~14% inflation (7 bytes in 8 characters). Which… huh, that looks to be within ε of Base45’s 50% bloat and 24% shrinkage. Not sure if that’s a coincidence or not because I don’t know how alphanumeric mode versus byte mode works in QR codes. But this suggests that alphanumeric mode and Base45-but-not-wasteful would be markedly more efficient than byte mode and Base-122. Still leaves numeric and kanji modes open as possibilities. Again, I’m inexpert and don’t know how the encodings are actually done, and that’ll matter.

On edflsafoiewq’s 40-L figures: Base64 gets 2214 bytes, Base45 gets 2864 bytes, optimally-efficient base-45 would get log₂₅₆ 45⁴²⁹⁶ = 2949 bytes, only around 3% more. I think I must have made a mistake somewhere with some of my numbers.

anderskaseorg|4 years ago

The right way to measure the area ratio is using entropy. An optimal encoding would save at most 3% area over Base45:

Base64 in binary: log₂ 64 / 8 = 75.000% efficient

Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient

Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient

Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient

Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient

Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient

Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient

Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient

The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.

lifthrasiir|4 years ago

> I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Byte mode is obviously most efficient. Alphanumeric mode uses 5.5 bits per each character, so combined with base45 it uses 5.5 * 3 = 16.5 bits to pack two octets. Base45 is actually not that bad as it seems (3.1% overhead). [EDIT: I've since seen edflsafoiewq's mention that typical QR readers try to interpret binary data as UTF-8, so it is not pointless.]

[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]

unknown|4 years ago

[deleted]

eventreduce1|4 years ago

What are the benefits to base58?

Base45 uses chars like backslash. This is super annoying when the encoded string is used in an url.

edflsafoiewq|4 years ago

It's not for avoiding looks-similar-to-a-human chars like Base58. It's for QR codes. The alphabet is exactly the alphabet for Alphanumeric mode QR codes.

A 40-L code allows 4296 characters in Alphanumeric-mode = 2864 bytes after Base45 decoding.

The same code allows 2953 characters in Byte mode. If you use Byte mode to hold Base64, that's 2214 bytes after decoding, so Base45 is more efficient.

The reason it gives for why you can't use Byte mode to directly hold binary data is

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

It also says you're not supposed to use it anywhere but QR codes (like URLs)

> If the data is to be sent via some other transport [not stored in a QR-code], a transport encoding suitable for that transport should be used instead of Base45. It is not recommended to first encode data in Base45 and then encode the resulting string in for example Base64 if the data is to be sent via email. Instead the Base45 encoding should be removed, and the data itself should be encoded in Base64.

DocTomoe|4 years ago

The draft specifically tells you not to do that:

"If the data is to be sent via some other transport, a transport encoding suitable for that transport should be used instead of Base45."

Base45 is mainly useful for binary information in QR codes.

zaxomi|4 years ago

slash = /

backslash = \

They are using slash.

And space, which is confusing. Did not see the space in the "Hello!!" example since the space is the last character on the line.

simojk|4 years ago

How does this compare with Data Matrix and C40 encoding?

upofadown|4 years ago

Why would this be a RFC? It seems quite specific to QR codes. Nothing specific to the Internet.

47 comments