top | item 29928282

(no title)

breser | 4 years ago

I know of at least one problem with ASN.1. The string encodings other than UTF-8 are terrible. Most of the string encodings are very limited and weird subsets of ASCII that nobody actually uses anymore. ASN.1 itself doesn't define the encodings and just refers to other standards.

The problem with this is probably most notable with the T.61 encoding which changed over the years and since ASN.1 references other standards nobody is quite sure exactly what you have to support to have T.61 actually work right.

Within X.509 certificates though nobody bothers to actually implement T.61 and just uses the T.61 flag for ISO-8859-1.

There are a bunch of gory details around this mess in this (now quite old) write-up here: https://www.cs.auckland.ac.nz/~pgut001/pubs/x509guide.txt

Since that write up I believe UTF-8 is pretty much the expectation for character encoding for X.509.

I documented some of the quirks around 6 years ago when I took an existing X.509 parser and improved it for use in certificate trust management in Subversion: http://svn.apache.org/viewvc/subversion/trunk/subversion/lib...

Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.

It's also notoriously difficult to parse well. It's very easy to have bugs in your parser, even if you're implementing a subset of it that's needed for X.509. Especially if you're doing so in a non-memory safe language.

I can't speak for why Google invented Protobufs, but I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.

discuss

tialaramex|4 years ago

For the string encoding thing, however, it does have UTF-8 and you should not use anything else to express arbitrary human text anyway.

PKIX actually leverages the weird encoding restriction to our benefit. It defines two kinds of names which things might have on the Internet (you can and should stop trying to name things which are actually on the Internet some other way), DnsNames and IpAddresses. IpAddresses, since they're either 32-bit or 128-bit arbitrary bit values, are just represented as either 32-bit or 128-bit arbitrary bit values. So you cannot express the erroneous IPv4 address 100.200.300.400 as an IpAddress, which means you can't trip up somebody's parser with that nonsense address. DnsNames use a deliberately sub-ASCII encoding from ASN.1 which can express all the legal DNS names (all A-labels and the ASCII dot . are permissible) but can't express lots of other goofy things including most Unicode. So a certificate issuer, even if they're completely incompetent, cannot write a valid DnsName that expresses some garbage IDN as Unicode. Hopefully they read the documentation and find out they need to use A-labels (Punycode) but if not they're prevented from emitting some ambiguous gibberish.

Even in forums where you'd once have expected pushback, "Just use UTF-8" is becoming more widespread. Microsoft for example, once upon a time you'd get at least some token resistance, today they're likely to agree "Just use UTF-8". So ASN.1 ends up no worse off for a half a dozen bad ways to write text you shouldn't use, compared to say XML, HTML, and so on.

xyzzy123|4 years ago

Agree, although the right thing to do helps in specific applications but not so much in the general case. You're very often stuck with other people's MIBs / specs and encoders, trying to make sense of what a) they're allowed to put on the wire and b) what they actually do and under what circumstances.

pzb|4 years ago

A couple of years ago I ran into the same confusion of the "TeletexString"/"T61String" data type in ASN.1. After going down the rabbit hole of what is T.61 and trying to map it to Unicode, I reread the ASN.1 (X.690) spec and realized that the authors never actually referenced T.61. Ever since the first edition of ASN.1 in 1988, those strings have not used T.61. They use a character set that is easily mapped to Unicode - https://www.itscj-ipsj.jp/ir/102.pdf, a subset of US ASCII.

Not to say the rest of the spec is notably better. If fully implemented, it requires supporting escape codes in strings to change character sets. I've never seen valid escape codes in real world data, but it probably exists.

As the original article shows, ASN.1 has lots of other challenges and complexity. Trying to write a code generator that supports all the complexity is no trivial task and the only open source one I've seen only generates C code. Protobuf has the advantage of having modern language support (including multiple type safe and memory safe languages).

jsmith45|4 years ago

Eh... It does have a transitive normative reference to T.61, but only by way of special restrictions on the use of three characters.

T61String is defined in terms of ISO 2022, with the default C0 Character set set to ISO-IR-102 (as you linked). ISO-IR-102 defines the set of graphical characters, but also places a condition on the use of 3 of them by reference to T.61. It also requires that the control character set C0 be set to ISO-IR-106 by default, and ISO-IR-107 for C1.

The net effect is that the default character set of T61String is almost the T.61 character set, except that to get the T.61 character set, you need to include the escape sequence to set G1 to ISO-IR-103. ESC 2/9 7/6

A conforming T61String implementation does need to support the escape sequences and resulting encodings from ISO-IR-6, ISO-IR-87, ISO-IR-102, ISO-IR-103, ISO-IR-106, ISO-IR-107, ISO-IR-126, ISO-IR-144, ISO-IR-150, ISO-IR-153, ISO-IR-156, ISO-IR-164, ISO-IR-165, ISO-IR-168.

Since the control character sets include shift prefixes etc, properly parsing T61Strings into Unicode is non-trivial.

This is actually a pretty good reflection of the complexity in ASN.1. Technically the ASN.1 spec proper only requires that a T61 string support exactly the set of characters specified in the above registrations. It does not mandate any particular format, for them. It is the BER encoding that requires that ISO2022 be used to encode these. A different encoding could specify that all strings are encoded as UTF-8, and the different types are just various subsets of allowed characters.

cryptonector|4 years ago

Heimdal's ASN.1 compiler generates C code. It also generates bytecode with C bindings. Two options.

Also, I've made it generate JSON dumps of the ASN.1 modules. My goal is to eventually replace the C-coded backends that generate C / bytecode with jq-coded backends that can generate C, Java, Rust, etc.

cryptonector|4 years ago

> Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.

ASN.1 has always been as-well- or better-defined than its competition. The ITU-T specs for it are a thing of beauty not often equaled outside the ITU-T.

That said, for a long time the ASN.1 specs were non-free, and that hurt a lot. Also, the BER family of encoding rules stunted development of open source tooling for ASN.1.

AceJohnny2|4 years ago

> I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.

Part of my curiosity stems from Apple using it as part of their bootable file-format: https://www.theiphonewiki.com/wiki/IMG4_File_Format

But as you say, I have to assume they're using it in a very constrained way.

memling|4 years ago

> Part of my curiosity stems from Apple using it as part of their bootable file-format: https://www.theiphonewiki.com/wiki/IMG4_File_Format

I could only speculate, but I wonder if part of the reason is that DER is completely unambiguous and therefore suitable for cryptographic services. It's also very easy to decode without a specification (TLV format). Apple are almost certainly using ASN.1 compilers for their mobile devices and security layers (even if they ship FOSS implementations, I'd be surprised if they aren't checking their work with commercial compilers), so there's overlap there. Rolling your own format in that case could be unnecessary and another failure point that could be rolled into a single unit.

cryptonector|4 years ago

> The string encodings other than UTF-8 are terrible.

Well, yes, because ASN.1 predates Unicode.