Show HN: ION, a JSON alternative – Versatile, compact, fast, binary data format

[+] efaref|10 years ago|reply

Or you could use an actual RFC standard[1] that has numerous high-quality implementations[2].

[1] https://tools.ietf.org/html/rfc7049 [2] http://cbor.io/

[+] VStack|10 years ago|reply

We have already look at CBOR and MessagePack. They miss the ION tables for compact arrays of data. Here is a link to our comparison to other data formats. http://tutorials.jenkov.com/iap/ion-vs-other-formats.html

[+] GordonS|10 years ago|reply

Obligatory xkcd: https://xkcd.com/927/

[+] robalfonso|10 years ago|reply

It wouldn't be fun if you didn't re-invent your own wheel but this time did it with the features you want instead of the ones that are almost-identical-but-you-didn't-invent!

[+] woah|10 years ago|reply

On one of these pages, the claim is made that Protobufs is not self-describing, and therefore cannot be used for "network applications". It seems that "self-describing" here means that the format includes key names, instead of compressing them by using numbers like protobufs does. I can't understand why having field names is going to make a difference for anyone. Once you are setting up a system to deal with a specific format of data, why not just include a protobufs schema?

[+] ctz|10 years ago|reply

Here's a use case where protobuf is terrible because it isn't self-describing: write a wireshark plugin which parses and pretty-prints protobuf messages for human consumption.

You can't, because such a plugin would have to have a-priori knowledge of the schema in use.

[+] VStack|10 years ago|reply

We have never claimed that Protobuf could not be used for network applications. We have claimed that Protobuf is a bad choice for messages have have to be routable by intermediaries that do not know the schema of a Protobuf message. Where does one Protobuf message end and another begin?

Additionally, Protobuf is not good at encoding raw bytes - according to their own words.

[+] umanwizard|10 years ago|reply

It's nice to not have to assume the client and server are running the exact same version of the protocol. If you use ordinal numbers instead of names, you can never remove or reorder things without completely breaking backward compatibility. You can only append new fields to the end.

[+] dplgk|10 years ago|reply

Binary format makes me believe it's not human-readable. How doesn't this compare in size to gzipped JSON? JSON overhead is fairly small (some quotes, colons, brackets and keys) - it's no XML.

[+] VStack|10 years ago|reply

Yes, binary formats are not easily readable in a text editor. But, it is actually possible to convert ION to an XML format and back again without loss of information (we have not implemented this yet). This should make it easier to read messages during debug - especially because you don't need to know the schema for the given message to conver it to XML.

Regarding GZipped JSON, it is true that GZiped JSON is small. But, due to the CRIME and BREACH attacks it is not recommended to compress data sent over encrypted connections (TLS).

If you look at our performance benchmarks page you can see a list of serialized length comparisons. As you can see, as soon as you send a few objects in an ION table, the difference is big. More than what you normally can gain with GZip (except perhaps for String).http://tutorials.jenkov.com/iap/ion-performance-benchmarks.h...

Furthermore, GZip only helps with transfer time, and actually slows down parsing time. If you look at our performance benchmarks you will see that ION parsing time is a lot faster than JSON. Additionally, if you really, really want high speed you do not parse ION (or JSON) into Java objects. You process the data directly in its binary form. If you look at our read-and-use benchmark you can see just how big a speed difference that gives. ION is designed for being processed directly. JSON isn't as good for that purpose.

Finally, ION is designed for fast arbitrary hierarchical navigation. JSON is not.

[+] efaref|10 years ago|reply

One thing that I like about CBOR is that with very little knowledge it's surprisingly readable in a hex-dump. Low value positive integers are the value of the byte itself, strings all have the form "0x6L [string]" or "0x7X [len] [string]". Arrays and maps are similarly obvious.

Of course, anything more than a simple construct you're better off using a decoder (e.g. a Wireshark one).

Also, the fact that it's compatible with JSON means that you can use JSON in your development, and then switch to CBOR at the end for the reduction in packet size. In python it's as simple as changing:

    import json as encoder

to

    import cbor as encoder

[+] Cyph0n|10 years ago|reply

I think it can make quite a difference in long messages, as well as when serving a high number of requests.

[+] umanwizard|10 years ago|reply

Note for anyone who was as confused as I was: this is not the same thing as Amazon's internal typed JSON format, which is also called Ion.

[+] rix0r|10 years ago|reply

Well, why would it be? That's internal.

Hasn't been published as far as I can Google.

[+] pinkunicorn|10 years ago|reply

Coming to think of it, ION was actually pretty great. Chris Suver's brain child!

[+] klodolph|10 years ago|reply

> As you can see, an ION field can contain values that are up to 2^120 bytes long. If you need to encode larger blocks of data than that, you would need to break it up into multiple fields.

Har, har.

[+] kentonv|10 years ago|reply

I'm looking at: http://tutorials.jenkov.com/iap/ion-vs-other-formats.html

As the author of Protobuf v2 (the version that was open sourced by Google), I object to some of the "no"s in the protobuf column.

(Note: I no longer work on Protobuf, and I did not invent the format. I do work on and did invent Cap'n Proto.)

> Protobuf apparently isn't great at encoding raw bytes either (according to their own website).

Protobuf can handle raw bytes just fine, using the "bytes" type. There is no special encoding done on bytes; parsing and encoding is done by memcpy(). I'm curious to know what part of the web site you interpret as saying otherwise. It's entirely possible that the web site contains confusing language, but a citation would have been a good idea here.

> Schema / Class Id > Self describing

The Protobuf libraries have extensive support for manipulating dynamic schemas and transmitting schemas over the wire. See the "Descriptor" and "DynamicMessage" APIs. This is mentioned on the web site:

https://developers.google.com/protocol-buffers/docs/techniqu...

> Even if these compact objects do not contain any property names, they are still self describing enough that you can see where fields start and end, plus their data type, without an external schema. You cannot do that with Protobuf (as far as we know).

You absolutely can do that with Protobuf. This is what the "protoc --decode_raw" flag does, and it should be clear enough from reading the encoding.

https://developers.google.com/protocol-buffers/docs/encoding

> Cyclic references

While it's true that Protobuf doesn't support these, I hope you've considered the denial-of-service vulnerabilities they tend to create if the receiver is not expecting them. Please ensure that cyclic references are only allowed in cases where the app opted into it.

Relatedly, overlapping references / backreferences ("Copy" in your table) potentially leads to an amplification attack where a small message on the wire turns out to be much, much larger when traversed. If applications cannot defend themselves from huge payloads by setting a message size limit, then you'll need to give them some other way.

> All of the formats (except perhaps Protobuf) supports arbitrary hierarchical navigation of the encoded data, without first converting it to objects.

Protobuf supports this, and in fact should be an unqualified "Yes" rather than "Yes(*)" like the others. Protobuf encoding is very similar to ION's. Sub-messages are length-delimited, which seems to be exactly the advantage you're claiming that ION has.

Note that none of these formats support random access in the way that Cap'n Proto does.

In summary, I believe Protobuf deserves a "yes" in: "Raw bytes", "Good at raw bytes", "Schema / Class Id", "Arbitrary hierarchical navigation", and "Self describing".

[+] VStack|10 years ago|reply

If that is really true, then we will of course update the comparison page. However, we have put it together from what we were able to find in Google Protocol Buffer's own docs + stack overflow + googling. It is entirely possible that we made mistakes.

Sending schema over the wire is not a good solution for anything else than point-to-point communication. An intermediate node would need every single schema transmitted along with every single messsage, or have another way to keep the schemas cached. That becomes complicated.

The Protobuf documentation says very clearly that you cannot see when one message ends and another begins. Then a protobuf message is not fully self describing. This might be easy to add, but it doesn't have it (according to Protobuf's own docs).

We have looked at Cap'n Proto - but late in the process where we had already looked at quite a lot of formats. From what I can see, Cap'n Proto is pretty much just a binary struct. That is pretty close to what we wanted to do with ION, except we wanted it to be compact on the wire too. We have seen that Cap'n Proto has a compaction mechanism, but we have not yet had time to analyze and compare it to ION's.Cap'n Proto with compaction would be very similar to ION - on a conceptual level.

However, we need to make space for some IAP specific fields coming later in the process (like cache references, column stores and more). Stuff that is IAP specific. That is why we chose to roll with our own encoding in the first place.

[+] leeoniya|10 years ago|reply

i have found that most of my huge json data is from uniform recordsets. there's a great json-compatible encoder for such cases that stores them in a format that's CSV-esque:

https://github.com/WebReflection/JSONH

[+] VStack|10 years ago|reply

That might work for a web app, but not for mobile apps, or backend to backend communication.

[+] kevinSuttle|10 years ago|reply

Is there a spec? Is this a Java API for it?

[+] VStack|10 years ago|reply

Yes please check:

Specification: http://tutorials.jenkov.com/iap/index.html

Benchmarks: http://tutorials.jenkov.com/iap/ion-performance-benchmarks.h...

Tutorials: http://tutorials.jenkov.com/iap-tools-java/index.html

Here is also a recent Infoq.com article http://www.infoq.com/articles/IAP-Fast-HTTP-Alternative?utm_....

[+] fnordsensei|10 years ago|reply

To be fair, Transit is probably a better message format than edn. https://github.com/cognitect/transit-java

[+] unknown|10 years ago|reply

[deleted]

[+] unknown|10 years ago|reply

[deleted]

[+] dmitrygr|10 years ago|reply

HTTP ERROR 404

Problem accessing /iap/message-structure.html. Reason:

    Page not found

[+] VStack|10 years ago|reply

Please try the following http://tutorials.jenkov.com/iap/iap-message-structure.html

[+] mike_hock|10 years ago|reply

Pfff 404, that's so HTTP. Why aren't you using IAP?

[+] blakecallens|10 years ago|reply

https://media.giphy.com/media/1M9fmo1WAFVK0/giphy.gif

[+] shockzzz|10 years ago|reply

Is there an ION vs. Thrift comparison?

[+] VStack|10 years ago|reply

Not yet. We have been asked to compare ION to Flatbuffers, Cap'n Proto, Thrift, Avro, Transit, BSON and several other encodings. However, writing the benchmarks and going through the features systematically is a lot of work, so we have not yet had the time to go through them all.

[+] Matthias247|10 years ago|reply

I think the most interesting difference to the usual serialization formats might be the copy and reference types. I'm a little bit undecided whether they might be a brilliant idea or not. The decision whether to support copy or not puts some extra effort in the serializer and deserializer, but the total result is the same that you can have as without a copy field mechanism. The support of cyclic references makes a big change, because you can't directly model them with technologies. You might also have trouble using these data structures in some programming languages or libraries (e.g. if you are only using immutable types or want to use only value types). However for some kind of data it seems to make sense to support cyclic data, as GraphQL and Falcor have also added support for that.

I also don't see that many use cases for the table structure. I have deployed thousands of RPC APIs into production, and I can't recall having the need for it. And even if you need it, using an object with 2 arrays in it would be just fine.

I also looked through the IAP documentation (btw. bad name => ipod accessory protocol) because it's quite related to what I'm working on. I think that the shown basic communication patterns are correct, but from the documentation I can't really get a feeling what I could expect from an IAP library. Would it be some low level messaging system (like MQTT, ZeroMQ, etc.) or would higher level communication patterns (request/response, notifications) also be built in. There are no predifined message formats for RPC listed in the documentation which would outline that. The WAMP specification (http://wamp.ws) e.g. makes it clearer what I could expect from such a protocol. I'm not sure whether we need a new low level messaging protocol or if the work should be more focused on adding higher level semantics on top if it.

E.g. I think some pattern that I really need in my domain is remote object synchronization, which means the status of an object on the server gets automatically pushed towards all interested client and is continously updated during changes (=> e.g. to build something like Firebase). Of course one can built something like that on top of basic messages by defining subscribe and update messages in the API, but I'm wondering if it's worthwhile to add something like that directly in the protocol. On the one hand this is also a special case of the subscription pattern which is also listed here, on the other hand it can not directly be implemented with the subscription possibilites of many message broker systems, because they won't send you the current state of an object after subscription but will only forward you a message after the value changes for the next time.

The connection and sequence definition in IAP looks a little bit redudant to me on the first look. I really think there is a need for message ordering and you must support it. The question for me is then if you don't need message ordering, why not put the message into a seperate channel and let channels/streams always be ordered (like in HTTP/2)? Overhead for channel creation? Or to setup channels during creation either as ordered or unordered and keep that for the lifetime of the channel?

[+] VStack|10 years ago|reply

Mathias, you also don't need a binary protocol. XML would work. JSON too. But binary is faster and more compact. Same with the table construct. You can work around not having it, but now ION has it built in. You don't make the mistake of serializing an array of JSON objects because you are busy. The objects are serialized as an ION table - not a list of ION object fields. If you ever send an array of objects across the wire with ION (IAP Tools), you will be using the table mode automatically. You save bandwidth and parsing time automatically. Who don't want that - even if you don't need it?

Regarding Copy and Reference, the support for them is still not very good (= not automatic). But imagine your service executes an SQL JOIN query, and in that result a lot of objects are repeated (e.g. same zip + city for a lot of objects). The Copy field can be use to include the zip + city fields just once, and after that refer to them later with a Copy field. That is shorter than including them again. These two fields still need some work to have full support, but we are working on it.

Right now ION is the most well-defined part of IAP. The network protocol itself is still not 100¤ finalized. But, now that we are close to being done with ION (we still have extra fields to add as extended types), we can move forward with the IAP core protocols and semantic protocols. If we do not define a standard semantic protocol for remote object synchronization, IAP will be designed so that you can plug in your own semantic protocol to meet that need.

[+] k__|10 years ago|reply

I consider no need for a schema a plus of MessagePack. But the Nos are all red, haha.

61 comments