top | item 45139656

Protobuffers Are Wrong (2018)

244 points| b-man | 6 months ago |reasonablypolymorphic.com | reply

307 comments

order
[+] lalaithion|6 months ago|reply
Protocol buffers suck but so does everything else. Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.

Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.

And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.

The reason why protos suck is because remote procedure calls suck, and protos expose that suckage instead of trying to hide it until you trip on it. I hope the people working on protos, and other alternatives, continue to improve them, but they’re not worse than not using them today.

[+] jitl|6 months ago|reply
Not widely used but I like Typical's approach

https://github.com/stepchowfun/typical

> Typical offers a new solution ("asymmetric" fields) to the classic problem of how to safely add or remove fields in record types without breaking compatibility. The concept of asymmetric fields also solves the dual problem of how to preserve compatibility when adding or removing cases in sum types.

[+] tyleo|6 months ago|reply
We use protocol buffers on a game and we use the back compat stuff all the time.

We include a version number with each release of the game. If we change a proto we add new fields and deprecate old ones and increment the version. We use the version number to run a series of steps on each proto to upgrade old fields to new ones.

[+] jnwatson|6 months ago|reply
ASN.1 implements message versioning in an extremely precise way. Implementing a linter would be trivial.
[+] yearolinuxdsktp|6 months ago|reply
I agree that saying that no-one uses backwards compatible stuff is bizarre. Rolling deploys, being able to function with a mixed deployment is often worth the backwards compatibility overhead for many reasons.

In Java, you can accomplish some of this with using of Jackson JSON serialization of plain objects, where there several ways in which changes can be made backwards-compatibly (e.g. in the recent years, post-deserialization hooks can be used to handle more complex cases), which satisfies (a). For (b), there’s no automatic linter. However, in practice, I found that writing tests that deserialize prior release’s serialized objects get you pretty far along the line of regression protection for major changes. Also it was pretty easy to write an automatic round-trip serialization tester to catch mistakes in the ser/deser chain. Finally, you stay away from non-schemable ser/deser (such as a method that handles any property name), which can be enforced w/ a linter, you can output the JSON schema of your objects to committed source. Then any time the generated schema changes, you can look for corresponding test coverage in code reviews.

I know that’s not the same as an automatic linter, but it gets you pretty far in practice. It does not absolve you from cross-release/upgrade testing, because serialization backwards-compatibility does not catch all backwards-compatibility bugs.

Additionally, Jackson has many techniques, such as unwrapping objects, which let you execute more complicated refactoring backwards-compatibly, such as extracting a set of fields into a sub-object.

I like that the same schema can be used to interact with your SPA web clients for your domain objects, giving you nice inspectable JSON. Things serialized to unprivileged clients can be filtered with views, such that sensitive fields are never serialized, for example.

You can generate TypeScript objects from this schema or generate clients for other languages (e.g. with Swagger). Granted it won’t port your custom migration deserialization hooks automatically, so you will either have to stay within a subset of backwards-compatible changes, or add custom code for each client.

You can also serialize your RPC comms to a binary format, such as Smile, which uses back-references for property names, should you need to reduce on-the-wire size.

It’s also nice to be able to define Jackson mix-ins to serialize classes from other libraries’ code or code that you can’t modify.

[+] mattnewton|6 months ago|reply
Exactly, I think of protobuffers like I think of Java or Go - at least they weren’t writing it in C++.

Dragging your org away from using poorly specified json is often worth these papercuts IMO.

[+] tshaddox|6 months ago|reply
> Name another serialization declaration format that both (a) defines which changes can be make backwards-compatibly, and (b) has a linter that enforces backwards compatible changes.

The article covers this in the section "The Lie of Backwards- and Forwards-Compatibility." My experience working with protocol buffers matches what the author describes in this section.

[+] the__alchemist|6 months ago|reply
This is always the thing to look for; "What are the alternatives?", and/why aren't there better ones.

I don't understand most use cases of protobufs, including ones that informed their design. I use it for ESP-hosted, to communicate between two MCUs. It is the highest-friction serialization protocol I've seen, and is not very byte-efficient.

Maybe something like the specialized serialization libraries (bincode, postcard etc) would be easier? But I suspect I'm missing something about the abstraction that applies to networked systems, beyond serialization.

[+] tgma|6 months ago|reply
> And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.

Yet the author has the audacity to call the authors of protobuf (originally Jeff Dean et al) "amateurs."

[+] jcgrillo|6 months ago|reply
As someone who has written many mapreduce jobs over years old protobufs I can confidently report the backwards compatibility made it possible at all.
[+] noitpmeder|6 months ago|reply
Not that I love it -- but SBE (Simple Binary Encoding) is a _decent_ solution in the realm of backwards/forwards compatibility.
[+] maximilianburke|6 months ago|reply
Flatbuffers satisfies those requirements and doesn’t have varint shenanigans.
[+] motorest|6 months ago|reply
> Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.

What I dislike the most about blog posts like this is that, although the blogger is very opinionated and critical of many things, the post dates back to 2018, protobuf is still dominant, and apparently during all these years the blogger failed to put together something that they felt was a better way to solve the problem. I mean, it's perfectly fine if they feel strongly about a topic. However, investing so much energy to criticize and even throw personal attacks on whoever contributed to the project feels pointless and an exercise in self promotion at the expense of shit-talking. Either you put something together that you feel implements your vision and rights some wrongs, or don't go out of your day to put down people. Not cool.

[+] summerlight|6 months ago|reply
https://news.ycombinator.com/item?id=18190005

Just FYI: an obligatory comment from the protobuf v2 designer.

Yeah, protobuf has lots of design mistakes but this article is written by someone who does not understand the problem space. Most of the complexity of serialization comes from implementation compatibility between different timepoints. This significantly limits design space.

[+] thethimble|6 months ago|reply
Relatedly, most of the author's concerns are solved by wrapping things in a message.

> oneof fields can’t be repeated.

Wrap oneof field in message which can be repeated

> map fields cannot be repeated.

Wrap in message which can contain repeated fields

> map values cannot be other maps.

Wrap map in message which can be a value

Perhaps this is slightly inconvenient/un-ergonomic, but the author is positioning these things as "protos fundamentally can't do this".

[+] missinglugnut|6 months ago|reply
>Most of the complexity of serialization comes from implementation compatibility between different timepoints.

The author talks about compatibility a fair bit, specifically the importance of distinguishing a field that wasn't set from one that was intentionally set to a default, and how protobuffs punted on this.

What do you think they don't understand?

[+] xyzzyz|6 months ago|reply
Granted, on paper it’s a cool feature. But I’ve never once seen an application that will actually preserve that property.

Chances are, the author literally used software that does it as he wrote these words. This feature is critical to how Chrome Sync works. You wouldn’t want to lose synced state if you use an older browser version on another device that doesn’t recognize the unknown fields and silently drops them. This is so important that at some point Chrome literally forked protobuf library so that unknown fields are preserved even if you are using protobuf lite mode.

[+] xmddmx|6 months ago|reply
I share the author's sentiment. I hate these things.

True story: trying to reverse engineer macOS Photos.app sqlite database format to extract human-readable location data from an image.

I eventually figured it out, but it was:

A base64 encoded Binary Plist format with one field containing a ProtoBuffer which contained another protobuffer which contained a unicode string which contained improperly encoded data (for example, U+2013 EN DASH was encoded as \342\200\223)

This could have been a simple JSON string.

[+] xg15|6 months ago|reply
I'm starting to wonder if some of those bad design decisions are symptoms of a larger "cultural bias" at Google. Specifically the "No Compositionality" point: It reminds me of similar bad designs in Go, CSS and the web platform at large.

The pattern seems to be that generalized, user-composable solutions are discouraged in favor of a myriad of special constructs that satisfy whatever concrete use cases seem relevant for the designers in the moment.

This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.

Eventually, the designers might give up and add more general constructs to the language - but those feel tacked on and have to coexist with specific features that can't be removed anymore.

[+] jsnell|6 months ago|reply
[+] bithive123|6 months ago|reply
I don't know if the author is right or wrong; I've never dealt with protobufs professionally. But I recently implemented them for a hobby project and it was kind of a game-changer.

At some stage with every ESP or Arduino project, I want to send and receive data, i.e. telemetry and control messages. A lot of people use ad-hoc protocols or HTTP/JSON, but I decided to try the nanopb library. I ended up with a relatively neat solution that just uses UDP packets. For my purposes a single packet has plenty of space, and I can easily extend this approach in the future. I know I'm not the first person to do this but I'll probably keep using protobufs until something better comes along, because the ecosystem exists and I can focus on the stuff I consider to be fun.

[+] tliltocatl|6 months ago|reply
Embedded/constrained UDP is where protobuf wire format (but not google's libraries) rocks: IoT over cellular and such, where you need to fit everything into a single datagram (number of roundtrips is what determines power consumption). As to those who say "UDP is unreliable" - what you do is you implement ARQ on the application level. Just like TCP does it, except you don't have to waste roundtrips on SYN-SYN-ACK handshake nor waste bytes on sending data that are no longer relevant.

Varints for the win. Send time series as columns of varint arrays - delta or RLL compression becomes quite straightforward. And as a bonus I can just implement new fields in the device and deploy right away - the server-side support can wait until we actually need it.

No, flatbuffers/cap'n'proto are unacceptably big because of fixed layout. No, CBOR is an absolute no go - why on earth would you waste precious bytes on schema every time? No, general-purpose compression like gzip wouldn't do much on such a small size, it will probably make things worse. Yes, ASN is supposed to be the right solution - but there is no full-featured implementation that doesn't cost $$$$ and the whole thing is just too damn bloated.

Kinda fun that it sucks for what it is supposed to do, but actually shines elsewhere.

[+] _zoltan_|6 months ago|reply
and since it's UDP, if it's lost it's lost. and since it's not standard http/JSON, nobody will have a clue in a year and can't decode it.

to learn and play with it it's fine, else why complicate life?

[+] ndr|6 months ago|reply
Not even before the first line ends you get "They’re clearly written by amateurs".

This is a rage bait, not worth the read.

[+] btilly|6 months ago|reply
The reasons for that line get at a fundamental tension. As David Wheeler famously said, "All problems in computer science can be solved by another level of indirection, except for the problem of too many indirections."

Over time we accumulate cleverer and cleverer abstractions. And any abstraction that we've internalized, we stop seeing. It just becomes how we want to do things, and we have no sense of what cost we are imposing with others. Because all abstractions leak. And all abstractions pose a barrier for the maintenance programmer.

All of which leads to the problem that Brian Kernighan warned about with, "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?" Except that the person who will have to debug it is probably a maintenance programmer who doesn't know your abstractions.

One of the key pieces of wisdom that show through Google's approaches is that our industry's tendency towards abstraction is toxic. As much as any particular abstraction is powerful, allowing too many becomes its own problem. This is why, for example, Go was designed to strongly discourage over-abstraction.

Protobufs do exactly what it says on the tin. As long as you are using them in the straightforward way which they are intended for, they work great. All of his complaints boil down to, "I tried to do some meta-manipulation to generate new abstractions, and the design said I couldn't."

That isn't the result of them being written by amateurs. That's the result of them being written to incorporate a piece of engineering wisdom that most programmers think that they are smart enough to ignore. (My past self was definitely one of those programmers.)

Can the technology be abused? Do people do stupid things with them? Are there things that you might want to do that you can't? Absolutely. But if you KISS, they work great. And the more you keep it simple, the better they work. I consider that an incentive towards creating better engineered designs.

[+] jilles|6 months ago|reply
The best way to get your point across is by starting with ad-hominem attacks to assert your superior intelligence.
[+] BugsJustFindMe|6 months ago|reply
If only the article offered both detailed analyses of the problems and also solutions. Wait, it does! You should try reading it.
[+] jeffbee|6 months ago|reply
Yep, the article opens with a Hall of Fame-grade compound fallacy: a strawman refutation of a hypothetical ad hominem that nobody has argued.

You can kinda see how this author got bounced out of several major tech firms in one year or less, each, according to their linkedin.

[+] TZubiri|6 months ago|reply
> if (m_foo = null)

Imagine calling google amateurs, and then the only code you write has a first year student error in failing to distinguish assignment from comparision operator.

There's a class of rant on the internet where programmers complain about increasingly foundational tech instead of admitting skill issues. If you go far deep into that hole, you end up rewriting the kernel in Rust.

[+] awalsh128|6 months ago|reply
Yeah, there is a lot of snark in the article which undermines their argument.
[+] IncreasePosts|6 months ago|reply
It's written by amateurs, but solves problems that only Google(one of the biggest/most advanced tech companies in the world) has.
[+] barrkel|6 months ago|reply
I'm afraid that this is a case of someone imagining that there are Platonic ideal concepts that don't evolve over time, that programs are perfectible. But people are not immortal and everything is always changing.

I almost burst out in laughter when the article argued that you should reuse types in preference to inlining definitions. If you've ever felt the pain of needing to split something up, you would not be so eager to reuse. In a codebase with a single process, it's pretty trivial to refactor to split things apart; you can make one CL and be done. In a system with persistence and distribution, it's a lot more awkward.

That whole meaning of data vs representation thing. There's fundamentally a truth in the correspondence. As a program evolves, its understanding of its domain increases, and the fidelity of its internal representations increase too, by becoming more specific, more differentiated, more nuanced. But the old data doesn't go away. You don't get to fill in detail for data that was gathered in older times. Sometimes, the referents don't even exist any more. Everything is optional; what was one field may become two fields in the future, with split responsibilities, increased fidelity to the domain.

[+] iamdelirium|6 months ago|reply
Yeah, oneOf fields can be repeated but you can just wrap them in a message. It's not as pretty but I've never had any issues with this.

The fact that the author is arguing for making all messages required means they don't understand the reasoning for why all fields are optional. This breaks systems (there are are postmortems outlining this) then there are proto mismatches .

[+] imtringued|6 months ago|reply
>The solution is as follows:

> * Make all fields in a message required. This makes messages product types.

Meanwhile in the capnproto FAQ:

>How do I make a field “required”, like in Protocol Buffers?

>You don’t. You may find this surprising, but the “required” keyword in Protocol Buffers turned out to be a horrible mistake.

I recommend reading the rest of the FAQ [0], but if you are in a hurry: Fixed schema based protocols like protobuffers do not let you remove fields like self describing formats such as JSON. Removing fields or switching them from required to optional is an ABI breaking change. Nobody wants to update all servers and all clients simultaneously. At that point, you would be better off defining a new API endpoint and deprecating the old one.

The capnproto faq article also brings up the fact that validation should be handled on the application level rather than the ABI level.

[0] https://capnproto.org/faq.html

[+] mountainriver|6 months ago|reply
> Protobuffers correspond to the data you want to send over the wire, which is often related but not identical to the actual data the application would like to work with

This sums up a lot of the issues I’ve seen with protobuf as well. It’s not an expressive enough language to be the core data model, yet people use it that way.

In general, if you don’t have extreme network needs, then protobuf seems to cause more harm than good. I’ve watched Go teams spend months of time implementing proto based systems with little to no gain over just REST.

[+] ryukoposting|6 months ago|reply
Protobuf's original sin was failing to distinguish zero/false from undefined/unset/nil. Confusion around the semantics of a zero value are the root of most proto-related bugs I've come across. At the same time, that very characteristic of protobuf makes its on-wire form really efficient in a lot of cases.

Nearly every other complaint is solved by wrapping things in messages (sorry, product types). Don't get the enum limitation on map keys, that complaint is fair.

Protobuf eliminates truckloads of stupid serialization/deserialization code that, in my embedded world, almost always has to be hand-written otherwise. If there was a tool that automatically spat out matching C, Kotlin, and Swift parsers from CDDL, I'd certainly give it a shot.

[+] spectraldrift|6 months ago|reply
I'm not sure why this post gets boosted every few years- and unfortunately (as many have pointed out) the author demonstrates here that they do not understand distributed system design, nor how to use protocol buffers. I have found them to be one of the most useful tools in modern software development when used correctly. Not only are they much faster than JSON, they prevent the inevitable redefinition of nearly identical code across a large number of repos (which is what i've seen in 95% of corporate codebases that eschew tooling such as this). Sure, there are alternatives to protocol buffers, but I have not seen them gain widespread adoption yet.
[+] ericpauley|6 months ago|reply
I lost the plot here when the author argued that repeated fields should be implemented as in the pure lambda calculus...

Most of the other issues in the article can be solved be wrapping things in more messages. Not great, not terrible.

As with the tightly-coupled issues with Go, I'll keep waiting for a better approach any decade now. In the meantime, both tools (for their glaring imperfections) work well enough, solve real business use cases, and have a massive ecosystem moat that makes them easy to work with.

[+] BugsJustFindMe|6 months ago|reply
I went into this article expecting to agree with part of it. I came away agreeing with all of it. And I want to point out that Go also shares some of these catastrophic data decisions (automatic struct zero values that silently do the wrong thing by default).
[+] sethammons|6 months ago|reply
We got bit by a default value in a DMS task where the target column didn't exist so the data wasn't replicated and the default value was "this work needs to be done."

This is not pb nor go. A sensible default of invalid state would have caught this. So would an error and crash. Either would have been better than corrupt data.

[+] dano|6 months ago|reply
It is a 7 year old article without specifying alternatives to an "already solved problem."

So HN, what are the best alternatives available today and why?

[+] allanrbo|6 months ago|reply
Sometimes you are integrating with system that already use proto though. I recently wrote a tiny, dependency-free, practical protobuf (proto3) encoder/decoder. For those situations where you need just a little bit of protobuf in your project, and don't want to bother with the whole proto ecosystem of codegen and deps: https://github.com/allanrbo/pb.py
[+] ants_everywhere|6 months ago|reply
> Maintain a separate type that describes the data you actually want, and ensure that the two evolve simultaneously.

I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.

What I personally want to do is use a language-agnostic IDL to describe the types that my programs use. Within Google you can even do things like just store them in the database.

The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema. JSON is IMO not as nice to work with. The fact that it's also slower probably doesn't matter to most codebases.

[+] thinkharderdev|6 months ago|reply
> I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.

I think this is exactly what you end up with using protobuf. You have an IDL that describes the interface types but then protoc generates language-specific types that are horrible so you end up converting the generated types to some internal type that is easier to use.

Ideally if you have an IDL that is more expressive then the code generator can create more "natural" data structures in the target language. I haven't used it a ton, but when I have used thrift the generated code has been 100x better than what protoc generates. I've been able to actually model my domain in the thrift IDL and end up with types that look like what I would have written by hand so I don't need to create a parallel set of types as a separate domain model.

[+] danans|6 months ago|reply
> The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema.

Protobuf has a bidirectional JSON mapping that works reasonably well for a lot of use cases.

I have used it to skip the protobuf wire format all together and just use protobuf for the IDL and multi-language binding, both of which IMO are far better than JSON-Schema.

JSON-Schema is definitely more powerful though, letting you do things like field level constraints. I'd love to see you tomorrow that paired the best of both.

[+] vander_elst|6 months ago|reply
Always initializing with a default and no algebraic types is an always loaded foot gun. I wonder if the people behind golang took inspiration from this.
[+] wrsh07|6 months ago|reply
The simplest way to understand go is that it is a language that integrates some of Google's best cpp features (their lightweight threads and other multi threading primitives are the highlights)

Beyond that it is a very simple language. But yes, 100%, for better and worse, it is deeply inspired by Google's codebase and needs