Saving Millions by Dumping Java Serialization

[+] hrshtr|9 years ago|reply

Was using Thrift or Protobuf an option?

[+] quest88|9 years ago|reply

I'd like to know this too. As a passerby, those seem to have solved serialization, so I'm curious why you need rowfiles instead of e.g. protobuf.

[+] jnewhouse|9 years ago|reply

We started developing rowfiles around 2005. Thrift wasn't open sourced until 2007. I couldn't find a date for protobuf's release, but I don't think it was standard outside of google at that time. We use protobufs internally, and have a number of Rows whose field values are byte[]s containing protobufs. One big thing our rowfiles gives us is fast indexing. The only other big data format I know of that gives that is Kudu, which uses the same indexing scheme.

[+] pokemon-trainer|9 years ago|reply

What is Thrift? Is it a service?

[+] Cieplak|9 years ago|reply

Some interesting benchmarks of various Java serialization libraries:

https://github.com/eishay/jvm-serializers/wiki

[+] jnewhouse|9 years ago|reply

Author here, let me know if you have any questions/want more details.

[+] whack|9 years ago|reply

Thanks for writing about your experiences. Why not use an existing serialization framework, such as Protobuf, instead of building something in-house?

[+] revscat|9 years ago|reply

1) Can you provide any more details about how Rowfiles are structured and/or implemented? Specifically, how does it handle nested objects? Does it support `transient`? Do `writeObject` and/or `readObject` come into play?

2) Do you feel this is a generic enough solution that you would consider submitting it as a JSR?

[+] user5994461|9 years ago|reply

So... what's quantcast?

[+] cntlzw|9 years ago|reply

I am by no means an expert, but I always wonder why people don't adopt ASN.1 for serialization? I know it is not pretty but writing machine readable stuff never is.

[+] bluecarbuncle|9 years ago|reply

Portable Object Format from 10 years ago? https://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_pof....

[+] MS_Buys_Upvotes|9 years ago|reply

Can someone explain to an amateur why serialization is faster than say passing raw JSON?

It seems like parsing JSON would be faster than the serialize -> deserialize process but with the popularity of things like Protobuff it's clear that JSON is slower.

[+] jerf|9 years ago|reply

Serialization is the process of writing arbitrary data out into a blob of some sort (binary, text, whatever) that can be read in later and processed back into the original data, possibly not by the same system. This should be considered to include even the degenerate case of just writing the content of an expanse of RAM out, as that still raises issues related to serialization.

"JSON Serialization" and "Java Serialization" are two different things that can accomplish that goal. It sounds to me from your question that you think they have some fundamental difference, because your second paragraph implies you believe there is some sort of fundamental difference between Java serialization and JSON serialization, but there isn't. There is a whole host of non-fundamental differences that you always have to consider with a serialization format (speed, what can be represented, circular data structure handling, whether untrusted data can be used), but there's not a fundamental difference.

[+] mnarayan01|9 years ago|reply

JSON is a serialization format, just one that at least nods in the direction of human-readability. Formats which don't worry about human readability (e.g. Protobuf) can gain various degrees of efficiency.

[+] ScottBurson|9 years ago|reply

For one thing, all numbers can be written in binary, saving the lexing and conversion time. For another, strings can be written by first writing the length (in binary, of course), then writing the raw contents; there's no need to scan the input looking for the closing quote, handle backslash escapes, or do UTF-8 conversion.

That's probably most of the gain right there, but more things can be done along those lines.

[+] athenot|9 years ago|reply

Either way, it's serialization: object serialization or JSON serialization.

However, independently of how the data is represented (JSON or one of the many binary formats), the issue is to only encode/decode what you actually care about. From the little I understand about java object serialization, there's a lot of extra stuff that gets encoded, which may not be needed at all for the application at hand.

For an example of efficient serialization techniques, take a look at some of the MPEG formats (the older ones are easier to grok). They have a neat way of representing what is needed and dealing with optional data.

[+] felixgallo|9 years ago|reply

JSON must also be serialized or deserialized. Parsing it is slow and hard and not cache friendly.

Protobuf has the benefit of being extremely compact and, in some important languages, fast and friendly to serialize and deserialize.

[+] coldtea|9 years ago|reply

JSON is a serialization format itself. Even Javascript, which has JSON-like structures as native objects, has to serialize it (JSON.stringify) and deserialize it (JSON.parse) into text.

[+] unknown|9 years ago|reply

[deleted]

[+] Alupis|9 years ago|reply

> Secondly, Java serialization produces very bulky outputs. Each serialization contains all of the data required to deserialize. When you’re writing billions of records at a time, recording the schema in every record massively increases your data size.

Sounds to me like you shouldn't be storing objects in your database.

Why not just write the data into tables, and then create new POJO's when necessary, using the selected data?

[+] jnewhouse|9 years ago|reply

A standard database table isn't large enough to handle our large datasets. For example, the Hercules dataset was over 2 petabytes and even after optimization is almost 1 petabyte. Big data systems like Spark, Impala, Presto, etc. are designed to make the data look like a table, even though it is spread out into many files in a distributed filesystem. This is what we do. It's pretty common to reimplement some database features onto these big data file formats. In our case we have very fast indexes that let us quickly fetch data, similar to an index in a postgresql table.

[+] ghusbands|9 years ago|reply

The article explains that they did, in fact, promote the data into columns of the data store.

[+] jankotek|9 years ago|reply

Perhaps I could share my project which is trying to 'fix' java serialization?

It was originally part of database engine, but was extracted into separate project. It solves things like cyclic reference, non-recursive graph traversal and incremental serialization of large object graphs.

https://github.com/jankotek/elsa/

[+] user123|9 years ago|reply

TLDR: we had shitty code, optimized it, now it runs well. No code examples, nothing.

[+] jnewhouse|9 years ago|reply

If you want more details, we were packing a Row class into a base64 encoded string using an ObjectOutputStream. This is a fine thing for small scale serialization but sucks at scale, because of the reasons mentioned in the post. Sorry we don't have code examples, but it's unclear how useful it'd be given that no one else uses our file format. If you want a bit more detail on how the format works. Each metadata contains a list of typed columns to define the schema of a given part. Our map-reduce framework has a bunch of internal logic that tries to justify the written Row class with the one the Mapper class is asking for. This allows us to do things like ingest different versions of a row with in the context of a single job. I think questions of serialization at the scale are generally interesting, although ymmv. I know of one company using Avro, which doesn't let you cleanly update or track schema. They've ended up storing every schema in an HBase table and reserving the first 8 bytes to do a lookup into this table to know the row's schema.

[+] jlg23|9 years ago|reply

"We had shitty code, optimized it, now it runs [maybe] better" is probably the only fixture in software development.

I'm not sure kind of code samples one would want in the context of an article as abstract as this one.

I did not take anything from the article, but had a lot of "been there, done that" moments when skimming it. The difference between the OP an me: The OP wrote about it so others can learn, something I never did.

45 comments