top | item 14096982

Show HN: Kim – A Python serialization and marshaling framework

136 points| mikeywaites | 9 years ago |kim.readthedocs.io

57 comments

order
[+] sandGorgon|9 years ago|reply
We are really looking for serialization libraries that will work with pandas and scikit.

This stuff is really all over the place - PMML, Arrow, Dill, pickle.

Some stuff won't work with one or the other. I will actually pay for consistency versus performance.

There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.

Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.

[+] makmanalp|9 years ago|reply
So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object.

For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.

Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)

Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.

[+] fnord123|9 years ago|reply
McKinney has been hard at work getting parquet and arrow support in pandas.

http://wesmckinney.com/blog/outlook-for-2017/

>Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.

pyarrow; pyarrow.parquet (which uses parquet-cpp).

[+] mhneu|9 years ago|reply
Python's data infrastructure has a huge problem: serialization and thus saving data results.

A good serialization library should serialize:

  - classes/objects (best practice: objects for holding data)
  - pandas/numpy objects (must have: minimizing space)
  - namedtuples (currently: a mess, factory implementation)
  - dicts and lists of dicts (must have: space efficiency)
Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.

[+] mikeywaites|9 years ago|reply
One of the things we felt very strongly about when developing Kim was that Simple things should be simple. Complex things should be possible. To that end the Pipeline system behind the Field objects really does allow anything to be achieved. Wether thats producing values from composite fields or handling unique or non standard data types.

it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.

Thanks for the message.

[+] limdauto|9 years ago|reply
I'd like to congratulate the authors regarding the clever naming. I totally get the Eminem's reference.

Disclaimer: Posting this comment because my colleague pointed out that I could get some points.

[+] _e|9 years ago|reply
Kim: A JSON Serialization and Marshaling framework that Mathers
[+] Dowwie|9 years ago|reply
Cool project!

In the case of serialization libraries,unless you are validating as part of your (de)serialization, I'd recommend avoiding schema-driven serialization libraries. These Kim-like libraries, such as Marshmallow, introduce quite a bit of overhead. If validation isn't required and performance matters, I recommend choosing a lighter-weight serialization/marshalling alternative, such as that provided by asphalt-serialization: https://github.com/asphalt-framework/asphalt-serialization

Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up

This recommendation is based on my own experience using Marshmallow for Yosai, analyzing its performance and then refactoring to a ported version of asphalt-serialization.

[+] mikeywaites|9 years ago|reply
Hey Dowwie!

That's a great point and an important distinction to make. As I mentioned in some of the other comments, we have certainly been focussed on features over performance so far but we are actively working on dramatically improving the performance of Kim.

I guess it's almost important to pick the right tool for the job. Thanks for sharing the link to asphalt too. I'd not see that before.

[+] voidfiles|9 years ago|reply
I added Kim to my ongoing set of python serialization framework benchmarks here is how it ranks.

  Library                  Many Objects    One Object
  ---------------------  --------------  ------------
  Custom                      0.0187769    0.00682402
  Strainer                    0.0603201    0.0337129
  serpy                       0.073787     0.038656
  Lollipop                    0.47821      0.231566
  Marshmallow                 1.14844      0.598486
  Django REST Framework       1.94096      1.3277
  kim                         2.28477      1.15237
Comments on how to improve the benchmark are appreciated.

source: https://voidfiles.github.io/python-serialization-benchmark/

[+] makmanalp|9 years ago|reply
This is brilliant, exactly what I was looking for. I did a profile recently on some API calls and found that 40-50% was being spent on serialization with marshmallow, which I'm looking to drop.

I'll be doing this stuff for myself, but would you be curious in having:

a) Support for lima: https://lima.readthedocs.io/en/latest/

b) more benchmark cases (serializing a larger list of objects)

[+] RussianCow|9 years ago|reply
Just a minor note: It seems you don't mention anywhere what those numbers actually mean. I'm assuming they are seconds, but I can't know for certain, which makes it really unclear if Kim is the fastest or the slowest.
[+] mikeywaites|9 years ago|reply
Thanks so much for this Voidfiles. We were under no illusions that we weren't the most performant library out there (yet)

This is a great start for us understanding where we need to get to! We've got some work to do :)

[+] yeukhon|9 years ago|reply
Nice, but I recommend closing issues https://github.com/mikeywaites/kim/issues which have fixes (some of them show 'merge'). It's one thing I as a user look at choosing whether to adopt a project or not.
[+] mikeywaites|9 years ago|reply
absolutely. Im a bit annoyed at myself that I hadn't got round to that yet but thanks for raising it.
[+] sakawa|9 years ago|reply
It does look like marshmalllow[1]. How does relate Kim with it?

[1]: https://github.com/marshmallow-code/marshmallow/

[+] jackqu7|9 years ago|reply
(I'm Jack, another developer at OSL.)

We started writing Kim around the same time as the Marshmallow project began as we found it wasn't suitable for our needs at that time, though it has come a long way since then.

They are very similar projects and have similar functionality, but Kim has a focus on making it relatively simple to do unusual or 'advanced' things.

For example, Kim supports polymorphism out of the box, if you have an AnimalMapper subclassed by a CatMapper and a DogMapper, passing a Cat and a Dog to AnimalMapper.many.serialize() will automatically do the right thing in a similar way to SQLAlchemy polymorphism.

We also have support for complex requirements such as nesting the same object to itself (useful when your JSON representation is nested but your DB representation is flat,) serialising multiple object fields to a single JSON field (eg full_name consisting of obj.first_name and obj.last_name,) a range of security models for marshalling nested objects and a fairly extensible roles system.

In general we've followed the philosophy "Simple things should be simple. Complex things should be possible."

[+] sametmax|9 years ago|reply
I think marshmallow primary use case is to unserialize to nested dicts/lists while kim outputs full classes. Did I understood that right ?
[+] siddhant|9 years ago|reply
Cool! Are there any speed comparisons available between this and marshmallow (or other alternatives)?
[+] mikeywaites|9 years ago|reply
Hi Siddhant,

We've not really dug into performance yet, though if you look at the last patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except block.

We've really focussed on features initially and performance is something we're actively researching now. Perhaps we can get some initial benchmarks together and share them with you this week. They will be useful no doubt as we start to plan a release focussed on speed ups.

Thanks for reaching out!

[+] amelius|9 years ago|reply
Can it serialize cycles?
[+] mikeywaites|9 years ago|reply
Hey Amelius,

thanks for the message. Gonna be honest, I'm not sure what you mean by cycles. Can you elaborate a bit?

[+] ziikutv|9 years ago|reply
So this takes JSON and maps it to namedtuples?

Silly question, what happens with Unicode?

[+] rat87|9 years ago|reply
I was just looking for something like this or marshmallow.
[+] ff7c11|9 years ago|reply
why not pickle :) :)
[+] BuuQu9hu|9 years ago|reply
Sorry, I must be harsh. No.

This fundamentally doesn't offer much advantage over a .toJSON() instance method and a .fromJSON() class method.

Don't say "security-focused" if you can't handle cyclic object graphs.

[+] mafro|9 years ago|reply
Please elaborate on the reasons for your opinion :)