So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object.
For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.
Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)
Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.
One of the things we felt very strongly about when developing Kim was that Simple things should be simple. Complex things should be possible. To that end the Pipeline system behind the Field objects really does allow anything to be achieved. Wether thats producing values from composite fields or handling unique or non standard data types.
it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.
In the case of serialization libraries,unless you are validating as part of your (de)serialization, I'd recommend avoiding schema-driven serialization libraries.
These Kim-like libraries, such as Marshmallow, introduce quite a bit of overhead. If validation isn't required and performance matters, I recommend choosing a lighter-weight serialization/marshalling alternative, such as that provided by asphalt-serialization: https://github.com/asphalt-framework/asphalt-serialization
Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up
This recommendation is based on my own experience using Marshmallow for Yosai, analyzing its performance and then refactoring to a ported version of asphalt-serialization.
That's a great point and an important distinction to make. As I mentioned in some of the other comments, we have certainly been focussed on features over performance so far but we are actively working on dramatically improving the performance of Kim.
I guess it's almost important to pick the right tool for the job. Thanks for sharing the link to asphalt too. I'd not see that before.
This is brilliant, exactly what I was looking for. I did a profile recently on some API calls and found that 40-50% was being spent on serialization with marshmallow, which I'm looking to drop.
I'll be doing this stuff for myself, but would you be curious in having:
Just a minor note: It seems you don't mention anywhere what those numbers actually mean. I'm assuming they are seconds, but I can't know for certain, which makes it really unclear if Kim is the fastest or the slowest.
Nice, but I recommend closing issues https://github.com/mikeywaites/kim/issues which have fixes (some of them show 'merge'). It's one thing I as a user look at choosing whether to adopt a project or not.
We started writing Kim around the same time as the Marshmallow project began as we found it wasn't suitable for our needs at that time, though it has come a long way since then.
They are very similar projects and have similar functionality, but Kim has a focus on making it relatively simple to do unusual or 'advanced' things.
For example, Kim supports polymorphism out of the box, if you have an AnimalMapper subclassed by a CatMapper and a DogMapper, passing a Cat and a Dog to AnimalMapper.many.serialize() will automatically do the right thing in a similar way to SQLAlchemy polymorphism.
We also have support for complex requirements such as nesting the same object to itself (useful when your JSON representation is nested but your DB representation is flat,) serialising multiple object fields to a single JSON field (eg full_name consisting of obj.first_name and obj.last_name,) a range of security models for marshalling nested objects and a fairly extensible roles system.
In general we've followed the philosophy "Simple things should be simple. Complex things should be possible."
We've not really dug into performance yet, though if you look at the last patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except block.
We've really focussed on features initially and performance is something we're actively researching now. Perhaps we can get some initial benchmarks together and share them with you this week. They will be useful no doubt as we start to plan a release focussed on speed ups.
[+] [-] sandGorgon|9 years ago|reply
This stuff is really all over the place - PMML, Arrow, Dill, pickle.
Some stuff won't work with one or the other. I will actually pay for consistency versus performance.
There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.
Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.
[+] [-] makmanalp|9 years ago|reply
For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.
Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)
Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.
[+] [-] fnord123|9 years ago|reply
http://wesmckinney.com/blog/outlook-for-2017/
>Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.
pyarrow; pyarrow.parquet (which uses parquet-cpp).
[+] [-] mhneu|9 years ago|reply
A good serialization library should serialize:
Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.
[+] [-] mikeywaites|9 years ago|reply
it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.
Thanks for the message.
[+] [-] limdauto|9 years ago|reply
Disclaimer: Posting this comment because my colleague pointed out that I could get some points.
[+] [-] _e|9 years ago|reply
[+] [-] Dowwie|9 years ago|reply
In the case of serialization libraries,unless you are validating as part of your (de)serialization, I'd recommend avoiding schema-driven serialization libraries. These Kim-like libraries, such as Marshmallow, introduce quite a bit of overhead. If validation isn't required and performance matters, I recommend choosing a lighter-weight serialization/marshalling alternative, such as that provided by asphalt-serialization: https://github.com/asphalt-framework/asphalt-serialization
Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up
This recommendation is based on my own experience using Marshmallow for Yosai, analyzing its performance and then refactoring to a ported version of asphalt-serialization.
[+] [-] mikeywaites|9 years ago|reply
That's a great point and an important distinction to make. As I mentioned in some of the other comments, we have certainly been focussed on features over performance so far but we are actively working on dramatically improving the performance of Kim.
I guess it's almost important to pick the right tool for the job. Thanks for sharing the link to asphalt too. I'd not see that before.
[+] [-] voidfiles|9 years ago|reply
source: https://voidfiles.github.io/python-serialization-benchmark/
[+] [-] makmanalp|9 years ago|reply
I'll be doing this stuff for myself, but would you be curious in having:
a) Support for lima: https://lima.readthedocs.io/en/latest/
b) more benchmark cases (serializing a larger list of objects)
[+] [-] RussianCow|9 years ago|reply
[+] [-] mikeywaites|9 years ago|reply
This is a great start for us understanding where we need to get to! We've got some work to do :)
[+] [-] yeukhon|9 years ago|reply
[+] [-] mikeywaites|9 years ago|reply
[+] [-] sakawa|9 years ago|reply
[1]: https://github.com/marshmallow-code/marshmallow/
[+] [-] jackqu7|9 years ago|reply
We started writing Kim around the same time as the Marshmallow project began as we found it wasn't suitable for our needs at that time, though it has come a long way since then.
They are very similar projects and have similar functionality, but Kim has a focus on making it relatively simple to do unusual or 'advanced' things.
For example, Kim supports polymorphism out of the box, if you have an AnimalMapper subclassed by a CatMapper and a DogMapper, passing a Cat and a Dog to AnimalMapper.many.serialize() will automatically do the right thing in a similar way to SQLAlchemy polymorphism.
We also have support for complex requirements such as nesting the same object to itself (useful when your JSON representation is nested but your DB representation is flat,) serialising multiple object fields to a single JSON field (eg full_name consisting of obj.first_name and obj.last_name,) a range of security models for marshalling nested objects and a fairly extensible roles system.
In general we've followed the philosophy "Simple things should be simple. Complex things should be possible."
[+] [-] tinnet|9 years ago|reply
[+] [-] sametmax|9 years ago|reply
[+] [-] siddhant|9 years ago|reply
[+] [-] mikeywaites|9 years ago|reply
We've not really dug into performance yet, though if you look at the last patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except block.
We've really focussed on features initially and performance is something we're actively researching now. Perhaps we can get some initial benchmarks together and share them with you this week. They will be useful no doubt as we start to plan a release focussed on speed ups.
Thanks for reaching out!
[+] [-] amelius|9 years ago|reply
[+] [-] mikeywaites|9 years ago|reply
thanks for the message. Gonna be honest, I'm not sure what you mean by cycles. Can you elaborate a bit?
[+] [-] ziikutv|9 years ago|reply
Silly question, what happens with Unicode?
[+] [-] rat87|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] ff7c11|9 years ago|reply
[+] [-] BuuQu9hu|9 years ago|reply
This fundamentally doesn't offer much advantage over a .toJSON() instance method and a .fromJSON() class method.
Don't say "security-focused" if you can't handle cyclic object graphs.
[+] [-] mafro|9 years ago|reply