Don't Pickle Your Data

[+] AussieWog93|3 years ago|reply

From the conclusion of the article: >Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects

ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".

None of the other options out there let you build a foolproof "save button" in 3 lines of code.

[+] henrydark|3 years ago|reply

I'm sure that most python developers who have worked with pickle for more than 3 lines of code can confirm that pickle does not in fact just work

[+] bobbylarrybobby|3 years ago|reply

The real question is why doesn't Python have something like a class decorator`@json.interchangeable` that you can apply to a class -- maybe dataclasses only? -- to have json (de)serialization be only three lines of code (or less).

[+] thraxil|3 years ago|reply

The problem is that "the average home-made Python script" frequently ends up turning into a critical production system.

[+] meatmanek|3 years ago|reply

They forgot another major problem: You can only reliably unpickle data using the same (or same-enough) code that pickled it. If your class definitions have changed or moved around, unpickling can break.

[+] kangalioo|3 years ago|reply

Reminds me of C's dumping structs to disk via memcpy

[+] philsnow|3 years ago|reply

I ran into a bug in production unpickling some builtins (dicts or sets or something) that were pickled in 2.2 and unpickled in 2.4 (or 2.4 -> 2.6, it's fuzzy).

Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.

[+] Spivak|3 years ago|reply

If you’re hydrating objects from your JSON you hit the same thing so it’s not as much of a downside as you might think.

In fact this is what the benchmark does.

[+] jleahy|3 years ago|reply

Should be (2014).

More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.

The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.

[+] a-dub|3 years ago|reply

even though all the metadata is weird and overengineered, i would probably still use hdf5 as it provides for interop with other numerical computing environments (matlab, julia).

also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.

fwiw. matlab .mat files are hdf5 at their core.

i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.

[+] mr_toad|3 years ago|reply

> Should be (2014).

I was wondering why it didn’t mention Apache Arrow.

[+] RhysU|3 years ago|reply

Why is mmaping out of the running?

[+] chaxor|3 years ago|reply

Last time I checked (i.e. performed several benchmarks upon), parquet with Zstd was about the best way to store compressed data for really fast and small files.

Zstd is quite good, and is now (iirc) in the linux kernel.

People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well. But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.

[+] mistrial9|3 years ago|reply

linux 5.15.0-25-generic on ubuntu 22.04 shows

    $lsmod | grep zstd  
    zstd_compress         229376  1 btrfs

[+] solarkraft|3 years ago|reply

> Pickle is slow

... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.

> unpickling malicious data can cause security issues

Why would I do that?

I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.

> Just use JSON

How do I effortlessly restore objects including their methods from JSON?

[+] marcosdumay|3 years ago|reply

> How do I effortlessly restore objects including their methods from JSON?

The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".

It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.

(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)

[+] vore|3 years ago|reply

One thing that's not mentioned is that pickled data is effectively fossilized once you've pickled it. If you want to change the layout of a class and have objects unpickle correctly, it can be an ordeal, as objects are unpickled by their class name, and you need both the original class and the new class to correctly unpickle and migrate.

If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.

[+] LtWorf|3 years ago|reply

The benchmark is bad. Because after you load a json you can't really use it. Well to use it you must check lists are lists for real, objects are really objects and have the keys you think they should have and so on.

The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.

In the end dealing with untrusted json directly is terrible.

[+] IshKebab|3 years ago|reply

> But "slow" means "plenty fast" nowadays

Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".

> Why would I do that?

Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!

> How do I effortlessly restore objects including their methods from JSON?

You don't. You shouldn't.

[+] cratermoon|3 years ago|reply

>> unpickling malicious data can cause security issues

> Why would I do that?

If you pickle data from an untrusted source, say a web form submission and then later unpickle it. See https://cwe.mitre.org/data/definitions/502.html

[+] TremendousJudge|3 years ago|reply

There's also the much faster cPickle. It may just be fast enough for your needs. If it isn't, then you start exploring other options.

[+] NotTameAntelope|3 years ago|reply

Instantiate a new object of the class with the JSON as arguments, is one way.

I’ve built a bunch of these systems, keeping your data separate solves a lot of future problems.

[+] ris|3 years ago|reply

Don't Assume Things About Others Use Cases.

In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.

I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.

[+] hansvm|3 years ago|reply

I made a simple library for just such a purpose if you're interested. You can wrap a whole module (like requests or pandas) and cache every function/coroutine result to disk. https://github.com/hmusgrave/ememo

I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.

[+] northisup|3 years ago|reply

Who is out there using pickle because they think it is a good idea? We use it because it is easy and builtin to the language and handles datetime by default!

[+] WorldMaker|3 years ago|reply

Good thing JSON is in the standard library now too.

[+] ridiculous_fish|3 years ago|reply

What are some alternatives to pickling which can handle cyclic references?

I've looked into ORMs but these are invasive in terms of needing to annotate your classes and fields.

[+] WorldMaker|3 years ago|reply

There are several approaches to references in JSON. A common Python library I found via StackOverflow mentions is: https://pypi.org/project/jsonref/ (it supports automatic dereferencing at load time, but dumping references is still a slight challenge).

[+] UncleEntity|3 years ago|reply

If you make your own C extensions then you certainly have to write code to be able to pickle your classes.

I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.

If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.

[+] jessikat|3 years ago|reply

JSON really is a terrible serialization format. Even JavaScript can't safely deserialize JSON without silent data corruption. I've had to stringify numbers because of JavaScript, and there were no errors. Perhaps that's the fault of JavaScript, but I find the lack of encoding the numerical storage type to be a bug rather than a feature.

[+] windows_sucks|3 years ago|reply

would love to see an example of the data corruption you're talking about

[+] sidewndr46|3 years ago|reply

The author is mostly correctly, except about one thing: pickle can be read from Golang code.

I wrote a library for this years ago: https://github.com/hydrogen18/stalecucumber

[+] 0xbadcafebee|3 years ago|reply

I would rather use YAML than JSON, if only for the schema tags that let you customize the processor to load data into a custom data structure automatically. Saves time and lets you represent complex data structures in a minimal way. Use ruamel.yaml rather than PyYAML.

[+] cratermoon|3 years ago|reply

Much the same can be leveled against Java's serialized objects. The OWASP top 10 from 2017 even had "Insecure Deserialization" at #8. The 2021 update[1] changes it to "Software and Data Integrity Failures", still at #8. It's CWE-502: Deserialization of Untrusted Data[2], where Python and Java are specifically mentioned.

1 https://owasp.org/www-project-top-ten/

2 https://cwe.mitre.org/data/definitions/502.html

[+] marginalia_nu|3 years ago|reply

Does anyone actually use the Java object serialization API for code written this side of 2010? Feels like a vestigial feature that's not made sense for a long time, like to the point where we've already discarded another bad serialization format (XML) before looking at JSON or YAML now.

[+] ksaj|3 years ago|reply

It is worth noting that this article was published in 2014. The accuracy of its comparisons may have changed in the past 8.5 years.

[+] solarkraft|3 years ago|reply

(2014)

[+] ohiovr|3 years ago|reply

I found unpickling a lot slower than json loading.

[+] LtWorf|3 years ago|reply

But then you have to check that the "list" is really a list, that the objects do have the keys, that the strings are strings.

This should be factored in the cost, and it wasn't in the benchmark.

[+] nomel|3 years ago|reply

I've found it to be much faster, with large amounts of data, like numpy arrays. And, some things aren't possible to convert to JSON, without writing a bunch of code to do the serialization/deserialization, which often makes things slow again.

76 comments