From the conclusion of the article:
>Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects
ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".
None of the other options out there let you build a foolproof "save button" in 3 lines of code.
The real question is why doesn't Python have something like a class decorator`@json.interchangeable` that you can apply to a class -- maybe dataclasses only? -- to have json (de)serialization be only three lines of code (or less).
They forgot another major problem: You can only reliably unpickle data using the same (or same-enough) code that pickled it. If your class definitions have changed or moved around, unpickling can break.
I ran into a bug in production unpickling some builtins (dicts or sets or something) that were pickled in 2.2 and unpickled in 2.4 (or 2.4 -> 2.6, it's fuzzy).
Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.
More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.
The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.
even though all the metadata is weird and overengineered, i would probably still use hdf5 as it provides for interop with other numerical computing environments (matlab, julia).
also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.
fwiw. matlab .mat files are hdf5 at their core.
i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.
Last time I checked (i.e. performed several benchmarks upon), parquet with Zstd was about the best way to store compressed data for really fast and small files.
Zstd is quite good, and is now (iirc) in the linux kernel.
People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well.
But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.
... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.
> unpickling malicious data can cause security issues
Why would I do that?
I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.
> Just use JSON
How do I effortlessly restore objects including their methods from JSON?
> How do I effortlessly restore objects including their methods from JSON?
The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".
It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.
(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)
One thing that's not mentioned is that pickled data is effectively fossilized once you've pickled it. If you want to change the layout of a class and have objects unpickle correctly, it can be an ordeal, as objects are unpickled by their class name, and you need both the original class and the new class to correctly unpickle and migrate.
If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.
The benchmark is bad. Because after you load a json you can't really use it. Well to use it you must check lists are lists for real, objects are really objects and have the keys you think they should have and so on.
The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.
In the end dealing with untrusted json directly is terrible.
Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".
> Why would I do that?
Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!
> How do I effortlessly restore objects including their methods from JSON?
In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.
I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.
I made a simple library for just such a purpose if you're interested. You can wrap a whole module (like requests or pandas) and cache every function/coroutine result to disk. https://github.com/hmusgrave/ememo
I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.
Who is out there using pickle because they think it is a good idea?
We use it because it is easy and builtin to the language and handles datetime by default!
There are several approaches to references in JSON. A common Python library I found via StackOverflow mentions is: https://pypi.org/project/jsonref/ (it supports automatic dereferencing at load time, but dumping references is still a slight challenge).
If you make your own C extensions then you certainly have to write code to be able to pickle your classes.
I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.
If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.
JSON really is a terrible serialization format. Even JavaScript can't safely deserialize JSON without silent data corruption. I've had to stringify numbers because of JavaScript, and there were no errors. Perhaps that's the fault of JavaScript, but I find the lack of encoding the numerical storage type to be a bug rather than a feature.
I would rather use YAML than JSON, if only for the schema tags that let you customize the processor to load data into a custom data structure automatically. Saves time and lets you represent complex data structures in a minimal way. Use ruamel.yaml rather than PyYAML.
Much the same can be leveled against Java's serialized objects. The OWASP top 10 from 2017 even had "Insecure Deserialization" at #8. The 2021 update[1] changes it to "Software and Data Integrity Failures", still at #8. It's CWE-502: Deserialization of Untrusted Data[2], where Python and Java are specifically mentioned.
Does anyone actually use the Java object serialization API for code written this side of 2010? Feels like a vestigial feature that's not made sense for a long time, like to the point where we've already discarded another bad serialization format (XML) before looking at JSON or YAML now.
I've found it to be much faster, with large amounts of data, like numpy arrays. And, some things aren't possible to convert to JSON, without writing a bunch of code to do the serialization/deserialization, which often makes things slow again.
[+] [-] AussieWog93|3 years ago|reply
ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".
None of the other options out there let you build a foolproof "save button" in 3 lines of code.
[+] [-] henrydark|3 years ago|reply
[+] [-] bobbylarrybobby|3 years ago|reply
[+] [-] thraxil|3 years ago|reply
[+] [-] meatmanek|3 years ago|reply
[+] [-] kangalioo|3 years ago|reply
[+] [-] philsnow|3 years ago|reply
Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.
[+] [-] Spivak|3 years ago|reply
In fact this is what the benchmark does.
[+] [-] jleahy|3 years ago|reply
More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.
The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.
[+] [-] a-dub|3 years ago|reply
also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.
fwiw. matlab .mat files are hdf5 at their core.
i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.
[+] [-] mr_toad|3 years ago|reply
I was wondering why it didn’t mention Apache Arrow.
[+] [-] RhysU|3 years ago|reply
[+] [-] chaxor|3 years ago|reply
Zstd is quite good, and is now (iirc) in the linux kernel.
People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well. But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.
[+] [-] mistrial9|3 years ago|reply
[+] [-] solarkraft|3 years ago|reply
... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.
> unpickling malicious data can cause security issues
Why would I do that?
I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.
> Just use JSON
How do I effortlessly restore objects including their methods from JSON?
[+] [-] marcosdumay|3 years ago|reply
The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".
It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.
(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)
[+] [-] vore|3 years ago|reply
If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.
[+] [-] LtWorf|3 years ago|reply
The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.
In the end dealing with untrusted json directly is terrible.
[+] [-] IshKebab|3 years ago|reply
Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".
> Why would I do that?
Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!
> How do I effortlessly restore objects including their methods from JSON?
You don't. You shouldn't.
[+] [-] cratermoon|3 years ago|reply
> Why would I do that?
If you pickle data from an untrusted source, say a web form submission and then later unpickle it. See https://cwe.mitre.org/data/definitions/502.html
[+] [-] TremendousJudge|3 years ago|reply
[+] [-] NotTameAntelope|3 years ago|reply
I’ve built a bunch of these systems, keeping your data separate solves a lot of future problems.
[+] [-] ris|3 years ago|reply
In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.
I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.
[+] [-] hansvm|3 years ago|reply
I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.
[+] [-] northisup|3 years ago|reply
[+] [-] WorldMaker|3 years ago|reply
[+] [-] ridiculous_fish|3 years ago|reply
I've looked into ORMs but these are invasive in terms of needing to annotate your classes and fields.
[+] [-] WorldMaker|3 years ago|reply
[+] [-] UncleEntity|3 years ago|reply
I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.
If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.
[+] [-] jessikat|3 years ago|reply
[+] [-] windows_sucks|3 years ago|reply
[+] [-] sidewndr46|3 years ago|reply
I wrote a library for this years ago: https://github.com/hydrogen18/stalecucumber
[+] [-] 0xbadcafebee|3 years ago|reply
[+] [-] cratermoon|3 years ago|reply
1 https://owasp.org/www-project-top-ten/
2 https://cwe.mitre.org/data/definitions/502.html
[+] [-] marginalia_nu|3 years ago|reply
[+] [-] ksaj|3 years ago|reply
[+] [-] solarkraft|3 years ago|reply
[+] [-] ohiovr|3 years ago|reply
[+] [-] LtWorf|3 years ago|reply
This should be factored in the cost, and it wasn't in the benchmark.
[+] [-] nomel|3 years ago|reply