> This has been done in response to the discovery that the popular installer uv has a different extraction behavior to many Python-based installers that use the ZIP parser implementation provided by the zipfile standard library module.
> For maintainers of installer projects: Ensure that your ZIP implementation follows the ZIP standard and checks the Central Directory before proceeding with decompression. See the CPython zipfile module for a ZIP implementation that implements this logic. Begin checking the RECORD file against ZIP contents and erroring or warning the user that the wheel is incorrectly formatted.
Good to know that I won't need to work around any issues with `zipfile` — and it would be rather absurd for any Python-based installer to use anything else to do the decompression. (Checking RECORD for consistency is straightforward, although of course it takes time.)
... but surely uv got its zip-decompression logic from a crate rather than hand-rolling it? How many other Rust projects out there might have questionable handling of zip files?
> PyPI already implements ZIP and tarball compression-bomb detection as a part of upload processing.
... The implication is that `zipfile` doesn't handle this. But perhaps it can't really? Are there valid uses for zips that work that way? (Or maybe there isn't a clear rule for what counts as a "bomb", and PyPI has to choose a threshold value?)
> and it would be rather absurd for any Python-based installer to use anything else to do the decompression.
You'd reasonably think, but it's difficult to assert this: a lot of people use third-party tooling (uv, but also a lot of hand-rolled stuff), and Python packages aren't always processed in a straight-line-from-the-index manner.
(I think a good reference example of this is security scanners: a scanner might fetch a wheel ZIP and analyze it, and use whatever ZIP implementation it pleases.)
It's also worth noting that one of the differentials here concerns the Central Directory, but the other one is more pernicious: the ZIP APPNOTE[1] isn't really clear about how implementations should key from to EOCDR back to the local file entries, and implementations have (reasonably, IMO) interpreted the language differently. Python's zipfile chooses to do it in one way that I think is justifiable, but it's a "true" differential in the sense that there's no golden answer.
> (Or maybe there isn't a clear rule for what counts as a "bomb", and PyPI has to choose a threshold value?)
Yes, it's this. There are legitimate uses for high-ratio archives (e.g. compressed OS images), but Python package distributions are (generally) not one of them. PyPI has its own compression ratio that's intended to be a sweet spot between "that was compressed really well" and "someone is trying to ZIP-bomb the index."
Related to multiple .zip formats: I've found macOS Archive Utility sometimes refuses to extract early pkzip .zips created on MS-DOS, but yet Info-ZIP handles them just fine.
And, the macOS Archive Utility will complain that a proper .tar.bz2 is "corrupt" created using bzip2.
In general, be liberal in input and be conservative in output. Sometimes, this means using less features or certain older formats so that all/most things work without issues.
> In general, be liberal in input and be conservative in output.
That is a dangerous maxim in a world with malicious players. In fact this PyPI problem is precisely because zip files are being too readily accepted, even if they have ambiguous meaning. Their fix is (very sensibly) to be less liberal with their input.
zahlman|6 months ago
> For maintainers of installer projects: Ensure that your ZIP implementation follows the ZIP standard and checks the Central Directory before proceeding with decompression. See the CPython zipfile module for a ZIP implementation that implements this logic. Begin checking the RECORD file against ZIP contents and erroring or warning the user that the wheel is incorrectly formatted.
Good to know that I won't need to work around any issues with `zipfile` — and it would be rather absurd for any Python-based installer to use anything else to do the decompression. (Checking RECORD for consistency is straightforward, although of course it takes time.)
... but surely uv got its zip-decompression logic from a crate rather than hand-rolling it? How many other Rust projects out there might have questionable handling of zip files?
> PyPI already implements ZIP and tarball compression-bomb detection as a part of upload processing.
... The implication is that `zipfile` doesn't handle this. But perhaps it can't really? Are there valid uses for zips that work that way? (Or maybe there isn't a clear rule for what counts as a "bomb", and PyPI has to choose a threshold value?)
woodruffw|6 months ago
You'd reasonably think, but it's difficult to assert this: a lot of people use third-party tooling (uv, but also a lot of hand-rolled stuff), and Python packages aren't always processed in a straight-line-from-the-index manner.
(I think a good reference example of this is security scanners: a scanner might fetch a wheel ZIP and analyze it, and use whatever ZIP implementation it pleases.)
It's also worth noting that one of the differentials here concerns the Central Directory, but the other one is more pernicious: the ZIP APPNOTE[1] isn't really clear about how implementations should key from to EOCDR back to the local file entries, and implementations have (reasonably, IMO) interpreted the language differently. Python's zipfile chooses to do it in one way that I think is justifiable, but it's a "true" differential in the sense that there's no golden answer.
> (Or maybe there isn't a clear rule for what counts as a "bomb", and PyPI has to choose a threshold value?)
Yes, it's this. There are legitimate uses for high-ratio archives (e.g. compressed OS images), but Python package distributions are (generally) not one of them. PyPI has its own compression ratio that's intended to be a sweet spot between "that was compressed really well" and "someone is trying to ZIP-bomb the index."
[1]: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
lexicality|6 months ago
well... https://github.com/astral-sh/rs-async-zip
burnt-resistor|6 months ago
And, the macOS Archive Utility will complain that a proper .tar.bz2 is "corrupt" created using bzip2.
In general, be liberal in input and be conservative in output. Sometimes, this means using less features or certain older formats so that all/most things work without issues.
quietbritishjim|6 months ago
That is a dangerous maxim in a world with malicious players. In fact this PyPI problem is precisely because zip files are being too readily accepted, even if they have ambiguous meaning. Their fix is (very sensibly) to be less liberal with their input.
captn3m0|6 months ago
calebbrown|6 months ago
Of these Java is the most interesting as there a few JDKs commonly in use.
But I’m also interested in various security scanners that are built in other languages that can be fooled.
jspiner|6 months ago