top | item 25616513

Python – Create large ZIP archives without memory inflation

103 points| pythonscripts2 | 5 years ago |github.com | reply

27 comments

order
[+] rahimiali|5 years ago|reply
I have questions about the code. Why do you need to say int('0x1', 16) and int('0x2', 16)? Why not just write 0x1 and 0x2? Or just plain 1 and 2?

I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?

[0] https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....

[+] mlyle|5 years ago|reply
I think the appeal is that it's a generator, so that if you need to encapsulate/cram bytes of the zip over some other transport that you can just naturally ask for a few more each time without having to accumulate it in memory.

Of course, by crafting a special file-like object you could avoid this too, but perhaps a bit less elegantly.

[+] userbinator|5 years ago|reply
I'm a little perplexed by the "marketing" around this --- all the archivers I know of don't require more memory than the compression state (which AFAIK for ZIP/deflate is not much more than a 64k window), since it is natural that files can be larger than available RAM.
[+] icegreentea2|5 years ago|reply
I think it's meant for a pretty narrow use-case: serving compressed files through frameworks(as mentioned, for example Django or Flask) that expect to serve file objects, but without writing to disk.

The "usual"/naive solution (if you stay within the python ecosystem) is to compress the files and write to a BytesIO or other in-memory file like object, and then have your framework serve it. The naive solution leads to writing the whole file to memory before serving (thus memory inflation).

This library just looks like a pretty straightforward way to implement the same idea, but with chunking to bound memory usage. At the bottom, it's doing the same thing, but using generators to yield chunks at a time.

It's a useful utility for that context. Nothing groundbreaking, it's something that most intermediate and higher developers could stitch together in probably a few days (especially if they had to brush on on DEFLATE and generator protocol), but it's nice to not have to.

[+] tyingq|5 years ago|reply
I believe the comparison is just to the bundled zipfile module and BytesIO, which would be the quick and dirty way to make a zipfile without creating actual files, but would be memory intensive.
[+] cozzyd|5 years ago|reply
not a safe assumption with Python packages!
[+] MaxBarraclough|5 years ago|reply
Can someone explain what it's doing? Is it using an algorithm with far superior space complexity than the usual algorithm?

Python seems a curious choice. Compression is computationally intensive.

[+] da_big_ghey|5 years ago|reply
Looks like it just splits by 16MB chunks, so just standard deflate. Actual compression is handled by the python zipfile module, which is probably C code underneath.
[+] julik|5 years ago|reply
Appreciate the quotes from the zip_tricks README as well as the resemblances between Buzon and WeTransfer. Glad some of the work we did proved inspirational ;-)
[+] 2bluesc|5 years ago|reply
I built a streaming zip app using nothing more then the Python stdlib zip implementation and some os primitives.

It runs on a small embedded device that can stream zip archives many times larger then the disk or system ram without any issue.

Example Python Falcon Proof of Concept:

https://gist.github.com/kylemanna/1e22bbf31b7e5ae84bbdfa32c6...

Other then what Python's zipfile buffers in memory, my implementation shouldn't use much more then a os.pipe()'s buffer (typically 64kB?).

[+] ejwhite|5 years ago|reply
Interesting.

I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?

[+] spockz|5 years ago|reply
Is there something like this for the JVM? I’m not sure whether with https://github.com/srikanth-lingala/zip4j#adding-entries-wit... will keep everything it is possible to keep it in constrained memory.
[+] taeric|5 years ago|reply
The standard zip tools in Java should be fine. I regularly compress gigs of data in an aws lambda environment. Streaming from and to s3.
[+] the8472|5 years ago|reply
Even the JRE-builtin ZipOutputStream would do the job, it's a proper streaming implementation that doesn't keep more state than necessary.
[+] amelius|5 years ago|reply
Does anyone know of a tar equivalent which performs deduplication?
[+] TheChaplain|5 years ago|reply
Should not be too complicated in Python. Just calculate the sha1/sha256 on the file before adding it to the tar-archive, skip any duplicates.