I have questions about the code. Why do you need to say int('0x1', 16) and int('0x2', 16)? Why not just write 0x1 and 0x2? Or just plain 1 and 2?
I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?
I think the appeal is that it's a generator, so that if you need to encapsulate/cram bytes of the zip over some other transport that you can just naturally ask for a few more each time without having to accumulate it in memory.
Of course, by crafting a special file-like object you could avoid this too, but perhaps a bit less elegantly.
I'm a little perplexed by the "marketing" around this --- all the archivers I know of don't require more memory than the compression state (which AFAIK for ZIP/deflate is not much more than a 64k window), since it is natural that files can be larger than available RAM.
I think it's meant for a pretty narrow use-case: serving compressed files through frameworks(as mentioned, for example Django or Flask) that expect to serve file objects, but without writing to disk.
The "usual"/naive solution (if you stay within the python ecosystem) is to compress the files and write to a BytesIO or other in-memory file like object, and then have your framework serve it. The naive solution leads to writing the whole file to memory before serving (thus memory inflation).
This library just looks like a pretty straightforward way to implement the same idea, but with chunking to bound memory usage. At the bottom, it's doing the same thing, but using generators to yield chunks at a time.
It's a useful utility for that context. Nothing groundbreaking, it's something that most intermediate and higher developers could stitch together in probably a few days (especially if they had to brush on on DEFLATE and generator protocol), but it's nice to not have to.
I believe the comparison is just to the bundled zipfile module and BytesIO, which would be the quick and dirty way to make a zipfile without creating actual files, but would be memory intensive.
Looks like it just splits by 16MB chunks, so just standard deflate. Actual compression is handled by the python zipfile module, which is probably C code underneath.
Appreciate the quotes from the zip_tricks README as well as the resemblances between Buzon and WeTransfer. Glad some of the work we did proved inspirational ;-)
I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?
[+] [-] rahimiali|5 years ago|reply
I'm also perplexed by the goal as this seems to just call zipfile.write under the hood, which already streams to a zip file without accumulating a memory buffer?
[0] https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....
[+] [-] mlyle|5 years ago|reply
Of course, by crafting a special file-like object you could avoid this too, but perhaps a bit less elegantly.
[+] [-] userbinator|5 years ago|reply
[+] [-] icegreentea2|5 years ago|reply
The "usual"/naive solution (if you stay within the python ecosystem) is to compress the files and write to a BytesIO or other in-memory file like object, and then have your framework serve it. The naive solution leads to writing the whole file to memory before serving (thus memory inflation).
This library just looks like a pretty straightforward way to implement the same idea, but with chunking to bound memory usage. At the bottom, it's doing the same thing, but using generators to yield chunks at a time.
It's a useful utility for that context. Nothing groundbreaking, it's something that most intermediate and higher developers could stitch together in probably a few days (especially if they had to brush on on DEFLATE and generator protocol), but it's nice to not have to.
[+] [-] tyingq|5 years ago|reply
[+] [-] cozzyd|5 years ago|reply
[+] [-] MaxBarraclough|5 years ago|reply
Python seems a curious choice. Compression is computationally intensive.
[+] [-] da_big_ghey|5 years ago|reply
[+] [-] julik|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] 2bluesc|5 years ago|reply
It runs on a small embedded device that can stream zip archives many times larger then the disk or system ram without any issue.
Example Python Falcon Proof of Concept:
https://gist.github.com/kylemanna/1e22bbf31b7e5ae84bbdfa32c6...
Other then what Python's zipfile buffers in memory, my implementation shouldn't use much more then a os.pipe()'s buffer (typically 64kB?).
[+] [-] ejwhite|5 years ago|reply
I need to open a very large CSV file in Python, which is around 25GB in .zip format. Any idea how to do this in a streaming way, i.e. stopping after reading the first few thousand rows?
[+] [-] leiserfg|5 years ago|reply
[+] [-] lern_too_spel|5 years ago|reply
[+] [-] spockz|5 years ago|reply
[+] [-] taeric|5 years ago|reply
[+] [-] the8472|5 years ago|reply
[+] [-] amelius|5 years ago|reply
[+] [-] TheChaplain|5 years ago|reply