> That question took me into the guts of the ZIP format, where I learned there’s a tiny index at the end that points to everything else.
Tangential, but any Free Software that uses `shared-mime-info` to identify files (any of your GNOMEs, KDEs, etc) are unable to correctly identify Zip files by their EOCD due to lack of accepted syntax for defining search patterns based on negative file offsets. Please show your support on this Issue if you would also like to see this resolved: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues... (linking to my own comment, so no this is not brigading)
For implementation in a library, you can use HttpRangeReader [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid feature that has been in the library for about 10 years.
Based on your experience, is zip the optimal archive format for long term digital archival in object storage if the use case calls for reading archives via http for scanning and cherry picking? Or is there a more optimal archive format?
I've been looking at this for gunzip files as well. There is a rust solution that looks interesting called https://docs.rs/indexed_deflate/latest/indexed_deflate/. My goals are to be able to index mysql dump files by tables boundaries.
Here's my Python library that does the same[0]. And it's incorporated into VisiData so you can view a .csv from within a .zip file over HTTP without downloading the whole .zip file.
I wrote a Rust command-line tool to do this for internal use in my SaaS. The motivation was to be able to index the contents of zip files stored on S3 without incurring significant egress charges. Is this something that people would generally find useful if it was open-sourced?
This is really cool! Could also make a useful standalone command line tool.
I think the general pattern - using the range header + prior knowledge of a file format to only download the parts of a file that are relevant - is still really underutilized.
One small problem I see is that a server that does not support range requests would just try to send you the entire file in the first request, I think.
So maybe doing a preflight HEAD request first to see if the server sends back Accept-Ranges could be useful.
How common is it in practice today to not support ranges? I remember back in the early days of broadband (c. 2000) when having a Download Manager was something most nerds endorsed, that most servers then supported partial downloads. Aside from toy projects has anyone encountered a server which didn't allow ranges (unless specifically configured to forbid it)?
7-zip does this. You can see it if you open (to view) a large ZIP file on slow network drive. There's no way it is downloading the whole thing. You can extract single files from the ZIP also with only a little traffic.
Would be surprised if that’s not how basically all tools behave, as I expect them all to seek to the central directory and to the referenced offset of individual files when extracting. Doesn’t really make a difference if that’s across a network file system or a local disc.
My 16yo son did exactly this over the last week as part of his Rust minecraft mod manager, using http range requests to get the file length, then the directory, then individual file data.
In this blog, I wrote about the architecture of a ZIP file and how we can leverage HTTP range requests to download files without decompressing the archive, in-browser.
jeffrallen|4 months ago
https://blog.nella.org/2016/01/17/seeking-http/
(Originally written for Advent of Go.)
rtk0|4 months ago
Lammy|4 months ago
Tangential, but any Free Software that uses `shared-mime-info` to identify files (any of your GNOMEs, KDEs, etc) are unable to correctly identify Zip files by their EOCD due to lack of accepted syntax for defining search patterns based on negative file offsets. Please show your support on this Issue if you would also like to see this resolved: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues... (linking to my own comment, so no this is not brigading)
Anything using `file(1)` does not have this problem: https://github.com/file/file/blob/280e121/magic/Magdir/zip#L...
gildas|4 months ago
[1] https://gildas-lormeau.github.io/zip.js/api/classes/HttpRang...
[2] https://github.com/gildas-lormeau/zip.js/blob/master/tests/a...
[3] https://github.com/gildas-lormeau/zip.js
toomuchtodo|4 months ago
silasb|4 months ago
saulpw|4 months ago
[0] https://github.com/saulpw/unzip-http/
rtk0|4 months ago
dabinat|4 months ago
rtk0|4 months ago
xg15|4 months ago
I think the general pattern - using the range header + prior knowledge of a file format to only download the parts of a file that are relevant - is still really underutilized.
One small problem I see is that a server that does not support range requests would just try to send you the entire file in the first request, I think.
So maybe doing a preflight HEAD request first to see if the server sends back Accept-Ranges could be useful.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
xp84|4 months ago
HPsquared|4 months ago
dividuum|4 months ago
jacknews|4 months ago
I'll dig up a link.
rtk0|4 months ago
aeblyve|4 months ago
dekhn|4 months ago