top | item 30365495

(no title)

rmgraham | 4 years ago

Tar doesn't use any sort of index like zip does, so to extract the specified file the server side would need to parse through possibly the entire file just to see if the requested file is there, and then start streaming it. Requests for files that aren't in the tar archive would be prohibitively expensive.

There are definitely ways to do it without those problems, though. They just wouldn't be quite as simple as the approach done for supporting zip.

discuss

remram|4 years ago

You could pre-index them I suppose. Though even that would only work with a subset of compression methods or no compression.

klauspost|4 years ago

We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

danudey|4 years ago

IIRC gzip can't handle this, but bzip2 can; a guy I know wrote an offline Wikipedia app for the original iPhone and had to crunch things down a lot, and he used bzip2 because you can skip ahead to a chunk without having to process the previous or subsequent chunks.

Then he just had to write some code to index article names based on which chunk(s) they were in, and boom, random-access compressed archive.

blacha|4 years ago

This is basically exactly what we do we have created a cloud optimised tar (cotar)[1] by creating a hash index of the files inside the tar.

I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.

We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).

To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]

> Though even that would only work with a subset of compression methods or no compression.

We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case

[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler

unknown|4 years ago

[deleted]