(no title)
blacha | 4 years ago
I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.
We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).
To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]
> Though even that would only work with a subset of compression methods or no compression.
We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case
[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler
remram|4 years ago
blacha|4 years ago
Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget
> or in ZIP format?
We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.
The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.
assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)
Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry
Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.[1] https://github.com/protomaps/PMTiles