top | item 37527194

(no title)

taldo | 2 years ago

A very simple optimization for those complaining about having to fetch a large file every time you need a little datapoint: if they promised the file was append-only, and used HTTP gzip/brotli/whatever compression (as opposed to shipping a zip file), you could use range requests to only get the new data after your last refresh. Throw in an extra checksum header for peace of mind, and you have a pretty efficient, yet extremely simple incremental API.

(Yes, this assumes you keep the state, and you have to pay the price of the first download + state-keeping. Yes, it's also inefficient if you just need to get the EUR/JPY rate from 2007-08-22 a single time.)

discuss

calpaterson|2 years ago

Absolutely! I have a plan for a client lib that uses ETags (+ other tricks) to do just that.

Very WIP but check out my current "research quality" code here: https://pypi.org/project/csvbase-client/

acqq|2 years ago

Also, on the topic of range requests, when a server allows the range requests for zip files, the zip files are huge and one needs just a few files from them, one can actually download just the "central directory" and the compressed data of the needed files without downloading the whole zip file:

https://github.com/gtsystem/python-remotezip

GuB-42|2 years ago

Or, just serve a bunch of diff files. Just having a single daily patch can drastically reduce the bandwidth required to keep the file up to date on your side.

That's if downloading a few hundred kB more per day matters to you. It probably doesn't.