Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.
It's tough when there's a cat and mouse game to spoof your UA so you don't get blocked. I wish webmasters had better relationships with scrapers and could accept the realities that your data will be scraped no matter how much you try and stop it.
If you're scraping any significant amount of data (>500K), and depending on the frequency, you might also want to add etag/cache-control headers as well as accept-encoding, to save server bandwidth.
Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs
RockRobotRock|2 years ago
hosteur|2 years ago
simonw|2 years ago
pcthrowaway|2 years ago
Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs
lloydatkinson|2 years ago