top | item 37082878

(no title)

mr_ndrsn | 2 years ago

This looks very cool!

Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.

discuss

RockRobotRock|2 years ago

It's tough when there's a cat and mouse game to spoof your UA so you don't get blocked. I wish webmasters had better relationships with scrapers and could accept the realities that your data will be scraped no matter how much you try and stop it.

hosteur|2 years ago

IMO, We should really just get rid of the user agent header altogether.

simonw|2 years ago

Yeah, that's good idea - I need to add that to my suggestions for how to implement this.

pcthrowaway|2 years ago

If you're scraping any significant amount of data (>500K), and depending on the frequency, you might also want to add etag/cache-control headers as well as accept-encoding, to save server bandwidth.

Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs

lloydatkinson|2 years ago

It should definitely be optional. I can only imagine some busybody PM insisting they block harmless scrapes.