top | item 30634700

(no title)

cameroncairns | 4 years ago

Really great techniques listed in this thread! I wanted to point out though that it's generally nicer to the website owner if you enable `Accept-Encoding: gzip, deflate`. The difference in the amount of bandwidth charges for the site owner is quite significant, especially should you want to do comprehensive crawls.

Yes, go ahead and disable that header when piping curl's output into `less`, however when converting the curl request into python just remember to re-add that header. Pretty much every python library I've used to handle web requests will automatically unzip the response from the server so you don't need to futz about with the zipping/unzipping logic yourself.

discuss

order

Nextgrid|4 years ago

Your HTTP client library is likely to set that by itself to a value it can understand. Setting it manually risks setting it to something your library can’t actually decode when it gets the response.

Klonoar|4 years ago

No, some HTTP clients actually require you to set it - you wouldn't set the header directly, sure, but you would enable gzip/etc. Their point is super valid.

1vuio0pswjnm7|4 years ago

There have been some very popular websites that ignore Accept-Encoding and only send compressed data. Sometimes I want uncompressed responses. I always have the urge to complain about these websites on HN but I sense that HN commenters/voters would be unsympathetic. (I do not use curl nor python.)