top | item 42549290

(no title)

adastral | 1 year ago

> I default to a headless browser

Headless browsers consume orders of magnitude more resources, and execute far more requests (e.g. fetching images) than a common webscraping job would require. Having run webscraping at scale myself, the cost of operating headless browsers made us only use them as a last resort.

discuss

at0mic22|1 year ago

Blocking all image/video/CSS requests is the rule of thumb when working with headless browsers via CDP

sangnoir|1 year ago

Speaking as a person who has played on both offense and defense: this is a heuristic that's not used frequently enough by defenders. Clients that load a single HTML/JSON endpoint without loading css or image resources associated with the endpoints are likely bots (or user agents with a fully loaded cache, but defenders control what gets cached by legit clients and how). Bot data thriftiness is a huge signal.

TekMol|1 year ago

So you maintain a table of domains and how to access them?

How do you build that table and keep it up to date? Manually?