top | item 43409569

(no title)

stuffoverflow | 11 months ago

There definitely are tools for scraping basically any site by using the browser itself to make sure all dynamically loaded stuff gets intercepted correctly. Browsertrix[0] is probably the most well known and complete scraper for that. They offer it as a paid service for convenient setup but its open source and can be self-hosted as well.

0: https://webrecorder.net/browsertrix/

discuss

order

weinzierl|11 months ago

Interesting, never had heard of them before. Pricing looks reasonable except for the time limit being per month. Daily limit sounds much more practical. How do people use that in a useful way?

Does anyone have experience self-hosting this in the cloud? I'd worry about run-away traffic cost but since ingress is cheap most of the time maybe this is not a big problem?