top | item 43409439

(no title)

op7 | 11 months ago

This isnt 1998 anymore so downloading the files from modern websites doesn't really work if youre trying to maintain your own private local / re-hosted copy of a site. especially ones with dynamically loaded content. Some additional processing is needed to fix the files. I have never been able to find a modern scraping solution that works with most modern websites. I suppose the existence of this sort of tool is in conflict of interest of Big Tech, for it would make the creation of visually identical looking phishing sites that easier much.

discuss

order

stuffoverflow|11 months ago

There definitely are tools for scraping basically any site by using the browser itself to make sure all dynamically loaded stuff gets intercepted correctly. Browsertrix[0] is probably the most well known and complete scraper for that. They offer it as a paid service for convenient setup but its open source and can be self-hosted as well.

0: https://webrecorder.net/browsertrix/

weinzierl|11 months ago

Interesting, never had heard of them before. Pricing looks reasonable except for the time limit being per month. Daily limit sounds much more practical. How do people use that in a useful way?

Does anyone have experience self-hosting this in the cloud? I'd worry about run-away traffic cost but since ingress is cheap most of the time maybe this is not a big problem?