top | item 46243458

(no title)

FieryMechanic | 2 months ago

When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.

discuss

order

conartist6|2 months ago

I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...

conartist6|2 months ago

You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).

Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.