top | item 47096940

(no title)

bdhcuidbebe | 8 days ago

They are able to scrape paywalled sites at random, so im guessing a residential botnet is used.

discuss

order

khannn|7 days ago

It's funny that residential VPN botnets aren't uncommon now. "Free VPN" if you allow your computer/phone to be an exit point.

pingou|8 days ago

But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.

They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.

wbmva|8 days ago

On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:

https://megalodon.jp/2026-0221-0304-51/https://d914s229qk4kj...

https://archive.is/Y7z4E

The second shows volth's Github notifications. Volth was a major nix-pkgs contributor, but his Github account disappeared.

https://github.com/orgs/community/discussions/58164

seanhly|8 days ago

There are some pretty robust browser addons for bypassing article paywalls, notably https://gitflic.ru/project/magnolia1234/bypass-paywalls-fire...

This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.

My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.

rkagerer|8 days ago

I thought saved pages sometimes do contain users' IP's?

https://www.reddit.com/r/Advice/comments/5rbla4/comment/dd5x...

The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.

bdhcuidbebe|8 days ago

> But how do they bypass the paywall?

I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.

> Otherwise, the saved page would contain information about the logged-in user.

If you read this article, theres plenty of evidence they are manipulating the scraped data.

But I’m just speculating here…