But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
khannn|7 days ago
pingou|8 days ago
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
wbmva|8 days ago
https://megalodon.jp/2026-0221-0304-51/https://d914s229qk4kj...
https://archive.is/Y7z4E
The second shows volth's Github notifications. Volth was a major nix-pkgs contributor, but his Github account disappeared.
https://github.com/orgs/community/discussions/58164
seanhly|8 days ago
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
rkagerer|8 days ago
https://www.reddit.com/r/Advice/comments/5rbla4/comment/dd5x...
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
bdhcuidbebe|8 days ago
I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.
> Otherwise, the saved page would contain information about the logged-in user.
If you read this article, theres plenty of evidence they are manipulating the scraped data.
But I’m just speculating here…