top | item 46624740

Ask HN: Weird archive.today behavior?

140 points| rabinovich | 1 month ago

archive.today has recently (I noticed this, like, 3 days ago) started automatically making requests to someone's personal blog on their CAPTCHA page. Here's a screenshot of what I'm talking about: https://files.catbox.moe/20jsle.png

The relevant JS is:

   setInterval(function() {
     fetch("https://gyrovague.com/?s=" + Math.round(new Date().getTime() % 10000000), {
       referrerPolicy: "no-referrer",
       mode: "no-cors"
     });
   }, 300);
Looking at this blog, there seems to be exactly one article mentioning archive.today - "archive.today: On the trail of the mysterious guerrilla archivist of the Internet" (https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...), where the person running the blog digs up some information about archive's owner.

So perhaps this is some kind of revenge/DOS attack attempt/deliberately wasting their bandwidth in response to this article? Maybe an attempt to silence them and force to delete their article? But if it is, then I have so many questions. Like, why would the owner of the archive do that 2.5 years after the article was published? Or why would they even do that in the first place, do they not know about Streisand effect?

I'm confused.

69 comments

order

mastermedo|1 month ago

What my pattern-matching eyes immediately spotted is that the hn username that posted this is rabinovich. The linked article speaks about Masha Rabinovich. Maybe a coincidence.

> in a 2012 F-Secure forum post, a “masharabinovich” complains about “my website http://archive.is/” being blacklisted. They pop up on Wikipedia as well getting told off for adding too many links to archive.is, including a mention that they’re using the Czech ISP fiber.cz

gghffguhvc|1 month ago

Wild idea: Could be a symbolic dead man switch.

Reports of FBI going hard after archive.today around the time the HN account was setup and they post an archive.today competitor. Pings on the investigative article then a post to HN saying “3 days ago” which could indicate when FBI succeeded.

The only comment by the poster on this article is a sharp clarification of what doxxing is and isn’t.

Perhaps this is just an unusual way of slowly stepping out from behind the curtain on your own quirky terms after a fantastically long tenure.

dunder_cat|1 month ago

Hmm. If it is an attempt at DDoS attacks, it's probably not very fruitful:

  >$ resolvectl query gyrovague.com

  gyrovague.com: 192.0.78.25                     -- link: eno1
                 192.0.78.24                     -- link: eno1
Viewing the first IP address on https://bgp.he.net/ip/192.0.78.25 shows AS2635 (https://bgp.he.net/AS2635) is announcing 192.0.78.0/24. AS2635 is owned by https://automattic.com aka wordpress.com. I assume that for a managed environment at their scale, this is just another Wednesday for them.

arcfour|1 month ago

I believe they're probably trying to get the blog suspended (automatically?) hence the cache busting; chewing through higher than normal resources all of a sudden might do the trick even if it doesn't actually take it offline.

mike_d|1 month ago

It is using the ?s= parameter which causes WordPress to initiate a search for a random string. This can result in high CPU usage, which I believe is one of the DoS vectors that works on hosted WordPress.

dunder_cat|1 month ago

It occurred to me while reading the article that I could also just have checked the TLS cert. The cert I was given presents "Common Name tls.automattic.com". However, maybe someone will discover bgp.he.net via this :-)

fhub|1 month ago

This feels like the start of treasure hunt like game. Between username of rabinovich (as others have pointed out) and the prior submission by rabinovich of an archive.today like tool 3 months ago - https://ghostarchive.org/. When you click into the search query examples for ghostarchive such as this one https://ghostarchive.org/search?term=https://docs.google.com. Many of the documents are very weird indeed.

jijijijij|1 month ago

> This feels like the start of treasure hunt like game. Between username of rabinovich (as others have pointed out) and the prior submission by rabinovich of an archive.today like tool 3 months ago - https://ghostarchive.org/. When you click into the search query examples for ghostarchive such as this one https://ghostarchive.org/search?term=https://docs.google.com. Many of the documents are very weird indeed.

This is what someone trying to start a treasure hunt like game would say....

Mom! Am I an NPC? Mom! Am I real???

eli|1 month ago

Well that is a very silly way to punish the author of an article you don’t want people to know about.

blorg|1 month ago

I never would have read the article had archive.today not gone into a CAPTCHA loop on me and then I see in developer tools it's pinging this other site. Talk about Streisand effect.

rafram|1 month ago

Remember when Archive.is/today used to send Cloudflare DNS users into an endless captcha loop because the creator had some kind of philosophical disagreement with Cloudflare? Not the first time they’ve done something petty like this.

stavros|1 month ago

It wasn't a philosophical disagreement, they needed some geo info from the DNS server to route requests so they could prevent spam and Cloudflare wasn't providing it citing privacy reasons. The admin decided to block Cloudflare rather than deal with the spam.

AndroTux|1 month ago

That's still a thing. Happens to me as we speak.

NedF|1 month ago

[deleted]

1vuio0pswjnm7|1 month ago

Irony:

The author of the personal blog post claimed he works for Google, who has arguably the world's most complete web archive and uses it for commercial purposes

This archive used to be publicly accessible, at least in part, at webcache.googleusercontent.com^1

The blog post compares the size of archive.today with archive.org (about 1:40, according to the author)

But it does not include a comparison to cache.googleusercontent.com

1. Bing, another Google competitor, also offered part of their own archive at cc.bingj.com during that time

aendruk|1 month ago

OP frames this like they just stumbled across the blog post but they created an account matching the name discussed within it three months ago?

I’m confused.

333c|1 month ago

Sometimes HN admins revive quality posts that didn't get much traction when they were first posted. When this happens, the timestamps are updated to make the post look new.

I can't say for sure whether this is what happened here, but it is a possible explanation.

gyrovague|1 month ago

Gyrovague here, author of the targeted blog post:

https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...

In the past week or so, I have received a GDPR takedown attempt of the archive.today blog post (which my hosting provider rightly rejected), a politely worded request to take it down (which was sadly eaten by my spam filter), and now this (thanks to the HN reader who tipped me off).

Given that the proverbial cat has been out of the bag for 2.5 years at this point, I'm genuinely puzzled as to what they're hoping to achieve, but this does not seem like a very good way of going about it.

opengrass|1 month ago

Sockpuppet/troll unless you link the HN thread in the blog. rabinovich OP while the article talks about "Masha Rabinovich." I suspect it's all a ruse for the FBI.

g-b-r|1 month ago

Great article, is the attack affecting you in any way?

Do you know when it began?

And what do you think of the account reporting this being named rabinovich, and having being created months ago?

notmysql_|1 month ago

What did the politely worded request say, was it from the creator?

sbdaman|1 month ago

Given it's set to generate random pages on the site, is there even any possible explanation for this that isn't sketchy?

mediumdeviation|1 month ago

It's not random, setting the query string to a new value on every fetch is a cache busting technique - it's trying to prevent the browser from caching the page, presumably to increase bandwidth usage.

internetter|1 month ago

There's really no interpretation of this which isn't malicious, although, not to defend this behaviour whatsoever, I'm not entirely surprised by it. The only real value of archive.is is its paywall bypassing abilities and, presumably, large swaths of residential proxies that allow it to archive sites that archive.org can't. Only somebody with some degree of lawlessness would operate such a project.

jijijijij|1 month ago

Not excusing this malicious behavior, but I have to say, the mentioned blog post is a major dick move, too. Got quite the impression of a passive aggressive undertone, and there is clearly bittersweet irony in collecting and "archiving" an archiver's personal information from long ago traces. Maybe it's all some feud between two dicks, some backstory untold. Maybe the blog author wanted some information gone from archive.today, but was denied.

Brybry|1 month ago

It's not just for paywall bypassing. Sometimes there are archive.today snapshots that aren't in the Wayback Machine (though I think your overall point about lawlessness still stands).

For example, there was some NASA debris that hit a guy's house in Florida and it was in the news. [1] Some news sites linked to a Twitter post he made with the images but he later deleted the post. [2]

The Wayback Machine has a ton of snapshots of the Twitter post but none of them render for me. [3]

But archive.today's snapshot works great. [4]

[1] https://www.bbc.com/news/articles/c9www02e49zo

[2] https://xcancel.com/Alejandro0tero/status/176872903149342722...

[3] https://web.archive.org/web/20240715000000*/https://twitter....

[4] https://archive.md/obuWr

ycombinator_acc|1 month ago

What's the alternative? At least they don't comply with takedown requests, which can't be said about archive.org who remove everything even semi-controversial.

mediumdeviation|1 month ago

Pretty sure that blog is hosted on Wordpress.com infrastructure so it's not like the blog owner would even notice unless it generates so much traffic that WP itself notices.

That said I don't think there's many non-malicious explanation for this, I would suggest writing to HN and see about blocking submissions from the domain hn@ycombinator.com

nativeit|1 month ago

I just tried in my browser (Firefox on Ubuntu) and got the same result. Deeply curious.

russian_archive|1 month ago

While many people here on HN seems to be pro archive.today, please remember that it's a website managed by pro-Kremlin people, who, among other things selectively choose which content to erase, and track visitors and archivers in a few sneaky ways (look at the HTTP / DNS requests when you visit / archive pages).

One has to wonder why all this tracking from administrator(s) that want to stay anonymous?

You can't trust anything hosted on archive.today because you can't trust that the content hasn't been altered in some way in the pursuit of their agenda.

ventegus|1 month ago

Hm, a pro-Kremlin website, banned on Russian state firewall while actively used by Myrotvorets and many gov.ua sites....

self_awareness|1 month ago

And that's how advertising works, folks. If someone wants a website dead, I want to know more about it.

Barbing|1 month ago

Worth blocking the URL for users of that Archive site then, avoid extra burden?

aendruk|1 month ago

How would you determine who is a user of the archive site?

ventegus|1 month ago

They might need to tweak a single word. Streisand readers won’t have a clue which.

Save the page now and compare a week later.