ArchiveTeam Warrior backing up Reddit

[+] octopoc|4 years ago|reply

Is it possible to write a program that connects to an HTTPS server and archives content, but it keeps track of the session keys and the encrypted data coming from the server, and then records all that session traffic in a file. Replaying the file will allow anyone to observe that the data truly did come from that specific server, because it's signed with the cert for that server.

In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?

[+] kiwidrew|4 years ago|reply

Unfortunately it's not possible, because TLS negotiates a symmetric key which is then used to encrypt and authenticate the rest of the session. If you post the transcript of a TLS session in an attempt to "prove" that you retrieved a specific document, a third party can verify that you did in fact negotiate a symmetric key with the correct server; but since it's a symmetric key, anyone with knowledge of the key can arbitrarily modify the transcript of the session [well, the part of the session where the HTTP request and response happen]. This obviously includes the original prover, and so a TLS transcript proves nothing at all.

[+] ajsfoux234|4 years ago|reply

TLSNotary tries to solve this problem: https://tlsnotary.org/

It was posted here a while ago: https://news.ycombinator.com/item?id=29090604

[+] paxys|4 years ago|reply

The problem is when you are trying to verify the data down the line the original certificate (up to and including the root cert or even the authority itself) will have expired, so it won’t be possible to trust it.

[+] Fudgel|4 years ago|reply

Not sure if this is entirely what you're after, but check out https://github.com/WICG/webpackage

[+] jerheinze|4 years ago|reply

For context: https://teddit.net/r/DataHoarder/comments/rhttac/with_reddit...

[+] mnd999|4 years ago|reply

I think this requires more explanation. Is this some kind of cloud archiving? Are they using ipfs or something like that?

[+] smarx007|4 years ago|reply

Volunteers run Docker containers or Virtualbox VMs at home so that the traffic looks residential and does not get banned. For example: https://imgur.com/a/QXrhudA

Most useful content gets packaged by the ArchiveTeam and send to the Internet Archive (no affiliation between the two).

[+] timdaub|4 years ago|reply

I'm also not totally familiar with what's going on but I discovered it on /r/datahoarder and I think it's because redditors are scared that content starts vanishing now that reddit files for IPO.

Anyways, here's a further description: http://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

[+] jcrawfordor|4 years ago|reply

ArchiveTeam usually sends all the data they collect into Internet Archive. The two are independent, unrelated organizations, but IA is pretty open about accepting hoards of this type.

[+] unknown|4 years ago|reply

[deleted]

[+] jl6|4 years ago|reply

A 2018-era article suggests the Internet Archive held 46PB of content. It’s probably much more now.

This warrior has extracted ~880TB so far. I wouldn’t be surprised if the result occupies a material proportion of the IA’s capacity, at significant cost.

Still, better than letting it all get burnt by the shareholders a few years down the line.

[+] romwell|4 years ago|reply

>Still, better than letting it all get burnt by the shareholders

You wish. By the looks of it, it will be burned down by yet another "redesign" when they inevitably shut down the sane UI of https://old.reddit.com because it's not pushing ads and "social" "features" strongly enough.

(If you don't use https://old.reddit.com/ instead of the "new", aka default, reddit, treat yourself to some sanity. Imagine if Reddit was more like HackerNews. Wait, you don't need to imagine, that's what the link goes to.)

I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.

Feels like reddit is digging it own grave with these moves, and the shareholders' money with it.

I really hope to revisit this comment in 10 years and say how utterly stupid and wrong I was, because I've got an incredible amount out of the communities on reddit, and poured a lot into them to. Particularly, support groups.

[+] smarx007|4 years ago|reply

ArchiveTeam != Internet Archive

[+] flatiron|4 years ago|reply

I wonder how much google stores. My dumb butt has close to 100TB of…uhh…Linux isos encrypted and hooked up to plex for $20 a month. Seems like a loss leader for them. But I always think of the multiple spinning discs I’m taking over there for content and residency since it can’t be deduped

[+] unknown|4 years ago|reply

[deleted]

[+] throwawaysea|4 years ago|reply

This is so very necessary. Reddit has banned great content from many subs whose ideology they didn’t agree with. Unfortunately it isn’t even possible to know the URLs of posts from banned subreddits to look them up in the first place.

Does anyone know if there are backups of banned subreddits already?

[+] NaturalPhallacy|4 years ago|reply

And with "permanently suspended accounts" nothing they wrote is visible.

And reddit is very liberal with their banhammers.

[+] fnord77|4 years ago|reply

in some cases it is positively ridiculous. I got banned from /r/coronavirus for posting a scientific article suggesting breakthrough cases in fully vaccinated people was a possibility (this was fairly early, Mar 2021 maybe). The mod denounced me as an antivaxxer which I am certainly am not. Lo and behold - breakthrough cases are a real thing.

[+] ajsfoux234|4 years ago|reply

Where does ArchiveTeam find all the reddit posts and comments to archive? Do they have a script automatically going through the "New" section or are they finding posts through Google or link crawling?

[+] giantrobot|4 years ago|reply

Besides their Archive Warrior distributed crawler I imagine PushShift[0] is probably a starting point for them.

[0] https://files.pushshift.io/

[+] Gigachad|4 years ago|reply

In general, ArchiveTeam has scripts which hit random links to see if there is any content. They have coordination servers which share info on which slugs have been checked before to avoid duplicate effort.

[+] pronoiac|4 years ago|reply

ArchiveTeam wiki on Reddit: https://wiki.archiveteam.org/index.php/Reddit

[+] ortusdux|4 years ago|reply

There are several sites that let you view deleted reddit comments. I wonder if they have complete text backups, or if they get the comments from somewhere else.

[+] elpocko|4 years ago|reply

The ones I know are powered by https://pushshift.io/

[+] nashashmi|4 years ago|reply

.

[+] donkarma|4 years ago|reply

What a waste of the space of the Internet Archive. Just because something exists doesn't mean it should be backed up, I would be surprised if anyone actually needed something from Usenet for example. Things like this are going to kill the Archive eventually.

[+] simonw|4 years ago|reply

> I would be surprised if anyone actually needed something from Usenet for example

Totally disagree. Usenet is the only place recording the history of a huge number of influential projects from the 80s and 90s. That history deserves to be recorded.

[+] serf|4 years ago|reply

absolutely disagree.

Some of the most valuable and insightful anthropological artifacts are merely shop ledgers and discourse on bathroom walls -- and we've never been better equipped to document/store/search/access the entirety of the saved artifacts from the modern age -- and presumably our mastery over information technology as a domain will only improve and make things easier.

[+] beamatronic|4 years ago|reply

I would love to lookup my old usenet posts from the early 90's

[+] darknavi|4 years ago|reply

> Just because something exists doesn't mean it should be backed up.

/r/DataHoarder disagrees. It's pretty crazy the what people will back up these days.

[+] karlding|4 years ago|reply

The 20 Newsgroups dataset is a collection of ~20k newsgroup documents and is super popular for experimentation in text applications of various machine learning techniques. Without Usenet (and archives of Usenet) that probably wouldn't exist.

[+] wl|4 years ago|reply

You never know what will be useful to the future. To give an example, the field of papyrology is largely built around trying to construct a view of the past using scraps of texts excavated from ancient dumps.

[+] _jal|4 years ago|reply

This is why individual people should never be left to make such decisions alone. They're likely to throw away things of enormous value for parochial reasons.

[+] zarzavat|4 years ago|reply

Archive Team is not the same as the Internet Archive.

[+] textfiles|4 years ago|reply

Thank you, drive through

[+] romwell|4 years ago|reply

...why?

It's a pretty high bar to say that something should not be archived and it's a waste.

I am not even going to state my opinion on the issue now, I am just disappointed about the level of discourse comments like this create. No justification, no logic.

TL;DR is that you don't like reddit. This is not useful to anyone.

[+] totetsu|4 years ago|reply

no one cared about the snap shot of Bill Clinton dancing with an intern either until later.

[+] unknown|4 years ago|reply

[deleted]

71 comments