Is it possible to write a program that connects to an HTTPS server and archives content, but it keeps track of the session keys and the encrypted data coming from the server, and then records all that session traffic in a file. Replaying the file will allow anyone to observe that the data truly did come from that specific server, because it's signed with the cert for that server.
In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?
Unfortunately it's not possible, because TLS negotiates a symmetric key which is then used to encrypt and authenticate the rest of the session. If you post the transcript of a TLS session in an attempt to "prove" that you retrieved a specific document, a third party can verify that you did in fact negotiate a symmetric key with the correct server; but since it's a symmetric key, anyone with knowledge of the key can arbitrarily modify the transcript of the session [well, the part of the session where the HTTP request and response happen]. This obviously includes the original prover, and so a TLS transcript proves nothing at all.
The problem is when you are trying to verify the data down the line the original certificate (up to and including the root cert or even the authority itself) will have expired, so it won’t be possible to trust it.
Volunteers run Docker containers or Virtualbox VMs at home so that the traffic looks residential and does not get banned. For example: https://imgur.com/a/QXrhudA
Most useful content gets packaged by the ArchiveTeam and send to the Internet Archive (no affiliation between the two).
I'm also not totally familiar with what's going on but I discovered it on /r/datahoarder and I think it's because redditors are scared that content starts vanishing now that reddit files for IPO.
ArchiveTeam usually sends all the data they collect into Internet Archive. The two are independent, unrelated organizations, but IA is pretty open about accepting hoards of this type.
A 2018-era article suggests the Internet Archive held 46PB of content. It’s probably much more now.
This warrior has extracted ~880TB so far. I wouldn’t be surprised if the result occupies a material proportion of the IA’s capacity, at significant cost.
Still, better than letting it all get burnt by the shareholders a few years down the line.
>Still, better than letting it all get burnt by the shareholders
You wish. By the looks of it, it will be burned down by yet another "redesign" when they inevitably shut down the sane UI of https://old.reddit.com because it's not pushing ads and "social" "features" strongly enough.
(If you don't use https://old.reddit.com/ instead of the "new", aka default, reddit, treat yourself to some sanity. Imagine if Reddit was more like HackerNews. Wait, you don't need to imagine, that's what the link goes to.)
I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.
Feels like reddit is digging it own grave with these moves, and the shareholders' money with it.
I really hope to revisit this comment in 10 years and say how utterly stupid and wrong I was, because I've got an incredible amount out of the communities on reddit, and poured a lot into them to. Particularly, support groups.
I wonder how much google stores. My dumb butt has close to 100TB of…uhh…Linux isos encrypted and hooked up to plex for $20 a month. Seems like a loss leader for them. But I always think of the multiple spinning discs I’m taking over there for content and residency since it can’t be deduped
This is so very necessary. Reddit has banned great content from many subs whose ideology they didn’t agree with. Unfortunately it isn’t even possible to know the URLs of posts from banned subreddits to look them up in the first place.
Does anyone know if there are backups of banned subreddits already?
in some cases it is positively ridiculous. I got banned from /r/coronavirus for posting a scientific article suggesting breakthrough cases in fully vaccinated people was a possibility (this was fairly early, Mar 2021 maybe). The mod denounced me as an antivaxxer which I am certainly am not. Lo and behold - breakthrough cases are a real thing.
Where does ArchiveTeam find all the reddit posts and comments to archive? Do they have a script automatically going through the "New" section or are they finding posts through Google or link crawling?
In general, ArchiveTeam has scripts which hit random links to see if there is any content. They have coordination servers which share info on which slugs have been checked before to avoid duplicate effort.
There are several sites that let you view deleted reddit comments. I wonder if they have complete text backups, or if they get the comments from somewhere else.
What a waste of the space of the Internet Archive. Just because something exists doesn't mean it should be backed up, I would be surprised if anyone actually needed something from Usenet for example. Things like this are going to kill the Archive eventually.
> I would be surprised if anyone actually needed something from Usenet for example
Totally disagree. Usenet is the only place recording the history of a huge number of influential projects from the 80s and 90s. That history deserves to be recorded.
Some of the most valuable and insightful anthropological artifacts are merely shop ledgers and discourse on bathroom walls -- and we've never been better equipped to document/store/search/access the entirety of the saved artifacts from the modern age -- and presumably our mastery over information technology as a domain will only improve and make things easier.
The 20 Newsgroups dataset is a collection of ~20k newsgroup documents and is super popular for experimentation in text applications of various machine learning techniques. Without Usenet (and archives of Usenet) that probably wouldn't exist.
You never know what will be useful to the future. To give an example, the field of papyrology is largely built around trying to construct a view of the past using scraps of texts excavated from ancient dumps.
This is why individual people should never be left to make such decisions alone. They're likely to throw away things of enormous value for parochial reasons.
It's a pretty high bar to say that something should not be archived and it's a waste.
I am not even going to state my opinion on the issue now, I am just disappointed about the level of discourse comments like this create. No justification, no logic.
TL;DR is that you don't like reddit. This is not useful to anyone.
[+] [-] octopoc|4 years ago|reply
In other words, is it possible to get any HTTPS website to give you what is essentially a digitally signed copy of content you want to prove originated with that site? And is it true that that digital signature is easily verified to belong to the original website?
[+] [-] kiwidrew|4 years ago|reply
[+] [-] ajsfoux234|4 years ago|reply
It was posted here a while ago: https://news.ycombinator.com/item?id=29090604
[+] [-] paxys|4 years ago|reply
[+] [-] Fudgel|4 years ago|reply
[+] [-] jerheinze|4 years ago|reply
[+] [-] mnd999|4 years ago|reply
[+] [-] smarx007|4 years ago|reply
Most useful content gets packaged by the ArchiveTeam and send to the Internet Archive (no affiliation between the two).
[+] [-] timdaub|4 years ago|reply
Anyways, here's a further description: http://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
[+] [-] jcrawfordor|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] jl6|4 years ago|reply
This warrior has extracted ~880TB so far. I wouldn’t be surprised if the result occupies a material proportion of the IA’s capacity, at significant cost.
Still, better than letting it all get burnt by the shareholders a few years down the line.
[+] [-] romwell|4 years ago|reply
You wish. By the looks of it, it will be burned down by yet another "redesign" when they inevitably shut down the sane UI of https://old.reddit.com because it's not pushing ads and "social" "features" strongly enough.
(If you don't use https://old.reddit.com/ instead of the "new", aka default, reddit, treat yourself to some sanity. Imagine if Reddit was more like HackerNews. Wait, you don't need to imagine, that's what the link goes to.)
I really like how they are pushing so hard for video, while struggling to display more than a thousand text comments or even links.
Feels like reddit is digging it own grave with these moves, and the shareholders' money with it.
I really hope to revisit this comment in 10 years and say how utterly stupid and wrong I was, because I've got an incredible amount out of the communities on reddit, and poured a lot into them to. Particularly, support groups.
[+] [-] smarx007|4 years ago|reply
[+] [-] flatiron|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] throwawaysea|4 years ago|reply
Does anyone know if there are backups of banned subreddits already?
[+] [-] NaturalPhallacy|4 years ago|reply
And reddit is very liberal with their banhammers.
[+] [-] fnord77|4 years ago|reply
[+] [-] ajsfoux234|4 years ago|reply
[+] [-] giantrobot|4 years ago|reply
[0] https://files.pushshift.io/
[+] [-] Gigachad|4 years ago|reply
[+] [-] pronoiac|4 years ago|reply
[+] [-] ortusdux|4 years ago|reply
[+] [-] elpocko|4 years ago|reply
[+] [-] nashashmi|4 years ago|reply
[+] [-] donkarma|4 years ago|reply
[+] [-] simonw|4 years ago|reply
Totally disagree. Usenet is the only place recording the history of a huge number of influential projects from the 80s and 90s. That history deserves to be recorded.
[+] [-] serf|4 years ago|reply
Some of the most valuable and insightful anthropological artifacts are merely shop ledgers and discourse on bathroom walls -- and we've never been better equipped to document/store/search/access the entirety of the saved artifacts from the modern age -- and presumably our mastery over information technology as a domain will only improve and make things easier.
[+] [-] beamatronic|4 years ago|reply
[+] [-] darknavi|4 years ago|reply
/r/DataHoarder disagrees. It's pretty crazy the what people will back up these days.
[+] [-] karlding|4 years ago|reply
[+] [-] wl|4 years ago|reply
[+] [-] _jal|4 years ago|reply
[+] [-] zarzavat|4 years ago|reply
[+] [-] textfiles|4 years ago|reply
[+] [-] romwell|4 years ago|reply
It's a pretty high bar to say that something should not be archived and it's a waste.
I am not even going to state my opinion on the issue now, I am just disappointed about the level of discourse comments like this create. No justification, no logic.
TL;DR is that you don't like reddit. This is not useful to anyone.
[+] [-] totetsu|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]