top | item 36254172

ArchiveTeam has saved over 11.2B Reddit links

549 points| susanthenerd | 2 years ago |old.reddit.com

159 comments

order
[+] alboaie|2 years ago|reply
From discussions it seems that the problem is that many subreddits will be going private to protest the recent Reddit API costs changes. Some will not come back unless the change is reverted. If the change is never reverted, they will be gone forever and this project is trying to save old posts so they can still be seen even though the subreddits are private. Not sure how usefull it will be... but somehow interesting as an example of reaction of policies changes of centralised social networks
[+] pjc50|2 years ago|reply
Reddit were threatening to de-private the subs and replace the mods. When they can do, but will likely kill the community anyway for small communities. Perhaps big ones are anonymous enough for that to work.
[+] reaperman|2 years ago|reply
https://www.reddit.com/r/ltsc/ Did that quite awhile ago, I forget in response to what. I lost a lot of really great info on how to use Windows 10 LTSC / Server / Enterprise / IoT when r/LTSC went permanently private.
[+] debacle|2 years ago|reply
It's also not uncommon for users to nuke their accounts when they close them, deleting all posts and comments.
[+] agilob|2 years ago|reply
Live leaderboard of archived links https://tracker.archiveteam.org/reddit/

I've been contributing to this project for ~2 years now and I've never seen it running so fast

[+] iruoy|2 years ago|reply
That's because it's the ArchiveTeam selected project now. Everybody that has their warrior set to auto will now work on this project.
[+] mozman|2 years ago|reply
How can I help?
[+] black_puppydog|2 years ago|reply
Funnily, my main beef with reddit has always been that it is "yet another data silo" and now that they seem hell-bent on proving my point, this project might actually change that to an extent. Of course, the move will still kill the platform or at least gut it of its actual value (the many niche communities built on it) but at least the data will be free. :)
[+] poutrathor|2 years ago|reply
how does compliance with RGPD will be hold by Archive Teams ? do they remove all personal information while scraping ?
[+] guraf|2 years ago|reply
I have read the linked post and they seem to be only saving links, possibly titles, but not comments or text posts?

Reddit's value is in the discussions (the links are usually shared on all social platforms, so that's the only differentiator).

So what value is there in a collection of links going back years (meaning most of them are likely already broken)?

[+] doctoboggan|2 years ago|reply
The OP said this:

> By Reddit links I mean posts/comments/images, I should’ve been a bit clearer.

[+] gammarays_|2 years ago|reply
The archive includes all information in the webpage including text posts, comments, images and is updated to archive.org.
[+] nXqd|2 years ago|reply
The link here is everything in that link including post, comments, likes ...
[+] arthurcolle|2 years ago|reply
You are misinterpreting across multiple categories
[+] armchairhacker|2 years ago|reply
I plan to stop using Reddit for social media. But I also add “reddit” to a lot of my searches, using Reddit posts for “Buy it for Life” products and obscure knowledge among other things, and I don’t want that to do away. So I’m glad ArchiveTeam managed to archive almost all of them.

I also use Reddit for tech updates and discussions from subreddits like r/rust and r/ProgrammingLanguages, but most of these already have alternative sites (e.g. discourse, Discord, SE), and I’m more hopeful most of the new posts migrate to one of these than other subreddits or Reddit migrating in general.

[+] lvncelot|2 years ago|reply
> but most of these already have alternative sites (e.g. discourse, Discord, SE)

I've seen Discord used as a platform for project-based discussion, and I just can't get into it. It just seems so fundamentally not built for long-term threads and the aggregation of information (like a wiki, for instance), and it kind of frustrates me to see some projects that are hell-bent on keeping all related discussion on Discord.

Granted, Reddit is not the greatest place for what I'm talking about either, since it also prioritizes newer self-posts/links, but I feel the clunkyness is even worse in Discord.

[+] londons_explore|2 years ago|reply
Google used to keep a mirror of all reddit posts and comments as a demo for their cloud bigquery product:

    SELECT * FROM fh-bigquery.reddit.subreddits
Unfortunately they stopped updating it in 2016.
[+] IanCal|2 years ago|reply
Comments data is available until the end of 2019 in fh-bigquery.reddit_comments

There are torrents available it seems for comments that cover later times than that.

[+] Rhaomi|2 years ago|reply
I really hope they're defaulting to Old Reddit, because I seem to recall Archive.org choking on the redesign and not actually showing anything readable for archived pages.

(Also, is this including Reddit-hosted images/video?)

[+] amelius|2 years ago|reply
We need someone like SciHub to say the ownership of those Reddit posts does not belong to Reddit, and simply fork the entire thing.
[+] social_ism|2 years ago|reply
First some facts and then some news.

The first rule of social networks is you can't touch them or they will die.

Proof: Aol MySpace Twitter Facebook (Zuckerberg won't touch it now) Reddit

This is obvious except to billionaires.

Now for the news: the only content you see on social networks is designed to reinforce your supposedly persistent self and sell ads.

There is no other purpose for social media.

All social media content on the internet since the internet began will be deleted, lost and forgotten as there will be no profit motive to do otherwise.

[+] jannes|2 years ago|reply
It seems that the ArchiveTeam servers are overloaded. I constantly get errors like these:

    @ERROR: max connections (-1) reached -- try again later
    rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]
[+] capableweb|2 years ago|reply
This is common and just means that the rsync server your warrior tried to upload to was too busy. It'll retry and try another upload host if you leave it to do its thing.
[+] doommius|2 years ago|reply
I setup the agent a few days ago just for fun, however it seems to have stalled/not getting new jobs.
[+] capableweb|2 years ago|reply
Are you running the watchtower (for automatic updates) as well? Otherwise, a restart should update it.
[+] lopkeny12ko|2 years ago|reply
Looks like someone configured their server connection limit incorrectly ;)

@ERROR: max connections (-1) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]

[+] capableweb|2 years ago|reply
Seemingly, Internet Archive is overloaded with upload requests from Archive Team. That error is hinting that the upload slots are all currently used.
[+] mirages|2 years ago|reply
It's me or the data they get is not in a searchable format nor indexed ?
[+] capableweb|2 years ago|reply
Usually it goes something like this:

- Grab the data in a raw format

- Upload to Internet Archive

- Figure out how to extract structured data from raw dump

- Upload structured data to IA