ArchiveTeam has saved over 11.2B Reddit links

[+] alboaie|2 years ago|reply

From discussions it seems that the problem is that many subreddits will be going private to protest the recent Reddit API costs changes. Some will not come back unless the change is reverted. If the change is never reverted, they will be gone forever and this project is trying to save old posts so they can still be seen even though the subreddits are private. Not sure how usefull it will be... but somehow interesting as an example of reaction of policies changes of centralised social networks

[+] pjc50|2 years ago|reply

Reddit were threatening to de-private the subs and replace the mods. When they can do, but will likely kill the community anyway for small communities. Perhaps big ones are anonymous enough for that to work.

[+] reaperman|2 years ago|reply

https://www.reddit.com/r/ltsc/ Did that quite awhile ago, I forget in response to what. I lost a lot of really great info on how to use Windows 10 LTSC / Server / Enterprise / IoT when r/LTSC went permanently private.

[+] debacle|2 years ago|reply

It's also not uncommon for users to nuke their accounts when they close them, deleting all posts and comments.

[+] agilob|2 years ago|reply

Live leaderboard of archived links https://tracker.archiveteam.org/reddit/

I've been contributing to this project for ~2 years now and I've never seen it running so fast

[+] iruoy|2 years ago|reply

That's because it's the ArchiveTeam selected project now. Everybody that has their warrior set to auto will now work on this project.

[+] mozman|2 years ago|reply

How can I help?

[+] black_puppydog|2 years ago|reply

Funnily, my main beef with reddit has always been that it is "yet another data silo" and now that they seem hell-bent on proving my point, this project might actually change that to an extent. Of course, the move will still kill the platform or at least gut it of its actual value (the many niche communities built on it) but at least the data will be free. :)

[+] poutrathor|2 years ago|reply

how does compliance with RGPD will be hold by Archive Teams ? do they remove all personal information while scraping ?

[+] guraf|2 years ago|reply

I have read the linked post and they seem to be only saving links, possibly titles, but not comments or text posts?

Reddit's value is in the discussions (the links are usually shared on all social platforms, so that's the only differentiator).

So what value is there in a collection of links going back years (meaning most of them are likely already broken)?

[+] doctoboggan|2 years ago|reply

The OP said this:

> By Reddit links I mean posts/comments/images, I should’ve been a bit clearer.

[+] gammarays_|2 years ago|reply

The archive includes all information in the webpage including text posts, comments, images and is updated to archive.org.

[+] nXqd|2 years ago|reply

The link here is everything in that link including post, comments, likes ...

[+] unknown|2 years ago|reply

[deleted]

[+] arthurcolle|2 years ago|reply

You are misinterpreting across multiple categories

[+] armchairhacker|2 years ago|reply

I plan to stop using Reddit for social media. But I also add “reddit” to a lot of my searches, using Reddit posts for “Buy it for Life” products and obscure knowledge among other things, and I don’t want that to do away. So I’m glad ArchiveTeam managed to archive almost all of them.

I also use Reddit for tech updates and discussions from subreddits like r/rust and r/ProgrammingLanguages, but most of these already have alternative sites (e.g. discourse, Discord, SE), and I’m more hopeful most of the new posts migrate to one of these than other subreddits or Reddit migrating in general.

[+] lvncelot|2 years ago|reply

> but most of these already have alternative sites (e.g. discourse, Discord, SE)

I've seen Discord used as a platform for project-based discussion, and I just can't get into it. It just seems so fundamentally not built for long-term threads and the aggregation of information (like a wiki, for instance), and it kind of frustrates me to see some projects that are hell-bent on keeping all related discussion on Discord.

Granted, Reddit is not the greatest place for what I'm talking about either, since it also prioritizes newer self-posts/links, but I feel the clunkyness is even worse in Discord.

[+] londons_explore|2 years ago|reply

Google used to keep a mirror of all reddit posts and comments as a demo for their cloud bigquery product:

    SELECT * FROM fh-bigquery.reddit.subreddits

Unfortunately they stopped updating it in 2016.

[+] IanCal|2 years ago|reply

Comments data is available until the end of 2019 in fh-bigquery.reddit_comments

There are torrents available it seems for comments that cover later times than that.

[+] Rhaomi|2 years ago|reply

I really hope they're defaulting to Old Reddit, because I seem to recall Archive.org choking on the redesign and not actually showing anything readable for archived pages.

(Also, is this including Reddit-hosted images/video?)

[+] zX41ZdbW|2 years ago|reply

Here is an example of analyzing 20+ billion Reddit comments in ClickHouse: https://clickhouse.com/docs/en/getting-started/example-datas...

[+] amelius|2 years ago|reply

We need someone like SciHub to say the ownership of those Reddit posts does not belong to Reddit, and simply fork the entire thing.

[+] social_ism|2 years ago|reply

First some facts and then some news.

The first rule of social networks is you can't touch them or they will die.

Proof: Aol MySpace Twitter Facebook (Zuckerberg won't touch it now) Reddit

This is obvious except to billionaires.

Now for the news: the only content you see on social networks is designed to reinforce your supposedly persistent self and sell ads.

There is no other purpose for social media.

All social media content on the internet since the internet began will be deleted, lost and forgotten as there will be no profit motive to do otherwise.

[+] malermeister|2 years ago|reply

Could one import all of this into, say, a Lemmy instance to kickstart a reddit alternative?

[+] PythagoRascal|2 years ago|reply

Haven't tried it, but this comment on /r/DataHoarder mentioned these two repos:

https://github.com/rileynull/RedditLemmyImporter

https://github.com/LemmyNet/lemmy

[+] rkwasny|2 years ago|reply

How can I download the data they archived? Asking for a friendly AGI :)

[+] iruoy|2 years ago|reply

https://archive.org/details/archiveteam_reddit

[+] jannes|2 years ago|reply

It seems that the ArchiveTeam servers are overloaded. I constantly get errors like these:

    @ERROR: max connections (-1) reached -- try again later
    rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]

[+] capableweb|2 years ago|reply

This is common and just means that the rsync server your warrior tried to upload to was too busy. It'll retry and try another upload host if you leave it to do its thing.

[+] doommius|2 years ago|reply

I setup the agent a few days ago just for fun, however it seems to have stalled/not getting new jobs.

[+] capableweb|2 years ago|reply

Are you running the watchtower (for automatic updates) as well? Otherwise, a restart should update it.

[+] unknown|2 years ago|reply

[deleted]

[+] lopkeny12ko|2 years ago|reply

Looks like someone configured their server connection limit incorrectly ;)

@ERROR: max connections (-1) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]

[+] capableweb|2 years ago|reply

Seemingly, Internet Archive is overloaded with upload requests from Archive Team. That error is hinting that the upload slots are all currently used.

159 comments