top | item 30524842

Help preserve the internet with Archiveteam's warrior

170 points| neoglow | 4 years ago |selfhostedheaven.com

51 comments

Yes!

Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.

Large institutions like the internet archive are doing an admirable job, but there is a lot of content that they cannot and will not cover. So we will definitely (also) need volunteer-based archival for the foreseeable future.

18TB drives are ~$300 a piece right now, go buy one and help our collective memory!

smarx007|4 years ago

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.

prox|4 years ago

I kind of wonder how we can make it searchable again. Is this included in this archiving effort?

In any case wonderful work.

Thorentis|4 years ago

The Internet Archive has a huge noise to signal ratio, very much in favour of noise. I admire the effort and regularly make use of the quality archives. However, I wonder if much like Bitcoin, tremendous energy and amounts of resources are being put towards very little of value.

DoingIsLearning|4 years ago

I disagree, the unfiltered high noise is what makes it valuable. Curation is a bias.

If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.

I agree that it makes it harder to find things but I also see the value of IA as a time capsule.

djokkataja|4 years ago

Storing data is cheap and gets cheaper all the time. This isn't a super comparison, but the Internet archive's 2019 revenue is listed as $36.7 mil on Wikipedia (https://en.m.wikipedia.org/wiki/Internet_Archive).

Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.

I get that this article is about people using their personal computers to help archive things, but I don't think the Internet archive is ever going to be using resources even remotely as aggressively as cryptocurrencies unless they somehow turn all their archiving into cryptocurrency.

prox|4 years ago

Value is really hard to predict, but as someone who researches a lot in archives, there is no such thing as too little information. Especially if you want the views of several parties or organizations. In anthropology and history research this work (archiving) can be of tremendous value.

Usually it’s hard to say if it’s valuable now , only time can tell.

uniqueuid|4 years ago

Just to point this out, on a technical level, the internet archive has very (!) little overhead.

Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]

[1] https://www.iso.org/standard/68004.html

qiskit|4 years ago

> tremendous energy and amounts of resources are being put towards very little of value.

I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.

The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.

cyber_kinetist|4 years ago

I think the real problem is a bit deeper: Unorganized raw data itself is of very low value, but it becomes much more valuable when humans process, categorize, and interpret it via a higher-level system of reason. We're doing a lot of the former but not the latter: we have so much data but have no idea what they all mean as a whole.

Libraries aren't just "a bunch of books piled up in shelves", they're a historical invention built and perfected for centuries where books are extensively coded and catalogued via a complex hierarchical system. As we are dealing with far more data than the past (not just books but posts and comments from all over the world, as well as new kinds of media such as images and videos), and also have new kinds of conceptual and technological inventions that previous librarians didn't have access to (hyperlinks, databases, graph theory, machine learning, etc.), the current status of data management begs for a major overhaul. (For example, the best we are currently doing for querying and searching from massive data is Google, and it is incredibly primitive! And even then we lament that the quality of it has decreased in favor of SEO-maximizing content.) So much raw data is created every day, and we just seem to fail to understand and interpret almost all of it, I see it as one of the major historical crises we face today. Instead of just storing data, we must find radical new methodologies and tools to search, filter, and explore data, and this poses as both a philosophical problem (of semiotics, linguistics, and hermeneutics) as well as a technological problem.

sandgiant|4 years ago

Can you provide some details on this? I'm curious how noise and signal are defined and measured in this case.

stjohnswarts|4 years ago

I would argue that the archive.org and saving the legacy of the internet is a far more important use of energy than making up imaginary digital currency pyramid schemes.

textfiles|4 years ago

I've been having fun with this post all day, but now I kind of need to know: Can you give examples of noise on the Archive?

janandonly|4 years ago

Unlike the Archive, the "value" of Bitcoin can be measured: Today's market cap of BTC is $839.5B

unknown|4 years ago

[deleted]

RNAlfons|4 years ago

Make it an easy installable/runable Windows application and it will spread like wildfire.

capableweb|4 years ago

If it was only that easy. To make distributed archiving as high quality as possible, you need reproducible environments as much as possible, which is why the "official" way of participating is to run virtual machines, instead of directly on the host.

Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Has a couple of different installation methods as well.

cxr|4 years ago

Why even require that? If the data in question is available over HTTP, it should be as easy as opening a page from the relevant origin in a browser tab, optionally opening a second tab for a "Warrior Dashboard", then invoking a bookmarklet on the former to slurp up data by XHR &tc. (If it's necessary to cross origins as the thing roves around, the dashboard can alert you to this while it continues doing what it can with the first origin. Just have the human return to the dashboard from time to time and repeat the second step to run as many in parallel as they want.)

causi|4 years ago

Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.

myself248|4 years ago

Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.

TheTechRobo|4 years ago

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.

iforgotpassword|4 years ago

How likely is it you end up downloading child porn on behalf of them? In other words, how well curated or specific is the list of download jobs your node gets assigned? If it's something like "just grab everything from this blog platform" I guess chances are not zero.

mhitza|4 years ago

I think you would be more likely to win the lottery without playing.

That type of content has long moved from clearnet to the darknet. I would be inexplicably surprised if that type of content can be found on the clearnet. But I still can be wrong.

However if you're in the US loli hentai is going to be a risk and legal headache for sure https://www.shouselaw.com/ca/blog/is-loli-illegal-in-the-uni...

As far as I'm aware, maybe excepting Australia (?) as well, in the rest of the world that type of content is not something they'll classify as child pornography, you'll just get a few sketchy looks.

smarx007|4 years ago

My experience has shown that list to be extremely well-curated. See https://wiki.archiveteam.org/#Warrior-based_projects for the current list.

Though if you join the Reddit archival project, all bets may be off but that's not AT team's fault, I guess.