Running ArchiveTeam's Warrior in Kubernetes

WildGreenLeave|1 year ago

The first thing I setup when I started to manage my own Kubernetes cluster more then a year ago was this Warrior, I completely forgot about it until this post.

Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!

Edit: typo

ch71r22|1 year ago

For anyone else interested in running this, it only took a couple seconds to launch their docker-compose.yml

https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...

NortySpock|1 year ago

I noticed from the docker overlay filesystem that the container was spraying files all over the disk. (Ephemeral, destroyed on container shutdown, sure, but I wanted to reduce write-wear on my ssd...)

I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...

Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?

crtasm|1 year ago

How should I approach looking at what that will install before I run it? Every path on the site returns 'nope' https://atdr.meo.ws/archiveteam/warrior-dockerfile

Havoc|1 year ago

Isn't there substantial risk involved in having who knows what scraped from your IP?

tech234a|1 year ago

Yes but many projects are usually restricted to specific websites. A few projects, such as the URLs project, are generally unrestricted.

honestSysAdmin|1 year ago

[deleted]

badlibrarian|1 year ago

Many of these sites are already captured and archived by proper entities as required by federal law. More is better, I guess, except when it isn't. Duplication of effort is a huge problem in the humanities in general and with archiving in particular.

The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.

https://web.archive.org/web/20250122000033/www.google.com

Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.

myself248|1 year ago

> by proper entities as required by federal law.

What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?

homebrewer|1 year ago

How do I as a non-US citizen get access to information from those "proper entities"? Is it even possible for US citizens? This is often a surprise for some visitors of this fine website, but there's a large world outside the US where "federal law" does not apply.

jfkrrorj|1 year ago

[deleted]

36 comments