Inside Wayback Machine, the internet’s time capsule

[+] pronoiac|7 years ago|reply

If you're in San Francisco this Wednesday, check out their annual bash: https://blog.archive.org/2018/08/20/save-the-date-building-a...

[+] branweb|7 years ago|reply

Always good to pause and reflect on the ephemeral nature of knowledge on the www. I've always admired the Internet Archive's Sisyphean mission to preserve some piece of it.

ps those statues of internet saints occupying the old benches of the former church/current IA hq are neat and kinda disturbing.

[+] Fnoord|7 years ago|reply

At the same time, the past is now fully documented. The knife cuts at both sides.

[+] tannhaeuser|7 years ago|reply

I very much appreciate Wayback Machine's work and would like to support them by offering our SGML software for free (see contact info on http://sgml.io or PM).

SGML can be used as swiss army knife to perform all kinds of difficult HTML parsing, manipulation, and preservation tasks since it is using classic DTD grammars for your HTML flavor at hand, rather than having a particular HTML grammar hardcoded. For example, see our HTML 5.1 DTD at [1] (which can be used with any SGML software freely anyway).

In today's dark age of the web, we're loosing content daily as classic web sites are shutting down.

[1]: http://sgmljs.net/docs/html5.html

[+] rmason|7 years ago|reply

Does anyone know if they take kindly to visitors? I'm always looking for things to do when I'm in SF when I have a few hours to spare and this interests me.

[+] jonah-archive|7 years ago|reply

We have public tours nearly every Friday at 1pm! Ping us beforehand to let us know you're coming: https://archive.org/about/contact.php

[+] celerity|7 years ago|reply

I wonder if the Wayback Machine people are using a (potentially more modern) version of the AOPIC algorithm to decide what to archive. I wrote an article about that algorithm (which is similar to the original PageRank, but simpler IMO), and stated that a service "like the Wayback Machine would probably use something like AOPIC." It would be nice to remove that first like from the sentence!

[1] https://intoli.com/blog/aopic-algorithm/

[+] greglindahl|7 years ago|reply

No, Heretrix doesn't really do any ranking as it crawls, it's all up to a cleverly chosen seed.

AOPIC looks to me like it's roughly the same as Yahoo's iterative pagerank algorithm, but I didn't look at it that carefully.

[+] bane|7 years ago|reply

Anybody know of something like this that I can use for personal archiving?

[+] xj9|7 years ago|reply

https://www.archiveteam.org/index.php?title=Wget_with_WARC_o...

https://github.com/oduwsdl/ipwb

  #!/bin/sh

  cd ${HOME}/data/archive

  user_agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

  domain="${1}"

  wget \
    --mirror \
    --page-requisites \
    --restrict-file-names=nocontrol \
    --timeout 60 \
    --tries 5 \
    --wait 1 \
    --waitretry 5 \
    --warc-cdx \
    --warc-file="${domain}" \
    --warc-header "operator: Wayback Mesh" \
    -U "${user_agent}" \
    -e robots=off \
    "http://${domain}"

  ipwb index "${domain}.warc" > "${domain}.ipfs.cdxj"

[+] telotortium|7 years ago|reply

https://www.gwern.net/Archiving-URLs

[+] fake-name|7 years ago|reply

https://github.com/fake-name/ReadableWebProxy

Disclaimer - It's a project of mine, specifically written for my interests, and is very much not production ready.

It does have a full blown spider in it, which supports distributed fetching and stuff via another project [1].

It does both rewritten archiving (basically overwriting the page style) and raw archiving.

I also have a bunch of miscellaneous stuff like a custom python interface for chromium so I can handle jerberscript shitshows (cloudflare "protection" [2], reCaptcha [3], etc...). People seem really dead-set on ruining the internet.

I keep meaning to implement WARC proxying to feed into the IA as a parallel output stream for the spider, but it's a lot of work and I have too many projects already.

1: https://github.com/fake-name/AutoTriever 2: https://github.com/fake-name/ChromeController 3: https://github.com/fake-name/WebRequest

[+] PeterMikhailov|7 years ago|reply

https://wallabag.org/en

If you run your own, make sure you turn on "Download images"

[+] nikisweeting|7 years ago|reply

The self-hosted way-back machine!

Bookmark Archiver: https://github.com/pirate/bookmark-archiver

It's modeled on the internet archive except it uses Chrome Headless in addition to wget so you get snapshots of the page after JS executes too.

[+] datavirtue|7 years ago|reply

Love way back machine. An exploit of modx recently resulted in losing a website that I maintain. Remembered the way back machine...all content plus my ass saved.

[+] cyborgx7|7 years ago|reply

The Wayback Machine is an archive, not a time capsule. Very different things.

17 comments