top | item 18108472

Inside Wayback Machine, the internet’s time capsule

175 points| rmason | 7 years ago |thehustle.co

17 comments

order
[+] branweb|7 years ago|reply
Always good to pause and reflect on the ephemeral nature of knowledge on the www. I've always admired the Internet Archive's Sisyphean mission to preserve some piece of it.

ps those statues of internet saints occupying the old benches of the former church/current IA hq are neat and kinda disturbing.

[+] Fnoord|7 years ago|reply
At the same time, the past is now fully documented. The knife cuts at both sides.
[+] tannhaeuser|7 years ago|reply
I very much appreciate Wayback Machine's work and would like to support them by offering our SGML software for free (see contact info on http://sgml.io or PM).

SGML can be used as swiss army knife to perform all kinds of difficult HTML parsing, manipulation, and preservation tasks since it is using classic DTD grammars for your HTML flavor at hand, rather than having a particular HTML grammar hardcoded. For example, see our HTML 5.1 DTD at [1] (which can be used with any SGML software freely anyway).

In today's dark age of the web, we're loosing content daily as classic web sites are shutting down.

[1]: http://sgmljs.net/docs/html5.html

[+] rmason|7 years ago|reply
Does anyone know if they take kindly to visitors? I'm always looking for things to do when I'm in SF when I have a few hours to spare and this interests me.
[+] celerity|7 years ago|reply
I wonder if the Wayback Machine people are using a (potentially more modern) version of the AOPIC algorithm to decide what to archive. I wrote an article about that algorithm (which is similar to the original PageRank, but simpler IMO), and stated that a service "like the Wayback Machine would probably use something like AOPIC." It would be nice to remove that first like from the sentence!

[1] https://intoli.com/blog/aopic-algorithm/

[+] greglindahl|7 years ago|reply
No, Heretrix doesn't really do any ranking as it crawls, it's all up to a cleverly chosen seed.

AOPIC looks to me like it's roughly the same as Yahoo's iterative pagerank algorithm, but I didn't look at it that carefully.

[+] bane|7 years ago|reply
Anybody know of something like this that I can use for personal archiving?
[+] xj9|7 years ago|reply
https://www.archiveteam.org/index.php?title=Wget_with_WARC_o...

https://github.com/oduwsdl/ipwb

  #!/bin/sh

  cd ${HOME}/data/archive

  user_agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"

  domain="${1}"

  wget \
    --mirror \
    --page-requisites \
    --restrict-file-names=nocontrol \
    --timeout 60 \
    --tries 5 \
    --wait 1 \
    --waitretry 5 \
    --warc-cdx \
    --warc-file="${domain}" \
    --warc-header "operator: Wayback Mesh" \
    -U "${user_agent}" \
    -e robots=off \
    "http://${domain}"

  ipwb index "${domain}.warc" > "${domain}.ipfs.cdxj"
[+] fake-name|7 years ago|reply
https://github.com/fake-name/ReadableWebProxy

Disclaimer - It's a project of mine, specifically written for my interests, and is very much not production ready.

It does have a full blown spider in it, which supports distributed fetching and stuff via another project [1].

It does both rewritten archiving (basically overwriting the page style) and raw archiving.

I also have a bunch of miscellaneous stuff like a custom python interface for chromium so I can handle jerberscript shitshows (cloudflare "protection" [2], reCaptcha [3], etc...). People seem really dead-set on ruining the internet.

I keep meaning to implement WARC proxying to feed into the IA as a parallel output stream for the spider, but it's a lot of work and I have too many projects already.

1: https://github.com/fake-name/AutoTriever 2: https://github.com/fake-name/ChromeController 3: https://github.com/fake-name/WebRequest

[+] datavirtue|7 years ago|reply
Love way back machine. An exploit of modx recently resulted in losing a website that I maintain. Remembered the way back machine...all content plus my ass saved.
[+] cyborgx7|7 years ago|reply
The Wayback Machine is an archive, not a time capsule. Very different things.