Always good to pause and reflect on the ephemeral nature of knowledge on the www. I've always admired the Internet Archive's Sisyphean mission to preserve some piece of it.
ps those statues of internet saints occupying the old benches of the former church/current IA hq are neat and kinda disturbing.
I very much appreciate Wayback Machine's work and would like to support them by offering our SGML software for free (see contact info on http://sgml.io or PM).
SGML can be used as swiss army knife to perform all kinds of difficult HTML parsing, manipulation, and preservation tasks since it is using classic DTD grammars for your HTML flavor at hand, rather than having a particular HTML grammar hardcoded. For example, see our HTML 5.1 DTD at [1] (which can be used with any SGML software freely anyway).
In today's dark age of the web, we're loosing content daily as classic web sites are shutting down.
Does anyone know if they take kindly to visitors? I'm always looking for things to do when I'm in SF when I have a few hours to spare and this interests me.
I wonder if the Wayback Machine people are using a (potentially more modern) version of the AOPIC algorithm to decide what to archive. I wrote an article about that algorithm (which is similar to the original PageRank, but simpler IMO), and stated that a service "like the Wayback Machine would probably use something like AOPIC." It would be nice to remove that first like from the sentence!
Disclaimer - It's a project of mine, specifically written for my interests, and is very much not production ready.
It does have a full blown spider in it, which supports distributed fetching and stuff via another project [1].
It does both rewritten archiving (basically overwriting the page style) and raw archiving.
I also have a bunch of miscellaneous stuff like a custom python interface for chromium so I can handle jerberscript shitshows (cloudflare "protection" [2], reCaptcha [3], etc...). People seem really dead-set on ruining the internet.
I keep meaning to implement WARC proxying to feed into the IA as a parallel output stream for the spider, but it's a lot of work and I have too many projects already.
Love way back machine. An exploit of modx recently resulted in losing a website that I maintain. Remembered the way back machine...all content plus my ass saved.
[+] [-] pronoiac|7 years ago|reply
[+] [-] branweb|7 years ago|reply
ps those statues of internet saints occupying the old benches of the former church/current IA hq are neat and kinda disturbing.
[+] [-] Fnoord|7 years ago|reply
[+] [-] tannhaeuser|7 years ago|reply
SGML can be used as swiss army knife to perform all kinds of difficult HTML parsing, manipulation, and preservation tasks since it is using classic DTD grammars for your HTML flavor at hand, rather than having a particular HTML grammar hardcoded. For example, see our HTML 5.1 DTD at [1] (which can be used with any SGML software freely anyway).
In today's dark age of the web, we're loosing content daily as classic web sites are shutting down.
[1]: http://sgmljs.net/docs/html5.html
[+] [-] rmason|7 years ago|reply
[+] [-] jonah-archive|7 years ago|reply
[+] [-] celerity|7 years ago|reply
[1] https://intoli.com/blog/aopic-algorithm/
[+] [-] greglindahl|7 years ago|reply
AOPIC looks to me like it's roughly the same as Yahoo's iterative pagerank algorithm, but I didn't look at it that carefully.
[+] [-] bane|7 years ago|reply
[+] [-] xj9|7 years ago|reply
https://github.com/oduwsdl/ipwb
[+] [-] telotortium|7 years ago|reply
[+] [-] fake-name|7 years ago|reply
Disclaimer - It's a project of mine, specifically written for my interests, and is very much not production ready.
It does have a full blown spider in it, which supports distributed fetching and stuff via another project [1].
It does both rewritten archiving (basically overwriting the page style) and raw archiving.
I also have a bunch of miscellaneous stuff like a custom python interface for chromium so I can handle jerberscript shitshows (cloudflare "protection" [2], reCaptcha [3], etc...). People seem really dead-set on ruining the internet.
I keep meaning to implement WARC proxying to feed into the IA as a parallel output stream for the spider, but it's a lot of work and I have too many projects already.
1: https://github.com/fake-name/AutoTriever 2: https://github.com/fake-name/ChromeController 3: https://github.com/fake-name/WebRequest
[+] [-] PeterMikhailov|7 years ago|reply
If you run your own, make sure you turn on "Download images"
[+] [-] nikisweeting|7 years ago|reply
Bookmark Archiver: https://github.com/pirate/bookmark-archiver
It's modeled on the internet archive except it uses Chrome Headless in addition to wget so you get snapshots of the page after JS executes too.
[+] [-] datavirtue|7 years ago|reply
[+] [-] cyborgx7|7 years ago|reply