Ask HN: How to Resurrect a Site from Archive.org?
91 points| rrr_oh_man | 1 year ago
Is there a way I can "revive" it from archive.org in a more or less automated fashion? Have you ever encountered anything like it? I am familiar with web scraping, but archive.org has its peculiarities.
I really, really love the content on it.
It's a very niche site, but I would love for it to live on.
[+] [-] duskwuff|1 year ago|reply
Buying a domain name does not award you ownership of the content it previously hosted. If you have not come to some agreement with the previous owner, you should not proceed.
[+] [-] aspenmayer|1 year ago|reply
[+] [-] moralestapia|1 year ago|reply
[+] [-] lhamil64|1 year ago|reply
[+] [-] ksec|1 year ago|reply
I recently learned CGTalk was completely shut down and ALL the information shared over the pass 20 years are gone. It never received the attention like DPreview. There are plenty of other examples where forum owner no longer wants the burden of owning it.
It really is a sad state of things.
Is there a site or exchange somewhere where owner could sell their site or at least put up a whole archive as asset?
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] ulrischa|1 year ago|reply
1. Collect a list of archived URLs (via archive.org’s CDX endpoints). 2. Download each page and all related assets. 3. Rewrite all links that currently point to `web.archive.org` so they point to your domain or your local file paths.
The tricky part is the Wayback Machine’s directory structure—every file is wrapped in these time-stamped URLs. You’ll need to remove those prefixes, leaving just the original directory layout. There’s no perfect, purely automated solution, because sometimes assets are missing or broken. Be prepared for some manual cleanup.
Beyond that, the process is basically: gather everything, clean up links, restore the original hierarchy, and then host it on your server. Tools exist that partially automate this (for example, some people have written scripts to do the CDX fetching and rewriting), but if you’re comfortable with web scraping logic, you can handle it with a few careful passes. In the end, you’ll have a mostly faithful static snapshot of the old site running under your revived domain.
[+] [-] Gualdrapo|1 year ago|reply
I scraped its contents (blog posts, pages, etcetera) with Python's beautifulsoup and redid its styling "by hand", which was not something otherworldy (the site was from line 2010 or so) and had the chance to put some improvements.
The thing with the scraping was that the connection was lost after a while and it was reaaaaaaaaaally sloooooooooow so I had to keep a register on memory of what was the last successful scraped post/page/whatever and, if something happened, restart from it as a starting point.
Got pennies for it, mostly because I lowballed myself, but got to learn a thing or two.
[+] [-] janesvilleseo|1 year ago|reply
Anyways there are tools out there. I haven’t used them
But a tool like https://www6.waybackmachinedownloader.com/website-downloader...
Or
https://websitedownloader.com/
Should do the trick. Depending on the size of the site a small cost is involved.
They can even package them into unusable files.
[+] [-] cdr420|1 year ago|reply
[+] [-] moxvallix|1 year ago|reply
[+] [-] d3VwsX|1 year ago|reply
[+] [-] latexr|1 year ago|reply
https://superuser.com/questions/828907/how-to-download-a-web...
[+] [-] aspenmayer|1 year ago|reply
https://wiki.archiveteam.org/index.php?title=Restoring
which mentions
https://github.com/hartator/wayback-machine-downloader
and also this tip:
> This is undocumented, but if you retrieve a page with id_ after the datecode, you will get the unmodified original document without all the Wayback scripts, header stuff, and link rewriting. This is useful when restoring one page at a time or when writing a tool to retrieve a site:
> http://web.archive.org/web/20051001001126id_/http://www.arch...
From the downloader's issues, you may or may not need to use this forked version if you encounter some errors:
https://github.com/hartator/wayback-machine-downloader/issue...
https://github.com/ShiftaDeband/wayback-machine-downloader
[+] [-] 01jonny01|1 year ago|reply
1) Download HTTrack if its a large websit with alot of pages 2) Download Search and Replace program, theres many of them. 3) The search and replace program allows you to remove the appended web archive url from the pages in bulk. 4. Upload to your host. 5. Run the site through a bulk link checker, that test for broken links. There is plenty of them online.
[+] [-] bagpuss|1 year ago|reply
[+] [-] aspenmayer|1 year ago|reply
None of my ire is directed at you, as I don't assume you knew any of this. I just wanted to let you know, in case you were mislead as to what the site does by its ad copy.
https://archivarix.com/en/affiliate/
https://archivarix.com/en/#show-prices-wbm
[+] [-] bagpuss|1 year ago|reply
posters, enhance your calm
- bagpuss, fat furry cat puss
[+] [-] toast0|1 year ago|reply
I pulled each page off internet archive, saved it as an archive; then did some minor tidying up, setting viewports for mobile, updating the linkback html snippet to go to my url instead of the old dead one, changing the snippet to not suggest hotloading the link image, crop the dead url out of the link image, pngcrush the images, put it on cheap hosting for static pages.
I did a bit of poking around trying to find a way to contact the owner, but had no luck. If they come back and want it down, I'll take it down. Copyright notices are intact. I'm clearly violating the author's copyrights, and I accept that.
[+] [-] gopher_space|1 year ago|reply
I'm looking at combining several old message boards into something useful, and I'd like to be proactive regarding copyright. My approach so far:
- I'm assuming that everyone owns their own post/comment.
- I'm assuming that submitting content meant they intended to grant rights to community members.
- I'm assuming that work done in support of the original community would be welcomed by members.
- And I'm assuming this all changes if I want money.
So I'm preserving attributions when I can, but treat content like it's CC or similar as long as I'm operating within the original authors area of concern. Anything that actually gets released will be as open as possible... and probably start with telling you how to download files. Entirely walling off my code makes sense but then it is no longer a fun little project, it is a framework.
[+] [-] Sysreq2|1 year ago|reply
https://registry.opendata.aws/commoncrawl/
[+] [-] paxys|1 year ago|reply
[+] [-] aspenmayer|1 year ago|reply
[+] [-] aoipoa|1 year ago|reply
https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
Very odd.
Even the times of the comments have changed, this is what the post looked like yesterday:
https://web.archive.org/web/20241205054108/https://news.ycom...
[+] [-] denotational|1 year ago|reply
To avoid a dupe, this mechanism post-dates the original post.
[+] [-] alsetmusic|1 year ago|reply
For anyone who may be curious, wayback machine has an archive: fuckthesouth.com
[+] [-] Alifatisk|1 year ago|reply
[+] [-] aspenmayer|1 year ago|reply
> HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.
Available on Windows, Mac, Linux, and Android.
[+] [-] pabs3|1 year ago|reply
Depending on the site you would use different tools, for eg for MediWiki/DokuWiki sites you would import the latest database dump on archive.org.
I have used wayback-machine-downloader before for completely static sites before:
https://github.com/hartator/wayback-machine-downloader/
[+] [-] donalhunt|1 year ago|reply
[+] [-] joshdavham|1 year ago|reply
[+] [-] davidjhall|1 year ago|reply
[+] [-] canU4|1 year ago|reply
[+] [-] ddgflorida|1 year ago|reply
[+] [-] comboy|1 year ago|reply