HTTrack Website Copier – Offline browser

[+] boulos|4 years ago|reply

Ahh, the good ole days. I used httrack nearly 20 years ago to make the CD copies of the osha.gov site (e.g., [1]). Back then, for ADA and internet access compliance, government websites had to also be made available for offline use if possible.

I haven’t followed httrack since, but it seems like scrapy and similar are much better replacements.

[1] https://forum.httrack.com/readmsg/3556/index.html

[+] xroche|4 years ago|reply

Glad the project helped a bit a few people :) I don't have much time unfortunately to enhance the engine nowadays, and the code is dirty and broken beyond any repair. Yet I'm still puzzled to see how many people are still using the project today.

You'll probably find better approaches, and while I never tried scrapy, it seems to be using a javascript engine for hard cases, which was something I thought about (but this was way above my skills at that time).

The hard parts remains however, if you want a functional site: you need to rewrite links, or use an external proxy-like mechanism. Having a fully functional offline, file-based site, is the real tricky part. Cases will remain unsolvable, as the inside code logic can produce whatever external link resource based on randomness, time, etc.

The approach in httrack was both ugly and pragmatic: attempting to recognize link/files patterns within javascript and fetch/replace what can be replaced with local links. Javascript producing html will typically be analyzed with really dumb - yet sometimes effective - js parsers. (parental advisory: don't look at the parsers code, your eyes would melt)

And obviously this is not going to solve all cases and will even break pages with tricky js

[+] fouc|4 years ago|reply

The good ole days when web pages were web pages!

[+] wilsonfiifi|4 years ago|reply

I’v recently been using Monolith[0] and I find it’s creation of a single html file much more convenient. It’s also written in Rust so I’m sure that will make the source code a bit more accessible for some.

[0] https://github.com/Y2Z/monolith

[+] kybernetikos|4 years ago|reply

It doesn't look like it follows links though, so it's much more of an alternative to "Save As -> mhtml" than to HTTrack.

[+] chowderman|4 years ago|reply

There is also a similar program called HyperFiler[0]* that bundles web pages into single HTML files with a few more options such as a headless chromium transport option, built in minifiers, page sanitizers, and an option to grayscale the output pages, among other options. It's TypeScript based and has an programmatic API to customize the bundling process as well.

[0] https://github.com/chowderman/hyperfiler

* disclaimer: I created HyperFiler

[+] AnyTimeTraveler|4 years ago|reply

Cool! Thanks for sharing! I've been on the lookout for replacements for archiving recipies on cooking sites and this tool works great.

[+] Minor49er|4 years ago|reply

I used to use HTTrack pretty often to save entire sites. But I learned that wget can take care of my copying needs even more easily most of the time. Something like this usually does the trick:

  wget -E -r -k -p --span-host http://mycoolhomepage.com

[+] mattowen_uk|4 years ago|reply

Wow, I didn't think people still use this. I have a copy on my PC, and wheel it out every so often, when I find a great small site that quite obviously has been abandoned by it's owner so could vanish any moment, and Wayback doesn't have a full copy. I could switch to something better, but I know how HTTrack works, and it works well enough for me.

[+] slumdev|4 years ago|reply

You know what might be a great add-on?

A tool to crawl all of the links within a website and submit each one of them to Wayback...

[+] notRobot|4 years ago|reply

HTTrack is one of my favourite pieces of software - it makes it super easy to create offline mirrors of websites and browse them later. It's sorta like wget on steroids for that use case.

[+] mosselman|4 years ago|reply

Wget has the —mirror flag which makes it much like httrack for minor scrapes. Httrack is faster because it can work in parallel.

[+] icythere|4 years ago|reply

I used httrack to transform the public version of my wordpress blogs to a static site. It often crashed but as long as I had a copy of its local data(base) it's just fine to restart it.

I really like the tool. I doubt if that is helpful today, bc. of the raise of the Javascript stuff...

[+] daniel_iversen|4 years ago|reply

I looked into it the other day but AFAIR it easily breaks down if a site is using Cloudflare to protect itself from abuse, which I imagine could be quite a fair few sides these days.

[+] Scoundreller|4 years ago|reply

I wonder if there’s a way to integrate it with my own browser so it can mirror a website over days/weeks by using my regular behaviour to avoid suspicion.

[+] klkvsk|4 years ago|reply

I used to use HTTrack, but then I found Cryotek WebCopy[1]. It's basically same, but with more user friendly GUI and some more features useful for modern web. Like for example, it can fetch resources by URLs contained in JS code or from special attributes like data-src. And it's free, but available only for Windows, though.

[1] https://www.cyotek.com/cyotek-webcopy

[+] DerekBickerton|4 years ago|reply

The SingleFile[0] Firefox addon is handy too, if you just want to archive a webpage with all the images, styling etc intact all enclosed in a single HTML file.

[0] https://addons.mozilla.org/en-US/firefox/addon/single-file/

[+] Gys|4 years ago|reply

Similar project (cross platform commandline): https://github.com/wabarc/wayback

[+] cubano|4 years ago|reply

HTTrack is a powerful package, but unless it can now handle JS-embedded links properly, it still can't take you all the way to. perfect local site mirroring.

[+] gengear|4 years ago|reply

Used it in undergraduate to mirror courses. Very good. doesn't work with js links or SPA. But a nice to have tool if you are into self hosting.

[+] cyberge99|4 years ago|reply

How does this differ from Save As in a browser?

[+] okareaman|4 years ago|reply

There's no comparison. It's way more powerful than that, but careful what you point it at because it will download gigabytes of everything and might annoy the website owner.

[+] judge2020|4 years ago|reply

Seems to be more powerful, eg. it can spider an entire website and save all linked pages and assets.

[+] EvilGretzky|4 years ago|reply

What makes this better than DarcyRipper?

[+] judge2020|4 years ago|reply

For one, that software's website seems to be down - http://darcyripper.com/

35 comments