Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.
> it took 2 weeks and 260GB of uncompressed disk space
Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.
So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.
Is there a friendly way to do this? I'd feel bad burning through hundreds of gigabytes of bandwidth for a non-corporate site. Would a database snapshot be as useful?
One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.
This saved me a ton when back in college in rural India without Internet in 2015. I would download whole websites from a nearby library and read at home.
oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
> now that the web is much more dynamic I guess it's not as useful anymore as it was back then
Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.
I have tried the windows version 2 years ago. The site I copied was our on-prem issue tracker (fogbugz) that we replaced.
HTTrack did not work because of too much javascript rendering, and I could not figure out how to make it login.
What I ended up doing was embedding a browser (WebView2) in a C# Desktop app. You can intercept all the images/css, and after the Javascript rendering was complete, write out the DOM content to a html file.
Also nice is that you can login by hand if needed, and you can generate all urls from code.
I use it to download sites with layouts that I like and want to use for landing pages and static pages for random projects. I strip all the copy and stuff and leave the skeleton to put my own content. Most recently link.com, column.com and increase.com. I don't have the time nor the youth to start with all the JavaScript & React stuff.
Felk|1 year ago
If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
smashed|1 year ago
codetrotter|1 year ago
Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.
So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.
criddell|1 year ago
begrid|1 year ago
corinroyal|1 year ago
rkhassen9|1 year ago
suriya-ganesh|1 year ago
I've read py4e, ostep, Pgs essays using this.
I am who I am because of httrack. Thank you
jregmail|1 year ago
Real copy of the netlify.com website for demonstration: https://crawler.siteone.io/examples-exports/netlify.com/
Sample analysis of the netlify.com website, which this tool can also provide: https://crawler.siteone.io/html/2024-08-23/forever/x2-vuvb0o...
xnx|1 year ago
alganet|1 year ago
dark-star|1 year ago
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
dspillett|1 year ago
Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.
oriettaxx|1 year ago
so, did developer of the github repo took over and updating/upgrading? very good!
superjan|1 year ago
chirau|1 year ago
zazaulola|1 year ago
Alifatisk|1 year ago
subzero06|1 year ago
j0hnyl|1 year ago
unknown|1 year ago
[deleted]
alberth|1 year ago
woutervddn|1 year ago