top | item 43411445

(no title)

benhoff | 11 months ago

I used this recently to download websites, stuffed them into a sqlite db, processed them with Mozllia's readability library, and then used the result and an llm to ask questions of the webpage itself.

It was helpful to take each step in chunks, as I didn't have a complete processing pipeline when I started.

I had wondered if there was an easier or better way to do this, as I probably would have liked to get the sitemap, pass the sitemap to an llm, then only download selected html pages vs the entire website.

discuss

order

gtirloni|11 months ago

But the sitemap could be incomplete, couldn't it?

benhoff|11 months ago

True, I guess that's the advantage of HTTrack.

I guess for my use case, it would be better to get the parsing that HTTrack does, get all the url's, and pass that into an intelligence to selectively grab files.