(no title)
benhoff | 11 months ago
It was helpful to take each step in chunks, as I didn't have a complete processing pipeline when I started.
I had wondered if there was an easier or better way to do this, as I probably would have liked to get the sitemap, pass the sitemap to an llm, then only download selected html pages vs the entire website.
gtirloni|11 months ago
benhoff|11 months ago
I guess for my use case, it would be better to get the parsing that HTTrack does, get all the url's, and pass that into an intelligence to selectively grab files.