I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM.
I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.
abhgh|1 year ago
I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].
[1] https://aclanthology.org/2021.acl-demo.15.pdf
[2] https://r.jina.ai/news.ycombinator.com/
[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274