What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.
Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.
Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.
The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.
It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.
Ad blockers have had something very close to this for some time, without any sparkly AI buttons.
I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.
That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.
One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..
Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.
More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.
LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.
jumploops|1 year ago
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
[0] https://apify.com/apify/website-content-crawler
[1] https://github.com/extractus/article-extractor
[2] https://magicloops.dev/
[3] https://reworkd.ai/
KhoomeiK|1 year ago
nodoodles|1 year ago
spxneo|1 year ago
the statistics are not in its favour
longgui0318|1 year ago
https://github.com/EZ-hwh/AutoCrawler
nodoodles|1 year ago
nikcub|1 year ago
The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.
nodoodles|1 year ago
selimthegrim|1 year ago
greggsy|1 year ago
Ad blockers have had something very close to this for some time, without any sparkly AI buttons.
I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.
uptown|1 year ago
geuis|1 year ago
nodoodles|1 year ago
unknown|1 year ago
[deleted]
cpobuda|1 year ago
wraptile|1 year ago
More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.
LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.
_el1s7|1 year ago