top | item 40292021

(no title)

What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..

discuss

jumploops|1 year ago

Agreed!

Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

We currently use this at Magic Loops[2] and it works _most_ of the time.

The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

[0] https://apify.com/apify/website-content-crawler

[1] https://github.com/extractus/article-extractor

[2] https://magicloops.dev/

[3] https://reworkd.ai/

KhoomeiK|1 year ago

This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.

nodoodles|1 year ago

Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)

spxneo|1 year ago

all around automation sucks with LLM thrown on top of it

the statistics are not in its favour

longgui0318|1 year ago

Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.

https://github.com/EZ-hwh/AutoCrawler

nodoodles|1 year ago

Thanks, will look into it, looks promising

nikcub|1 year ago

Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.

The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.

nodoodles|1 year ago

Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..

selimthegrim|1 year ago

There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.

greggsy|1 year ago

It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.

Ad blockers have had something very close to this for some time, without any sparkly AI buttons.

I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.

uptown|1 year ago

Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.

geuis|1 year ago

That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.

nodoodles|1 year ago

One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..

unknown|1 year ago

[deleted]

cpobuda|1 year ago

I have been working on this. Feel free to DM me.

wraptile|1 year ago

Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.

_el1s7|1 year ago

Exactly, are you aware of any current efforts of people trying to do that?