top | item 41429220

(no title)

luigi23 | 1 year ago

Why are scrapers so popular nowadays?

discuss

order

adamtaylor_13|1 year ago

There’s a lot of data that we should have programmatic access to that we don’t.

The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.

Any website that has my data and doesn’t give me access to it is a great target for scraping.

drusepth|1 year ago

I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.

bongodongobob|1 year ago

Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.

rietta|1 year ago

Because publishers don’t push structured data or APIs enough to satisfy demand for the data.

luigi23|1 year ago

Got it, but why is it booming now and often it’s a showcase of llm model? Is there some secret market/ usecase for it?

CSMastermind|1 year ago

There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.

kordlessagain|1 year ago

I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...

nsonha|1 year ago

What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.