There’s a lot of data that we should have programmatic access to that we don’t.
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.
Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.
There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.
Parsing the rendered HTML is the only way to extract the data you need.
I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...
What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.
adamtaylor_13|1 year ago
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
drusepth|1 year ago
bongodongobob|1 year ago
rietta|1 year ago
luigi23|1 year ago
CSMastermind|1 year ago
Parsing the rendered HTML is the only way to extract the data you need.
kordlessagain|1 year ago
nsonha|1 year ago