top | item 46890261

Turn any website into a live, structured data feed

36 points| chadwebscraper | 25 days ago |meter.sh

32 comments

order

arm32|25 days ago

Residential proxies are sketchy at best. How can you guarantee that your service's infrastructure isn't hinging on an illicit botnet?

chadwebscraper|25 days ago

This is a good callout - I’ve tried my best thus far to limit the use of proxies unless absolutely necessary and then focus on reputable providers (even though these are a bit more pricey).

Definitely going to give this more thought though, thank you for the comment

dewey|25 days ago

There's a lot of variety in the residential proxy market. Some are sourced from bandwidth sharing SDKs for games with user consent, some are "mislabeled" IPs from ISPs that offer that as a product and then there's a long tail of "hacked" devices. Labeling them generally as sketchy seems wrong.

groby_b|25 days ago

"AntiBot bypass".

I see we continue to aim for high ethical standards throughout the industry.

tglobs|24 days ago

I had my ebike stolen today a few hours after seeing this, and immediately made an account to watch Craigslist for bike thieves trying to sell it.

If you had asked for $60/month to run it, I would've paid it.

6 attempts later, it's failed every time. I love that it's so easy to throw together things like this, but we need better ways of testing vibe-coded apps.

chadwebscraper|24 days ago

First off, really sorry to hear that it didn't work.

Edit: it looks like you hit an edge case that I didn't see in testing. Happy to explain more, but it was skipping the extraction due to some pre-processing that craigslist was failing when it shouldn't have.

Would love if you want to try the tool again, but completely understand if not :)

golfer|25 days ago

As a site owner, how does one opt out of this, since it obviously ignores robots.txt?

chadwebscraper|25 days ago

Shoot me your site and I can blacklist it

cyanydeez|25 days ago

I recommend a pivot: take your structured data approach and build a browser plugin that allows users to pin forums, wiki edits and adverts on any web content they like.

chadwebscraper|25 days ago

This is actually a really interesting though - like an embedder with live data?

arjunchint|25 days ago

So what happens when the website layout updates, does the monitoring job fail silently?

chadwebscraper|25 days ago

So with APIs, it adjusts. For HTML layouts, it looks at the previous diffs to catch potential errors and then re-indexes.

chadwebscraper|25 days ago

Here’s how it works:

1. Paste a URL in, describe what you want

2. Define an interval to monitor

3. Get real time webhooks of any changes in JSON

Lots of customers are using this across different domains to get consistent, repeatable JSON out of sites and monitor changes.

Supports API + HTML extraction, never write a scraper again!

codingdave|25 days ago

Writing a scraper isn't the hard part, that is actually fairly trivial at this point in time. Pulling content into JSON from your scrape is also fairly trivial - libraries exist that handle it well.

The harder parts are things like playing nicely so your bot doesn't get banned by sysadmins, detecting changes downstream from your URL, handling dynamically loading content, and keeping that JSON structure consistent even as your sites change their content, their designs, etc. Also, scalability. One customer I'm talking to could use a product like this, but they have 100K URLs to track, and that is more than I currently want to deal with.

I absolutely can see the use case for consistent change data from a URL, I'm just not seeing enough content in your marketing to know whether you really have something here, or if you vibe coded a scraper and are throwing it against the wall to see if it sticks.

tmaly|25 days ago

this must wreck their google analytics stats

the_arun|25 days ago

What is a strategy? You need to elaborate that in pricing.

chadwebscraper|25 days ago

Thank you for the feedback - agreed.

It’s an extraction pattern for a certain site, so you can reuse it. Think a pattern to extract all forum posts - then using that on different pages with the same format. Like show new, show, new posts on HN.