The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.
karlshea|4 months ago
r_singh|4 months ago
I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.
Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.
Cthulhu_|4 months ago
Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.
r_singh|4 months ago
And for the record, large companies regularly ignore robots.txt themselves: LinkedIn, Google, OpenAI, and plenty of others.
The reality is that it’s the big players who behave like the aggressors, shaping the rules and breaking them when convenient. Smaller developers aren’t the problem, they’re just easier to punish.
intended|4 months ago
ac29|4 months ago
georgefrowny|4 months ago
Today there are far too many people scraping stuff that isn't intended to be scraped, for profit, and doing it in a heavy-handed way that actually does have a negative and continuous effect on the victim's capacity.
Everyone from AI services too lazy or otherwise unwilling to cache to companies exfiltrating some kind of data for their own commercial purposes.