top | item 45747337

(no title)

r_singh | 4 months ago

The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.

discuss

karlshea|4 months ago

There were already guidelines, these trash people aren’t following them. That’s why there’s now “sentiment” against them.

r_singh|4 months ago

It’s fair to be angry at abuse and "aggressive bots", but it's important to remember most large platforms—including the ones being scraped—built their own products on scraping too.

I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.

Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.

Cthulhu_|4 months ago

Well sure, but these guidelines exist, the robots.txt guidelines has been an industry-led, self-governing / self-restrictive standard. But newer bots ignore them. It'll take years for legislation to catch up, and even then it would be by country or region, not something global because that's not how the internet works.

Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.

r_singh|4 months ago

As the legal history around scraping shows, it’s almost always the smaller company that gets sued out of existence. Taking on OpenAI or Microsoft, as you suggest, isn’t realistic — even governments often struggle to hold them accountable.

And for the record, large companies regularly ignore robots.txt themselves: LinkedIn, Google, OpenAI, and plenty of others.

The reality is that it’s the big players who behave like the aggressors, shaping the rules and breaking them when convenient. Smaller developers aren’t the problem, they’re just easier to punish.

intended|4 months ago

What ? What do you mean ?

ac29|4 months ago

As posted in another comment, they run a scraping API. I think their opinion is at least slightly biased.

georgefrowny|4 months ago

To be fair the heyday of unshit search was driven by mostly-consensual scraping.

Today there are far too many people scraping stuff that isn't intended to be scraped, for profit, and doing it in a heavy-handed way that actually does have a negative and continuous effect on the victim's capacity.

Everyone from AI services too lazy or otherwise unwilling to cache to companies exfiltrating some kind of data for their own commercial purposes.