top | item 41110456

(no title)

jackienotchan | 1 year ago

Affected companies are becoming increasingly frustrated with the army of AI crawlers out there as they won't stick to any scraping best practices (respect robot.txt, use public APIs, no peak load). It's not necessarily about copyright, but the heavy scraping traffic also leads to increased infra costs.

What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.

discuss

order

hibikir|1 year ago

The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.

Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human

amiga386|1 year ago

I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827

Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.

Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.

thomasahle|1 year ago

The only way I can see to make it truly expensive to scrape, is to build javascript bitcoin mining into every request.

kjkjadksj|1 year ago

Seems to me eventually we might hit a point where stuff like api access is whitelisted. You will have to build a real relationship with a real human at the company to validate you aren’t a bot. This might include in person meeting as anything else could be spoofed. Back to the 1960s business world we go. Thanks, technologists, for pulling the rug under us all.

bunderbunder|1 year ago

Scraping implies not API - they're accessing the site as a user agent. And whitelisting access to the actual web pages isn't a tenable option for many websites. Humans generally hate being forced to sign up for an account before they can see this page that they found in a Google search.

tedivm|1 year ago

Scraping often uses the same APIs that the website itself does, so to make that work a lot of sites will have to put their content around authentication of some sort.

For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.

smt88|1 year ago

You can't whitelist and also have a consumer-facing service. There is no reliable way to differentiate between a legitimate user and the AI company's scraper.

disqard|1 year ago

Yep, it reminds me of the Ferrari almost-scam that was thwarted because the target thought to verify by asking about something that was only shared in-person.

brightball|1 year ago

I could definitely see this. I worked for a company that had a few popular free inspector tools on their website. The constant traffic load of bots was nuts.

__MatrixMan__|1 year ago

I don't know if the AI's have an endgame in mind. As for the humans, I think it's an internet built for a dark forest. We'll stop assuming that everything is benign except for the malicious parts which we track and block. Instead we'll assume that everything is malicious except for the parts which our explicitly trusted circle of peers have endorsed. When we get burned, we'll prune the trust relationship that misled us, and we'll find ways to incentivize the kind of trust hygiene necessary to make that work.

When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.

bunderbunder|1 year ago

Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug.

MattDaEskimo|1 year ago

API-based interactions w/ Authentication.

Websites previously would have their own in-house API to freely deliver content to anyone who requests it.

Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.

Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.

So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.

Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.

AI (and greed) has killed the open freedoms of the Internet.

candiddevmike|1 year ago

Invite only authenticated islands based on trust. Which seems like the end result of the rampant centralization of the internet.

zeroCalories|1 year ago

The open web is on a crash course. I don't necessarily believe in copyright claims, but I think it makes sense to aggressively prosecute scrapers for DDOSing.

tempfile|1 year ago

An optimistic outcome would be that public content becomes fully peer-to-peer. If you want to download an article, you must seed at least the same amount of bandwidth to serve another copy. You still have to deal with leechers, I guess.

mahdi7d1|1 year ago

There is no reason to protect against bots using regular captchas (Seems like I'm weaker than your average bot in passing those). Brave search has a proof of work captcha and everytime I face it I'm glad it's not google's choose the bicycle one. Having a captcha be a hevy process ran for a couple of seconds might be a nuisance to me who needs to complete it once a day but to the person who has to do it a lot of time for scraping, the costs might add up rather quickly. And the foundamental mechanism of it makes its effectivenes irrelevant to how much progress AI has made.

Also maybe the recent rise in captcha difficulty is not companies making them harder to prevent bots but rather bots twisting the right answer. As I know it captcha works based on other users' answers so if a huge portion of these other users are bots they can fool the alghorithm into thinking their wrong answer is the right answer.

MattGaiser|1 year ago

Feed bad data to heavy users. Instead of blocking, use poison.

tempfile|1 year ago

Presumes you can distinguish the heavy users. If you knew who the heavy users were, you could just block them.

bgorman|1 year ago

Web Attestation, cryptography to the rescue.

rs999gti|1 year ago

How? Watermark everything with a hash?

yifanl|1 year ago

You can rather easily set up semi-hard rate limiting with a proof of work scheme. Will very trivially affect human users, while bot spammers have to eat up the cost of a million hash reversions per hour or whatever.

dartos|1 year ago

Yep. That works well enough for password hashing algorithms to deter brute force attackers.

This is a similar situation.

zkid18|1 year ago

Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.

Firefishy|1 year ago

We publish a live stream of minutely updated OpenStreetMap data in ready do digest form on https://planet.openstreetmap.org/ and S3. Scraping of our data still happens.

Our S3 bucket is thankfully supported by the AWS Open Data Sponsorship Program.

dorgo|1 year ago

Would the snapshot contain the same info ( beyound any doubt ) that an actual user would see if they opened LinkedIn/Facebook/Service from Canada on an IPhone at a saturday morning (for example)? If not, the snapshot is useles for some usecases and we are back to scraping.

glitchc|1 year ago

Data from S3 isn't free though, still costs money and has a limit based on the tier you purchase.

Scoundreller|1 year ago

Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.

(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).

MisterBastahrd|1 year ago

How long before companies start putting AI restrictions on new account creation simply because of the sheer amount of noise and storage issues associated with bot spam?

zild3d|1 year ago

isn't the answer just rate limiting unauthenticated requests to a level that's reasonable/expected for a human?

thomasahle|1 year ago

No, the scrapers can just spread over lots of IPs.

agilob|1 year ago

lower max upload speed for certain IPs to 5kb/s

jgalt212|1 year ago

> What's the endgame here?

We've had good success with

- Cloudflare Turnstile

- Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents)

londons_explore|1 year ago

> AI can already solve captchas, so the arms race for bot protection is pretty much lost.

Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.

tempfile|1 year ago

> require login

this is not a solution if you want a public internet (and sites that don't care about the public internet already don't have a problem)

_heimdall|1 year ago

I must be an outlier here, but I don't keep email addresses that long. After a couple years they're on too many spam lists. I'll wind those addresses down and use them for a couple years only for short interactions that I expect spam from, and ultimately close then down completely the next cycle.

At best any email I have is 4 or 5 years old.

azemetre|1 year ago

How does one find the age of an email account?

mcherm|1 year ago

This is about OpenStreetMap, so you are proposing that my minor daughter not be allowed to read a map?