top | item 45775259

(no title)

bakql | 4 months ago

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

discuss

order

jraph|4 months ago

When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".

Cervisia|4 months ago

> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

Aloisius|4 months ago

> Well behaved robots do not usually use millions of residential IPs

Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).

Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.

Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.

While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.

Razengan|4 months ago

> And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free".

The answer to THAT could: "It is free but leave some for others you greedy fuck"

grayhatter|4 months ago

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

batch12|4 months ago

Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.

wqaatwt|4 months ago

Somebody concealing or obfuscating various information a browser would send by standard for privacy or other reasons is also “lying” by that standard? Or someone using a VPN?

Calavar|4 months ago

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

kelnos|4 months ago

> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

smsm42|4 months ago

"Theft" may be wrong, but "abuse" certainly is not. Human interactions in general, and the web in particular, are built on certain set of conventions and common behaviors. One of them is that most sites are for consuming information at human paces and volumes, not downloading their content wholesale. There are specialized sites that are fine with that, but they say it upfront. Average, especially hobbyist site, is not that. People who do not abide by it are certainly abusing it.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don't see why people would spend time to prove this is the only possible and normal world - it's certainly not the case, we can do better.

watwut|4 months ago

If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.

bigbuppo|4 months ago

Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.

bigiain|4 months ago

> robots.txt is a polite request to please not scrape these pages

At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.

Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.

hsbauauvhabzb|4 months ago

How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.

grayhatter|4 months ago

> I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.

> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.

You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.

> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.

It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.

> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.

Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?

I guess that's a take...

There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.

mxkopy|4 months ago

The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.

whimsicalism|4 months ago

There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.

smsm42|4 months ago

You are still trying to pretend that accessing HTTP server once and burying it under an avalanche of never-stopping bot crawlers is the same thing? And spam is the same as "sending an email" and should be treated the same? I thought in this day and age we're past that.

1gn15|4 months ago

If you're trying to say DDoS, just say that.

malfist|4 months ago

If I set out a bowl of candy for ticker treaters, I wouldn't expect to be okay with the first adult strolling by and taking everything.

righthand|4 months ago

Then cutting up the candy and taping candy together in the most statistically pleasing way and finally selling all of the stolen frankenstein’s monster candy as innovative new candy and the future of humanity.

dylan604|4 months ago

and if they do, you have no recourse just like with scrapers. with the candy example, you spend you time sitting near the candy bowl supervising. for servers, we have various anti-bot supervisors. however, some asshat with no scruples can still just walk right up to your bowl and empty the contents into their bag and then just walk away even with you sitting right there. Unless you're willing to commit violence, there's nothing stopping them. now you're the assailant and the asshat is the victim. you still loose.

sdenton4|4 months ago

The problem is that serving content costs money. Llm scraping is essentially ddos'ing content meant for human consumption. Ddos'ing sucks.

dylan604|4 months ago

running the scraping bots cost money too.

2OEH8eoCRo0|4 months ago

Scraping is legal. DDoSing isn't.

We should start suing these bad actors. Why do techies forget that the legal system exists?

j2kun|4 months ago

You should not have to ask for permission, but you should have to honestly set your user-agent. (In my opinion, this should be the law and it should be enforced)

gkbrk|4 months ago

> In my opinion, this should be the law and it should be enforced

You think people should go to prison if they go to their browser settings and change their user agent?

arccy|4 months ago

yeah all open HTTP servers are fair game for DDoS because well it's open right?

XenophileJKO|4 months ago

What about people using an LLM as their web client? Are you now saying the website owner should be able to dictate what client I use and how it must behave?

aDyslecticCrow|4 months ago

> Are you now saying the website owner should be able to dictate what client I use and how it must behave?

Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.

The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.

grayhatter|4 months ago

Yes? I'd suggest that you understand that's not an unreasonable expectation either.

Your browser has a bug, if you leave my webpage open in a tab, because of that bug, it's going to close the connection, reconnect, new tls handshake and everything and re-request that page without any cache tag, every second, everyday, for as long as you have the tab open.

That feels kinda problematic, right?

Web servers block well formed clients all the time, and I agree with you, that's dumb. But servers should be allowed to serve only the traffic they wish. If you want to use some LLM client, but the way that client behaves puts undue strain on wy server, what should I do, just accept that your client, and by proxy you, are an asshole and just accept that?

You shouldn't put your rules on my webserver, exactly as much I my webserver shouldn't put my rules on yours. But i believe that ethically, we should both attempt to respect and follow the rules of the other. Blocking traffic when it starts to behave abusively. It's not complex, just try to be nice and help the other as much as you reasonably can.

munk-a|4 months ago

I think there's a massive shift in what the letter of the law needs to be to match the intent. The letter hasn't changed and this is all still quite legal - but there is a significant different between what webscraping was doing to impact creative lives five years ago and today. It was always possible for artists to have their content stolen and for creative works to be reposted - but there was enough IP laws around image sharing (which AI disingenuously steps around) and other creative work wasn't monetarily efficient to scrape.

I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).

The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.

codyb|4 months ago

The sign on the door said "no scrapers", which as far as I know is not a protected class.

anon10484810573|4 months ago

This mindset really baffles me. Just because it is not illegal doesn't mean one should do it. And for anything truly innovative there are bound to be gaps in the current law.

It's pretty obvious that there is an asymmetry in benefit between those creating the models and those creating the content. If that doesn't bother you consider the fact that this currently undermines the economic and social model for open content creation on the internet.

What happens when the content significantly decreases?

Should those who create content not have some say in how their content is used?

davesque|4 months ago

I mean, it costs money to host content. If you are hosting content for bots fine, but if the money you're paying to host it is meant to benefit human users (the reason for robots.txt) then yeah, you ought to ask permission. Content might also be copyrighted. Honestly, I don't even know why I'm bothering to mention these things because it just feels obvious. LLM scrapers obviously want as much data as they can get, whether or not they act like assholes (ignoring robots.txt) or criminals (ignoring copyright) to get it.

isodev|4 months ago

Ah yes, the “it’s ok because I can” school of thought. As if that was ever true.

Lionga|4 months ago

So if a house is not not locked I can take whatever I want?

Ylpertnodi|4 months ago

Yes, but you may get caught, and there suffer 'consequences'. I can drive well over 220kmh+ on the autobahn (Germany, Europe), and also in France (also in Europe). One is acceptable, the other will get me Royale-e fucked. If the can catch me.