(no title)
bakql | 4 months ago
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
bakql | 4 months ago
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
jraph|4 months ago
However,
(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.
(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)
(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.
Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".
Cervisia|4 months ago
In Germany, it is the law. § 44b UrhG says (translated):
(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.
(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.
(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.
Aloisius|4 months ago
Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).
Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.
Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.
While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.
Razengan|4 months ago
The answer to THAT could: "It is free but leave some for others you greedy fuck"
grayhatter|4 months ago
You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.
But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?
batch12|4 months ago
wqaatwt|4 months ago
Calavar|4 months ago
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
kelnos|4 months ago
People who ignore polite requests are assholes, and we are well within our rights to complain about them.
I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
That's a shitty world that we shouldn't have to live in.
smsm42|4 months ago
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don't see why people would spend time to prove this is the only possible and normal world - it's certainly not the case, we can do better.
watwut|4 months ago
bigbuppo|4 months ago
bigiain|4 months ago
At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.
Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.
hsbauauvhabzb|4 months ago
grayhatter|4 months ago
This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.
> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.
You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.
> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.
It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.
> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.
Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?
I guess that's a take...
There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.
mxkopy|4 months ago
whimsicalism|4 months ago
smsm42|4 months ago
1gn15|4 months ago
malfist|4 months ago
righthand|4 months ago
dylan604|4 months ago
sdenton4|4 months ago
dylan604|4 months ago
2OEH8eoCRo0|4 months ago
We should start suing these bad actors. Why do techies forget that the legal system exists?
unknown|4 months ago
[deleted]
j2kun|4 months ago
gkbrk|4 months ago
You think people should go to prison if they go to their browser settings and change their user agent?
arccy|4 months ago
XenophileJKO|4 months ago
aDyslecticCrow|4 months ago
Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.
The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.
grayhatter|4 months ago
Your browser has a bug, if you leave my webpage open in a tab, because of that bug, it's going to close the connection, reconnect, new tls handshake and everything and re-request that page without any cache tag, every second, everyday, for as long as you have the tab open.
That feels kinda problematic, right?
Web servers block well formed clients all the time, and I agree with you, that's dumb. But servers should be allowed to serve only the traffic they wish. If you want to use some LLM client, but the way that client behaves puts undue strain on wy server, what should I do, just accept that your client, and by proxy you, are an asshole and just accept that?
You shouldn't put your rules on my webserver, exactly as much I my webserver shouldn't put my rules on yours. But i believe that ethically, we should both attempt to respect and follow the rules of the other. Blocking traffic when it starts to behave abusively. It's not complex, just try to be nice and help the other as much as you reasonably can.
munk-a|4 months ago
I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).
The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.
codyb|4 months ago
anon10484810573|4 months ago
It's pretty obvious that there is an asymmetry in benefit between those creating the models and those creating the content. If that doesn't bother you consider the fact that this currently undermines the economic and social model for open content creation on the internet.
What happens when the content significantly decreases?
Should those who create content not have some say in how their content is used?
davesque|4 months ago
isodev|4 months ago
Lionga|4 months ago
Ylpertnodi|4 months ago