(no title)
dannyobrien | 1 month ago
There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.
It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.
tux|1 month ago
Cadwhisker|1 month ago
fartfeatures|1 month ago
What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.
arjie|1 month ago
jacksnipe|1 month ago
gloflo|1 month ago
I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.
squigz|1 month ago
hamdingers|1 month ago
I'm not sure why you're personifying what is almost certainly a script that fetches documents, parses all the links in them, and then recursively fetches all of those.
When we say "AI scraper" we're describing a crawler controlled by an AI company indiscriminately crawling the web, not a literal AI reading and reasoning about each page... I'm surprised this needs to be said.
ryantgtg|1 month ago
yardstick|1 month ago
Depends on if they wrote their own BitTorrent client or not. It’s possible to write a client that doesn’t share, and even reports false/inflated sharing stats back to the tracker.
A decade or more ago I modified my client to inflate my share stats so I wouldn’t get kicked out of a private tracker whose high share ratios conflicted with my crappy data plan.
toofy|1 month ago
this should give us pause. if a bot considers this adversarial and is refusing to respect the site owners wishes, thats a big part of the problem.
a bot should not consider that “adversarial”
chii|1 month ago
should a site owner be able to discriminate between a bot visitor and a human visitor? Most do, and hence the bots treats it as a hostile environment.
Of course, bots that behave badly have created this problem themselves. That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.
zzo38computer|1 month ago
Is there a mechanism to indicate this? The "a" command in the Scorpion crawling policy file is meant for this purpose, but that is not for use with WWW. (The Scorpion crawling policy file also has several other commands that would be helpful, but also are not for use with WWW.)
There is also the consideration to know what interval they will be archived that can be downloaded in this way; for data that changes often, you will not do it every time. This consideration is also applicable for torrents, since a new hash will be needed for a new version of the file.
m463|1 month ago
that is an amazing thought.