top | item 46637258

(no title)

sippeangelo | 1 month ago

With all respect to Mozilla, "respects robots.txt" makes this effectively DoA. AI agents are a form of user agent like any other when initiated by a human, no matter the personal opinion of the content publisher (unlike the egregious automated /scraping/ done for model training).

discuss

order

MrTravisB|1 month ago

This is a valid perspective. Since this is an emerging space, we are still figuring out how to show up in a healthy way for the open web.

We recognize that the balance between content owners and the users or developers accessing that content is delicate. Because of that, our initial stance is to default to respecting websites as much as possible.

That said, to be clear on our implementation: we currently only respond to explicit blocks directed at the Tabstack user agent. You can read more about how this works here: https://docs.tabstack.ai/trust/controlling-access

x3haloed|1 month ago

This tension is so close to a fundamental question we’re all dealing with, I think: “Who is the web for? Humans or machines?”

I think too often people fall completely on one side of this question or the other. I think it’s really complicated, and deserves a lot of nuance. I think it mostly comes down to having a right to exert control over how our data should be used, and I think most of it’s currently shaped by Section 230.

Generally speaking, platforms consider data to be owned by the platform. GDPR and CCPA/CPRA try to be the counter to that, but those are also too-crude a tool.

Let’s take an example: Reddit. Let’s say a user is asking for help and I post a solution that I’m proud of. In that act, I’m generally expecting to help the original person who asked the question, and since I’m aware that the post is public, I’m expecting it to help whoever comes next with the same question.

Now (correct me if I’m wrong, but) GDPR considers my public post to be my data. I’m allowed to request that Reddit return it to me or remove it from the website. But then with Reddit’s recent API policies, that data is also Reddit’s product. They’re selling access to it for … whatever purposes they outline in the use policy there. That’s pretty far outside what a user is thinking when they post on Reddit. And the other side of it as well — was my answer used to train a model that benefits from my writing and converts it into money for a model maker? (To name just an example).

I think ultimately, platforms have too much control, and users have too little specificity in declaring who should be allowed to use their content and for what purposes.

Findecanor|1 month ago

There is still a difference between "fetch this page for me and summarise" and "go find pages for me, and cross-reference". And what makes you think that all AI agents using Tabstack would be directly controlled in real time with a 1:1 correspondence between human and agent, and not in some automated way?

I'm afraid that Tabstack would be powerful enough to bypass some existing countermeasures against scrapers, and once allowed in its lightweight mode be used to scrape data it is not supposed to be allowed to. I'd bet that someone will at least try.

Then there is the issue of which actions and agent is allowed to do on behalf of a user. Many sites have in their Terms of Service that all actions must be by done directly by a human, or that all submitted content be human-generated and not from a bot. I'd suppose that an AI agent could find and interpret the ToS, but that is error-prone and not the proper level to do it at. Some kind of formal declaration of what is allowed is necessary: robots.txt is such a formal declaration, but very coarsely grained.

There have been several disparate proposals for formats and protocols that are "robots.txt but for AI". I've seen that at least one of them allow different rules for AI agents and machine learning. But these are too disparate, not widely known ... and completely ignored by scrapers anyway, so why bother.

mossTechnician|1 month ago

I agree with you in spirit, but I find it hard to explain that distinction. What's the difference between mass web scraping and an automated tool using this agent? The biggest differences I assume would be scope and intent... But because this API is open for general development, it's difficult to judge the intent and scope of how it could be used.

jakelazaroff|1 month ago

What's difficult to explain? If you're having an agent crawl a handful of pages to answer a targeted query, that's clearly not mass scraping. If you're pulling down entire websites and storing their contents, that's clearly not normal use. Sure, there's a gray area, but I bet almost everyone who doesn't work for an AI company would be able to agree whether any given activity was "mass scraping" or "normal use".

observationist|1 month ago

Exactly. robots.txt with regards to AI is not a standard and should be treated like the performative, politicized, ideologically incoherent virtue signalling that it is.

There are technical improvements to web standards that can and should be made that doesn't favor adtech and exploitative commercial interests over the functionality, freedom, and technically sound operation of the internet