top | item 44332910

(no title)

lolinder | 8 months ago

https://www.robotstxt.org/faq/what.html

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

There's nothing recursive about "summarize all the cooking recipes linked on this page". That's a single-level iterative loop.

I will grant that I should alter my original statement: if OP wanted to respect robots.txt when it receives a request that should be interpreted as an instruction to recursively fetch pages, then I'd think that's an appropriate use of robots.txt, because that's not materially different than implementing a web crawler by hand in code.

But that represents a tiny subset of the queries that will go through a tool like this and respecting robots.txt for non-recursive requests would lead to silly outcomes like the browser refusing to load reddit.com [0].

[0] https://www.reddit.com/robots.txt

discuss

mattigames|8 months ago

The concept of robots.txt was created in a different time, when nobody envisioned that users would one day use commands written in plain English sentences to interact with websites (including interacting with multiple pages with such commands), so the discussion about if AI browsers should respect it or if they should not is senseless, and instead -if this kind of usage takes off- it would probably make more sense to have a new standard for such use cases, something like AI-browsers.txt to make clear the intent of blocking (or not) AI browsing capabilities.

lolinder|8 months ago

Alright, I think we can agree on that. I'll see you over in that new standardization discussion fighting fiercely for protections to make sure companies don't abuse it to compromise the open web.