top | item 45942060

(no title)

SquareWheel | 3 months ago

That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.

discuss

order

klodolph|3 months ago

That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.

varenc|3 months ago

An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.