top | item 44146978

(no title)

ThePinion | 9 months ago

Can you further elaborate on this robots.txt? I was under the impression most AI just completely ignores anything to do with robots.txt so you may just be hitting the ones that are maybe attempting to obey it?

I'm not against the idea like others here seem to be, I'm more curious about implementing it without harming good actors.

discuss

order

kevindamm|9 months ago

If your robots.txt has a line specifying, for example

   Disallow: /private/beware.zip
and you have no links to that file from elsewhere on the site, then if you get a request for that URL it was because someone/something read the robots.txt and explicitly violated it, then you can send it a zipbomb or ban the source IP or whatever.

But in my experience it isn't the robots.txt violations being so flagrant (half the requests are probably humans who were curious what you're hiding, and most bots written specifically for LLMs don't even check the robots.txt). The real abuse is the crawler that hits an expensive and frequently-changing URL more often than reasonable, and the card-testers hitting payment endpoints, sometimes with excessive chargebacks. And port-scanners, but those are a minor annoyance if your network setup is decent. And email spoofers who bring your server's reputation down if you don't set things up correctly early on and whenever changing hosts.