top | item 44796117

(no title)

harvie | 6 months ago

Maybe we can just configure webservers to block anyone who requests robots.txt, regular browsers don't do it, but robots do to get list of urls to crawl (while ignoring rules). Just create simple PHP/CGI script that adds client IP addres to iptables once /robots.txt is accessed.

discuss

Trung0246|6 months ago

One way to easily bypass is to let external services fetching robots.txt (archive.org, GitHub actions, etc...) to cache it and either expose through separate apis/webhook/manual download to the actual scrape server.

robots txt file size is usually small and would not alert external services.