top | item 43425958

(no title)

geekrax | 11 months ago

I have replaced all robots.txt rules with simple WAF rules, which are cheaper to maintain than dealing with offending bots.

discuss

order

claudiulodro|11 months ago

I do essentially both: robots.txt backed by actual server-level enforcement of the rules in robots.txt. You'd think there would be zero hits on the server-level blocking since crawlers are supposed to read and respect robots.txt, but unsurprisingly they don't always. I don't know why this isn't a standard feature in web hosting.

Joe_Cool|11 months ago

For my personal stuff I also included a Nepenthes tarpit. Works great and slows the bots down while feeding them garbage. Not my fault when they consume stuff robots.txt says they shouldn't.

I'm just not sure if legal would love me doing that on our corporate servers...

rustc|11 months ago

The WAF rule matches based on the user agent header? Perplexity is known to use generic browser user agents to bypass that.