top | item 42551696

(no title)

buro9 | 1 year ago

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

discuss

l1n|1 year ago

403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.

buro9|1 year ago

> 403 is generally a bad way to get crawlers to go away

Hardly... the article links says that a 403 will cause Google to stop crawling and remove content... that's the desired outcome.

I'm not trying to rate limit, I'm telling them to go away.

vultour|1 year ago

That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.

gs17|1 year ago

From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

Libcat99|1 year ago

Switching to sending wrong, inexpensive data might be preferable to blocking them.

I've used this with voip scanners.