Copying my comment here for additional discussion:
Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.
I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).
If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.
I advise 402 Payment Required \n Location: mailto:payment-offers@yourdomain. It’s still 4xx so they’ll probably know not to hit it repeatedly, and if they reach out to negotiate payment, you have a choice whether to accept or not.
If playing around with blocking bots here is one that will block some legit and friendly bots and some bad ones too. This will do nothing for headless chrome acting as a bot. Only use this on silly hobby sites, do not use in production even though most load balancers can do this and much more.
In the main site config redirect anyone not using HTTP/2.0. GoogleBot still doesnt use HTTP/2.0 so this will block Google. Bing is OK though. One could instead use variables to make this multi-condition and make exceptions for their CIDR blocks. Point "auth." DNS record to the same IP and ensure you have a cert for it or a wildcard cert.
# in main TLS site config:
# replace apex with your domain and tld with its tld.
if ($server_protocol != HTTP/2.0) { return 302 https://auth.apex.tld$request_uri; }
Then in your "auth" domain use the same config as the main site minus the redirect but then add basic authentication. Anyone not using HTTP/2.0 can still access the site if they know the right username/password. If you get a lot of bots then have an init script copy the password file into /dev/shm and reference it from there in NGinx to avoid the disk reads.
# then in the auth.apex.tld config.
# optionally give a hint replacing i_heart_bots with name_blah_pass_blah
auth_delay 2s;
location / {
auth_basic "i_heart_bots"; auth_basic_user_file /etc/nginx/.pw;
}
This will block some API command line tools, most bots good or bad, some scanning tools. Some bots will give up prior to 2 seconds so you will get a status 499 instead of 401 in the access logs. Only do this on silly hobby sites. Do not use in production. Only people wearing a T-Shirt like this one [1] may do this in production.
One may be surprised to find that most bots use old libraries that are not HTTP/2.0 enabled. When they catch up we can replace this logic using HTTP/3.0 and UDP. Beyond that we can force people to win a game of tic-tac-toe or Doom over Javascript.
The most permanent and effective solutions (in terms of minimizing adversarial activity over time and destroying the value of what is harvested) involve serving fake content (poison!), making site failures sporadic (forcing them to maintain state), and making some of those errors look like they're upstream not something you're doing on a specific machine (really bad luck mate!).
The deadenders who felt it was worth it will keep trying for at least a while; the new exploiters will tend to give up sooner. robots.txt is a courtesy. Not everybody puts stuff on the internet with a working theory that your experience is more important than theirs.
Why would you want to block Meta and Twitter? I think the rich objects on social networks pretty important, which are only shown if you let the social networks visit your site.
robots.txt isn’t honoured only by bad bots and scrapers, so I agree with you that this nginx configuration is pretty useless and doesn’t even solve the problem on the other side!
[+] [-] kisamoto|2 years ago|reply
Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.
I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).
If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.
[+] [-] altairprime|2 years ago|reply
[+] [-] LinuxBender|2 years ago|reply
In the main site config redirect anyone not using HTTP/2.0. GoogleBot still doesnt use HTTP/2.0 so this will block Google. Bing is OK though. One could instead use variables to make this multi-condition and make exceptions for their CIDR blocks. Point "auth." DNS record to the same IP and ensure you have a cert for it or a wildcard cert.
Then in your "auth" domain use the same config as the main site minus the redirect but then add basic authentication. Anyone not using HTTP/2.0 can still access the site if they know the right username/password. If you get a lot of bots then have an init script copy the password file into /dev/shm and reference it from there in NGinx to avoid the disk reads. This will block some API command line tools, most bots good or bad, some scanning tools. Some bots will give up prior to 2 seconds so you will get a status 499 instead of 401 in the access logs. Only do this on silly hobby sites. Do not use in production. Only people wearing a T-Shirt like this one [1] may do this in production.One may be surprised to find that most bots use old libraries that are not HTTP/2.0 enabled. When they catch up we can replace this logic using HTTP/3.0 and UDP. Beyond that we can force people to win a game of tic-tac-toe or Doom over Javascript.
[1] - https://www.amazon.com/Dont-Always-Test-Production-Shirt/dp/...
[+] [-] Mandatum|2 years ago|reply
Once it’s online, it’s online.
[+] [-] m3047|2 years ago|reply
The deadenders who felt it was worth it will keep trying for at least a while; the new exploiters will tend to give up sooner. robots.txt is a courtesy. Not everybody puts stuff on the internet with a working theory that your experience is more important than theirs.
[+] [-] super256|2 years ago|reply
[+] [-] danradunchev|2 years ago|reply
- I don't want my content on those sites in any form and I don't want my content to feed their algorithms. So I do not care for opengraph or previews.
- Using robot.txt assumes they will 'obey' it. But they may choose not to. Its not mandatory in anyway.
- Yes, they can fake UA. This does not mean I should not take any measures to block them just becasue they can fake.
[+] [-] therealmarv|2 years ago|reply
[+] [-] subsequentmask|2 years ago|reply
[+] [-] brainzap|2 years ago|reply
[+] [-] tjpnz|2 years ago|reply
[+] [-] xgbi|2 years ago|reply
and check your logs to see who is not complying
[+] [-] sMarsIntruder|2 years ago|reply
[+] [-] funOtter|2 years ago|reply
[+] [-] altairprime|2 years ago|reply
[+] [-] danradunchev|2 years ago|reply
[+] [-] bishopsmother|2 years ago|reply
[0] https://webmasters.stackexchange.com/questions/137914/spike-...