top | item 36476243

Using Nginx to block Meta, Twitter and ChatGPT access to your sites

104 points| danradunchev | 2 years ago |gist.github.com | reply

25 comments

order
[+] kisamoto|2 years ago|reply
Copying my comment here for additional discussion:

Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.

I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).

If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.

[+] altairprime|2 years ago|reply
I advise 402 Payment Required \n Location: mailto:payment-offers@yourdomain. It’s still 4xx so they’ll probably know not to hit it repeatedly, and if they reach out to negotiate payment, you have a choice whether to accept or not.
[+] LinuxBender|2 years ago|reply
If playing around with blocking bots here is one that will block some legit and friendly bots and some bad ones too. This will do nothing for headless chrome acting as a bot. Only use this on silly hobby sites, do not use in production even though most load balancers can do this and much more.

In the main site config redirect anyone not using HTTP/2.0. GoogleBot still doesnt use HTTP/2.0 so this will block Google. Bing is OK though. One could instead use variables to make this multi-condition and make exceptions for their CIDR blocks. Point "auth." DNS record to the same IP and ensure you have a cert for it or a wildcard cert.

    # in main TLS site config:
    # replace apex with your domain and tld with its tld.
    if ($server_protocol != HTTP/2.0) { return 302 https://auth.apex.tld$request_uri; }
Then in your "auth" domain use the same config as the main site minus the redirect but then add basic authentication. Anyone not using HTTP/2.0 can still access the site if they know the right username/password. If you get a lot of bots then have an init script copy the password file into /dev/shm and reference it from there in NGinx to avoid the disk reads.

    # then in the auth.apex.tld config.
    # optionally give a hint replacing i_heart_bots with name_blah_pass_blah
    auth_delay 2s;
    location / {
    auth_basic "i_heart_bots"; auth_basic_user_file /etc/nginx/.pw;
    }
This will block some API command line tools, most bots good or bad, some scanning tools. Some bots will give up prior to 2 seconds so you will get a status 499 instead of 401 in the access logs. Only do this on silly hobby sites. Do not use in production. Only people wearing a T-Shirt like this one [1] may do this in production.

One may be surprised to find that most bots use old libraries that are not HTTP/2.0 enabled. When they catch up we can replace this logic using HTTP/3.0 and UDP. Beyond that we can force people to win a game of tic-tac-toe or Doom over Javascript.

[1] - https://www.amazon.com/Dont-Always-Test-Production-Shirt/dp/...

[+] Mandatum|2 years ago|reply
That’s fine until SEO scrapers eat your site and regurgitate the content somewhere else.

Once it’s online, it’s online.

[+] m3047|2 years ago|reply
The most permanent and effective solutions (in terms of minimizing adversarial activity over time and destroying the value of what is harvested) involve serving fake content (poison!), making site failures sporadic (forcing them to maintain state), and making some of those errors look like they're upstream not something you're doing on a specific machine (really bad luck mate!).

The deadenders who felt it was worth it will keep trying for at least a while; the new exploiters will tend to give up sooner. robots.txt is a courtesy. Not everybody puts stuff on the internet with a working theory that your experience is more important than theirs.

[+] super256|2 years ago|reply
Why would you want to block Meta and Twitter? I think the rich objects on social networks pretty important, which are only shown if you let the social networks visit your site.
[+] danradunchev|2 years ago|reply
For everyone who miss the point.

- I don't want my content on those sites in any form and I don't want my content to feed their algorithms. So I do not care for opengraph or previews.

- Using robot.txt assumes they will 'obey' it. But they may choose not to. Its not mandatory in anyway.

- Yes, they can fake UA. This does not mean I should not take any measures to block them just becasue they can fake.

[+] therealmarv|2 years ago|reply
And allow your data to be trained by Google and Bing (bot) etc ;) You cannot do SEO and excluding AI nowadays!
[+] subsequentmask|2 years ago|reply
Hmm, I use NPM (Nginx Proxy Manager) to manage my Nginx, I think I'll look into ways to implement this into that
[+] brainzap|2 years ago|reply
why not use robots.txt
[+] tjpnz|2 years ago|reply
That assumes they'll honour it, either now or in the future.
[+] xgbi|2 years ago|reply
why not both?

and check your logs to see who is not complying

[+] sMarsIntruder|2 years ago|reply
robots.txt isn’t honoured only by bad bots and scrapers, so I agree with you that this nginx configuration is pretty useless and doesn’t even solve the problem on the other side!
[+] funOtter|2 years ago|reply
can you do this with htaccess (for shared hosting)?
[+] altairprime|2 years ago|reply
Typically, yes, but you’ll need some other syntax if it’s Apache or whatever.