(no title)
buro9 | 1 year ago
I have data... 7d from a single platform with about 30 forums on this instance.
4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT
That Claude one! Wowser.
Bots that match this (which is also the list I block on some other forums that are fully private by default):
(?i).(AhrefsBot|AI2Bot|AliyunSecBot|Amazonbot|Applebot|Awario|axios|Baiduspider|barkrowler|bingbot|BitSightBot|BLEXBot|Buck|Bytespider|CCBot|CensysInspect|ChatGPT-User|ClaudeBot|coccocbot|cohere-ai|DataForSeoBot|Diffbot|DotBot|ev-crawler|Expanse|FacebookBot|facebookexternalhit|FriendlyCrawler|Googlebot|GoogleOther|GPTBot|HeadlessChrome|ICC-Crawler|imagesift|img2dataset|InternetMeasurement|ISSCyberRiskCrawler|istellabot|magpie-crawler|Mediatoolkitbot|Meltwater|Meta-External|MJ12bot|moatbot|ModatScanner|MojeekBot|OAI-SearchBot|Odin|omgili|panscient|PanguBot|peer39_crawler|Perplexity|PetalBot|Pinterestbot|PiplBot|Protopage|scoop|Scrapy|Screaming|SeekportBot|Seekr|SemrushBot|SeznamBot|Sidetrade|Sogou|SurdotlyBot|Timpibot|trendictionbot|VelenPublicWebCrawler|WhatsApp|wpbot|xfa1|Yandex|Yeti|YouBot|zgrab|ZoominfoBot).
I am moving to just blocking them all, it's ridiculous.
Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).
vunderba|1 year ago
https://github.com/ai-robots-txt/ai.robots.txt
Pooge|1 year ago
After some digging, I also found a great way to surprise bots that don't respect robots.txt[1] :)
[1]: https://melkat.blog/p/unsafe-pricing
coldpie|1 year ago
frereubu|1 year ago
buro9|1 year ago
1. A proxy that looks at HTTP Headers and TLS cipher choices
2. An allowlist that records which browsers send which headers and selects which ciphers
3. A dynamic loading of the allowlist into the proxy at some given interval
New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.
I'd rather just say "I trust real browsers" and dump the rest.
Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".
jprete|1 year ago
That could also be a user login, maybe, with per-user rate limits. I expect that bot runners could find a way to break that, but at least it's extra engineering effort on their part, and they may not bother until enough sites force the issue.
jprete|1 year ago
Mistletoe|1 year ago
Dilettante_|1 year ago
Wait, that seems disturbingly conceivable with the way things are going right now. *shudder*
Aeolun|1 year ago
buro9|1 year ago
If a more specific UA hasn't been set, and the library doesn't force people to do so, then the library that has been the source of abusive behaviour is blocked.
No loss to me.
phito|1 year ago
EVa5I7bHFq9mnYK|1 year ago
If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products? Especially given that people now often consult ChatGPT instead of searching at Google?
rchaud|1 year ago
ChatGPT won't 'recommend' anything that wasn't already recommended in a Reddit post, or on an Amazon page with 5000 reviews.
You have however correctly spotted the market opportunity. Future versions of CGPT with offer the ability to "promote" your eshop in responses, in exchange for money.
ai-christianson|1 year ago
petee|1 year ago
buro9|1 year ago
rchaud|1 year ago
unknown|1 year ago
[deleted]
nedrocks|1 year ago
pogue|1 year ago
buro9|1 year ago
if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}
iLoveOncall|1 year ago
The fact that you choose to host 30 websites on the same instance is irrelevant, those AI bots scan websites, not servers.
This has been a recurring pattern I've seen in people complaining about AI bots crawling their website: huge number of requests but actually a low TPS once you dive a bit deeper.
buro9|1 year ago
In fact 2M requests arrived on December 23rd from Claude alone for a single site.
Average 25qps is definitely an issue, these are all long tail dynamic pages.