top | item 46611432

(no title)

chlorion | 1 month ago

I self host a small static website and a cgit instance on an e2-micro VPS from Google Cloud, and I have got around 8.5 million requests combined from openai and claude over around 160 days. They just infinitely crawl the cgit pages forever unless I block them!

    (1) root@gentoo-server ~ # egrep 'openai|claude' -c /var/log/lighttpd/access.log
    8537094
So I have lighttpd setup to match "claude|openai" in the user agent string and return a 403 if it matches, and a nftables firewall seutp to rate limit spammers, and this seems to help a lot.

discuss

order

dang|1 month ago

And those are the good actors! We're under a crawlocalpyse from botnets, er, residential proxies.

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", anyone?

zerocrates|1 month ago

Yeah the flood of these Chrome UAs with every version number under the sun, and a really large portion being *.0.0.0 version numbers, that's what I've tended to experience lately. Also just kind of every browser user agent ever:

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; .NET CLR 3.5.21022)

There were waves of big and sometimes intrusive traffic admitting to being from Amazon, Anthropic, Google, Meta, etc., but those are easy to block or throttle and aren't that big a deal in the scheme of things.

zahlman|1 month ago

The third-party hit-counting service I use implies that I'm not getting any of this bot scraping on my GitHub blog.

Is Microsoft doing something to prevent it? Or am I so uncool that even bots don't want to read my content :(

lelanthran|1 month ago

I'm interested in that service and how it works. Link?