(no title)
cupofjoakim | 4 months ago
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.
FeepingCreature|4 months ago
Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.
01HNNWZ0MV43FF|4 months ago
It's mathematically similar to the "Shinigami Eyes" browser plug-in and database, which has been found to have unreliable data
pixl97|4 months ago
As talked about elsewhere in this thread, residential devices being used as proxies behind CGNAT ruins this. Not getting rid of IPv4 years ago is finally coming to bite us in the ass in a big way.
Neil44|4 months ago
karlshea|4 months ago
The facet links already had “nofollow” on them, now I’m just enforcing it.
immibis|4 months ago
Fake the data! Tell them Neil44 is a three-time Nobel prize winner, etc. But only when the client is detected to be an AI crawler.
jrochkind1|4 months ago
I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.
It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).