top | item 45746441

(no title)

We feel this at work too. We run a book streaming platform with all books, booklists, authors, narrators and publishers available as standalone web pages for SEO, in the multiple millions. Last 6 months have turned into a hellscape - for a few reasons:

1. It's become commonplace to not respect rate limits

2. Bots no longer identify themselves by UA

3. Bots use VPNs or similar tech to bypass ip rate limiting

4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting

5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them

I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.

I guess the solution is to just scale up the origin servers... /shrug

In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.

discuss

FeepingCreature|4 months ago

In principle, it should be possible to identify malign IPs at scale by using a central service and reporting IPs probabilistically. That is, if you report every thousandth page hit with a simple UDP packet, the central tracker gets very low load and still enough data to publish a bloom filter of abusive IPs, say a million bits gives you pretty low false-positive. (If it's only ~10k malign IPs, tbh you can just keep a lru counter and enumerate all of them.) A billion hits per hour across the tracked sites would still only correspond to ~50KB/s inflow on the tracker service. Any individual participating site doesn't necessarily get many hits per source IP, but aggregating across a few dozen should highlight the bad actors. Then the clients just pull the bloom filter once an hour (80KB download) and drop requests that match.

Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.

01HNNWZ0MV43FF|4 months ago

The hard part is the trust, not the technology. Everyone has to trust that everyone else is not putting bogus data into that database to hurt someone else.

It's mathematically similar to the "Shinigami Eyes" browser plug-in and database, which has been found to have unreliable data

pixl97|4 months ago

>malign IPs at scale

As talked about elsewhere in this thread, residential devices being used as proxies behind CGNAT ruins this. Not getting rid of IPv4 years ago is finally coming to bite us in the ass in a big way.

Neil44|4 months ago

Same, I have a few hundred Wordpress sites and bot activity has ramped up a lot over the last year or two. AI scrapers can be quite aggressive and often generate a ton of requests where for example a site has a lot of parameters, the bot will go nuts seeming to iterate through all possible parameters. Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.

karlshea|4 months ago

A client of mine had this exact problem with faceted search, and putting the site behind Fastly didn’t help since you can’t cache millions of combinations. And they don’t have the budget for more than one origin server. The solution was if you’ve got “bot” in your UA Fastly’s VCL returns a 403 with any facet query param. Problem solved. And it’s not going to break anything, all of the information is still accessible to all of the indexers on the actual product pages.

The facet links already had “nofollow” on them, now I’m just enforcing it.

immibis|4 months ago

> Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.

Fake the data! Tell them Neil44 is a three-time Nobel prize winner, etc. But only when the client is detected to be an AI crawler.

jrochkind1|4 months ago

I hate relying on a proprietary single-source product from a company I don't particularly trust, but (free) Cloudflare Turnstile works for me, only thing I've found that does.

I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.

It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).