> What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.
Wild indeed, and potentially horrific for the owners of the affected devices also! Any corroboration for that out there?
This is actually a commonly known fact. There are many services now that sell “residential proxies”, which are always mobile IP addresses. Since mobile IPs use CGNat it’s also not great to block the IP because it can be like geofencing an entire city or town. Some examples are: oxylabs, iproyal, brightdata, etc.
Recently I filed an abuse complaint directly with brightdata because I was getting hit with 1000s of requests from their bots. The funny part is the didn’t even stop, after acknowledging the complaint.
> How is it free?
>
> In return for free usage of Hola Free VPN Proxy, Hola Fake GPS location and Hola Video Accelerator, you may be a peer on the Bright Data network. By doing so you agree to have read and accepted the terms of service of the Bright Data SDK SLA (https://bright-sdk.com/eula). You may opt out by becoming a Premium user.
If you have a moderately successful app, sdk or browser extension you will get hit up to add things to it like this. I think most free VPN services also lease out your bandwidth to make their money as well.
they use a mixture of colo (M247, Datacamp, HostRoyale, Oxylabs, etc) and international residential. I suspect the latter are where those residential app proxies come into play (bright SDK, etc). Oxylabs is also a well known proxy provider, which makes me think they're the gateway into all of these IPs.
Definitely interesting times to try and host a web server!
You can get paid a few dollars (not many) to let them use your connection. I would like Cloudflare's business model (blocking datacenter IPs) to be worthless, so I do it. Haven't tried a withdrawal yet so it could well be a scam. This is not illegal (unless it's a scam).
I had a website earlier this year running on Hetzner. It was purely experimenting with some ASP.NET stuff but when looking at the logs, I noticed a shit-load of attempts at various WordPress-related endpoints.
I then read something about a guy who deliberately put a honeypot in his robots.txt file. It was pointing to a completely bogus endpoint. Now, the theory was, humans won't read robots.txt so there's no danger, but bots and the like will often read robots.txt (at least to figure out what you have... they'll ignore the "deny" for the most part!) and if they try and go to that fake endpoint you can be 100% sure (well, as close as possible) that it's not a human and you can ban them.
So I tried that.
I auto-generated a robots.txt file on the fly. It was cached for 60 seconds or so as I didn't want to expend too many resource on it. When you asked for it, you either got the cached one or I created a new one. The CPU-usage was negligible.
However, I changed the "deny" endpoint each time I built the file in case the baddies cached it, however, it still went to the same ASP.NET controller method. By hitting it, I sent a 10GB zip bomb and your IP was automatically added to the FW block list.
It was quite simple: anyone that hit that endpoint MUST be dodgy... I believe I even had comments for the humans that stumbled across it letting them know that if they went to this endpoint in their browser it was an automatic addition to the firewall blocklist.
Anyway... at first I caught a shit load of bad guys. There were thousands at first and then the numbers dropped and dropped to only tens per day.
Anyway, this is a single data point but for me, it worked... I have no regrets about the zip bomb either :)
I have another site that I'm working on so I may evolve it a bit so that you are banned for a short time and if you come back to the dodgy endpoint then I know you're a bot so into the abyss with you!
This is approximately my approach minus the zip bomb. I use a piece of middleware in my AspNetCore pipeline that tracks logical resource consumption rates per IPv4. If a client trips any of the limits, their IP goes into a HashSet for a period of time. If a client has an IP in this set, they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later".
The other aspect of my strategy is to use AspNetCore (Kestrel). It is so fast that you can mostly ignore the noise as long as things are configured properly and you make reasonable attempts to address the edge case of an asshole trying to break your particular system on purpose. A HashSet<int> as the very first piece of middleware rejecting bad clients is exceedingly efficient. We aren't even into URL routing at this point.
I have found that attempting to catalog and record all of the naughty behavior my web server sees is the highest risk to DDOS so far. Logging lines like "banned client rejected" every time they try to come in the door is shooting yourself in the foot with regard to disk wear, IO utilization, et. al. There is no reason you should be logging all of that background radiation to disk or even thinking about it. If your web server cant handle direct exposure to the hard vacuum of space, it can be placed behind a proxy/CDN (i.e., another web server that doesn't suck).
It's interesting to study, right? This is the Internet equivalent of background radiation. Harmless in most cases. Exploit scanners aren't new to the LLM age and shouldn't overload your server - unless you're vulnerable to the exploit.
Fun fact: Some people learn about new exploits by watching their incoming requests.
We feel this at work too. We run a book streaming platform with all books, booklists, authors, narrators and publishers available as standalone web pages for SEO, in the multiple millions. Last 6 months have turned into a hellscape - for a few reasons:
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.
In principle, it should be possible to identify malign IPs at scale by using a central service and reporting IPs probabilistically. That is, if you report every thousandth page hit with a simple UDP packet, the central tracker gets very low load and still enough data to publish a bloom filter of abusive IPs, say a million bits gives you pretty low false-positive. (If it's only ~10k malign IPs, tbh you can just keep a lru counter and enumerate all of them.) A billion hits per hour across the tracked sites would still only correspond to ~50KB/s inflow on the tracker service. Any individual participating site doesn't necessarily get many hits per source IP, but aggregating across a few dozen should highlight the bad actors. Then the clients just pull the bloom filter once an hour (80KB download) and drop requests that match.
Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.
Same, I have a few hundred Wordpress sites and bot activity has ramped up a lot over the last year or two. AI scrapers can be quite aggressive and often generate a ton of requests where for example a site has a lot of parameters, the bot will go nuts seeming to iterate through all possible parameters. Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.
I hate relying on a proprietary single-source product from a company I don't particularly trust, but (free) Cloudflare Turnstile works for me, only thing I've found that does.
I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.
It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).
You may want to take a look at Pingoo (https://github.com/pingooio/pingoo), a reverse proxy with automatic TLS that can also block bots with advanced rules that go beyond simple IP blocking.
> Auto-restart the reverse-proxy if bandwidth usage drops to zero for more than 2 minutes
It's understandable in your case as you have traffic coming in constantly, but first thing that came to my mind is a loop of contant reboots - again, very unlikely in your case. Sometimes such blanket rules hit me due to most unexpected reasons, like the proxy somehow failed to start serving traffic in the given timeframe.
Though I completely appreciate and agree with the 'ship now something that works now' approach!
The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.
Well sure, but these guidelines exist, the robots.txt guidelines has been an industry-led, self-governing / self-restrictive standard. But newer bots ignore them. It'll take years for legislation to catch up, and even then it would be by country or region, not something global because that's not how the internet works.
Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.
Not sure if that's satire or not but how would you even identify the party to sue? What do you do if they're based in a country where you can't sue them ofer relatively trivial matters as this?
Do we shift over everything to le Dark Web and let the corpos use this one for selling their shit to consumers? These toys don’t want to play nice and there’s no real way to stop them without bringing in things like Real ID and other verifications that infringe on anonymity.
It's wild. Data is very valuable. This manifests in two fronts simultaneously: who has the data controls heavily on who sees it and under what circumstances, and on the other side, they scrape it as hard as they can.
I think he should consider getting out of the indie blog hosting business. It’s only going to get worse as the internet continues to decay and he can’t be making all that much off the service.
No way. People deserve expression and to have a place that's THEIRS where they can foster a community. Much is learned. Playing battle bots is fun at the sysadmin level (for me), maybe not so much for others, but to have a place where people express themselves, and have THEIR place outside of the walled gardens such as social media, AND they protect it from the bots?
That's the battle, and expression, people, their interests, and their communities are worth fighting for. _ESPECIALLY_ in this day and age where botnets/scrapers are using things such as Infatica to mask themselves as residential IP addresses, and mimicking human behaviors to better avoid bot detection.
There's a war on authenticity, people's authentic works, and the reverse: determining if a user is authentic now adays.
His persistent efforts are the reason I pay for Bear Blog. I think he should fight for the chance to come out on the other side of whatever future we’re heading towards.
asplake|4 months ago
Wild indeed, and potentially horrific for the owners of the affected devices also! Any corroboration for that out there?
VladVladikoff|4 months ago
Recently I filed an abuse complaint directly with brightdata because I was getting hit with 1000s of requests from their bots. The funny part is the didn’t even stop, after acknowledging the complaint.
kaoD|4 months ago
https://hola.org/legal/sdk
https://hola.org/legal/sla
> How is it free? > > In return for free usage of Hola Free VPN Proxy, Hola Fake GPS location and Hola Video Accelerator, you may be a peer on the Bright Data network. By doing so you agree to have read and accepted the terms of service of the Bright Data SDK SLA (https://bright-sdk.com/eula). You may opt out by becoming a Premium user.
This "VPN" is what powers these residential proxies: https://brightdata.com/
I'm sure there are many other companies like this.
curious_curios|4 months ago
lucastech|4 months ago
they use a mixture of colo (M247, Datacamp, HostRoyale, Oxylabs, etc) and international residential. I suspect the latter are where those residential app proxies come into play (bright SDK, etc). Oxylabs is also a well known proxy provider, which makes me think they're the gateway into all of these IPs.
Definitely interesting times to try and host a web server!
Zanfa|4 months ago
immibis|4 months ago
ItsBob|4 months ago
I then read something about a guy who deliberately put a honeypot in his robots.txt file. It was pointing to a completely bogus endpoint. Now, the theory was, humans won't read robots.txt so there's no danger, but bots and the like will often read robots.txt (at least to figure out what you have... they'll ignore the "deny" for the most part!) and if they try and go to that fake endpoint you can be 100% sure (well, as close as possible) that it's not a human and you can ban them.
So I tried that.
I auto-generated a robots.txt file on the fly. It was cached for 60 seconds or so as I didn't want to expend too many resource on it. When you asked for it, you either got the cached one or I created a new one. The CPU-usage was negligible.
However, I changed the "deny" endpoint each time I built the file in case the baddies cached it, however, it still went to the same ASP.NET controller method. By hitting it, I sent a 10GB zip bomb and your IP was automatically added to the FW block list.
It was quite simple: anyone that hit that endpoint MUST be dodgy... I believe I even had comments for the humans that stumbled across it letting them know that if they went to this endpoint in their browser it was an automatic addition to the firewall blocklist.
Anyway... at first I caught a shit load of bad guys. There were thousands at first and then the numbers dropped and dropped to only tens per day.
Anyway, this is a single data point but for me, it worked... I have no regrets about the zip bomb either :)
I have another site that I'm working on so I may evolve it a bit so that you are banned for a short time and if you come back to the dodgy endpoint then I know you're a bot so into the abyss with you!
It's not perfect but it worked for me anyway.
bob1029|4 months ago
This is approximately my approach minus the zip bomb. I use a piece of middleware in my AspNetCore pipeline that tracks logical resource consumption rates per IPv4. If a client trips any of the limits, their IP goes into a HashSet for a period of time. If a client has an IP in this set, they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later".
The other aspect of my strategy is to use AspNetCore (Kestrel). It is so fast that you can mostly ignore the noise as long as things are configured properly and you make reasonable attempts to address the edge case of an asshole trying to break your particular system on purpose. A HashSet<int> as the very first piece of middleware rejecting bad clients is exceedingly efficient. We aren't even into URL routing at this point.
I have found that attempting to catalog and record all of the naughty behavior my web server sees is the highest risk to DDOS so far. Logging lines like "banned client rejected" every time they try to come in the door is shooting yourself in the foot with regard to disk wear, IO utilization, et. al. There is no reason you should be logging all of that background radiation to disk or even thinking about it. If your web server cant handle direct exposure to the hard vacuum of space, it can be placed behind a proxy/CDN (i.e., another web server that doesn't suck).
immibis|4 months ago
Fun fact: Some people learn about new exploits by watching their incoming requests.
psnehanshu|4 months ago
cupofjoakim|4 months ago
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.
FeepingCreature|4 months ago
Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.
Neil44|4 months ago
jrochkind1|4 months ago
I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.
It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).
pingoo101010|4 months ago
y-zon128|4 months ago
It's understandable in your case as you have traffic coming in constantly, but first thing that came to my mind is a loop of contant reboots - again, very unlikely in your case. Sometimes such blanket rules hit me due to most unexpected reasons, like the proxy somehow failed to start serving traffic in the given timeframe.
Though I completely appreciate and agree with the 'ship now something that works now' approach!
marginalia_nu|4 months ago
Every open port of every IP is continuously scanned for exploits.
r_singh|4 months ago
karlshea|4 months ago
Cthulhu_|4 months ago
Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.
intended|4 months ago
Havoc|4 months ago
I bet it’s free VPN apps
2OEH8eoCRo0|4 months ago
cupofjoakim|4 months ago
kelvinjps10|4 months ago
reustle|4 months ago
uvaursi|4 months ago
Chabsff|4 months ago
kwa32|4 months ago
npteljes|4 months ago
deepstateisfbi|4 months ago
[deleted]
unknown|4 months ago
[deleted]
TimorousBestie|4 months ago
sudosays|4 months ago
Indie blog businesses are great for the health of the human internet, and I don't think surrendering preemptively will help things get better.
vpShane|4 months ago
That's the battle, and expression, people, their interests, and their communities are worth fighting for. _ESPECIALLY_ in this day and age where botnets/scrapers are using things such as Infatica to mask themselves as residential IP addresses, and mimicking human behaviors to better avoid bot detection.
There's a war on authenticity, people's authentic works, and the reverse: determining if a user is authentic now adays.
flaviuspopan|4 months ago