top | item 46970902

(no title)

krick | 19 days ago

So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?

discuss

order

esseph|19 days ago

The dirty secret is a lot of them come through "residential proxies", aka backdoored home routers, iot devices with shitty security, etc. Basically the scrapers who are often also third party, go to these "companies" and buy access to these "residential proxies". Some are more... considerate than others.

Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).

tigerlily|19 days ago

How can I detect if my router is backdoored, or being used as a residential proxy?

the_biot|19 days ago

Has this actually been investigated and proven to be true? I see allegations, but no facts really.

It seems to me to be just as likely that people are installing LLM chatbot apps that do the occasional bit of scraping work on the sly, covered by some agreed EULA.

karel-3d|19 days ago

it isn't that hard to just buy a bunch of sim cards and put them in a modem and use that. it's good enough as a residential proxy. source: I did that before, when I worked on plaid-like thing.

recursivecaveat|19 days ago

I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well.

reppap|19 days ago

Stop just making up excuses for these companies. Other comments on this story have showed the bots are using openai user agents and making requests from openai owned ip ranges.

mikepavone|18 days ago

As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.

There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.

wseqyrku|19 days ago

> this is pretty much the reason why Cloudflare even exists,

You said it yourself. If you're selling a cure, you might as well start a plague.