top | item 46319207

(no title)

thethingundone | 2 months ago

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

discuss

order

sethops1|2 months ago

I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.

n1xis10t|2 months ago

It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.

thisislife2|2 months ago

If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.

tokioyoyo|2 months ago

Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.

idiotsecant|2 months ago

Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.

https://www.youtube.com/watch?v=vUgs2O7Okqc

tmnvix|2 months ago

A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)

n1xis10t|2 months ago

If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.

thethingundone|2 months ago

I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.

thethingundone|2 months ago

The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.

n1xis10t|2 months ago

Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.

danpalmer|2 months ago

How do you define a user, and how do you define online?

If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.

crote|2 months ago

That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.

Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.

thethingundone|2 months ago

AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.

Edit: it’s 15 minutes.

mrweasel|2 months ago

Why pay for storage when you do it for them?

stevage|2 months ago

I'd love to know the answer to this question. AI scrapers wanting everything on the internet makes sense to me. But I don't understand how that leads to every site being hit hundreds of thousands of times per day.

GaryBluto|2 months ago

Why do you keep it operating? Is it the aquarium value?

andrepd|2 months ago

When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".

csomar|2 months ago

Sure you do by now. You are the hard drive.

sandblast|2 months ago

Are you sure the counter is not broken?

thethingundone|2 months ago

Yes, it’s running on a Woltlab Burning Board since forever.