(no title)
f33d5173 | 15 days ago
I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
CqtGLRGcukpy|15 days ago
This is from my experience having a personal website. AI companies keep coming back even if everything is the same.
giancarlostoro|15 days ago
This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.
zmmmmm|15 days ago
Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.
We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.
dawnerd|14 days ago
nickpsecurity|14 days ago
It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.
Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?
iririririr|15 days ago
Maybe they vibecoded the crawlers. I wish I were joking.
anonnon|15 days ago
Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?
fartfeatures|15 days ago
lukeasch21|15 days ago
Seattle3503|15 days ago
Operyl|15 days ago
shark_laser|15 days ago
You've just described Nostr: Content that is tied to a hash (so its origin and authenticity can be verified) that is hosted by third parties (or yourself if you want)
Hendrikto|14 days ago
demetris|15 days ago
Also, I always wonder about Common Crawl:
Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
ccgreg|15 days ago
The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.
Denatonium|14 days ago
WalterBright|14 days ago
I wrote a short paper on that 25 years ago, but it went nowhere. I still think it is a great idea!
j45|14 days ago
Kind of sucks because the news are an important part of that kind of an archive.
raincole|15 days ago
pigggg|15 days ago
golem14|15 days ago
jeron|14 days ago
definitely, this is going to hurt those over at /r/datahoarder
toomuchtodo|15 days ago
Aurornis|15 days ago
News websites aren’t like those labyrinthian cgit hosted websites that get crushed under scrapers. If 1,000 different AI scrapers hit a news website every hour it wouldn’t even make a blip on the traffic logs.
Also, AI companies are already scraping these websites directly in their own architecture. It’s how they try to stay relevant and fresh.
dawnerd|14 days ago
terminalshort|15 days ago
nerdponx|15 days ago
lxgr|15 days ago
zaphirplane|15 days ago