Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.
In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).
There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.
> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.
I'm already constantly being classified as bot. Just today:
To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.
Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.
Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...
I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.
> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers
And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.
already sites like perplexity have been completed blocked by cloudflare due to some meta signal and can't even load it. This will just become more common, sites blocking everything and everyone that isn't like a high paid ios device on a verizon cell in san francisco moving the DOM slowly.
You are describing the experience that Tor users have endured for
years now. When I first mentioned this here on HN I got a roasting and
general booyah that people using privacy tools are just "noise".
Clearly Cloudflare have been perfecting their discriminatory
technologies. I guess what goes around comes around. "first they came
for the...." etc etc.
Anyway, I see a potential upside to this, so we might be optimistic.
Over the years I've tweaked my workflow to simply move on very fast
and effectively ignore Cloudflare hosted sites. I know... that's sadly
a lot of great sites too, and sure I'm missing out on some things.
On the other hand, it seems to cut out a vast amount of rubbish.
Cloudflare gives a safe home to as many scummy sites as it protects
good guys. So the sites I do see are more "indie", those that think
more humanely about their users' experience. Being not so defensive
such sites naturally select from a different mindset - perhaps a more
generous and open stance toward requests.
So what effect will this have on AI training?
Maybe a good one. Maybe tragic. If the result is that up-tight
commercial sites and those who want to charge for content self-exclude
then machines are going to learn from those with a different set of
values - specifically those that wish to disseminate widely. That will
include propaganda and disinformation for sure. It will also tend to
filter out well curated good journalism. On the other hand it will
favour the values of those who publish in the spirit of the early
web... just to put their own thing up there for the world.
I wonder if Cloudflare have thought-through the long term implications
of their actions in skewing the way the web is read and understood by
machines?
> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't
... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.
This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.
Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.
The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.
My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.
> The only real difference this will make is further entrenching big players
It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.
As an actual content provider I see this as an opportunity. We pay our journalists real money to write real stories. If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic. Cloudflare now offers a third option if we can figure out how to use it.
Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.
I distinctly remember Cloudfare being accused here of hosting spammers and selling protection against them a decade ago. Then suddenly the name became associated with positive things only, and the whole thing have been memory-holed.
> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.
You don't have to make any deals, or participate in the marketplace, "block all" is right there.
And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.
Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.
The term "abuse" in this description is both confused and confusing. Websites are trying to meter out a public resource, which is something they're unable to do by themselves. Cloudflare is offering to help them, for a fee. Once the practice is metered, it isn't abuse anymore. It's just using the public service, which the website owner deliberately operates.
I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them.
This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!
To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.
- dump availability was shaky at best back then (could see months go by without successful dumps)
- you had to fiddle with it to actually process the dumps
- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered
- you couldn't get their exact version of mediawiki, because they added more than what was released openly
Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.
At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.
I’ve just taken to blocking entire swaths of cloud services IP networks.
I don’t care what the intentions are, my personal sites don’t get the infinite bandwidth to put up with a thousands of poorly written spiders.
I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.
I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.
The AI scrapers are failing to discover something old-style search engines have been doing for decades: respecting a host and not giving them too much load. I'd say you did a good job banning those that generate noticeable load.
How long does the world-wide-web have left? It's always felt like it would be around forever, but at some point it will fade into obscurity like IRC has done. The golden age, I feel, has been gone a while, but "AI" seems like the beginning of the end.
"AI" is the beginning of the end the same way as spam, malware and bot content were perceived in the past. To every action there is a reaction and "AI" won't be an exception.
> A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.
And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.
Here's a look at my AI Audit on Bingeclock for anyone who's curious. Interesting drop in the last 48 hours given that it coincided with Cloudflare's announcement.
The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.
> If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved
I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.
I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.
With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.
To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?
This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.
Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.
Edit: updated comment to not be needlessly diversive.
Most likely aspiring AI startups gathering as much data as they can before regulation jaws snap shut around them cutting off the blood stream.
In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.
It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.
The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.
The sitemap.xml spec already has fields for indicating the last time a page was changed and how often it's expected to change in the future, so that search engines can optimize their updates accordingly, but AI scrapers tend to disregard that and just download the same unchanged page 10,000 times for the hell of it.
I guess with marketplace like this, if webmasters are happy and the AI agents are also happy, then we'll be seeing quite a few services to come up with similar solution.
Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.
Any recommendations for simple WAF tool that will stop the majority of the abuse without having to use Cloudflare? I use Cloudflare just to keep that noise away from my logs but I'm not super keen to be dependent on them.
Maybe they could solve some of the core issues instead. It is like CF lost the source code and just pushing new more or less useless features all the time. Even though I think this is a fair change.
I guess Web3 will exist after all. In a microtransaction-per-webpage-utilized sense. No way websites don't start charging real people when there's money to be made.
One minor, tedious thing that I've become so tired of lately is showcased very plainly in the screenshot in this article: That the Cloudflare admin dashboard has now prominently placed "AI Audit (ALPHA)" as a top-level navigation menu item at the very top of the list of a Cloudflare Account's products. Everyone is doing this, for AI products or whatever came before them, and it genuinely pushes me away from paying for Cloudflare, as I get the distinct sense that they aren't building the things or fixing the problems that I feel are important to me.
I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.
What's wrong with AI agents accessing website content?
We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.
The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.
This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.
The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.
Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.
Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?
For traditional search indexing the interests of the aggregator and the content creator were aligned. AIs on the other hand are adversarial to the interest of content creators, a sufficiently advanced AI can replace the creator of the content it was trained on.
Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.
Boy I'm sick of clicking "Verify you are human" on everything from GitLab to banking apps running Cloudflare.
Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.
One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.
I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.
>Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.
On what basis? It sucks that you can't visit those sites without going through an interstitial, but at the end of the day, those sites are essentially private property and the owners can impose whatever requirements they want on visitors. It's not any different than sites that have registration walls, for instance.
Cloudflare is just one of many products blocking unwanted network traffic. They're the biggest, for sure, but hardly the only one. If Cloudflare disappeared tomorrow, another would pop up instantly.
The problem isn't Cloudflare, it's that the internet is filled with ill-willed bots, and those bots seem to have infected your network or your ISPs network as well.
If ISPs did a better job taking action against infected IoT crap and spam farms, you wouldn't need to click so many CAPTCHAs.
Without Cloudflare, you'd just see a page saying "blocked because of supicious network activity" or nothing at all or a redirect shock site if the site admin is feeling particularly spicy. If anything, Cloudflare CAPTCHAs are doing you a service by being a cheap and effective alternative to mass IP range blocks.
Something I never considered, I wonder how clicking to be a human works for people with disabilities. There’s gotta be accessibility features there, and I bet bots are abusing them.
> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.
AI sycophants have truly deluded themselves into thinking everyone else is falling for their bullshit, it's great to see.
This feature wouldn't exist if "the tech community" didn't support it. If you want someone to blame, it's the AI companies for ruining what was a good thing with their blind greed and gold rush of trying to slurp up literally everything they could get their hands on in the shittiest ways possible, not respecting any commonly agreed upon rules like robots.txt
> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.
I don't think that it's not supported by the tech community. Much of that community is on the receiving end of the bad actors. I know that depending on the day I, for one, have muttered under my breath "This would be much easier if everyone were using the same damn web browser."
Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.
I really love Cloudflare. They're always up to something interesting and different. I hope we see more companies rise up similar to Cloudflare. I almost want to say Cloudflare is everything we hoped Google would be, but Google became another corporate cog machine that innovates and then scraps things up in one swoop. I don't recall the last I heard of Cloudflare spinning something up just to wind it back down? I don't think its impossible for them to make a bad choice, but I think they really think their projects through typically.
My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.
My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.
I'm not criticising. I'm just saying they're part of an industry that thought web3 was the Next Big Thing between 2017-2022 and then pivoted when ChatGPT released in Nov 2022. Now AI is the Next Big Thing.
I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.
Someone somewhere outside of your country's legal entities can still do all the things your country doesn't like and there's little to stop them.
Governments might limit legal or commercial usage but it doesn't mean it won't exist.
billyhoffman|1 year ago
In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).
There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.
Aachen|1 year ago
I'm already constantly being classified as bot. Just today:
To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.
Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.
Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...
johnklos|1 year ago
AlienRobot|1 year ago
paxys|1 year ago
And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.
account42|1 year ago
And I'm sure Buttflare will be more than happy to sell those products.
sfmike|1 year ago
nonrandomstring|1 year ago
You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.
Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.
On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.
So what effect will this have on AI training?
Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.
I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?
shadowgovt|1 year ago
... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.
creatonez|1 year ago
hipadev23|1 year ago
The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.
andyp-kw|1 year ago
The big players might just pay the fee because they might one day need to prove where they got the data from.
spiderfarmer|1 year ago
spacebanana7|1 year ago
It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.
neilv|1 year ago
This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.
And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.
I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.
troyvit|1 year ago
Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.
jsheard|1 year ago
Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.
gwervc|1 year ago
TZubiri|1 year ago
loceng|1 year ago
How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.
theamk|1 year ago
> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.
You don't have to make any deals, or participate in the marketplace, "block all" is right there.
And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.
flir|1 year ago
immibis|1 year ago
mrits|1 year ago
[deleted]
brigadier132|1 year ago
[deleted]
tempfile|1 year ago
flaburgan|1 year ago
luckylion|1 year ago
- dump availability was shaky at best back then (could see months go by without successful dumps)
- you had to fiddle with it to actually process the dumps
- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered
- you couldn't get their exact version of mediawiki, because they added more than what was released openly
Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.
At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.
epc|1 year ago
kijin|1 year ago
I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.
I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.
kccqzy|1 year ago
h8hawk|1 year ago
I've been making crawlers for a living! Thanks for informing me that I'm a parasite.
FlyingSnake|1 year ago
sdflhasjd|1 year ago
ivanjermakov|1 year ago
neilv|1 year ago
And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.
BSDobelix|1 year ago
yard2010|1 year ago
zebomon|1 year ago
https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...
The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.
dageshi|1 year ago
marcus_holmes|1 year ago
I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.
osigurdson|1 year ago
Mistletoe|1 year ago
https://www.reddit.com/r/webscraping/comments/w1ve97/virgin_...
sunshadow|1 year ago
boristsr|1 year ago
kordlessagain|1 year ago
Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...
With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.
hedora|1 year ago
Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.
I don’t see why this particular effort will turn out differently.
dogleash|1 year ago
To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?
tomjen3|1 year ago
Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.
Edit: updated comment to not be needlessly diversive.
kylehotchkiss|1 year ago
n_ary|1 year ago
In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.
nitwit005|1 year ago
sharpshadow|1 year ago
The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.
jsheard|1 year ago
NoMoreNicksLeft|1 year ago
delanyoyoko|1 year ago
Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.
siliconc0w|1 year ago
AtNightWeCode|1 year ago
CatWChainsaw|1 year ago
dangoodmanUT|1 year ago
but if they are only tracking the bot via the user agent
then can't i piggyback on that user agent?
no ai scraper is going to include an auth header when accessing your website...
rahimnathwani|1 year ago
datavirtue|1 year ago
015a|1 year ago
I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.
renewiltord|1 year ago
brikym|1 year ago
synack|1 year ago
j45|1 year ago
johnisgood|1 year ago
micromacrofoot|1 year ago
zkid18|1 year ago
red_admiral|1 year ago
This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.
6gvONxR4sf7o|1 year ago
spiderfarmer|1 year ago
unknown|1 year ago
[deleted]
lolinder|1 year ago
Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?
brigadier132|1 year ago
Workaccount2|1 year ago
johnsutor|1 year ago
zackmorris|1 year ago
Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.
One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.
I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.
gruez|1 year ago
On what basis? It sucks that you can't visit those sites without going through an interstitial, but at the end of the day, those sites are essentially private property and the owners can impose whatever requirements they want on visitors. It's not any different than sites that have registration walls, for instance.
Maxion|1 year ago
jeroenhd|1 year ago
The problem isn't Cloudflare, it's that the internet is filled with ill-willed bots, and those bots seem to have infected your network or your ISPs network as well.
If ISPs did a better job taking action against infected IoT crap and spam farms, you wouldn't need to click so many CAPTCHAs.
Without Cloudflare, you'd just see a page saying "blocked because of supicious network activity" or nothing at all or a redirect shock site if the site admin is feeling particularly spicy. If anything, Cloudflare CAPTCHAs are doing you a service by being a cheap and effective alternative to mass IP range blocks.
laserbeam|1 year ago
Icathian|1 year ago
sensanaty|1 year ago
AI sycophants have truly deluded themselves into thinking everyone else is falling for their bullshit, it's great to see.
This feature wouldn't exist if "the tech community" didn't support it. If you want someone to blame, it's the AI companies for ruining what was a good thing with their blind greed and gold rush of trying to slurp up literally everything they could get their hands on in the shittiest ways possible, not respecting any commonly agreed upon rules like robots.txt
shadowgovt|1 year ago
I don't think that it's not supported by the tech community. Much of that community is on the receiving end of the bad actors. I know that depending on the day I, for one, have muttered under my breath "This would be much easier if everyone were using the same damn web browser."
unknown|1 year ago
[deleted]
xyzzy_plugh|1 year ago
Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s
What absolute garbage.
kelsey98765431|1 year ago
meiraleal|1 year ago
giancarlostoro|1 year ago
My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.
My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.
nindalf|1 year ago
Cloudflare bet big on NFTs (https://blog.cloudflare.com/cloudflare-stream-now-supports-n...), Web3 (https://blog.cloudflare.com/get-started-web3/), Proof of stake (https://blog.cloudflare.com/next-gen-web3-network/). In fact they "bet on blockchain" way back in 2017 (https://blog.cloudflare.com/betting-on-blockchain/) but it's telling that they haven't published anything in the last couple of years (since Nov 2022). Since then the only crypto related content on blog.cloudflare.com is real cryptography - like data encryption.
I'm not criticising. I'm just saying they're part of an industry that thought web3 was the Next Big Thing between 2017-2022 and then pivoted when ChatGPT released in Nov 2022. Now AI is the Next Big Thing.
I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.
clvx|1 year ago