Cloudflare Introduces Default Blocking of A.I. Data Scrapers

abalashov|8 months ago

Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.

It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.

andy99|8 months ago

It's cloudflare and parasites like them that will make the internet un-free. It's already happening, I'm either blocked or back to 1998 load times be cause of "checking your browser". They are destroying the internet and will make it so only people who do approved things on approved browsers (meaning let advertising companies monetize their online activity) will get real access.

Cloudflare isn't solving a problem, they are just inserting themselves as an intermediary to extract a profit, and making everything worse.

jefftk|8 months ago

I write online (comments here, open source software, blogging, etc) because I have ideas I want to share. Whether it's "I did a thing and here's how" or "we should change policy in this specific way" or "does anyone know how to X" I'm happy for this to go into training models just like I'm happy for it to go into humans reading.

bawolff|8 months ago

I think its 100% ok to freely train on public internet data.

What is absolutely not ok is to crawl at such an excessive speed that it makes it difficult to host small scale websites.

Truly a tragedy of the commons.

godelski|8 months ago

Including your comment, including this comment.

HN itself is routinely scraped. What makes me most uncomfortable is deanonymization via speech analysis. It's something we can already do but is hard to do at scale. This is the ultimate tool for authoritarians. There's no hidden identities because your speech is your identifier. It is without borders. It doesn't matter if your government is good, a bad acting government (or even large corporate entity) has the power to blackmail individuals in other countries.

We really are quickly headed towards a dystopia. It could result in the entire destruction of the internet or an unprecedented level of self censorship. We already have algospeak because platform censorship[0]. But this would be a different type of censorship. Much more invasive, much more personal. There are things worse than the dark forest

[0] literally yesterday YouTube gave me, a person in the 25-60 age bracket, a content warning because there was a video about a person that got removed from a plane because they wore a shirt saying "End veteran suicide".

[0.1] Even as I type this I'm censored! Apple will allow me to swipe the word suicidal but not suicide! Jesus fuck guys! You don't reduce the mental health crisis by preventing people from even being able to discuss their problems, you only make it worse!

visarga|8 months ago

> everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop

I think on the contrary, who sets the prompts stands to get benefits, the AI provider gets a flat fee, and authors get nothing except the same AI tools as anyone else. That is natural since the users are bringing the problem to the AI, of course they have the lion share here.

AI is useless until applied to a specific task owned by a person or company. Within such a task there is opportunity for AI to generate value. AI does not generate its own opportunities, users do.

Because users are distributed across society benefits follow the same curve. They don't flow to the center but mainly remain at the edge. In this sense LLMs are like Linux, they serve every user in their specific way, but the contributors to the open source code don't get directly compensated.

jowea|8 months ago

Is it even possible that Cloudfare could manage to block all AI data scrapping? I think this measure is just going to make it harder and more expensive, which will stop AI scrappers from hitting every single page every single day and creating expenses for publishers, but not actually stop their data from ending up in a few datasets.

cmeacham98|8 months ago

Cutting humans out of what loop? What jobs or opportunities were people posting Reddit comments or whatever getting that are now going to AI?

unknown|8 months ago

[deleted]

Kostic|8 months ago

This would be true if not for open-weights (and even some open source) LLMs that exist today. Not everything should be done for profit.

Dig1t|8 months ago

That was always the cost of free and open exchange of ideas though. The idea of the internet in the first place was to allow people to communicate in the open and publish ideas freely. There was never any stipulation that using the published ideas to make money was off limits.

Technology has advanced and now reading the sum total of the freely exchanged ideas has become particularly valuable. But who cares? The internet still exists and is still usable to freely exchange ideas the way it’s always been.

The value that one website provides is a minuscule amount, the value of one individual poster on Reddit is minuscule. Are we asking that each poster on Reddit be paid 1 penny (that’s probably what your posts are worth) for their individual contribution? My websites were used to train these models probably, but the value that each contributed is so small that I wouldn’t even expect a few cents for it.

The person who’s going to profit here is Cloudflare or the owners of Reddit, or any other gatekeeper site that is already profiting from other people’s contributions.

The “parasitism” here just feels like normal competition between giant companies who have special access to information.

lofaszvanitt|8 months ago

Cyberpunk aged well. "You better not be on the unprotected internet". Too many hazards out there. Rogue AIs and other shit...

Cloudflare is here to protecc you from all those evils. Just come under our umbrella.

risyachka|8 months ago

Maybe so, but I'll take Cloudflare over OpenAI and Meta every time.

rramon|8 months ago

Isn't there a possibility that model makers retaliate by erasing them and their frameworks from memory, hurting CF adoption by devs?

az226|8 months ago

That’s the irony. Doing it now is just hampering competition and making it better for the incumbents.

mathiaspoint|8 months ago

This has been going on even since early social media. I think most of the users actually prefer it.

giancarlostoro|8 months ago

There's a reason reddit started charging for API usage.

dwoldrich|8 months ago

I think the parasitism goes quite a bit further than AI. We're being digested not parasitized.

nektro|8 months ago

it brings me so much joy that this is the top comment on this post

k__|8 months ago

Is anyone suing to make the models and their weights open source?

tcdent|8 months ago

[deleted]

jasonthorsness|8 months ago

I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.

# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.

# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.

# BEGIN Cloudflare Managed content

User-agent: Amazonbot Disallow: /

User-agent: Applebot-Extended Disallow: /

User-agent: Bytespider Disallow: /

User-agent: CCBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: GPTBot Disallow: /

User-agent: meta-externalagent Disallow: /

# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$

1vuio0pswjnm7|8 months ago

"User-agent: CCBot disallow: /"

Is Common Crawl exclusively for "AI"

CCBot was already in so many robots.txt prior to this

How is CC supposed to know or control how people use the archive contents

What if CC is relying on fair use

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing fees

Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee

Is this fee shared with the rights holders

postalcoder|8 months ago

This is interesting. The reasoning and response don't line up.

  > Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said

  >  prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation

This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.

bee_rider|8 months ago

I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.

xyst|8 months ago

So in addition to updating the robots.txt file, which really only blocks a small number of them.

Seems CF has been gathering data and profiling these malicious agents.

This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...

Basically becomes a game of cat and mouse.

Bender|8 months ago

For my silly hobby sites I just return status 444 close the connection for anything that has case-insentive "bot" in the UA requesting anything other than robots.txt, humans.txt, favicon.ico, etc... This would also drop search engines but I blackhole route most of their CIDR blocks. I'm probably the only one here that would do this.

lxgr|8 months ago

That's at least a more reasonable default than that I've seen at least one newspaper do, which is to block both LLM scrapers and things like ChatGPT's search feature explicitly.

slenk|8 months ago

I thought I saw cloudflare insert noindex links?

swyx|8 months ago

what actually are the consequences of ignoring robots.txt (apart from DDOS)? have any of these cases ended up in court at all?

btown|8 months ago

The headline is somewhat misleading: sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare.

The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.

GrayShade|8 months ago

> sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare

Do you have a source for that? https://blog.cloudflare.com/content-independence-day-no-ai-c... does say "changing the default".

bitpush|8 months ago

It is now an adversarial relationship between aibots and website, and cloudflare is merely reacting to it.

Would you say the same for ddos protection? Isn't that the same as well?

TechDebtDevin|8 months ago

They cant do anything other than bog down the internet. I havent found a single cf provided challenge I havent been able to get past in < half a day.

This is simply juat the first step in them implementing a marketplace and trying to get into LLM SEO. They dont care about your site or protecting it. They are gearing up to start making a cut in the Middle between scrapers and publishers. Why wouldnt I go DIRECTLY to the publisher and make a deal. So dumb I hate cf so much.

The only thing cloudflare knows how to do is MITM attacks.

DeusExMachina|7 months ago

I would expect these features to be opt-in. Even though I agree with it, I would be pretty upset if they just turned it on automatically on my website.

sct202|8 months ago

My data served by Cloudflare has increased to 100gb /month compared to <20gb like 2 years ago, and they're all fairly static hobby sites. Actual people traffic is down by like half in the same time frame, so I imagine a lot of this is probably cost savings for Cloudflare to reduce resource usage.

Apofis|8 months ago

Makes total sense, bandwidth on this scale is expensive.

alganet|8 months ago

> If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content

I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.

Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.

spwa4|8 months ago

Yes what everyone wants to do with AI: generate entertainment and interactions with humans, including economical ones, will need to happen or AI will starve.

fennecfoxy|8 months ago

Additionally, ad blocker usage is apparently at 30%. So it's a redundant or more nuanced argument, really.

BrouteMinou|8 months ago

Just like cancer?

preachermon|8 months ago

[deleted]

Meekro|8 months ago

I've heard lots of people on HN complaining about bot traffic bogging down their websites, and as a website operator myself I'm honestly puzzled. If you're already using Cloudflare, some basic cache configuration should guarantee that most bot traffic hits the cache and doesn't bog down your servers. And even if you don't want to do that, bandwidth and CPU are so cheap these days that it shouldn't make a difference. Why is everyone so upset?

noodle|8 months ago

As someone who had some outages due to AI traffic and is now using CloudFlare's tools:

Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.

Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.

Symbiote|8 months ago

That's a pretty big assumption.

The largest site I work on has 100,000s of pages, each in around 10 languages — that's already millions of pages.

It generally works fine. Yesterday it served just under 1000 RPS over the day.

AI crawlers have brought it down when a single crawler has added 100, 200 or more RPS distributed over a wide range of IPs — it's not so much the number of extra requests, though it's very disproportionate for one "user", but they can end up hitting an expensive endpoint excluded by robots.txt and protected by other rate-limiting measures, which didn't anticipate a DDoS.

conductr|8 months ago

The presumption I’m already using cloudfare is a start. Is this a requirement for maintaining a simple website now?

jtolmar|8 months ago

The stories I've heard have been mostly about scraper bots finding APIs like "get all posts in date range" and then hammering that with every combo of start/end date.

x0x0|8 months ago

It's not complex. I worked on a big site. We did not have the compute or i/o (most particularly db iops) to live generate the site. Massive crawls both generated cold pages / objects (cpu + iops) and yanked them into cache, dramatically worsening cache hit rates. This could easily take down the site.

Cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.

As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.

Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.

jauntywundrkind|8 months ago

I too am a bit confused / mystified at the strong reaction. But I do expect a lot of badly optimized sites that just want out.

I struggle to think of a web related library that has spread faster than Anubis checker. It's everywhere now! https://github.com/TecharoHQ/anubis

I'm surprised we don't see more efforts to rate limit. I assume many of these are distributed crawlers, but it feels like there's got to be pools of activity spinning up, on a handful of IPs. And that they would be time correlated together pretty clearly. Maybe that's not true. But it feels like the web, more than anything else, needs some open source software to add a lot more 420 Enhance Your Calm responses, as it feels like. https://http.dev/420

deepsiml|8 months ago

Not much into that kind of DevOps. What is a good basic caching in this instance?

postalcoder|8 months ago

  > When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?

edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.

  > We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

  > When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...

jerf|8 months ago

"doesn't punishing respectful and transparent bots only incentivize obfuscation?"

Sure, but we crossed that bridge over 20 years ago. It's not creating an arms race where there wasn't already one.

Which is my generic response to everyone bringing similar ideas up. "But the bots could just...", yeah, they've been doing it for 20+ years and people have been fighting it for just as long. Not a new problem, not a new set of solutions, no prospect of the arms race ending any time soon, none of this is new.

hombre_fatal|8 months ago

Next line:

> The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

The Block AI Bots rule on the Super Bot Fight Mode page does filter out most bot traffic. I was getting 10x the traffic from bots than I was from users.

It definitely doesn't rely on robots.txt or user agent. I had to write a page rule bypass just to let my own tooling work on my website after enabling it.

fluidcruft|8 months ago

Cloudflare already knows how to make the web hell for people they don't like.

I read the robots.txt entries as those AI bots that will be not marked as "malicious" and that will have the opportunity to be allowed by websites. The rest will be given the Cloudflare special.

colechristensen|8 months ago

>doesn't punishing respectful and transparent bots only incentivize obfuscation?

They're cloudflare and it's not like it's particularly easy to hide a bot that is scraping large chunks of the Internet from them. On top of the fact that they can fingerprint any of your sneaky usage, large companies have to work with them so I can only assume there are channels of communication where cloudflare can have a little talk with you about your bad behavior. I don't know how often lawyers are involved but I would expect them to be.

Sol-|8 months ago

Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.

chasd00|8 months ago

My thought too, honoring robots.txt is just a convention. There's no requirement to follow robots.txt, or at least certainly no technical requirement. I don't think there's any automatic legal requirement either.

Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.

px43|8 months ago

There's a lack of clarity, but it seems likely to me that a majority of this traffic is actually people asking questions to the AI, and the AI going out and researching for answers. When the AI tools are being used like a web browser to do research, should they still be adhering to robots.txt, or is that only intended for search indexing?

deepsun|8 months ago

Hard to tell, because minor crawlers mimic major companies to not getting banned.

mschuster91|8 months ago

Cloudflare, for all I hate their role as a gatekeeper these days, actually has the leverage to force the AI companies to bend.

blakesterz|8 months ago

The list of bots is pretty short right now:

https://developers.cloudflare.com/bots/concepts/bot/#ai-bots

JimDabell|8 months ago

> AI bots

> You can opt into a managed rule that will block bots that we categorize as artificial intelligence (AI) crawlers (“AI Bots”) from visiting your website. Customers may choose to do this to prevent AI-related usage of their content, such as training large language models (LLM).

> CCBot (Common Crawl)

Common Crawl is not an AI bot:

https://commoncrawl.org

hennell|8 months ago

Cloudflare sees a lot of the web traffic. I assume these are the biggest bots they're seeing right now, and any new contenders would be added as they find them. Probably impossible to really block everything, but they've got the web-coverage to detect more than most.

ZiiS|8 months ago

Enough to more than half the traffic to most sites if the blocks hold.

cmg|8 months ago

Archive link: https://archive.ph/ARnyu

joshdavham|8 months ago

How did you make that link?

zargath|8 months ago

Sounds very basic, sadly.

Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.

I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.

I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.

stereolambda|8 months ago

> I believe robots.txt was invented in 1994(thx chatgpt).

Not to pick on you, but I find it quicker to open new tab and do "!w robots.txt" (for search engines supporting the bang notation) or "wiki robots.txt"<click> (for Google I guess). The answer is right there, no need to explain to LLM what I want or verify [1].

[1] Ok, Wikipedia can be wrong, but at least it is a commonly accessible source of wrong I can point people to if they call me out. Plus my predictive model of Wikipedia wrongness gives me pretty low likelihood for something like this, while for ChatGPT it is more random.

reaperducer|8 months ago

robots.txt was invented in 1994(thx chatgpt)

Thought of and discussed as a possibility in 1994.

Proposed as a standard in 2019.

Adopted as a standard in 2022.

Thanks, IETF.

TechDebtDevin|8 months ago

This comment seems like it comes from a Cloudflare employee.

This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.

badlibrarian|8 months ago

Did they ever fix the auto-blocking of RSS feeds?

https://news.ycombinator.com/item?id=41864632

yodon|8 months ago

Discussed yesterday (270+ comments)[0]

[0]https://news.ycombinator.com/item?id=44432385

j45|8 months ago

This is interesting. I'm a fan of Cloudflare, and appreciate all the free tiers they put out there for many.

Today I see this article about Cloudflare blocking scrapers. There are useful and legitimate cases where I ask Claude to go research something for me. I'm not sure if Cloudflare discerns legitimate search/research traffic from an AI client vs scraping. Of the sites that are blocked by default will include content by small creators (unless on major platforms with deal?), while the big guys who have something to sell like an Amazon, etc, will likely be able to facilitate and afford a deal to show up more in the results.

A few days ago, Cloudflare is also looking to charge AI companies to scrape the content, which is cached copies of other people's content. I'm guessing it will involve paying the owners of the data at some point as well. Being able to exclude it from this purpose (sell/license content, or scrape) would be a useful lever.

Putting those two stories together:

- Is this a new form of showing up in the AISEO (Search everywhere optimization) to show up in an AI's corpus or ability to search the web, or paying licensing fees instead of advertising fees.. these could be new business models which are interesting, but trying to see where these steps may vector ahead towards, and what to think about today.

- With training data being the most valuable thing for AI companies, and this is another avenue for revenue for Cloudflare, this can look like a solution which helps with content licensing as a service.

I'd like to see where abstracting this out further ends up going

Maybe I'm missing something, is anyone else seeing it this way, or another way that's illuminating to them? Is anyone thinking about rolling their own service for whatever parts of Cloudflare they're using?

ec109685|8 months ago

It seems like search access is more valuable these days since reasoning requires realtime access to site data.

dirkc|8 months ago

I assume they will "protect original content online" by blocking LLM clients from ingesting data as context?

I'm not optimistic that you can effectively block your original content from ending up in training sets by simply blocking the bots. For now I just assume that anything I put online will end up being used to train some LLM

dougb5|8 months ago

> Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website

It's the bots that do hide their behavior -- via residential proxy services -- that are causing most of the burden, for my site anyway. Not these large commercial AI vendors.

maximilianburke|8 months ago

Every evolution of the web, from Web 2 giving us walled gardens to Web 3 giving us, well, nothing, to what we have now is taking us further from a network of communities and personal repositories of knowledge.

Sure, fidelity has gotten better but so much has been lost.

hmate9|8 months ago

Isn’t this only useful for blogs, news sites, or forums? Why would I want an AI to know less about my product? I want it to understand it, talk about it, and ideally recommend it. Should be default off.

dawnerd|8 months ago

I’ve been using this for a while on my mastodon server and after a few tweaks to make sure it wasn’t blocking legit traffic it’s been really working great. Between Microsoft and Meta, they were hitting my services more than any other traffic combined which says a lot of you know how noisy mastodon can be. Server load went down dramatically.

It also completely put a stop to perplexity as far as I can tell.

And the robots file meant nothing, they’d still request it hundreds of thousands of times instead of caching it. Every request they’d hit it first then hit their intended url.

danielspace23|8 months ago

Have you considered Anubis? I know it's harder to install, but personally, I think the point of Mastodon is trying to avoid centralization where possible, and CloudFlare is one of the corporations that are keeping the internet centralized.

TechDebtDevin|8 months ago

This does nothing dude. Literally nothing. OpenAI or whoever are just going to hire people like me who dont get caught. Stop ruining the experience of users and allowing cf to fill the internet with more bloated javascript challenge pages and privacy invading fingerprinting. Stop making cf the police of the internet. We're literally handing the internet to this company on a silver platter to do MITM attacks on our privacy and god knows what else. Fucking wild.

gazpacho|8 months ago

From an open source projects perspective we’d want to disable this on our docs sites. We actually want those to be very discoverable by LLMs, during training or online usage.

StochasticLi|8 months ago

ehem https://github.com/Kaliiiiiiiiii-Vinyzu/patchright

account42|8 months ago

Yay, looking forward to more CAPTCHAs as a regular user.

unknown|8 months ago

[deleted]

nemild|8 months ago

Think this is the future, as the AI Web takes over the human web.

At Coinbase, we've been building tools to make the blockchain the ideal payment rails for use cases like this with our x402 protocol:

https://www.x402.org/

Ping if you're interested in joining our open source community.

sneak|8 months ago

This idea that you can publish data for people to download and read but not for people to download and store, or print, or think about, or train on is a doomed one.

If you don’t want people reading your data, don’t put it on the web.

The concept that copyright extends to “human eyeballs only” is a silly one.

t1001|8 months ago

With the problem being bots hammering the site en masse, it feels like the better analog is "allowing free replicator use without having someone ruin the fun by requesting ten tons of food be produced in their quarters every minute".

cratermoon|8 months ago

I'm still not sure this is going to be very effective, as so many of the worst offenders don't identify themselves as bots, and often change their user agent. Has Cloudflare said anything about identifying the bad actors?

GrayShade|8 months ago

Yes, they have over the years, for example https://blog.cloudflare.com/residential-proxy-bot-detection-..., https://blog.cloudflare.com/cloudflare-bot-management-machin..., https://blog.cloudflare.com/introducing-bot-analytics/.

chasd00|8 months ago

i've mentioned this in a couple replies so maybe i'm wrong but it's up to the client to obey robots.txt. Why would they not just ignore it? Unless there's some legal consequence not complying with robots.txt then why even follow it? There's no technical enforcement of the policies in the file, it's up to the client to honor them.

NullCascade|8 months ago

How would you do the opposite of this? Optimize your content to be more likely crawled by AI bots? I know traditional Google-focused SEO is not enough because these AI bots often use other web search/indexing APIs.

TechDebtDevin|8 months ago

There are script tags you can put in your site from LLM SEO companies if you want your content to be indexed by Perplexity or OpenAI. Theyre kind of too new for me to reccomend.

ssijak|8 months ago

I dont want this by default. I want my website to end up in AI chatbots. For SEO

YPPH|8 months ago

This is great. But my concerns about Cloudflare's power remain. Today it's blocking AI crawlers, tomorrow will it be blocking all browsers that fail hardware-attestation checks?

nsoonhui|8 months ago

But how is this effective against Gemini and even OpenAI, who can instead of relying on their Google and Bing crawlers respectively to crawl the content?

ChrisArchitect|8 months ago

[dupe] https://news.ycombinator.com/item?id=44432385

zackmorris|8 months ago

As usual, this is the wrong approach.

The open web is akin to the commons, public domain and public land. So this is like putting a spy cam on a freeway billboard, detecting autonomous vehicles, and shining a spotlight at their camera to block them from seeing the ad. To what end?

Eventually these questions will need to be decided in court:

1) Do netizens have the right to anonymity? If not, then we'll have to disclose whether we're humans or artificial beings. Spying on us and blocking us on a whim because our behavior doesn't match social norms will amount to an invasion of privacy (eventually devolving into papers please).

2) Is blocking access to certain users discrimination? If not, then a state-sanctioned market of civil rights abuse will grow around toll roads (think whites-only drinking fountains).

3) Is downloading copyrighted material for learning purposes by AI or humans the same as pirating it and selling it for profit? If so, then we will repeat the everyone-is-a-criminal torrenting era of the 2000s and 2010s when "making available" was treated the same as profiting from piracy, and take abuses by HBO, the RIAA/MPAA and other organizations who shut off users' internet connections through threat of legal actions like suing for violating the DMCA (which should not have been made law in the first place).

I'm sure there are more. If we want to live in a free society, then we must be resolute in our opposition of draconian censorship practices by private industry. Gatekeeping by large, monopolistic companies like Cloudflare simply cannot be tolerated.

I hope that everyone who reads this finds alternatives to Cloudflare and tells their friends. If they insist on pursuing this attack on our civil rights for profit, then I hope we build a countermovement by organizing with the EFF and our elected officials to eventually bring Cloudflare up on antitrust charges.

Cloudflare has shown that they lack the judgement to know better. Which casts doubt on their technical merits and overall vision for how the internet operates. By pursuing this course of action, they have lost face like Google did when it removed its "don't be evil" slogan from its code of conduct so it could implement censorship and operate in China (among other ensh@ttification-related goals).

Edit: just wanted to add that I realize this may be an opt-in feature. But that's not the point - what I'm saying is that this starts a bad precedent and an unnecessary arms race, when we should be questioning whether spidering and training AI on copyrighted materials are threats in the first place.

kristoff200512|8 months ago

AI will endlessly crawl my website, quickly exhausting the egress quota of my Supabase free plan,but Cloudflare can stop all of this.

aunty_helen|8 months ago

I saw yesterday that they were going to allow websites to charge per scrape.

Looks like cloudflare just invented the new App Store.

Roark66|8 months ago

This is a bit silly. Slowing down, yes, but blocking? People who *really* want that content will find a way and this will hit everyone else instead that will have to do silly riddles before following every link or run crypto mining for them before being shown the content .

I recently went to a big local auction site on which I buy frequently and I got one of these "we detected unusual traffic from your network" messages. And "prove you're human". Which was followed by "you completed the capcha in 0.4s your IP is banned". Really? Am I supposed to slow down my browsing now? I tried a different browser, a different OS, logging on,clearing cookies, etc. Same result when I tried a search. It took 4h after contacting their customer service to unblock it. And the explanation was "you're clicking too fast".

At some point it just becomes a farce and the hassle is not worth the content. Also, while my story doesn't involve any bots perhaps a time will come when local LLMs will be good enough that I'll be able to tell one "reorder my cat food" and it will go and do it. Why are they so determined to "stop it" (spoiler, they can't).

For anyone who says LLMs are already capable of ordering cat food I say not so fast. First the cat food has to be on sale/offer (sometimes combined with extras). Second it is supposed to be healthy (no grains) and third the taste needs to be to my cats liking. So far I'm not going to trust a LLM with this.

picohernandez|8 months ago

I was chatting with my sister last weekend. As a hobby, she creates and sells wedding invitation and other designs at an online marketplace site called Zazzle. She was telling me all about how that site implemented some automatic bot detection and it sounded like it was a total disaster. Real content creators were getting wrongly flagged as bots and then getting blocked from using the site just for using the most fundamental site functionality, and to make it worse, then it was apparently impossible for them to get past the captcha challenge or whatever it showed next. She forwarded me a link to some support forum discussion about it and it was mindboggling the troubles that some of the content creators there had to go through:

https://community.zazzle.com/t5/technical-issues/bot-test-wo...

My sister said that her sales figures are way down compared to what they used to be and she didn't know if this bot flagger was disrupting real paying customers too. She said it had flagged her a couple of times, although she was luckily able to get past the bot challenge. She has pretty much given up on making and uploading new designs because of what was happening to other content creators there. She's now scared to use the site because she doesn't want to get wrongly locked out of her account.

e38383|8 months ago

Why is every second article about this claiming that it’s automatic? It needs to be turned on or at least there was no mention of automatic in the original blog post.

I really hope that we can continue training AI the same way we train humans – basically for free.

userbinator|8 months ago

s/A.I. Data Scrapers/non-sanctioned browsers running on non-sanctioned platforms/

They've been trying to do this for years. Now "AI" gives a convenient excuse.

lucasyvas|8 months ago

I fail to see how this won’t just result in UA string or other obfuscation.

kube-system|8 months ago

Cloudflare’s filtering is already way more sophisticated than just looking at UA string or other voluntary reporting. They’re almost certainly using fingerprinting and behavioral analytics.

chasd00|8 months ago

a crawler doesn't have to change anything, they can just ignore the robots.txt file. It's up to the client to read robots.txt and follow its directives but there's no technical reason why the client cannot just ignore everything in the file period.

deadbabe|8 months ago

No one else can really do this except Cloudflare.

jjangkke|8 months ago

so TLDR it adjusts your robot.txt and relies on cloudflare to catch bot behavior and it doesn't actually do any sophisticated residential proxy filtering or common bypass methods that works on cloudflare turnstill, do I have this correct?

this just pushes AI agents "underground" to adopt the behavior of a full blown stealth focused scraper which makes it harder to detect.

thephotonsphere|8 months ago

account wall :-(

unknown|8 months ago

[deleted]

Spivak|8 months ago

Poor ChatGPT-User, nobody understands you. Blocking a real user because of the, admittedly odd, browser they're using misses the point.

unknown|8 months ago

[deleted]

bgwalter|8 months ago

The destruction of the Web and IP theft needs to be addressed legally. The opinion of a single judge notwithstanding, "AI" scraping already violates copyright. This needs to be made explicit in law and scrapers must get the same treatment as Western governments gave to thousands of individuals who were bankrupted or jailed for copyright infringement.

We are in the Napster phase of Web content stealing.

rorylaitila|8 months ago

Unfortunately I think pissing into the wind. Information websites are all but dead. AI contains all published human information. If you have positioned your website as an answer to a question, it won't survive that way.

"Information" is dead but content is not. Stories, empathy, community, connection, products, services. Content of this variety is exploding.

The big challenge is discoverability. Before, information arbitrage was one pathway to get your content discovered, or to skim a profit. This is over with AI. New means of discovery are necessary, largely network and community based. AI will throw you a few bones, but it will be 10% of what SEO did.

fennecfoxy|8 months ago

>AI contains all published human information

No, it most certainly does not. It was certainly trained on large swathes of human knowledge/interactions.

A model that consists of a perfect representation/compression of all this info is a zip file, not a model file.

ozgrakkurt|8 months ago

You are assuming LLMs will replace search engines. Why is this the case?

To me it seems like there has to be so much optimization for this to happen that, it is not likely. LLM answers are slow and unreliable. Even using something like perplexity doesn’t give much value over using a regular search engine in my experience

331 comments