top | item 42494702

(no title)

jaybna | 1 year ago

25% of the top 1000 websites are blocking OpenAI from crawling: https://originality.ai/ai-bot-blocking

I am betting hundreds of thousands, rising to millions more little sites, will start blocking/gating this year. AI companies might license from big sources (you can see the blocking percentage went down), but they will be missing the long tail, where a lot of great novel training data lives. And then the big sites will realize the money they got was trivial as agents start to crush their businesses.

Bill Gross correctly calls this phase of AI shoplifting. I call it the Napster-of-Everything (because I am old). I am also betting that the courts won't buy the "fair use" interpretation of scraping, given the revenues AI companies generate. That means a potential stalling of new models until some mechanism is worked out to pay knowledge creators. (And maybe nothing we know of now will work for media: https://om.co/2024/12/21/dark-musings-on-media-ai/)

Oh, and yes, I love generative AI and would be willing to pay 100x to access it...

P.S. Hope is not a strategy, but hoping something like ProRata.ai and/or TollBits can help make this self-sustainable for everyone in the chain

discuss

order

jpablo|1 year ago

They aren't blocking anything. They are just asking nicely not to be crawled. Given that AI companies haven't cared a single bit about ripping of other's peoples data I don't see why they would care now.

wing-_-nuts|1 year ago

A number of sites have started outright blocking any traffic that looks remotely suspicious. This has made browsing with a vpn a bit of a pain.

EVa5I7bHFq9mnYK|1 year ago

In their attempt to block OpenAI, they block me. Many sites that were accessible just 2 years ago, require login/captchas/rectal exam now just to read the content.

kjkjadksj|1 year ago

They block plenty and they do it crudely. I get suspicious traffic bans from reddit all the time. Trivial enough to route around by switching user agent however. Which goes to show any crawling bot writer worth their salt already routes around reddit and most other sites bs by now. I’m just the one getting the occasional headache because I use firefox and block ads and site tracking I guess.

njovin|1 year ago

Wouldn't it be somewhat trivial to set up honeypots?

jaybna|1 year ago

Yeah, probably right. If you want a great rabbit hole, look up "Common Crawl" and see how a great academic project was absolutely hijacked for pennies on the dollar to grab training data - the foundation for every LLM out there right now.

cshores|1 year ago

It ultimately doesn't matter because a fairly current snapshot of all of the world's information is already housed in their data lakes. The next stage for AI training is to generate synthetic data either by other AI or by simulations to further train on as human generated content can only go so far.

pphysch|1 year ago

How is synthetic data supposed to work? Broadly speaking, ML is about extracting signal from noisy data and learning the subtle patterns.

If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad.

Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum?

aftbit|1 year ago

IMO this is an underappreciated advantage for Google. Nobody wants to block the GoogleBot, so they can continue to scrape for AI data long after AI-specific companies get blocked.

Gemini is currently embarrassingly bad given it came from the shop that:

1. invented the Transformer architecture

2. has (one of) the largest compute clusters on the planet

3. can scrape every website thanks to a long-standing whitelist

Art9681|1 year ago

The new Gemini Experimental models are the best general purpose models out right now. I have been comparing with o1 Pro and I prefer Gemini Experimental 1206 due to its context, speed, and accuracy. Google came out with a lot of new stuff last week if you havent been following. They seem to have the best models across the board, including image and video.

kibwen|1 year ago

> Nobody wants to block the GoogleBot

This only remains true as long as website operators think that Google Search is useful as a driver of traffic. In tech circles Google Search is already considered a flaming dumpster heap, so let's take bets on when that sentiment percolates out into the mainstream.

jameslk|1 year ago

For OpenAI, they could lean on their relationship with Microsoft for Bing crawler access

Websites won’t be blocking the search engine crawlers until they stop sending back traffic, even if they’re sending back less and less traffic

tartuffe78|1 year ago

Wonder if OpenAI is considering building a search engine for this reason... Imagine if we get a functional search engine again from some company just trying to feeding their model generation...

thiagowfx|1 year ago

There are two to distinguish: "Googlebot" and "Google-Extended".

heavyset_go|1 year ago

> I am betting hundreds of thousands, rising to millions more little sites, will start blocking/gating this year. AI companies might license from big sources (you can see the blocking percentage went down), but they will be missing the long tail, where a lot of great novel training data lives.

This is where I'm at. I write content when I run into problems that I don't see solved anywhere else, so my sites host novel content and niche solutions to problems that don't exist elsewhere, and if they do, they are cited as sources in other publications, or are outright plagiarized.

Right now, LLMs can't answer questions that my content addresses.

If it ever gets to the point where LLMs are sufficiently trained on my data, I'm done writing and publishing content online for good.

zifpanachr23|1 year ago

I don't think it is at all selfish to want to get some credit for going to the trouble of publishing novel content and not have it all stolen via an AI scraping your site. I'm totally on your side and I think people that don't see this as a problem are massively out of touch.

I work in a pretty niche field and feel the same way. I don't mind sharing my writing with individuals (even if they don't directly cite me) because then they see my name and know who came up with it, so I still get some credit. You could call this "clout farming" or something derogatory, but this is how a lot of experts genuinely get work...by being known as "the <something> guy who gave us that great tip on a blog once".

With AI snooping around, I feel like becoming one of those old mathematicians that would hold back publicizing new results to keep them all for themselves. That doesn't seem selfish to me, humans have a right to protect ourselves and survive and maintain the value of our expertise when OpenAI isn't offering any money.

I honestly think we should just be done with writing content online now, before it's too late. I've thought a lot about it lately and I'm leaning more towards that option.

glenstein|1 year ago

>Bill Gross correctly calls this phase of AI shoplifting. I call it the Napster-of-Everything (because I am old). I am also betting that the courts won't buy the "fair use" interpretation of scraping, given the revenues AI companies generate. That means a potential stalling of new models until some mechanism is worked out to pay knowledge creators.

To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

ben_w|1 year ago

> To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

Still around, doing fine: https://en.wikipedia.org/wiki/Google_Books and https://books.google.com/intl/en/googlebooks/about/index.htm...

Given the timing, I suspect it was started as simple indexing, in keeping with the mission statement "Organize the world's information and make it universally accessible and useful".

There was also reCAPTCHA v1 (books) and v2 (street view), which each improved OCR AI until the state of the art AI were able to defeat them in the role of CAPTCHA systems.

pncnmnp|1 year ago

> I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

A few months ago, there was an interesting submission on HN about this - The Tragedy of Google Books (2017) (https://news.ycombinator.com/item?id=41917016).

Kostchei|1 year ago

Using the real world- as in vision, 3d orientation, physical sensors and building training regimes that augment the language models to be multidimensional and check that perception, that is the next step.

And there is very little shortage of data and experience in the actual world, as opposed to just the text internet. Can the current AI companies pivot to that? Or do you need to be worldlabs, or v2 of worldlabs?

shanusmagnus|1 year ago

Ironically, if it plays out this way, it will be the biggest boon to actual AGI development there could be -- the intelligence via text tokenization will be a limiting factor otherwise, imo.

Tossrock|1 year ago

Some can. Google owns Waymo and runs Streetview, they're collecting massive amounts of spatial data all the time. It would be harder for the MS/OpenAI centaur.

code51|1 year ago

With current state of legal, a real challenge can happen only around 10 years from now. By then AI players will gather immense power over the law.

lxgr|1 year ago

If you're willing to believe the narrative that there's some sort of existential "race to AGI" going on at the moment (I'm ambivalent myself, but my opinion doesn't really matter; if enough people believe it to be true, it becomes true), I don't think that'll realistically stop anyone.

Not sure how exactly the Library of Congress is structured, but the equivalent in several countries can request a free copy of everything published.

Extending that to the web (if it's not already legally, if not practically, the case) and then allowing US companies to crawl the resulting dataset as a matter of national security, seems like a step I could see within the next few years.

zifpanachr23|1 year ago

I agree with you about the fair use argument. Seems like it doesn't meet a lot of the criteria for fair use based on my lay understanding of how those factors are generally applied.

See https://fairuse.stanford.edu/overview/fair-use/four-factors/

I think in particular it fails the "Amount and substantiality of the portion taken" and "Effect of the use on the potential market" extremely egregiously.

cedws|1 year ago

Cloudflare has a toggle for blocking AI scrapers. I don’t think it’s default, but it’s there.

kyledrake|1 year ago

This just feels like mystery meat to me. My guess is that a lot of legitimate users and VPNs are being blocked from viewing sites, which numerous users in this discussion have confirmed.

This seems like a very bad way to approach this, and ironically their model quite possible also uses some sort of machine learning to work.

A few web hosting platforms are using the cloudflare blocker and I think it's incredibly unethical. They're inevitably blocking millions of legitimate users from viewing content on other people's sites and then pretending it's "anti AI". To paraphrase Theo Deraadt, they saw something on the shelf, and it has all sorts of pretty colours, and they bought it.

input_sh|1 year ago

It's not much smarter than just adding user agents to robots.txt manually.

jaybna|1 year ago

They might get into the micro-licensing game too. More power to them.

jasondigitized|1 year ago

The amount of content coming off of YouTube every minute puts Google in a very enviable position.

vidarh|1 year ago

All the big players are pouring a fortune into manually curated and created training data.

As it stands, OpenAI has a market cap large enough to buy a major international media conglomerate or two. They'll get data no matter how blocked they get.

Workaccount2|1 year ago

Doing basic copyright analyses on model outputs is all that is needed. Check if the output contains copyright, block it if it does.

Transformers aren't zettabyte sized archives with a smart searching algo, running around the web stuffing everything they can into their datacenter sized storage. They are typically a few dozen GB in size, if that. They don't copy data, they move vectors in a high dimensional space based on data.

Sometimes (note: sometimes) they can recreate copyrighted work, never perfectly, but close enough to raise alarm and in a way that a court would rule as violation of copyright. Thankfully though we have a simple fix for this developed over the 30 years of people sharing content on the internet: automatic copyright filters.

parineum|1 year ago

It's not even close to that simple. Nobody is really questioning if the data contains the copyrighted information, we know that to be true in enough cases to bankrupt open ai, the question is what analogy should the courts be using as a basis to determine if it's infringement.

It read many works but can't duplicate them exactly sounds a lot like what I've done, to be honest. I can give you a few memorable lines to a few songs but only really can come close to reciting my favorites completely. The LLMs are similar but their favorites are the favorites of the training data. A line in a pop song mentioned a billion times is likely reproducible, the lyrics to the next track on the album, not so much.

IMO, any infringement that might have happened would be acquiring data in the first place but copy protection cares more about illegal reproduction than illegal acquisition.

EricMausler|1 year ago

No comment on if output analysis is all that is needed, though it makes sense to me. Just wanted to note that using file size differences as an argument may simply imply transformers could be a form of (either very lossy or very efficient) compression.

jaybna|1 year ago

So then copyrighted content scraped is not needed for training? Guess I missed AGI suddenly appearing that reasoned things out all by itself.

cma|1 year ago

People upload lots from those sites to chatgpt asking to summarize.

devsda|1 year ago

That's still manual and minuscule compared to the amount they can gather by scraping.

If blocking really becomes a problem, they can take a page out of Google's playbook[1] and develop a browser extension to scrape page content and in exchange offer some free credits for Chat-GPT or a summarizer type of tool(s). There won't be shortage of users.

1. https://en.wikipedia.org/wiki/Google_Toolbar