(no title)
jaybna | 1 year ago
I am betting hundreds of thousands, rising to millions more little sites, will start blocking/gating this year. AI companies might license from big sources (you can see the blocking percentage went down), but they will be missing the long tail, where a lot of great novel training data lives. And then the big sites will realize the money they got was trivial as agents start to crush their businesses.
Bill Gross correctly calls this phase of AI shoplifting. I call it the Napster-of-Everything (because I am old). I am also betting that the courts won't buy the "fair use" interpretation of scraping, given the revenues AI companies generate. That means a potential stalling of new models until some mechanism is worked out to pay knowledge creators. (And maybe nothing we know of now will work for media: https://om.co/2024/12/21/dark-musings-on-media-ai/)
Oh, and yes, I love generative AI and would be willing to pay 100x to access it...
P.S. Hope is not a strategy, but hoping something like ProRata.ai and/or TollBits can help make this self-sustainable for everyone in the chain
jpablo|1 year ago
wing-_-nuts|1 year ago
EVa5I7bHFq9mnYK|1 year ago
kjkjadksj|1 year ago
njovin|1 year ago
jaybna|1 year ago
cshores|1 year ago
pphysch|1 year ago
If there is untapped signal in existing datasets, then learning processes should be improved. It does not follow that there should be a separate economic step where someone produces "synthetic data" from the real data, and then we treat the fake data as real data. From a scientific perspective, that last part sounds really bad.
Creating derivative data from real data sounds, for the purpose of machine learning, like a scam by the data broker industry. What is the theory behind it, if not fleecing unsophisticated "AI" companies? Is it just myopia, Goodhart's Law applied to LLM scaling curves? Some MBA took the "data is the new oil" comment a little too seriously and inferred that data is as fungible as refined petroleum?
unknown|1 year ago
[deleted]
jaybna|1 year ago
aftbit|1 year ago
Gemini is currently embarrassingly bad given it came from the shop that:
1. invented the Transformer architecture
2. has (one of) the largest compute clusters on the planet
3. can scrape every website thanks to a long-standing whitelist
Art9681|1 year ago
kibwen|1 year ago
This only remains true as long as website operators think that Google Search is useful as a driver of traffic. In tech circles Google Search is already considered a flaming dumpster heap, so let's take bets on when that sentiment percolates out into the mainstream.
jameslk|1 year ago
Websites won’t be blocking the search engine crawlers until they stop sending back traffic, even if they’re sending back less and less traffic
tartuffe78|1 year ago
thiagowfx|1 year ago
heavyset_go|1 year ago
This is where I'm at. I write content when I run into problems that I don't see solved anywhere else, so my sites host novel content and niche solutions to problems that don't exist elsewhere, and if they do, they are cited as sources in other publications, or are outright plagiarized.
Right now, LLMs can't answer questions that my content addresses.
If it ever gets to the point where LLMs are sufficiently trained on my data, I'm done writing and publishing content online for good.
zifpanachr23|1 year ago
I work in a pretty niche field and feel the same way. I don't mind sharing my writing with individuals (even if they don't directly cite me) because then they see my name and know who came up with it, so I still get some credit. You could call this "clout farming" or something derogatory, but this is how a lot of experts genuinely get work...by being known as "the <something> guy who gave us that great tip on a blog once".
With AI snooping around, I feel like becoming one of those old mathematicians that would hold back publicizing new results to keep them all for themselves. That doesn't seem selfish to me, humans have a right to protect ourselves and survive and maintain the value of our expertise when OpenAI isn't offering any money.
I honestly think we should just be done with writing content online now, before it's too late. I've thought a lot about it lately and I'm leaning more towards that option.
glenstein|1 year ago
To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.
ben_w|1 year ago
Still around, doing fine: https://en.wikipedia.org/wiki/Google_Books and https://books.google.com/intl/en/googlebooks/about/index.htm...
Given the timing, I suspect it was started as simple indexing, in keeping with the mission statement "Organize the world's information and make it universally accessible and useful".
There was also reCAPTCHA v1 (books) and v2 (street view), which each improved OCR AI until the state of the art AI were able to defeat them in the role of CAPTCHA systems.
pncnmnp|1 year ago
A few months ago, there was an interesting submission on HN about this - The Tragedy of Google Books (2017) (https://news.ycombinator.com/item?id=41917016).
Kostchei|1 year ago
And there is very little shortage of data and experience in the actual world, as opposed to just the text internet. Can the current AI companies pivot to that? Or do you need to be worldlabs, or v2 of worldlabs?
shanusmagnus|1 year ago
Tossrock|1 year ago
code51|1 year ago
lxgr|1 year ago
Not sure how exactly the Library of Congress is structured, but the equivalent in several countries can request a free copy of everything published.
Extending that to the web (if it's not already legally, if not practically, the case) and then allowing US companies to crawl the resulting dataset as a matter of national security, seems like a step I could see within the next few years.
zifpanachr23|1 year ago
See https://fairuse.stanford.edu/overview/fair-use/four-factors/
I think in particular it fails the "Amount and substantiality of the portion taken" and "Effect of the use on the potential market" extremely egregiously.
cedws|1 year ago
kyledrake|1 year ago
This seems like a very bad way to approach this, and ironically their model quite possible also uses some sort of machine learning to work.
A few web hosting platforms are using the cloudflare blocker and I think it's incredibly unethical. They're inevitably blocking millions of legitimate users from viewing content on other people's sites and then pretending it's "anti AI". To paraphrase Theo Deraadt, they saw something on the shelf, and it has all sorts of pretty colours, and they bought it.
input_sh|1 year ago
jaybna|1 year ago
1vuio0pswjnm7|1 year ago
https://twitter.com/Bill_Gross/status/1859999138836025808
https://pdl-iphone-cnbc-com.akamaized.net/VCPS/Y2024/M11D20/...
He appears to be criticising "AI" only to solicit support for his own company.
jasondigitized|1 year ago
vidarh|1 year ago
As it stands, OpenAI has a market cap large enough to buy a major international media conglomerate or two. They'll get data no matter how blocked they get.
Workaccount2|1 year ago
Transformers aren't zettabyte sized archives with a smart searching algo, running around the web stuffing everything they can into their datacenter sized storage. They are typically a few dozen GB in size, if that. They don't copy data, they move vectors in a high dimensional space based on data.
Sometimes (note: sometimes) they can recreate copyrighted work, never perfectly, but close enough to raise alarm and in a way that a court would rule as violation of copyright. Thankfully though we have a simple fix for this developed over the 30 years of people sharing content on the internet: automatic copyright filters.
parineum|1 year ago
It read many works but can't duplicate them exactly sounds a lot like what I've done, to be honest. I can give you a few memorable lines to a few songs but only really can come close to reciting my favorites completely. The LLMs are similar but their favorites are the favorites of the training data. A line in a pop song mentioned a billion times is likely reproducible, the lyrics to the next track on the album, not so much.
IMO, any infringement that might have happened would be acquiring data in the first place but copy protection cares more about illegal reproduction than illegal acquisition.
EricMausler|1 year ago
jaybna|1 year ago
cma|1 year ago
devsda|1 year ago
If blocking really becomes a problem, they can take a page out of Google's playbook[1] and develop a browser extension to scrape page content and in exchange offer some free credits for Chat-GPT or a summarizer type of tool(s). There won't be shortage of users.
1. https://en.wikipedia.org/wiki/Google_Toolbar