top | item 39942104

Big Tech's underground race to buy AI training data

152 points| twilightzone | 1 year ago |reuters.com

146 comments

order

htrp|1 year ago

>Rates vary by buyer and content type, but Braga said companies are generally willing to pay $1 to $2 per image, $2 to $4 per short-form video and $100 to $300 per hour of longer films. The market rate for text is $0.001 per word, she added.

This is high enough that there should be a market to compensate the end users who created these

mitthrowaway2|1 year ago

> The market rate for text is $0.001 per word, she added.

I'm astonished that a picture turns out to be worth a thousand words.

paxys|1 year ago

There is a market...to compensate the platforms where creators uploaded the data for free.

golergka|1 year ago

If you give something away when it's worthless, don't come back for more when it's discovered to be worth more.

Users of these sites have had license agreements and privacy policies for a long time, and freely gave away their content just because free web hosting was worth it. Why would they be entitled to anything more now that this content have found new value?

altdataseller|1 year ago

Are certain types of textual content more valuable than others? For instance, conversations vs long form content vs short form (ie tweets)

m3kw9|1 year ago

Are counterfeit words then AI generated? Just like money you need a very good “press” and hard to detect..

miki123211|1 year ago

With what's happening in the EU with the GDPR on one hand and with the DMA on the other, I wouldn't be surprised if this becomes the new business model for social media companies.

neolefty|1 year ago

This market is troubling. But I have a different question:

What does the long game look like for raw training data? How will AIs maintain the quality of their diet?

To compare, web search started — in the early days of Google — as a huge win because so much valuable information that was scattered around became findable. But over time it has become whac-a-mole with spam and AI copypasta, and now it's a struggle to keep returning good results, for any search engine.

__MatrixMan__|1 year ago

Just like how ads have integrated into everything, trying to get us to click away from the happy path, AI will be in everything, trying to get us to do things that it is not yet good at so that it can learn from us. Which would be fine if the newfound efficiencies were properly democratized.

bilsbie|1 year ago

I wonder if they’ve considered hiring people to write. A lot of people might do it for cheap just to have their imprint on AI.

Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.

tracerbulletx|1 year ago

I have to imagine the valuable training data is domain specific stuff like sales call recordings for specific industries and technical materials about specific topics owned by companies. Surely there is enough public or copyright free general purpose material.

passion__desire|1 year ago

This won't be necessary in future AIs. As AIs will start aligning tokens from all the rich modalities of audio, video, 3D with text so that they can express complex ideas, they will bootstrap in proper language generation.

I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.

slyall|1 year ago

Turnitin will have millions of essays written by students. No doubt they will already be looking at these deal (or getting ready to update their license if it currently doesn't permit it).

cdme|1 year ago

They're more interested in eliminating jobs than creating them.

SnowflakeOnIce|1 year ago

This already happens. I have seen recruiters trying to get domain experts in various fields to write articles for AI training.

laborcontract|1 year ago

Most companies are hiring for the role of AI Tutor. Some of that is definitely happening.

EVa5I7bHFq9mnYK|1 year ago

People will just use ai to write those essays and emails!

layer8|1 year ago

This will be a fun reminiscence once we find out how humans are able to learn with just a tiny fraction of that data volume.

mattgreenrocks|1 year ago

Despite all the hoopla around AGI, the sheer amount of data required really makes human learning all the more impressive.

Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!

dkasper|1 year ago

Not sure this is a good conjecture. The main reasons are 1) AI’s are expected to have incredible range that the average human does not 2) humans actually do take in enormous amounts of data but it happens over the course of many years and most of it is audio/visual/tactile/experience.

We already see that if you want to focus on a narrow skillset you can use a much smaller model and training set. But right now it is a race because everyone wants to be the one true generalized intelligence model.

sigmoid10|1 year ago

The data volume is actually not that different once you account for all senses and how many years it takes for a human to become useful. The interesting thing would be how the human brain filters out the unimportant information as it develops.

gdsimoes|1 year ago

Because we actually think? I'm not just trying to guess the next word and I understand causal relationships.

quietbritishjim|1 year ago

The real difference with human learning is feedback: when young humans learn, at least some of the time they are interacting with intelligent agents that are able to give them focused feedback on their recent inputs and initial reactions to them.

weregiraffe|1 year ago

Tiny fraction... if you ignore the learning data processed by a billion years of evolution.

cdme|1 year ago

All the more reason for comprehensive privacy/data protection legislation and a refusal to provide data to these companies wherever possible.

Shawnj2|1 year ago

The fact that ChatGPT isn’t deemed copyright infringement is absurd. Like you can’t take the entire internet and use it to train your software and claim you’re not violating the copyright of thousands of people

1024core|1 year ago

Companies like Quest Diagnostics (a lab testing firm) are sitting on a goldmine of clean data. It's only a matter of time before a firm like Amazon (who already bought One Medical) gobbles them up.

Disclaimer: Long on $DGX

Shrezzing|1 year ago

>in talks with multiple tech companies to license Photobucket's 13 billion photos and videos

>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.

>tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps

In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

Centigonal|1 year ago

Photobucket is a morally bankrupt shell of its former self. They send constant emails with extremely urgent subject lines threatening to delete your photos unless you sign up for a $5/mo plan. They do this even if your account doesn't contain any photos.

bonton89|1 year ago

> It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

I can think of worse things than that which might be hidden away for public scraping.

5040|1 year ago

>In former times it was maintained that ownership of landed property extends from heaven all the way down to the center of the Earth, but this doctrine is obsolete, as evidenced by the flight of airplanes.

soulofmischief|1 year ago

I vividly remember consenting to all variety of terms agreements as a 13-year old on the web in 2007. I also remember explicitly licensing all of my output as CC and embracing copyleft. It's never been a secret that even captchas contribute to the improvement of models designed to ultimately sell ads to eyeballs.

A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.

bilbo0s|1 year ago

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid

Unfortunately, they did actually. It's more accurate to say that they were presented a EULA and Terms of Service that no reasonable teenager would have had any hope of understanding. But since they're over 13, they're held to the terms of those agreements in any case.

These companies are slimy. Make no mistake, this will get worse in the future.

nico|1 year ago

They talk about voice samples, but they don’t mention prices for them

Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?

1024core|1 year ago

Funnily, this is how Google improved their voice recognition.

Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.

tomschwiha|1 year ago

It would solve all government budget issues if the three letter agencys would start selling all data.

Cthulhu_|1 year ago

No, that's gross violation of privacy; no such thing as anonymized recordings.

asattarmd|1 year ago

Google having so many private photos in Google Photos must be a goldmine for them.

Melting_Harps|1 year ago

> Google having so many private photos in Google Photos must be a goldmine for them.

While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.

If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.

That is a much larger treasure trove given the sheer scale of people on those platforms, Google is limited to mainly Android users and those who use it's suite on PC (relatively small compared to social media users), which excludes most Mac users.

The thing they don't tell you about this dark underbelly of AI is just like the (meta)data that is for sale to 3rd parties, it's tiered price structure wherein Mac users are often the premium tier de to their more 'affluent' status and likelihood of impulsive in app purchases.

This is why I think META already won the AI race, they opensource Llama and have the a massive treasure trove of data to refine and train when they see what the OSS community creates that is of actual value: ChatGPT/DALL-e runs at a loss for MS/OpenAI. But if anyone can monetize this gold rush it will be META.

And perhaps more critically from an infrastructure POV, Llamma now runs better on CPU [1] rather than GPU, which means they won't have to be constrained or price pinched on GPUs like Microsoft, Google, Amazon likely will due to demand constraints from Nvidia (see ETH mining craze during COVID). They can focus on optimizing their data centers with more free cash flow which meant they can have a bigger footprint for when they finally figure out how to properly monetize this AI bubble, because it is is a bubble, from now until then.

I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.

0: https://www.movieguide.org/news-articles/facebook-allowed-ne...

1: https://news.ycombinator.com/item?id=39890262

altdataseller|1 year ago

As well as emails, documents, reviews…

JohnFen|1 year ago

I am incredibly thankful that I never used any of those services. I'm angry enough at the thought that my own websites may have been scraped to train LLMs, but at least I could remove that content. I'd be beside myself if I couldn't do at least that much.

xnx|1 year ago

I assume some of the more shady/no-name dashcam units with Wifi capability are uploading their video and internal microphone recordings. Distributed surveillance: The Panopitcar

digging|1 year ago

Any modern car is likely to already be transmitting that data and more, such as your weight, metadata about your doctor visits, etc. Cars are a privacy nightmare.

flir|1 year ago

I've wondered about crowdsourcing that. Sousveillance. Don't think enough people would be interested, though.

spxneo|1 year ago

Nobody's going to mention Worldcoin?

angryasian|1 year ago

I still speculate PG's golden boy was fired for unethically sourced training data for gpt 4 but we'll likely never get the real story.

sylware|1 year ago

I wonder when one of the richest corps will manage to get exclusive access to such data and lock out the others.

bilbo0s|1 year ago

Never.

Because no one will sell them an exclusive license to the data.

The companies selling this data are slimy. They're borderline crimelords. Picture a pirate captain with a hostage that he is ransoming. Now imagine he gets his ransom, but before he releases the hostage he makes a copy of her. Then ransoms the copy to another interested party. But before he releases the copy, he makes another copy and... you get the idea.

It's pirate thinking.

"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka

ganzuul|1 year ago

GDPR covered data should be worth a lot less.

outside1234|1 year ago

Ha - I love your optimism that they are even considering GDPR

mostlysimilar|1 year ago

Who could have guessed giving away all of our data to corporations wholly focused on profit would be a bad thing?

dnissley|1 year ago

If the end result is ai chat agents that anyone in the world can access for free, that seems like an absolutely wonderful thing