(no title)
Oarch | 2 months ago
Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.
Oarch | 2 months ago
Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.
findjashua|2 months ago
sotrusting|2 months ago
torginus|2 months ago
For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?
I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.
Or alternatively - it's easier to ask for forgiveness than permission.
I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.
Oarch|2 months ago
There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.
The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.
bdangubic|2 months ago
GCUMstlyHarmls|2 months ago
Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.
matt-p|2 months ago
kankerlijer|2 months ago
phendrenad2|2 months ago
Aurornis|2 months ago
There are claims all through this thread that “AI companies” are probably doing bad things with enterprise customer data but nobody has provided a single source for the claim.
This has been a theme on HN. There was a thread a few weeks back where someone confidently claimed up and down the thread that Gemini’s terms of service allowed them to train on your company’s customer data, even though 30 seconds of searching leads to the exact docs that say otherwise. There is a lot of hearsay being spread as fact, but nobody actually linking to ToS or citing sections they’re talking about.
gaigalas|2 months ago
Oarch|2 months ago
Many businesses simply couldn't afford to operate without such an edge.
Aurornis|2 months ago
None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.
Retric|2 months ago
“How can I control whether my data is used for model training?
If you are logged into Copilot with a Microsoft Account or other third-party authentication, you can control whether your conversations are used for training the generative AI models used in Copilot. Opting out will exclude your past, present, and future conversations from being used for training these AI models, unless you choose to opt back in. If you opt out, that change will be reflected throughout our systems within 30 days.” https://support.microsoft.com/en-us/topic/privacy-faq-for-mi...
At this point suggesting it has never and will her happen is wildly optimistic.
leptons|2 months ago
Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.
lwhi|2 months ago
While this isn't used specifically for LLM training, it can involve aggregating insights from customer behaviour.
unknown|2 months ago
[deleted]
AuthAuth|2 months ago
nerdponx|2 months ago
TheRoque|2 months ago
fzeroracer|2 months ago
It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this. And it keeps happening. Granted I'm unaware of cases of this occuring currently with professional AI services but it's basic security 101 that you should never let anything even have the remote opportunity to ingest data unless you don't care about the data.
popalchemist|2 months ago
Many of the top AI services use human feedback to continuously apply "reinforcement learning" after the initial deployment of a pre-trained model.
https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...
agumonkey|2 months ago
sotrusting|2 months ago
[deleted]