top | item 47062108

(no title)

doctorpangloss | 11 days ago

All the vendors paraphrase user data, then use the paraphrased data for training. This is what their terms of service say.

They have significant experience in this. Microsoft software since the 2014, for the most part, is also paraphrased from other people's code they find laying around online.

discuss

order

benterix|11 days ago

> All the vendors paraphrase user data, then use the paraphrased data for training. This is what their terms of service say.

It depends. E.g. OpenAI says: "By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API."[0]

[0] https://openai.com/policies/how-your-data-is-used-to-improve...

shakna|11 days ago

"By default" is a fantastic escape catch in the language used there. So... What are the exceptions?

simonw|11 days ago

Why would they want to train on random garbage proprietary emails?

If their models ever spit out obviously confidential information belonging to their paying customers they'll lose those paying customers to their competitors - and probably face significant legal costs as well.

Your random confidential corporate email really isn't that valuable for training. I'd argue it's more like toxic waste that should be avoided at all costs.

doctorpangloss|10 days ago

Your opinion seems a little unimaginative. To me, since email is the primary work output of millions of Americans, including all of its leaders, there is a lot of opportunity there.

moritzwarhier|11 days ago

> Microsoft software since the 2014, for the most part, is also paraphrased from other people's code they find laying around online.

That was pretty funny and explains a lot.

I wish I could do more :(

Instead I always break things when I paraphrase code without the GeniusParaphrasingTool

nyrikki|11 days ago

This is exactly why I moved to self hosted code in 2017.

While I couldn’t have predicted the future, even classic data mining posed a risk.

It is just reality that if you give a third party access to your data, you should expect them to use it.

It is just too tempting of a value stream and legislation just isn’t there to avoid the EULA trap.

I was targeting a market where fractions of a percentage advantage were important which did drive my what at the time was labeled paranoia