Stack Overflow simply bans folks who don't want their advice used to train AI

[+] ChrisArchitect|1 year ago|reply

[dupe]

Lots more discussion:

https://news.ycombinator.com/item?id=40297027

https://news.ycombinator.com/item?id=40302792

[+] lolinder|1 year ago|reply

Discussed the other day in a consolidated thread from a few duplicate submissions (311 comments total):

https://news.ycombinator.com/item?id=40302792

[+] sethhochberg|1 year ago|reply

> Stack Overflow has been banning users wholesale who have attempted to delete or deface their own posts on the site

Its not just people who are upset getting banned for being upset, its people who are attempting to burn it all down on their way out the door in protest.

I get where the protests are coming from, but the cardinal rule of online communities is "once you post it, its out there". Other people have reacted to it, replied to it, quoted it. You break not just your own content but entire discussions if you mass-delete your contributions. They stopped being exclusively yours to take back once you contributed them to a broader conversation.

[+] creativeSlumber|1 year ago|reply

> but the cardinal rule of online communities is "once you post it, its out there"

That's you opinion, not a cardinal rule. For example, Reddit comments and posts are deleted by users all the time.Same for Twitter, where users delete their posts and reacts all the time.

[+] manuelabeledo|1 year ago|reply

> You break not just your own content but entire discussions if you mass-delete your contributions.

If they allow users to delete or edit their own content, then what's the issue? That just sounds like a technical problem StackExchange didn't think of.

On the other hand, why is this behaviour deemed unacceptable, and StackExchange double dipping on user generated content is not?

[+] beefnugs|1 year ago|reply

"Once you post it its out there" sure... but if you piss off huge portions of the platform, then they have just as much right to now become your enemy and flood your system with worse trash than the AI is capable of.

Great, so how did these companies expect to stab so many people in the back and then continue as if it didnt happen again?

[+] add-sub-mul-div|1 year ago|reply

I wrote a script to delete over 10,000 reddit comments from my accounts there when they enshittified past the point of no return last year. Yes, the whole point is to devalue the site so that old search results are broken and people stop considering it a worthwhile place. The remaining users who are docile enough to remain on a shit site are holding the internet back from evolving through better sites.

[+] wokkel|1 year ago|reply

Hard no. There is such a thing as the right to forget. At least in Europe. Stackoverflow likes to take that right away as it earns them more money, but that is never a good reason.

[+] 7e|1 year ago|reply

A right to one's own expression/speech is a basic human right.

[+] ToucanLoucan|1 year ago|reply

> They stopped being exclusively yours to take back once you contributed them to a broader conversation.

Says who, exactly? Stack Overflow, previously, did not say this. Reddit threads get deleted all the time. Facebook posts slide into the ephemeral "past" and are never seen again. Old forums die entirely, with nothing but (hopefully) a copy on the wayback machine.

And that's just regular old rot, in this case, Stack Overflow is explicitly changing the rules on what they can do with writing provided by volunteers under a previous understood agreement, to a new agreement, with no opportunity for negotiation and no engagement with their community. Simply an edict, issued from On High: "We can now sell your contributions to AI companies." And I'm sure for plenty of people they don't give a shit, but clearly some do, and so we get what HN is constantly clamoring for, individuals making decisions about their own creations and their own speech, but now it's suddenly bad because it's the Wrong decision.

[+] lolc|1 year ago|reply

It's not clear to me what the deal is supposed to be about. Isn't it expected that Stackoverflow questions and answers can be used by anybody including to train models?

If Stackoverflow is trying to make exclusive deals with Openai, that is against the collaborative spirit of the platform, and I will stop contributing. After all, Openai is charging people for service. If Openai are the only ones given access, Stackoverflow becomes a gatekeeper, peddling my contributions. It'll beget a fork.

[+] coldpie|1 year ago|reply

> Isn't it expected that Stackoverflow questions and answers can be used by anybody including to train models?

I don't think so. Content contributed to SO gets licensed under CC-BY-SA, no? How do these AI models respect the "BY" portion of that license? How do we enforce the "SA" portion of the license on people who use the output of the model?

The answer is they don't, but the big companies decided violating copyright is OK so long as they do it all at once, because there's a lot of money to be made if you ignore copyright.

Personally I'd be OK with a compromise where any entity that creates or uses an AI trained on copyrighted data forfeits copyright on all of their own works (not just those created by AI, all of their works).

[+] throwaway290|1 year ago|reply

> Isn't it expected that Stackoverflow questions and answers can be used by anybody including to train models?

CC ShareAlike requires attribution + derivative works to be shared under the same license. ClosedAI is absolutely in the wrong here.

[+] add-sub-mul-div|1 year ago|reply

> It's not clear to me what the deal is supposed to be about.

AI is unpopular and makes this an unpopular move, enough so to cause this drama. It's really not any more complicated than that.

[+] mort96|1 year ago|reply

It's shitty to take content under an attribution license like the CC-BY-SA license SO uses and pour it into a soup of linear algebra where the content remains but all attribution is lost.

[+] fer|1 year ago|reply

Content contributed on StackOverflow (and all the StackExchange sites I believe) is CC, so it's really a losing battle.

[+] schlauerfox|1 year ago|reply

How is the use of CC-BY-4.0 by a LLM going to be attributed to satisfy the license conditions is my question. Every answer by the model contains the work as part of it's token weights and can regurgitate.

[+] trueismywork|1 year ago|reply

Its CC-BY-SA so openAI has to release the weights obtained by training.

[+] jessriedel|1 year ago|reply

I don't get it.

Like, it's one thing if a fact-collecting business like the NYTimes doesn't want its stories to train an MML. I think under current law they don't have much of a case, because facts aren't copyrightable, but there's a reasonable argument that the law should be updated somehow in light of technological change.

But the work produced by all StackExchange users is explicitly released under a CC BY-SA license. The whole point is to collect and publish facts/ideas/understanding for anyone to see and use for any purpose, including running a business. Yes, the "SA" (share alike) part means if you want to use and modify the words then you need to release them under license that is at least as permissive, but LLMs aren't using the words; they are clearly digesting the facts and expressing them in their own words. And, unlike the NYTimes, there is no issue of "couldn't new tech undermine society's current method of economically incentivizing fact-collection?". The StackExchange users are not being paid, and the fact that the license is not NC (non-commercial) explicitly means that using their hard work to make money is allowed (and encouraged!).

[+] add-sub-mul-div|1 year ago|reply

> I don't get it.

It's not about legality, it's that sites like SO and Reddit have turned a corner and are no longer sites or businesses that people want to support. The contract of user generated content has been disrupted, we'd have happily given them content to profit from forever if they didn't turn shitty. And now people are lashing out even if it's futile because the content has already been sold. These sites failed a generational marshmallow test, taking short term profit at the expense of souring people on the idea of providing free content going forward.

[+] progval|1 year ago|reply

> Yes, the "SA" (share alike) part means if you want to use and modify the words then you need to release them under license that is at least as permissive, but LLMs aren't using the words; they are clearly digesting the facts and expressing them in their own words

The CC BY-SA text <https://creativecommons.org/licenses/by-sa/4.0/legalcode.en> says nothing about the words. The words "word" and "words" do not even appear in the legal text. What it says, however, is:

> In addition to the conditions in Section 3(a) , if You Share Adapted Material You produce, the following conditions also apply.

> 1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.

> 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.

where "Adapted Material" is defined as:

> material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.

So if you want to argue that LLMs don't have to follow the SA clause, then you have to argue either that:

1. LLMs aren't an arrangement or transformation of their input, or

2. the license doesn't apply at all, eg. because it's fair use in whatever jurisdiction they fall under.

OpenAI is using argument 2 <https://news.ycombinator.com/item?id=37780199>.

[+] dongobread|1 year ago|reply

Legality aside, I think the "payment" people get from posting free knowledge on the Internet is the human connection, and the satisfaction of knowing that other people are reading and appreciating it directly.

Injecting an LLM middleman between your post and the end user changes this dynamic quite a bit - without the human component, the feeling is that you're just doing unpaid labor for a profit-oriented company (OpenAI).

[+] trueismywork|1 year ago|reply

It's not clear whether LLM actions are derivative works or not. The only way they would not he derivative works is if there's specific logic in the implementation to check for all copyrighted works used in training and prevent an infringing output from such a work.

[+] elforce002|1 year ago|reply

StackOverflow is destroying its brand one step at a time. ChatGPT is good for boilerplate and generic errors. I still use SO for complex errors that are still getting updated with new ways to solve that said error, etc...

Without the community posting and answering questions, ChatGPT won't work anymore. They (SO) have to block any scrapping, wait until GPU prices go down, or use open-source LLMs with their data and figure out how to monetize that service (maybe giving some percentage to devs, etc...) or even check if that approach (add a chatbot service) makes sense.

They just caved to the FOMO mentality without considering that tech is always evolving and they need engineers, dev, etc... to keep finding out bugs, writing about their experiences on how they solved those errors, etc...

We'll see if a different contender figures this out and comes out with that solution if SO doesn't change its course.

[+] renewiltord|1 year ago|reply

Fascinating. Perhaps Stallman's greatest innovation is copyleft. The idea of required reciprocation seems to tie deeply into people's views. For the little OSS I have I picked it by default but would gladly use BSD or some safe PD license on the other hand.

The idea of Pillaging the Commons is interesting. I wonder when the mainstream opinion started shifting. It wasn't quite sudden but it seems to me that even ten years ago the dominant Internet visible position was that much of copyright was bogus: information wants to be free / if a pirate copies your stuff, you still have it

But now there's a stronger sense of "this information is ours". Perhaps that subculture moved somewhere and this one came here or perhaps I moved from where the former was to where the latter is.

I find it an interesting sociological phenomenon.

[+] throwaway290|1 year ago|reply

Ultimately they are digging their own grave (who needs SO if you can ask ChatGPT and shell out a few bucks to Microsoft eh?)

If anyone else thinks it's a bad move, the most efficient way to boycott it is by adding senseless questions and answers and upvoting them. Us people is how they got big, us people is how they go down.

[+] elforce002|1 year ago|reply

If people stop posting, even chatGPT will be useless since it can't keep up with current bugs, new ways to solve them, etc...

[+] amelius|1 year ago|reply

And let me guess, meanwhile they ask a big fee from AI companies for using the data.

[+] lxgr|1 year ago|reply

Can they even do that?

I know that SO content is licensed under creative commons; is there an additional dual license that allows them to commercially use it under different (e.g. non-attribution) terms?

[+] smilespray|1 year ago|reply

No need to guess — that's what the protests are about.

[+] cosmin800|1 year ago|reply

Stackoverflow made millions, people got badges.

[+] lxgr|1 year ago|reply

People also get a freely accessible database dump licensed under CC-BY-SA. That's more than almost every other content-oriented platform out there gives back.

[+] SirMaster|1 year ago|reply

People also got answers to questions from other professionals that probably helped them personally financially and professionally in some way.

87 comments