top | item 40303035

(no title)

mixedmath | 1 year ago

About 5 years ago, StackOverflow messed up and declared that they were making all content submitted by users available under CC-BY-SA 4.0 [1]. The error here is that the users-content agreement was that all users' contributions are made available under CC-BY-SA 3.0 (and not anything about later). In the middle there were also some licensing problems concerning code vs noncode that were confusing.

I remember thinking that if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license. But I thought that without any damages, this probably wasn't likely to succeed.

But now I wonder whether making all content available to AI scrapers and OpenAI in particular might be enough to actually base a case. As far as I can tell, StackOverflow continued being duplicitous with what license applies to what content for half of the year 2018 and the first few months of the year 2019. Their current licensing suggests CC-BY-SA 3.0 for things before May 5 2018, and CC-BY-SA 4.0 for things after. Sometime in early 2019 (if memory serves, it was after the meta post I link to), they made users login again and accept a new license agreement for relicensing content. But those middle months are murky.

I should emphasize that I know nothing.

[1]: https://meta.stackexchange.com/q/333089/205676

discuss

order

frognumber|1 year ago

My understanding of licensing law is that something like 3.0 -> 4.0 is very unlikely to be a winnable case in the US.

Programmers think like machines. Lawyers don't. A lot of confusion comes from this. To be clear, there are places where law is machine-like, but I believe licensing is not one of them.

If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.

IANAL, but this is based on one conversation with a law professor specializing in this stuff, so it's also not completely uninformed. But it matches up with what you wrote. If your history is right, the 2019 changes is where there would be a case.

The joyful part here is that there are 200 countries in the world, and in many, the 3.0->4.0 would be a valid complaint. I suspect this would not fly in most common law jurisdictions (British Empire), but it would be fine in many statutory law ones (e.g. France). In the internet age, you can be sued anywhere!

lifthrasiir|1 year ago

> If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.

Which does exist and can affect the ruling. CC notably didn't grant sui generis database rights until 4.0, and I'm aware of at least one case where this could have mattered in South Korea because the plaintiff argued that these rights were never granted to and thus violated by the defendant. Ultimately it was found that the plaintiff didn't have database rights anyway, but could have been else.

9991|1 year ago

If there wasn’t a substantive difference, then there’s no need to make the change.

reddalo|1 year ago

The fact itself that programmers keep insisting on writing "IANAL" is maybe an example of that.

A court would probably not agree on the fact that writing "IANAL", not the full sentence, is a sufficient disclaimer.

sidewndr46|1 year ago

It is worth remembering that law professors have a vested interest in making sure the system work as you described. If contract law was straightforward, they'd be out of job.

kragen|1 year ago

> if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license.

they can plausibly sue people other than stackoverflow if they attempt to reuse the answers under a different license. but i think it's very difficult to find a use that 4.0 permits that 3.0 doesn't

trueismywork|1 year ago

If it is indeed CC-BY-SA then, openAI needs to publish their weights under the same license.

drivingmenuts|1 year ago

People put their content on the site for the public to use, and now the public is using it, it's just that "the public" includes AIs. Admittedly, a non-human public, nonetheless ...

_xivi|1 year ago

The problem is LLMs don't provide attribution/credit which directly violates the license[0]

Otherwise search engines were already "non-human public" that scraped the site but directly linked to the answers, which was great. They didn't claim its their work like these models. The problem isn't human vs non-human. LLMs aren't magic, they don't create stuff out of thin air, what they're doing is simply content laundering.

[0] https://creativecommons.org/licenses/by-sa/4.0/#ref-appropri...

postepowanieadm|1 year ago

You have to agree on how your work may be used, no one has expected it will be sold for ai training.