Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.
raggi|6 months ago
This is similar to the ruling by Alsup in the Anthropic books case that the training is “exceedingly transformative”. I would expect a reinterpretation or disagreement on this front from another case to be both problematic and likely eventually overturned.
I don’t actually think provenance is a problem on the axis you suggest if Alsups ruling holds. That said I don’t think that’s the only copyright issue afoot - the copyright office writing on copyrightability of outputs from the machine essentially requires that the output fails the Feist tests for human copyrightability.
More interesting to me is how this might realign the notion of copyrightability of human works further as time goes on, moving from every trivial derivative bit of trash potentially being copyrightable to some stronger notion of, to follow the feist test, independence and creativity. Further it raises a fairly immediate question in an open source setting if many individual small patch contributions themselves actually even pass those tests - they may well not, although the general guidance is to set the bar low - but is a typo fix either? There is so far to go on this rabbit hole.
snickerbockers|6 months ago
I swear the ML community is able to rapidly change their mind as to whether "training" an AI is comparable to human cognition based on whichever one is beneficial to them at any given instant.
j4coh|6 months ago
strogonoff|6 months ago
On that note, I am not sure why creators in so many industries are sitting around while they are being more or less ripped off by massive corporations, when music has got it right.
— Do you want to make a cover song? Go ahead. You can even copyright it! The original composer still gets paid.
— Do you want to make a transformative derivative work (change the composition, really alter the style, edit the lyrics)? Go ahead, just damn better make sure you license it first. …and you can copyright your derivative work, too. …and the original composer still gets credit in your copyright.
The current wave of LLM-induced AI hype really made the tech crowd bend itself in knots trying to paint this as an unsolvable problem that requires IP abuse, or not a problem because it’s all mostly “derivative bits of trash” (at least the bits they don’t like, anyway), argue in courts how it’s transformative, etc., while the most straightforward solution keeps staring them in the face. The only problem is that this solution does not scale, and if there’s anything the industry in which “Do Things That Don’t Scale” is the title of a hit essay hates then that would be doing things that don’t scale.
[0] It should be clarified that if art is considered (as I do) fundamentally a mechanism of self-expression then there is, of course, no trash and the whole point is moot.
camgunz|6 months ago
We don't need all this (seemingly pretty good) analysis. We already know what everyone thinks: no relevant AI company has had their codebase or other IP scraped by AI bots they don't control, and there's no way they'd allow that to happen, because they don't want an AI bot they don't control to reproduce their IP without constraint. But they'll turn right around and be like, "for the sake of the future, we have to ingest all data... except no one can ingest our data, of course". :rolleyes:
rovr138|6 months ago
> Contributed Code
> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.
source, https://www.sqlite.org/copyright.html
jojobas|6 months ago
Borealid|6 months ago
An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.
This doesn't detract from the core of your point, that LLM output may be copyright-contaminated by LLM training data. Yes, but that doesn't necessarily mean that an LLM output cannot be a valid clean-room reverse engineer.
account42|6 months ago
This is assuming that you are only concerned with a particular work when you need to be sure that you are not copying any work that might be copyrighted without making sure to have a valid license that you are abiding by.
Aeolun|6 months ago
We didn't have this whole issue 20 years ago because nobody gave a shit. If your code was public, and on the internet, it was free for everyone to use by definition.