(no title)
ealexhudson | 1 year ago
Having read MS code and starting to generate new code that is heavily inspired - sure, that's not copyright infringement. But, if you had memorized a bunch of code (and this is within human capability; people can recite many works of literature of varying length with total accuracy, given sufficient study) - that would be copyright infringement once the code was a non-trivial amount. The test in copyright is whether the copying is literal, not how the copying was done/did it pass through a human brain.
This scenario rarely comes up because humans are, generally, an awful medium for accurate repetition. However, it's not really been shown than LLMs are not: in fact, CoPilot claims (at least in its Enterprise agreements) to check its output _does not_ parrot existing code identically. The specific commitment they made in their blog post is/was, "We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content". To be clear, they only propose to reduce the possibility, not remove it.
LLMs rely on a form of lossy compression which can sometimes give back verbatim content. I think it's pretty clear and unarguable that this is a copyright infringement.
No comments yet.