top | item 43361247

(no title)

avgd | 11 months ago

> But there are models trained on legal content (like Wikipedia or StackOverflow)

Literally all of them are trained on wikipedia and SO. But /none/ of them are /only/ trained on wikipedia and SO. They need much more than that.

> Also, no human needs to read millions of pirated books to become intelligent.

Obviously, LLM architectures that were inspired by GPT 2/3 are not learning like humans.

There has never been anything remotely good in the world of LLM that could have been said to have been trained on a moderate, more human scoped amount of data. They're all trained on trillions of tokens.

Models trained on less than 1T are experimental jokes that have no real use to provide.

You'll notice even so called "open data" LLMs like Olmo are, in fact, also trained on copyrighted data, datasets like Common Crawl claim fair use over anything that can be accessed from a web browser.

And then there's the whole notion of laundered data by training on synthetic data generated by another LLM. All the so-called "open" LLMs include a very significant amount of LLM-generated data. If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.

discuss

Dylan16807|11 months ago

> If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.

It's fuzzy. I could imagine a situation where a primary LLM trained on copyrighted material is a big hazard and can't be released, but carefully monitored and filtered output could be declared copyright-safe, and then used to make a copyright-safe secondary LLM.