top | item 44669636

(no title)

tcldr | 7 months ago

Exactly. This is the argument that I find lacking from today's discourse: AI companies are already extracting generations worth of human intellectual data into their models. If they want to argue that this is 'fair use' then model distillation is, too. Can't have it both ways.

discuss

an0malous|7 months ago

You can when the laws exist to serve the investor class instead of fairness and justice. There is a ludicrous amount of money in AI now, it has become a central initiative of the current administration and defense industry. The large AI companies will get whatever they want now.

GuB-42|7 months ago

It is complicated, and culture and legal systems will have to adapt.

But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.

Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.

Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.

There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.

miki123211|7 months ago

Open AI is transforming those works, Deepseek is not.

OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.

To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.

People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.

Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.

tcldr|7 months ago

Just because we're unable to compensate many millions, perhaps billions of people, for using their work without a) permission, or b) remuneration, doesn't justify giving a blanket license to use it without some form of *serious* compensation that reflects the gravity of what is being created.

The current winner-takes-all approach to the outcome is wholly inappropriate. AI companies right now are riding atop the shoulders of giants. Data, mathematics and science that humanity has painstakingly assembled discovered, developed and shared over millennia. Now, we're saying the companies that tip the point of discovery over into a new era should be our new intellectual overlords?

Not cool.

It's clear that model creators and owners should receive some level of reward for their work, but to discount the intellectual labour of generations as worthless is clearly problematic. Especially given the implications for the workforce and society.

Ultimately we'll need to find a more equitable deal.

Until then, forgive me if I don't have much sympathy for a company that's had its latest model distilled.

AdamConwayIE|7 months ago

People always forget that back when OpenAI accused DeepSeek of distillation, o1's reasoning process was locked down, with only short sentences shared with the user as it "thought." There was a paper published in November 2024 from Shanghai Jiao Tong University that outlined how one would distill information from o1[1], and it even says that they used "tens of thousands" of o1 distilled chains. Given that the primary evidence given for distillation, according to Bloomberg[2], was that a lot of data was sent from OpenAI developer accounts in China in late 2024, it's not impossible that this (and other projects like it) could also have been the cause of that.

The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]

It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.

https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason

LearnYouALisp|7 months ago

YOu mean making something sound like it was either written on Reddit or in a paper mill and requires effort to quickly find the material of value like a reading a machine-translation