top | item 32928246

(no title)

emcq | 3 years ago

Be wary of using this model - the licensing of this model seems sketchy. Several of the datasets used for training like WSJ and TED-LIUM have clear non-commercial clauses. I'm not a lawyer but releasing a model as "MIT" seems dubious, and hopefully OpenAI has paid for the appropriate licenses during training as they are no longer a research-only non profit.

discuss

jefftk|3 years ago

This is a big dispute right now: OpenAI and other AI companies generally take the position that models learning from data does not make the output of the models a derivative work of that data. For example, GitHub Co-pilot uses all publicly available GitHub code regardless of license, and DALLE-2/StableDiffusion/etc use lots of non-free images. I don't think this has been challenged in court yet, and I'm very curious to see what happens when it is.

petercooper|3 years ago

I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)

I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.

bscphil|3 years ago

> models learning from data does not make the output of the models a derivative work of that data

Most of the debate seems to be happening on the question of whether everything produced by models trained on copyrighted work represents a derivative work. I argue that at the very least some of it does; so the claim said to be made by the AI companies (see quote above) is clearly a false one.

We're in a weird place now where AI is able to generate "near verbatim" work in a lot of cases, but I don't see an obvious case for treating this any differently than a human reproducing IP with slight modifications. (I am not a lawyer.)

For example, copyright law currently prevents you from selling a T-shirt with the character Spider-Man on it. But plenty of AI models can give you excellent depictions of Spider-Man that you could put on a T-shirt and try to sell. It's quite silly to think that any judge is going to take you seriously when you argue that your model, which was trained on a dataset that included pictures of Spider-Man, and was then asked to output images using "Spider-Man" as a search term, has magically circumvented copyright law.

(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)

emcq|3 years ago

This is even slightly more direct: access to WSJ data requires paying LDC for the download, and the pricing varies depending on what institution / license you're from. The cost may be a drop in the bucket compared to compute, but I don't know that these licenses are transferable to the end product. We might be a couple court cases away from finding out but I wouldn't want to be inviting one of those cases :)

nshm|3 years ago

I think they didn't use WSJ for training, only for evaluation. Paper includes WSJ under "Evaluation datasets"

pabs3|3 years ago

Are there any AI/ML models that don't use sketchy licensed datasets? Everything seems to be "downloaded from the internet, no license" or more explicitly proprietary. The only exception I can think of would be coqui/DeepSpeech?