In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.
There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.
It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.
They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.
IANAL, but reading a bit on this topic: the relevant part of the copyright law for AI isn't academia, it's transformative work. The AI created by training on copyrighted material transforms the material so much that it is no longer the original protected work (collage and sampling are the analogous transformations in the visual-arts and music industries).
As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.
That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.
At least for English, those "fansubs" aren't typically burnt into the movie*, but ride along in the video container (MP4/MKV) as subtitle streams. They can typically be extracted as SRT files (plain text with sentence level timestamps).
*Although it used to be more common for AVI files in the olden days.
codedokode|7 months ago
snickerdoodle12|7 months ago
It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.
Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg
gruez|7 months ago
They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.
shadowgovt|7 months ago
As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.
robswc|7 months ago
unknown|7 months ago
[deleted]
0x457|7 months ago
kgeist|7 months ago
cyp0633|7 months ago
simsla|7 months ago
*Although it used to be more common for AVI files in the olden days.