top | item 36158715

(no title)

napier | 2 years ago

Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.

discuss

benxh|2 years ago

Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.

fragmede|2 years ago

Google has the resources train on Google Books, Google Scholar, and their crawled copy of the whole Internet. No clue what Bard is/isn't trained on tho.