I think a distributed OCR project is needed. Problem is that a lot of books are PDF scans and missing raw text. OcrMyPdf does a pretty good job of is but it's cpu intensive.
I'd wager that there are several players in the AI market who have already scraped and OCR'd every book and magazine on zlib and libgen to feed into training models. Google are almost certainly piped everything they have in Google Books into their models, before some future legal case says they can't. Won't take long before the open community starts doing the same.
greggsy|2 years ago