top | item 39674177

(no title)

mellutussa | 2 years ago

I think a distributed OCR project is needed. Problem is that a lot of books are PDF scans and missing raw text. OcrMyPdf does a pretty good job of is but it's cpu intensive.

discuss

order

greggsy|2 years ago

I'd wager that there are several players in the AI market who have already scraped and OCR'd every book and magazine on zlib and libgen to feed into training models. Google are almost certainly piped everything they have in Google Books into their models, before some future legal case says they can't. Won't take long before the open community starts doing the same.