(no title)
sidebute | 7 months ago
Thank you, that's really helpful.
I hadn't considered content reordering but it makes perfect sense given that the internal character ordering can be anything, as long as the page renders correctly. There's an interesting comp-sci homework project: Given a document represented by an unordered list of tuples [ (pageNum, x, y, char) ], quickly determine whether the doc contains a given search string.
Sometimes I need to search PDFs for a regex and use pdfgrep. That builds on poppler/xpdf, which extracts text >2x slower than mupdf (https://documentation.help/pymupdf/app1.html#part-2-text-ext..., fitz vs xpdf). From this discussion, I'm now writing my own pdfgrep that builds on mupdf.
No comments yet.