top | item 44788066

(no title)

sidebute | 7 months ago

> From what i can tell digging into the code that's available, ..., how much time they take to put the text back in content order... The slowest are doing letter by letter. The fastest are not.

Thank you, that's really helpful.

I hadn't considered content reordering but it makes perfect sense given that the internal character ordering can be anything, as long as the page renders correctly. There's an interesting comp-sci homework project: Given a document represented by an unordered list of tuples [ (pageNum, x, y, char) ], quickly determine whether the doc contains a given search string.

Sometimes I need to search PDFs for a regex and use pdfgrep. That builds on poppler/xpdf, which extracts text >2x slower than mupdf (https://documentation.help/pymupdf/app1.html#part-2-text-ext..., fitz vs xpdf). From this discussion, I'm now writing my own pdfgrep that builds on mupdf.

discuss

No comments yet.