top | item 44780927

(no title)

UglyToad | 7 months ago

You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.

What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.

However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.

[0]: https://github.com/UglyToad/PdfPig/pull/1102

discuss

farkin88|7 months ago

That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.

The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?

Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.