How to compare two PDF documents

[+] btown|4 years ago|reply

https://draftable.com/compare is by far the best solution I've found for this, and it's a shame it's not more widely known about. It's not open-source, and their offline app is Windows only, but its ability to handle multi-page relayouts is far and above Acrobat's diff functionality (as the OP laments), and there's a free online version that's reasonably secure so long as you don't share the secret URL around. I've used it many times to obtain readable redlines when only given successive "baked" versions of a document, and it's a really useful tool for any B2B startup founder.

[+] phiresky|4 years ago|reply

There's a free and opensource program called `diffpdf` that can compare both visually and by text. It has a GUI and works great, though it doesn't specially handle layout changes. It's included in the normal package sources in most Linux distros:

    apt install diffpdf
    pacman -S diffpdf

https://gitlab.com/eang/diffpdf

[+] pronoiac|4 years ago|reply

I recently read a blog post on putting the Crafting Interpreters book together, and there was an interesting tidbit about visual pdf diffs (under Proofreading the proof) - https://journal.stuffwithstuff.com/2021/07/29/640-pages-in-1...

[+] upofadown|4 years ago|reply

It's logically the same issue as with signing documents. You have to decide what aspect of the document you want to certify and ignore the rest. If you sign something in a complex document format you don't even know exactly what you are signing. Much of what you are signing is not even visible.

Things like legal documents should be restricted to plain text... and stuff like line endings should be standardized for this purpose.

[+] wolverine876|4 years ago|reply

> line endings should be standardized

I don't see that happening! ;)

[+] mckmk|4 years ago|reply

Just to throw in another solution for anyone looking. Abbyy FineReader has a comparison module that is excellent. https://pdf.abbyy.com/how-to/compare-documents/

[+] fmos|4 years ago|reply

Second that. This is THE solution for comparing any two PDFs (image or not). I’ve been using it for years almost on a daily basis. Part of its use is certainly derived from the excellent OCR engine it relies on. Also, this runs fully local, which is critical for legal purposes. (edit for context: I still use v14)

[+] redman25|4 years ago|reply

For visually comparing PDFs instead of textually comparing them, I use https://parepdf.com I work in publishing, so comparing printer proof PDFs is something we do regularly.

[+] seesawtron|4 years ago|reply

It just seems to overlay two pdfs' text and images in two colors. Looks kind of ugly but atleast its a free service.

[+] spdegabrielle|4 years ago|reply

This is perfect. What a great tool!

[+] Toutouxc|4 years ago|reply

A guy I work with did his PDF diffing (basically testing whether invoices generated by new code match those generated by old code) by running the two invoices through ImageMagick and subtracting them from each other, basically looking for different pixels (maybe > $threshold) and then looking at the visual diff with missing pixels brightly colored. I thought that was really elegant.

[+] yardshop|4 years ago|reply

BeyondCompare from Scooter Software [1] does a good job of comparing PDFs, although it does only compare the extracted text. But that is just one of the many many things it can do, so $60 (for the pro version or $30 for standard) gets you a lot more than just PDF comparison.

[1] https://www.scootersoftware.com

[+] riobard|4 years ago|reply

Ignoring all the bad things about Microsoft Word, it really nails revision management for non-tech people. I've worked thru many legal contracts, collaborating with lawyers and other parties. I wish there'd be better alternatives, tho.

[+] umvi|4 years ago|reply

Store PDFs as html or markdown under the hood which is diffable and only generate PDF at the last second

[+] wolverine876|4 years ago|reply

Nice if you are the only user. If you distribute documents to others, or receive them from others, it's less practical.

Also, Markdown won't nearly represent all the formatting possibilities of a PDF.

Also, I'm not sure HTML is much easier to compare than PDF. Both are markup languages, essentially.

But following along your idea, it would be nice if the document itself contained separate content and presentation layers, so that we could manipulate and analyze (and compare) the content.

[+] spdegabrielle|4 years ago|reply

PDF is an abomination & a relic.

[+] wolverine876|4 years ago|reply

OK, so what do you recommend for readability and also for stable presentation across platforms?

Reading a book on a screen, I find PDF far superior to HTML or Word or text.

[+] unknown|4 years ago|reply

[deleted]

[+] sbmthakur|4 years ago|reply

What alternative do you suggest?

[+] Communitivity|4 years ago|reply

It's been a while, but if I remember right PDF is nothing more than PostScript with extensions by Adobe and a PDF pre-amble. Essentially, it is structured text based on PostScript with the possibility of added embedded binary streams for things such as forms. Given this, implementing a good compare tool should not be too difficult.

The hardest parts would probably be scoping to figure out what Adobe extensions to support, and acquiring or reverse engineering the Adobe extension formats.

Simple PDF example:

  %PDF-1.3
  %‚„œ”

  1 0 obj
  <<
  /Type /Catalog
  /Outlines 2 0 R
  /Pages 3 0 R
  >>
  endobj

  2 0 obj
  <<
  /Type /Outlines
  /Count 0
  >>
  endobj

  3 0 obj
  <<
  /Type /Pages
  /Count 2
  /Kids [ 4 0 R 6 0 R ] 
  >>
  endobj

  4 0 obj
  <<
  /Type /Page
  /Parent 3 0 R
  /Resources <<
  /Font <<
  /F1 9 0 R 
  >>
  /ProcSet 8 0 R
  >>
  /MediaBox [0 0 612.0000 792.0000]
  /Contents 5 0 R
  >>
  endobj

  5 0 obj
  << /Length 1074 >>
  stream
  2 J
  BT
  0 0 0 rg
  /F1 0027 Tf
  57.3750 722.2800 Td
  ( A Simple PDF File ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 688.6080 Td
  ( This is a small demonstration .pdf file - ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 664.7040 Td
  ( just for use in the Virtual Mechanics tutorials. More 
  text. And more ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 652.7520 Td
  ( text. And more text. And more text. And more text. ) Tj
  ET
  BT
  /F1 0010 Tf 
  69.2500 628.8480 Td
  ( And more text. And more text. And more text. And more 
  text. And more ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 616.8960 Td
  ( text. And more text. Boring, zzzzz. And more text. And 
  more text. And ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 604.9440 Td
  ( more text. And more text. And more text. And more text. 
  And more text. ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 592.9920 Td 
  ( And more text. And more text. ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 569.0880 Td
  ( And more text. And more text. And more text. And more 
  text. And more ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 557.1360 Td
  ( text. And more text. And more text. Even more. Continued 
  on page 2 ...) Tj
  ET
  endstream
  endobj

  6 0 obj
  <<
  /Type /Page
  /Parent 3 0 R
  /Resources <<
  /Font <<
  /F1 9 0 R 
  >>
  /ProcSet 8 0 R
  >>
  /MediaBox [0 0 612.0000 792.0000]
  /Contents 7 0 R
  >>
  endobj

  7 0 obj
  << /Length 676 >>
  stream
  2 J
  BT
  0 0 0 rg
  /F1 0027 Tf
  57.3750 722.2800 Td
  ( Simple PDF File 2 ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 688.6080 Td
  ( ...continued from page 1. Yet more text. And more text. 
  And more text. ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 676.6560 Td
  ( And more text. And more text. And more text. And more 
  text. And more ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 664.7040 Td
  ( text. Oh, how boring typing this stuff. But not as boring as watching ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 652.7520 Td
  ( paint dry. And more text. And more text. And more text. 
  And more text. ) Tj
  ET
  BT
  /F1 0010 Tf
  69.2500 640.8000 Td
  ( Boring.  More, a little more text. The end, and just as 
  well. ) Tj
  ET
  endstream
  endobj

  8 0 obj
  [/PDF /Text]
  endobj

  9 0 obj
  <<
  /Type /Font
  /Subtype /Type1
  /Name /F1
  /BaseFont /Helvetica
  /Encoding /WinAnsiEncoding
  >>
  endobj
 
  10 0 obj
  <<
  /Creator (Rave \(http://www.nevrona.com/rave\))
  /Producer (Nevrona Designs)
  /CreationDate (D:20060301072826)
  >> 
  endobj

  xref
  0 11
  0000000000 65535 f
  0000000019 00000 n
  0000000093 00000 n
  0000000147 00000 n
  0000000222 00000 n
  0000000390 00000 n
  0000001522 00000 n
  0000001690 00000 n
  0000002423 00000 n
  0000002456 00000 n
  0000002574 00000 n

  trailer
  <<
  /Size 11
  /Root 1 0 R
  /Info 10 0 R
  >>

  startxref
  2714
  %%EOF

[+] untoxicness|4 years ago|reply

  #! /bin/bash

  pdf_one="$1"
  pdf_two="$2"

  text_one=$(mktemp)
  text_two=$(mktemp)

  pdftotext "$pdf_one" "$text_one"
  pdftotext "$pdf_two" "$text_two"

  diff "$text_one" "$text_two"

[+] halostatue|4 years ago|reply

It is entirely possible for that to present unexpected differences because of the way that the PDF format works. One can have two different PDFs that encode the same content in two different ways and unless `pdftotext` does virtual layout and then OCR-like extraction, you might end up with jumbled text or text in different orders.

[+] Hello71|4 years ago|reply

not only is this the first solution presented in the article, you've not even bothered to pass -layout to pdftotext or -u/-y to diff which would make this marginally workable. spaces after shebang also doesn't always work (e.g. in qemu), you don't clean up the temporary files, and the temporary files have unhelpful names.

[+] seoaeu|4 years ago|reply

Won't that fail terribly if the text has been re-flowed?

30 comments