https://draftable.com/compare is by far the best solution I've found for this, and it's a shame it's not more widely known about. It's not open-source, and their offline app is Windows only, but its ability to handle multi-page relayouts is far and above Acrobat's diff functionality (as the OP laments), and there's a free online version that's reasonably secure so long as you don't share the secret URL around. I've used it many times to obtain readable redlines when only given successive "baked" versions of a document, and it's a really useful tool for any B2B startup founder.
There's a free and opensource program called `diffpdf` that can compare both visually and by text. It has a GUI and works great, though it doesn't specially handle layout changes. It's included in the normal package sources in most Linux distros:
It's logically the same issue as with signing documents. You have to decide what aspect of the document you want to certify and ignore the rest. If you sign something in a complex document format you don't even know exactly what you are signing. Much of what you are signing is not even visible.
Things like legal documents should be restricted to plain text... and stuff like line endings should be standardized for this purpose.
Second that. This is THE solution for comparing any two PDFs (image or not). I’ve been using it for years almost on a daily basis. Part of its use is certainly derived from the excellent OCR engine it relies on. Also, this runs fully local, which is critical for legal purposes. (edit for context: I still use v14)
For visually comparing PDFs instead of textually comparing them, I use https://parepdf.com I work in publishing, so comparing printer proof PDFs is something we do regularly.
A guy I work with did his PDF diffing (basically testing whether invoices generated by new code match those generated by old code) by running the two invoices through ImageMagick and subtracting them from each other, basically looking for different pixels (maybe > $threshold) and then looking at the visual diff with missing pixels brightly colored. I thought that was really elegant.
BeyondCompare from Scooter Software [1] does a good job of comparing PDFs, although it does only compare the extracted text. But that is just one of the many many things it can do, so $60 (for the pro version or $30 for standard) gets you a lot more than just PDF comparison.
Ignoring all the bad things about Microsoft Word, it really nails revision management for non-tech people. I've worked thru many legal contracts, collaborating with lawyers and other parties. I wish there'd be better alternatives, tho.
Nice if you are the only user. If you distribute documents to others, or receive them from others, it's less practical.
Also, Markdown won't nearly represent all the formatting possibilities of a PDF.
Also, I'm not sure HTML is much easier to compare than PDF. Both are markup languages, essentially.
But following along your idea, it would be nice if the document itself contained separate content and presentation layers, so that we could manipulate and analyze (and compare) the content.
It's been a while, but if I remember right PDF is nothing more than PostScript with extensions by Adobe and a PDF pre-amble. Essentially, it is structured text based on PostScript with the possibility of added embedded binary streams for things such as forms. Given this, implementing a good compare tool should not be too difficult.
The hardest parts would probably be scoping to figure out what Adobe extensions to support, and acquiring or reverse engineering the Adobe extension formats.
Simple PDF example:
%PDF-1.3
%‚„œ”
1 0 obj
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj
3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj
5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf
69.2500 688.6080 Td
( This is a small demonstration .pdf file - ) Tj
ET
BT
/F1 0010 Tf
69.2500 664.7040 Td
( just for use in the Virtual Mechanics tutorials. More
text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 652.7520 Td
( text. And more text. And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 628.8480 Td
( And more text. And more text. And more text. And more
text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 616.8960 Td
( text. And more text. Boring, zzzzz. And more text. And
more text. And ) Tj
ET
BT
/F1 0010 Tf
69.2500 604.9440 Td
( more text. And more text. And more text. And more text.
And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 592.9920 Td
( And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 569.0880 Td
( And more text. And more text. And more text. And more
text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 557.1360 Td
( text. And more text. And more text. Even more. Continued
on page 2 ...) Tj
ET
endstream
endobj
6 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 7 0 R
>>
endobj
7 0 obj
<< /Length 676 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( Simple PDF File 2 ) Tj
ET
BT
/F1 0010 Tf
69.2500 688.6080 Td
( ...continued from page 1. Yet more text. And more text.
And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 676.6560 Td
( And more text. And more text. And more text. And more
text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 664.7040 Td
( text. Oh, how boring typing this stuff. But not as boring as watching ) Tj
ET
BT
/F1 0010 Tf
69.2500 652.7520 Td
( paint dry. And more text. And more text. And more text.
And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 640.8000 Td
( Boring. More, a little more text. The end, and just as
well. ) Tj
ET
endstream
endobj
8 0 obj
[/PDF /Text]
endobj
9 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobj
10 0 obj
<<
/Creator (Rave \(http://www.nevrona.com/rave\))
/Producer (Nevrona Designs)
/CreationDate (D:20060301072826)
>>
endobj
xref
0 11
0000000000 65535 f
0000000019 00000 n
0000000093 00000 n
0000000147 00000 n
0000000222 00000 n
0000000390 00000 n
0000001522 00000 n
0000001690 00000 n
0000002423 00000 n
0000002456 00000 n
0000002574 00000 n
trailer
<<
/Size 11
/Root 1 0 R
/Info 10 0 R
>>
startxref
2714
%%EOF
It is entirely possible for that to present unexpected differences because of the way that the PDF format works. One can have two different PDFs that encode the same content in two different ways and unless `pdftotext` does virtual layout and then OCR-like extraction, you might end up with jumbled text or text in different orders.
not only is this the first solution presented in the article, you've not even bothered to pass -layout to pdftotext or -u/-y to diff which would make this marginally workable. spaces after shebang also doesn't always work (e.g. in qemu), you don't clean up the temporary files, and the temporary files have unhelpful names.
[+] [-] btown|4 years ago|reply
[+] [-] phiresky|4 years ago|reply
[+] [-] pronoiac|4 years ago|reply
[+] [-] upofadown|4 years ago|reply
Things like legal documents should be restricted to plain text... and stuff like line endings should be standardized for this purpose.
[+] [-] wolverine876|4 years ago|reply
I don't see that happening! ;)
[+] [-] mckmk|4 years ago|reply
[+] [-] fmos|4 years ago|reply
[+] [-] redman25|4 years ago|reply
[+] [-] seesawtron|4 years ago|reply
[+] [-] spdegabrielle|4 years ago|reply
[+] [-] Toutouxc|4 years ago|reply
[+] [-] yardshop|4 years ago|reply
[1] https://www.scootersoftware.com
[+] [-] riobard|4 years ago|reply
[+] [-] umvi|4 years ago|reply
[+] [-] wolverine876|4 years ago|reply
Also, Markdown won't nearly represent all the formatting possibilities of a PDF.
Also, I'm not sure HTML is much easier to compare than PDF. Both are markup languages, essentially.
But following along your idea, it would be nice if the document itself contained separate content and presentation layers, so that we could manipulate and analyze (and compare) the content.
[+] [-] spdegabrielle|4 years ago|reply
[+] [-] wolverine876|4 years ago|reply
Reading a book on a screen, I find PDF far superior to HTML or Word or text.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] sbmthakur|4 years ago|reply
[+] [-] Communitivity|4 years ago|reply
The hardest parts would probably be scoping to figure out what Adobe extensions to support, and acquiring or reverse engineering the Adobe extension formats.
Simple PDF example:
[+] [-] untoxicness|4 years ago|reply
[+] [-] halostatue|4 years ago|reply
[+] [-] Hello71|4 years ago|reply
[+] [-] seoaeu|4 years ago|reply