top | item 40856614

(no title)

ck_one | 1 year ago

Can anyone recommend a method to deduplicate pdfs? The hash is often different but the content and meta data is 99.99% the same.

discuss

pixelmonkey|1 year ago

You might want strip metadata before doing a comparison, using exiftool. Even though exiftool was originally written for EXIF metadata on JPGs, these days, it supports a lot of metadata standards, including PDF. This command will do it assuming you set filename=`basename your.pdf .pdf`:

    exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf

That won't help you with small differences in the contents, but might help with small differences in metadata. Running `md5sum` on the stripped PDF should give more reliable dedupe results.

I was recently working on a similar problem for JPG, RAW, and MP4 files (photo/video backup) so it is fresh in my mind.

bob1029|1 year ago

I would consider rasterizing the PDFs and then hashing the resulting bitmaps.

strangus|1 year ago

cp?