top | item 46983811

(no title)

pogue | 18 days ago

I'd be very curious what works well with FOIA historical documents that have been scanned by hand with redactions by markers & etc.

discuss

order

chaps|18 days ago

I like to use textual anchors for things like, "line starts with" or "line ends with" or "file ends with" and combining that with levenshtein distance with some normalization stuff (combining adjacent strings in various patterns to account for OCR wonkiness). Turns into building lists of anchors that can be built off of. Of all the things I've tried, including things like image hashing and such, it's been the most effective generalized "tool".

But also, I hold the strong philosophy that it's important to actually read the documents that are being scanned. In that way, OCR tends to be more of a procedural step than anything.

Really, it ultimately depends on your goals.