top | item 46978491

(no title)

chaps | 18 days ago

Documents that come from FOIA. So, some scanned, some not. Lots of forms and lots of hand writing to add info that the form format doesn't recognize. Lots of repeated documents, but lots of one-off documents that have high signal.

discuss

order

pogue|18 days ago

I'd be very curious what works well with FOIA historical documents that have been scanned by hand with redactions by markers & etc.

chaps|17 days ago

I like to use textual anchors for things like, "line starts with" or "line ends with" or "file ends with" and combining that with levenshtein distance with some normalization stuff (combining adjacent strings in various patterns to account for OCR wonkiness). Turns into building lists of anchors that can be built off of. Of all the things I've tried, including things like image hashing and such, it's been the most effective generalized "tool".

But also, I hold the strong philosophy that it's important to actually read the documents that are being scanned. In that way, OCR tends to be more of a procedural step than anything.

Really, it ultimately depends on your goals.