top | item 32613835

(no title)

Fizzz | 3 years ago

How do you currently extract info from the FOIA request pages? What kind of info do you look for? Just thinking how you could standardise this

discuss

chaps|3 years ago

Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal.

Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.