top | item 32613835

(no title)

Fizzz | 3 years ago

How do you currently extract info from the FOIA request pages? What kind of info do you look for? Just thinking how you could standardise this

discuss

order

chaps|3 years ago

Mostly tesseract, or uploading it through documentcloud and running manual searches. I do a combination of data analysis and spend many hours reading through documents. Sometimes I use unix tools like grep/awk/etc, sometimes I use SQL. If the PDF isn't scanned, I use tabula for csv extraction, but if it's scanned it becomes a silly ordeal.

Mind you, I'm not exactly looking for advice here. It's a supremely difficult problem and gut-ideas more often than not don't pan out.