top | item 46370840

(no title)

shbooms | 2 months ago

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option

discuss

order

pottertheotter|2 months ago

This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

2ICofafireteam|2 months ago

I have encountered PDFs that would exhibit this behavior in one browser but not in another.

One fun thing I encountered from local government is releasing files with potato quality resolution and not considering the page size.

I had a FOI request that returned mainly Arch D sized drawings but they were in a 94 DPI PDF rendered as letter sized. It was a fun conversation trying to explain to an annoyed city employee that putting those large drawings in a 94 DPI letter size page effectively made it 30-ish DPI.

eviks|2 months ago

Hostile indeed, and also happens in user-facing documents like product manuals!

8note|2 months ago

run some ocr on them after to recreate the text layer?

albert_e|2 months ago

With the aggressive push of LLMs and Generative AI ..i am expecting a lot of OCR features to become "smarter" by default, namely go beyond mechanical OCR and start inserting hallucinations and sematically/contextually "more correct" information in OCR output

It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context