top | item 22475583

(no title)

robinhowlett | 6 years ago

Thanks for the links - agree about the (x,y,text) callout but other metadata like font size can be useful too.

Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.

I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...

discuss

order

Defenestresque|6 years ago

I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.

[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...

giovannibonetti|6 years ago

Oh, yeah, pdf2json returns font sizes as well. I forgot to mention that.

pierre|6 years ago

pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :

https://github.com/AXATechLab/pdf2json

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....