(no title)
robinhowlett | 6 years ago
Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.
I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...
Defenestresque|6 years ago
[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...
giovannibonetti|6 years ago
pierre|6 years ago
https://github.com/AXATechLab/pdf2json
Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....