Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly.[1] https://arxiv.org/pdf/2303.09957
No comments yet.