top | item 41107022

(no title)

My use case is research papers. That means very clear text, combined with graphs of varying form and quality and finally occasional formulas.

Two approaches I had most, but not full, success with are: 1) converting to image with pdf2image, then reading with pytesseract 2) throwing whole pdfs into pypdf 3) experimental multimodal models

You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck.

discuss

freethejazz|1 year ago

Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly.

[1] https://arxiv.org/pdf/2303.09957

authorfly|1 year ago

I have great news I wish someone delivered to me when I was in your shoes - try "GROBID". It parses papers into objects with abstract/body/figures! It will help you out a great deal. It is designed for papers and can extract the text almost flawlessly, but also give information on graphs for separate processing. I have several years experience with academic text processing (including presentations) working with an Academic Publisher if I could be helpful to anything?

Teleoflexuous|1 year ago

I have no idea how did I miss them last time I was looking around, unless they grew significantly over last half a year or so. I'll check it out when I get back to this project, thanks.

I wish I was hiring, if that's what you're asking ;) Otherwise, if you have any ideas for processing formulas (even just for reading them out, but any extra steps towards expressing what they mean - ' 'sum divided by count' is 'mean'/'average' value ' being the most simple example I can think of) I'd love to hear them. Novel ideas in technical papers are often expressed with formulas which aren't that complicated conceptually, but are critical to understanding the whole paper and that was another piece I was having very mixed results with.

siamese_puff|1 year ago

Check out appjsonify for research papers