(no title)
Teleoflexuous | 1 year ago
Two approaches I had most, but not full, success with are: 1) converting to image with pdf2image, then reading with pytesseract 2) throwing whole pdfs into pypdf 3) experimental multimodal models
You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck.
freethejazz|1 year ago
[1] https://arxiv.org/pdf/2303.09957
authorfly|1 year ago
Teleoflexuous|1 year ago
I wish I was hiring, if that's what you're asking ;) Otherwise, if you have any ideas for processing formulas (even just for reading them out, but any extra steps towards expressing what they mean - ' 'sum divided by count' is 'mean'/'average' value ' being the most simple example I can think of) I'd love to hear them. Novel ideas in technical papers are often expressed with formulas which aren't that complicated conceptually, but are critical to understanding the whole paper and that was another piece I was having very mixed results with.
siamese_puff|1 year ago