top | item 38429366

(no title)

Clueed | 2 years ago

I've looked into the available options of parsing PDFs, including pypdf, which is what is being used here, a while ago and it's not good. While I haven't testing equations specifically, it think it's fair so assume that the results will be subpar especially complex ones.

I guess, this could be an application of the agent model. I've seen multiple LLMs recently trained specifically on LateX parsing. One model would recognize from the parsed PDF garbage that there is probably an equation there and call a different want to parse it.

discuss

Loic|2 years ago

Thank you for the idea to recognize the garbage to then use a different flow for the image of the equation from the pdf. Still left with an image to LaTeX problem, but maybe the state of the art has improved in the past years.