(no title)
kwon-young | 7 months ago
I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc. But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans.
[1] https://poppler.freedesktop.org/ [2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...
throwaway4496|7 months ago
Rastering and OCR'ing PDF is like using regex to parse XHTML. My eyes are starting to bleed out, I am done here.
ramraj07|7 months ago
But im a guy who's in the market for a pdf parser service, im happy to pay pretty penny per page processed. I just want a service that works without me thinking for a second about any of the problems you guys are all discussing. What service do I use? Do I care if it uses AI in the lamest way possible? The only thing that matters are the results. There are two people including you in this thread ramming with pdf parsing gyan but from reading it all, it doesn't look like I can do things the right way without spending months fully immersed in this problem alone. If you or anyone has a non blunt AI service that I can use Ill be glad to check it out.
anthk|7 months ago