(no title)
yfontana | 9 months ago
The full extract is projected to run for several days on a GPU cluster, at a cost of like 20-30k (can't remember the exact number but it's in that ballpark). When you can afford this kind of compute, text extraction from PDFs isn't quite a fully solved problem, but we're most of the way there.
What the article in the OP tries to do is, as far as I understand, somewhat different. It's trying to use much simpler heuristics to get acceptable results cheaper and faster, and this is definitely an open issue.
No comments yet.