top | item 40370571

(no title)

shekhar101 | 1 year ago

Tangential - I just want a decent (financial transaction) Table to text conversion that can retain the table structure well enough (e.g. merged cells) and have tried everything under the sun short of fine tuning my own model, including all the multimodal LLMs. None of them work very well without a lot of prompt engineering on case by case basis. Can this help? How can I set it up with a large number of pdfs that are sorted by type and extract tabular information? Any other suggestions?

discuss

order

derefr|1 year ago

Or how about the opposite? Give me a CLI tool to pipe implicitly-tabular space-padded text into — a smart cut(1) — where I can say "give me column 3" and it understands how to analyze the document as a whole (or at least a running sample of a dozen lines or so), to model the correct column boundaries, to extract the contents of that column. (Which would also include trimming off any space-padding from the content. I want the data, not a fixed-width field containing it!)

For that matter, give me a CLI tool that takes in an entire such table, and lets me say "give me rows 4-6 of column Foo" — and it reads the table's header (even through fancy box-drawing line-art) to determine which column is Foo, ignores any horizontal dividing lines, etc.

I'm not sure whether these tasks actually require full-on ML — probably just a pile of heuristics would work. Anything would be better than the low-level tools we have today.

KhoomeiK|1 year ago

That's an interesting problem—Tarsier probably isn't the best solution here since it's focused on webpage perception rather than any kind of OCR. But one could try adapting the `format_text` function in tarsier/text_format.py to convert any set of OCR annotations to a whitespace-structured string. Curious to see if that works.

davedx|1 year ago

I'm having decent success with GPT4o on this. Have you given it a try? It probably varies from table structure to table structure.