top | item 41292193

(no title)

guiomie | 1 year ago

Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.

discuss

abhi_p|1 year ago

Disclosure: I'm an employee.

Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...

We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).