For the curious on how it works (not mentioned in the readme), it uses pymupdf and a precise mapping of all information in area coordinates, as such the document layout is hard coded.
When layout changes this breaks but layout changes on this sort of documents do not happen often (I think). Also code is very clean and it serms straightforward to fix.
This kind of code is maybe something that can be generated from an LLM/agent? (It would be easy to write checks)
Besides the practical value for those who might need it, I think it is possibly interesting for others to look at this approach.
Offtopic: Stay away from Poste Italiane at all costs. The worst bank I have ever dealt with in my entire life. I'm glad that I don't have to deal with them anymore. Terrible service and eternal waitings on their branches. They are extremeley incompetent.
I would love to have something more generic (and tried to build it already), but parsing tables and bank statements even from digital PDFs (as in, those that really have tables and not a picture) is still very difficult. Especially when the bank changes layouts from one month to another.
I would love to be proven wrong, but everything I have tried so far is... subpar.
Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data
I just tried it on a fairly ugly TD Bank statement PDF I have and the markdown of the whole PDF (tables and all) is very accurate. Here is the config I use:
You might be able to tell the LLM to directly output the data in CSV format - granted it will still be in a .md file - using the `--block_correction_prompt` which apparently is "useful for custom formatting or logic that you want to apply to the output"
> Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data
In practice, the flow from my perspective looks like LLM parser -> normalizer -> validator. So you only save one step (parser), and given the unique stochastic nature of the LLM output, the normalizer and validator can be trickier to write than one used for an old-fashioned rules-based parser. But each situation is different, or YMMV.
pietroppeter|7 months ago
When layout changes this breaks but layout changes on this sort of documents do not happen often (I think). Also code is very clean and it serms straightforward to fix.
This kind of code is maybe something that can be generated from an LLM/agent? (It would be easy to write checks)
Besides the practical value for those who might need it, I think it is possibly interesting for others to look at this approach.
Neat project, thanks for sharing!
genbs|7 months ago
liendolucas|7 months ago
sebtron|7 months ago
nkjoep|7 months ago
genbs|7 months ago
denysvitali|7 months ago
I would love to be proven wrong, but everything I have tried so far is... subpar.
Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data
vdm|7 months ago
dimitri-vs|7 months ago
I just tried it on a fairly ugly TD Bank statement PDF I have and the markdown of the whole PDF (tables and all) is very accurate. Here is the config I use:
marker_single --format_lines --use_llm --llm_service marker.services.gemini.GoogleGeminiService --gemini_model_name gemini-2.5-flash --disable_image_extraction --output_format markdown --output_dir "$OutDir" ` "$In"
You might be able to tell the LLM to directly output the data in CSV format - granted it will still be in a .md file - using the `--block_correction_prompt` which apparently is "useful for custom formatting or logic that you want to apply to the output"
jgalt212|7 months ago
In practice, the flow from my perspective looks like LLM parser -> normalizer -> validator. So you only save one step (parser), and given the unique stochastic nature of the LLM output, the normalizer and validator can be trickier to write than one used for an old-fashioned rules-based parser. But each situation is different, or YMMV.
simonebrunozzi|7 months ago
The usual, amazing irony of us Italians. Love it.
genbs|7 months ago
amadeuspagel|7 months ago
Tox46|7 months ago
brightbeige|7 months ago
translates to:
“to the unfortunate ones who have a postal account”
unknown|7 months ago
[deleted]
rcastellotti|7 months ago
genbs|7 months ago