(no title)
vikp | 1 year ago
Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.
You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .
The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.
Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.
lolinder|1 year ago
For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.
Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.
I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?
[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...
themanmaran|1 year ago
- Every document has ground truth text, a JSON schema, and the ground truth JSON.
- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema
- Compare the predicted JSON against the ground truth JSON for accuracy.
In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.
https://github.com/getomni-ai/benchmark
vikp|1 year ago
I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.
There are a few different benchmark types in the marker repo:
None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.
arthurcolle|1 year ago
to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema
carlgreene|1 year ago
vikp|1 year ago
netdevphoenix|1 year ago
Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption
bfors|1 year ago
unknown|1 year ago
[deleted]
ntkris|1 year ago
ChrisRob|1 year ago
codelion|1 year ago
DeathArrow|1 year ago
To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?
boxed|1 year ago
stavros|1 year ago