top | item 46730879

(no title)

You just need a robust benchmark. As long as you understand your benchmark, you can trust the results.

We have a hard OCR problem.

It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.

Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.

Some problems aren't so easily benchmarked.

discuss

No comments yet.