top | item 43190115

(no title)

hashta | 1 year ago

An effective way that usually increases accuracy is to use an ensemble of capable models that are trained independently (e.g., gemini, gpt-4o, qwen). If >x% of them have the same output, accept it, otherwise reject and manually review

discuss

rafram|1 year ago

There’s a very low chance that three separate models will come up with the same result. There are always going to be errors, small or large. Even if you find a way around that, running the process three times on every page is going to be prohibitively expensive, especially if you want to finetune.

vintermann|1 year ago

No, running it two or three times for every page isn't prohibitive. In fact, one of the arguments for using modern general-purpose multimodal models for historical HTR is that it is cheaper and faster than Transkribus.

What you can do is for instance to ask one model for a transcription, and ask a second model to compare the transcription to the image and correct any errors it finds. You actually have a lot of budget to try things like these if the alternative is to fine-tune your own model.

jjk166|1 year ago

The odds of them getting the same result for any given patch should be very high if it is the correct result and they aren't garbage. The only times where they are not getting the same result would be the times when at least one has made a mistake. The odds of 3 different models making the same mistake should be low (unless it's something genuinely ambiguous like 0 vs O in a random alphanumeric string).

Best 2 out of 3 should be far more reliable than any model on its own. You could even weight their responses for different types of results, like say model B is consistently better for serif fonts, maybe their confidence counts for 1.5 times as much as the confidence of models A and C.