(no title)
fcoury
|
6 months ago
I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.
nawazgafar|6 months ago
threeducks|6 months ago
Here is the corresponding GitHub issue for your default model (Qwen2.5-VL):
https://github.com/QwenLM/Qwen2.5-VL/issues/241
You can mitigate the fallout of this repetition issue to some degree by chopping up each page into smaller pieces (paragraphs, tables, images, etc.) with a page layout model. Then at least only part of the text is broken instead of the entire page.
A better solution might be to train a model to estimate a heat map of character density for a page of text. Then, condition the vision-language model on character density by feeding the density to the vision encoder. Also output character coordinates, which can be used with the heat map to adjust token probabilities.