top | item 45645032

(no title)

cle | 4 months ago

That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

discuss

No comments yet.