top | item 44639660

(no title)

That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).

We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.

Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.

discuss

themanmaran|7 months ago

This just depends a lot on how well you can parse down the context prior to passing to an LLM.

Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.

In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.