(no title)
ArnavAgrawal03 | 7 months ago
We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.
Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.
themanmaran|7 months ago
Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.
In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.