(no title)
olup | 1 year ago
Having developed a couple of page to markdown myself, I think the bigger challenge is to make sense of so many pages that rely on spacial organisation of information that only makes sense to human, or even presence of images. One way to do it is to render the page as an image and extract data with a vision llm. But you do need heuristic on when to do classic extraction and when to use vision, plus get rid of cookie banner and overlays. This is more complex and costly, but have real business value, for the one that can pull it off.
sachou|1 year ago
Right now adding chunk size, model for embedding, what else?
Image is a great challenge with OCR can be solve as you mentioned
olup|1 year ago
If I were in your shoes I would not think embedding and inserting in a vector store would be my responsibility, especially since there are so many different stores on the market.