(no title)
evrydayhustling | 1 year ago
> ... the vectors truly capture the semantic content contained in the screenshots. This robustness is due to the model’s unique approach of processing all input modalities through the same backbone.
With that said, I think this benchmark is a pretty narrow way of thinking about multi-modal embedding. Having text embed close to images of related text is cool and convenient, but doesn't necessarily extend to other notions of related visual expression (e.g. "rabbit" vs a photo of a rabbit). And on the narrow goal of indexing document images, I suspect there are other techniques that could also work quite well.
This seems like a great opportunity for a new benchmark dataset with multi-modal concept representations beyond media-of-text.
No comments yet.