top | item 46444975

Show HN: Multimodal Search over the National Gallery of Art

2 points| Beefin | 2 months ago |mxp.co

We indexed 120K images from the National Gallery of Art for visual search. Text queries, image uploads, and "find similar" all in one retriever, fused with RRF.

Demo: https://mxp.co/r/nga

Stack: SigLIP (768-dim embeddings), Ray on 2× L4 GPUs, Qdrant. ~2 hours to process, <100ms queries.

Why SigLIP over CLIP: sigmoid loss instead of softmax means embeddings live in a global semantic space—similarity scores stay consistent at scale instead of being batch-relative.

The interesting part is the retriever. One stage, three optional inputs:

- text → encode → kNN - image → encode → kNN - document_id → lookup stored embedding → kNN

Pass any combination. If multiple, fuse with reciprocal rank fusion (RRF). No score normalization needed—RRF only cares about rank position.

Killer query: pass a document_id + text like "but wearing blue." RRF combines structural similarity with the text constraint.

Blog with full config: https://mixpeek.com/blog/visual-search-rrf/

discuss

No comments yet.