top | item 38567288

(no title)

iDon | 2 years ago

Extending that idea to the web, or at least to the blogosphere and information / knowledge web-sites, seems useful. I wonder if there is a web service which has calculated vector embeddings for some of the web, and supports vector search, e.g. given a URL, find URLs with similar embeddings. Inverting that, web-sites could annotate their web pages with embeddings via json-ld; which search engines could utilise. Both these ideas might be impractical, e.g. the cost of http GET of the vector might be similar to the cost of calculating the embedding; and the embedding would be only comparable with embeddings from the same model (which would be recorded in the json-ld) so it would age quickly. It would also be subject to SEO gaming, like meta tags.

A quick search didn't find either of these; the closest was this paper which used json-ld to record a vector reduced to 2 dimensions using tSNE : https://hajirajabeen.github.io/publications/Metadata_for_Eme... Metadata standards for the FAIR sharing of vector embeddings in Biomedicine S¸ enay Kafkas et al.

discuss

tomhazledine|2 years ago

Yeah - it's a great idea. The size of the embeddings is the big restricting factor IMO. Even with my approach of embedding the entire article, my embeddings index was about the same size as my "regular" search index.

Once you start increasing the granularity of what you're embedding (either by paragraph or sentence) then the old-fashioned search index has a big advantage.

Might be worth it in some scenarios because of the quality of the results. I bet there are places where an embedding search would be more effective by orders of magnitude.

jasonjmcghee|2 years ago

I’m very interested in this. Specifically, I don’t want it to be a web service, I want it to be in a CDN.

Took a crack at building “vector-store-in-a-cdn” last weekend.

https://github.com/jasonjmcghee/portable-hnsw

flir|2 years ago

> It would also be subject to SEO gaming, like meta tags.

Would be easy to mark a vector-providing site as a bad actor, though? Re-run a few of their pages, if you come up with different answers, don't trust them.