top | item 45805239

(no title)

xfalcox | 3 months ago

I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics, by doing the same thing you described where we over capture with binary embeddings, and only use the full (or half) precision on the subset.

Making the storage cost of the index 32 times smaller is the difference of being able to offer this at scale without worrying too much about the overhead.

discuss

Someone|3 months ago

> I was taken back when I saw what was basically zero recall loss in the real world task of finding related topics

By moving the values to a single bit, you’re lumping stuff together that was different before, so I don’t think recall loss would be expected.

Also: even if your vector is only 100-dimensional, there already are 2^100 different bit vectors. That’s over 10^30.

If your dataset isn’t gigantic and has documents that are even moderately dispersed in that space, the likelihood of having many with the same bit vector isn’t large.

barrkel|3 months ago

And if dispersion isn't good, it would be worthwhile running the vectors through another model trained to disperse them.