top | item 37567909

(no title)

If I understand, having only 4096 bytes of data per term causes multiple terms in the same query to intersect to little or no results. The purpose seems to cut cost in compromise of completeness.

discuss

marginalia_nu|2 years ago

Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

daoudc|2 years ago

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.

daoudc|2 years ago

Yes, you're correct on the purpose. We mitigate it a little by also indexing on bigrams.