top | item 38879560

(no title)

krka | 2 years ago

I haven't worked on this code in a long time but I will try to answer based on what I remember. (Also on phone without easy access to the code)

The idea is that all entries have a natural ideal position in the table and the displacement is how far away from it it is placed in practice. If two entries are competing for the same slot, the logic will prefer displacements of 1,2 over 0,3 so if you observe an entry with displacement 5 after already having visited 5 entries you can already conclude you wont find what you are looking for. However, I am not sure this quick exit is actually implemented.

I think the index size is not a power of two, but that was something I considered but it would greatly affect size. At the time of building it, spinning disks were still common and optimizing for space compactness and minimizing page faults/iops were more important. I think it would be at most one division per lookup and the the memory/disk reads would dominate the work anyway.

Deletions are implemented as a special tombstone entry in the log file which then causes a removal from the index slot as well. There is also a naive implementation of a compaction flow (go through log, build index, regenerate new log by looking at the index, then rewrite index)

discuss

teo_zero|2 years ago

Thank you for the answer.

> if you observe an entry with displacement 5 after already having visited 5 entries you can already conclude you wont find what you are looking for.

Exactly. If I have already visited 5 slots without finding what I was looking for, and I'm examining the 6th one, where my target would then be "off by 6" from its natural place, and I find it filled with something that is off by 5 or less from ITS natural place, then I can abort the search knowing that my target has never been inserted because if it was inserted and arrived at this slot, the algorithm would have evicted the former occupant of the slot (smaller displacement means wealthier in Robin Hood's metaphor, so prone to be robbed of the slot).

To summarize, I can abort the search when I find an entry with a displacement lower than my target would have in the same position. But the docs read higher. Hence my puzzlement.

> I think it would be at most one division per lookup

It's one division per visited slot, because in each slot I need to calculate the displacement of the entry that I find there. And to calculate the displacement I need to know the "natural" slot, which I calculate as hash % size.

> Deletions are implemented as a special tombstone entry in the log file

Agreed.

> which then causes a removal from the index slot as well.

That's the part that I'm missing. If you simply remove it without rearranging all the neighbor slots, the nice properties we discussed before becomes void.

krka|2 years ago

True, if the lookup actually implemented this, it would be more divisions. This might partly be why it did not get implemented at all. With a sparse index (25% empty capacity) we should quickly run into empty slots anyway in most cases, and if we mostly expect hits the check would not be useful anyway.

For deletions, neighbour entries would bubble up into their correct slots to preserve that property.