Evolution of tree data structures for indexing

renewiltord|4 years ago

Something I've always wondered is if there is any benefit in tuning these tree-based indices for SSDs. Would you make them deeper and narrower? Or do you usually just aim to load your indices in main mem anyway?

jandrewrogers|4 years ago

There is, but it isn’t what I would call intuitive. Traditional tree-based indexing algorithms embed a diverse set of assumptions about the nature of the hardware that aren’t actually true in modern systems. SSDs are also quasi-sequential devices in important ways that need to be respected for performance reasons. The design problem is that storage density and bandwidth has increased massively and traditional indexing algorithms are poorly suited for that environment.

The relative memory wastefulness of many indexing algorithms also means that you can’t load the indexes into memory on machines with high storage density, they literally won’t fit. This leads to a page fault cascade in practice that causes a very rapid degradation of performance beyond a relatively ordinary scale.

Modern indexing algorithms (not the ones in the article) give up a small amount of selectivity and balanced trees in exchange for reducing the memory footprint of indexes by multiple orders of magnitude and greatly increasing write scalability. Given the storage bandwidth available in modern systems, that is a big win in terms of throughput. Balanced trees are an anachronism generally in current design; it doesn’t matter how unbalanced an index tree is if it fits in CPU cache while searching countless terabytes of data.

20 years ago, I was using all of the indexing structures in the article for high-performance systems. They started going out of style about 10 years ago and I honestly can’t remember the last time I worked on a new system that implemented them. That transition happened slowly and then all at once.

phamilton|4 years ago

I've wondered similarly. I don't have answers for you, but to add more questions: Are there tree structures that are more amenable to SSDs and the associated write amplification and other perf penalties? Could "hard-linking" in the SSD facilitate covered indexes? Could bitwise operations be performed on the SSD itself?

Edit: I found some exploration: https://www.usenix.org/conference/osdi14/technical-sessions/...

kilodeca|4 years ago

Title of the latest article:

> How many engineers does it take to make subscripting work?

My question: How many engineers does it take to have horizontal scroller on overflow?

RonaldOlzheim|4 years ago

I visited a ´real´ libary yesterday and theur automated reaction durting covid-19 is slowing down.

I also noticed that ´width tech´ is the only lacking verb before tech without a specific describtion.

So short:

I think we should call width tech the ability to change pace, change quality and change the amount of code written.

Why: Because issues like these are making or breaking tech companies.

What do you think?

bob1029|4 years ago

No mention of the splay tree? I am a little disappointed. The locality of access and dynamic programming semantics are super compelling to me.

binarybanana|4 years ago

>splay tree

Oh yes!

Probably my favorite data structure. In the past I've used them instead of arrays for performance (latency) reasons when constructing ordered sets from randomly distributed elements that came in multiple batches. Insertion into a splay tree meant that no sorting step at the end was necessary and the overhead was lower than first collecting everything into an array and only sorting after all elements were inserted.

Their amortized performance characteristics in general are close to ideal and the self-tuning character is intriguing. The only downside is the possibility of getting into a degenerate state, but that can be avoided by adding some randomness. Some paper I read even suggested that splaying only occasionally actually improves performance.

12 comments