top | item 9808871

The Ubiquitous B-Tree (1979) [pdf]

66 points| robinson_k | 10 years ago |wwwold.cs.umd.edu

10 comments

order

craigching|10 years ago

Both times before this was posted it garnered 0 comments:

https://news.ycombinator.com/item?id=4317545

https://news.ycombinator.com/item?id=7243568

When I read this I immediately recognized it as a PDF that I have in GoodReader (where I store all my technical PDFs). So I'm curious if it will start a good conversation about b-trees this time :)

To get the conversation started, a couple of years ago, I decided to write a b-tree data structure to refresh my skills a bit. I wrote it in C since I've been doing a lot of Java the last 10 years or so. I only got as far as an in-memory b-tree and had started working on getting it stored to disk before I abandoned the project (think I was working on it during Christmas holidays and the holiday ended).

I subsequently read that in-memory b-trees can actually outperform other tree data structures like rb-trees due to cache coherency. I never bothered to performance test my b-tree vs. an rb-tree I also wrote for fun.

Anyway, that's my story about b-trees!

detrino|10 years ago

Cache coherency is something else.

But ya, B-Trees will (sometimes significantly) outperform BSTs for small keys because they will have fewer cache misses. Also interesting is that the idea that RB-Trees are faster than AVL trees for insert/delete may be outdated as well. I've seen benchmarks of AVL trees beating std::set in both libstdc++ and libc++ for both of those operations. Again this is probably due to less cache misses.

My B-Tree story is actually a B+Tree story. B-Trees (and B+Trees) can can actually be used for more than just ordered sets/maps, I've implemented a C++ sequence container[1] that allows O(log n) random access insertion and deletion. It is _much_ faster than a similar container[2] implemented using AVL trees, but at the cost of iterator stability.

[1] https://github.com/det/segmented_tree_seq

[2] http://avl-array.sourceforge.net

bbrazil|10 years ago

I did my final year project of my degree on B-Trees (Cache Conscious Optimisation of STL Data Structures).

The short version is that you want to tune to page size, inserts/reads tend to be faster than RB-Trees but deletes are slower. This was true across a variety of chips and cpu architectures.

srean|10 years ago

Then this library from Google would interest you https://code.google.com/p/cpp-btree/

It exposes the standard STL container interface. Although it was primarily intended to be used as a map or a set, if I remember correctly they also had b-tree backed large vectors that outperformed std:: vectors in random reads. I seem to have lost the url for that post/comment.

flgr|10 years ago

If you are interested in an overview of B-Trees, similar data structures, and how well they work on modern hardware, you may also find the survey bit of my master thesis interesting:

See here: https://www.researchgate.net/profile/Florian_Gross/publicati....

Along those lines:

* CSS Trees: Pointerless b-Trees with a layout optimized for cache lines (http://www.vldb.org/conf/1999/P7.pdf)

* Intel & Oracle's fast architecture-sensitive tree search (combines huge pages, cache line blocking, and SIMD in an optimal layout): http://www.researchgate.net/profile/Jatin_Chhugani/publicati...

* Adaptive radix trees (http://codematch.muehe.org/~leis/papers/ART.pdf)

tomcam|10 years ago

A lucid, enjoyable paper that allowed me to write a very fast B-Tree implementation in nascent ANSI C back in the day. Comer's books always struck just the right balance between scholarly completeness and clarity, at least to me.

rizzin|10 years ago

Why does the root have only two children and not d, like every other non-leaf node?

detrino|10 years ago

Because when you split the root, it can only have 2 children.

tomcam|10 years ago

[deleted]