top | item 5423594

(no title)

wandermatt | 13 years ago

"This is a very nice theoretically clean and powerful sequential data structure, but in practice it is very very very very very very very very very very very slow... very very slow. No one uses finger trees because they're so slow." -Daniel Spiewak

From his 2011 NEScala talk on functional data structures, covering practical considerations like locality of references and caching. Available on vimeo http://vimeo.com/20262239

discuss

fusiongyro|13 years ago

Haskell's Data.Sequence is based on finger trees. I'd be surprised if the Real World Haskell guys would recommend it so strongly if it performed "very"^N badly. They're well-known for being performance-obsessed.

"Both Haskell's built-in list type and the DList type that we defined above have poor performance characteristics under some circumstances. The Data.Sequence module defines a Seq container type that gives good performance for a wider variety of operations." -- http://book.realworldhaskell.org/read/data-structures.html#d...

carterschonwald|13 years ago

Indeed. As one of those performance obsessed haskellers, a golden rule is "choose your data structures based upon your work load". This means not just the time space complexity of the operations , but also whether you want persistence vs shared state. Etc etc. finger trees are nice for workloads that are heavy on adding / removing values on the ends, with some random access, sequential scans are ok too.

I may wind up using this or a similar data structure in a number of current commercial and open source projects

dons|13 years ago

The data structure has great complexity, and is closely relate to hash-array mapped tries, and ropes, which have good implementations.

Individual implementations might suck. Others are just fine (http://hackage.haskell.org/packages/archive/fingertree/0.0/d...) - though even this could be optimized (for constant factors).

marssaxman|13 years ago

Slow compared to what? You'd certainly be wasting your time to use a finger tree as a simple queue, but if you need an immutable / persistent container with fast access to the ends and not-terrible random lookup, I can't think what else you would use.

wandermatt|13 years ago

Compared to a Bitmapped Vector Trie.

If you don't want to watch the video, a version of this talk was also given at Strange Loop, the slides for that one are available online: http://www.infoq.com/presentations/Functional-Data-Structure...

AaronFriel|13 years ago

NEScala -- was he implementing finger trees as Java objects? Java objects have extraordinary overhead, and Finger Trees are only efficient when the per-entry overhead is low. So naturally he prefers array-mapped tries, because arrays have substantially less per-object overhead in Java.

See this overview of the difficulty of building efficient datastructures in Java: http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/mem...

AaronFriel|13 years ago

Some further exposition.

Some takeaways from this slide deck: a TreeMap<Double, Double> has a minimum of 82% overhead. If you allocate 100 megabytes of memory to that tree map, only one fifth of that memory is your data, the rest is object overhead. An array, on the other hand, would have only 16 bytes of data, and thus would store close to five times as much data per megabyte. Even if you implement some strange accounting scheme on top of your array to make it act like a tree, you're probably already reaping performance benefits due to greater data locality, likelihood of having data in cache lines, etc.

I'm reminded of when I worked with libraries for Judy arrays in 2006 or so. Judy arrays are a radix tree like data structure with performance optimized for cache line sizes and dereferences at every node of the structure. Pointers to leaves are tagged in their lowest 3 bits to indicate which of 8 types of node they are. A very full node then will usually be allocated as an array, a node with fewer entries than roughly 3/4 of a cacheline worth of pointers will have an array of bytes, and an array of pointers to child nodes, located adjacently. The first list is searched, and if there's a hit, it uses that index to find the pointer to the child in the second list.

All of these optimizations are really fantastic, Judy arrays not only have more "features" than hash tables, such as very efficiently stored arbitrary length keys and the ability to access the first, last, and random elements of the tree very efficiently, but also a Judy array can be accessed in lexicographic order which cannot be done with a conventional hash table. On top of this, in practice, they outperform naive hash trees straight out, due to their eye to optimizing for real world processors' cache line size.

All of this is fantastic, except if you implemented this in Java it would be, ahem, "in practice it is very very very very very very very very very very very slow... very very slow".

wandermatt|13 years ago

Yes, this on the JVM. No, that was not the issue.