top | item 6445929

(no title)

JeanPierre | 12 years ago

Author here: I think you talk about the persistent hashmaps, not the persistent vectors. I've been looking for papers explaining Clojure's persistent collections, and Bagwell seems to cover the hash maps and hash sets quite well. However, I've not seen a paper on the persistent vectors, which was quite a bummer, and that was the reason I started explaining them in the first place.

If you have a reference to a paper explaining something similar (or the actual implementation), I'd love to put it in the post for others.

discuss

jasonwatkinspdx|12 years ago

https://github.com/clojure/clojure/blob/c6756a8bab137128c811...

Looking at the source the persistent vectors are virtually identical to Bagwell's paper. Rich did add a couple tweaks, namely moving the bitvector that indicates what slots of a node are occupied from being a word in the node object to being embedded in the 64bit integers stored in each node slot. When a node is filled enough to span 2 cache lines, around 9 slots on typical hardware with 64 byte lines, and the next desired index fragment is the 9th slot or higher, this avoids touching the first cache line, potentially saving a cache miss. This is why the nodes are 32 way: 32bits for the bitvector and 32bits for the offset in the underlying storage array fit in one 64bit word which can be written atomically (inside a transient obviously). Rich goes through this in one of his talks but I don't recall which.

The modification to go from mutable to immutable isn't an invention either. Anyone who's read any of the functional data structure literature will be familiar with path copying being one of the two general ways of making any data structure persistent.

From the perspective of these data structures there's little difference between a vector with integer indexes and a hashmap. The hashmap just requires a preliminary step of hashing the key to an integer.

swannodette|12 years ago

From the paper I cited below:

The immutable vector data structure as pioneered by the programming language Clojure [4] strikes a good balance between read and write performance and supports many commonly used programming patterns in an effi- cient manner. In Clojure, immutable vectors are an essential part of the language implementation design. Ideal Hash Tries (HAMTs) [1] were used as a basis for immutable hash maps and the same structure, 32-way branching trees, was used for immutable vectors.

I'm pretty sure they picked the word pioneered for a reason. If Rich Hickey didn't invent them, then Tiark & Bagwell didn't invent RRB-Trees.

JeanPierre|12 years ago

I do agree that the structure is influenced by the paper, so I've added that in. I don't agree that they are "virtually identical" however, because there are significant differences leading to a lot of performance improvements (never any hash collisions, insertion and removal only at the end, resulting in perfect balancing, etc) and, as swanodette mentioned, Bagwell himself mentioned that they were pioneered by Hickey.

> Rich goes through this in one of his talks but I don't recall which.

If you figure out which, I'd love to know! :)

cgrand-net|12 years ago

There's no bitvector in clojure persistent vectors. The difference between vectors and a hashmap without the hashing stage is that a persistent vector can't be sparse!

e12e|12 years ago

Thanks for the link to the source. I had no idea the clojure source code was so readable :-)

swannodette|12 years ago

Tiark & Bagwell cover Rich Hickey's modifications and come up with some pretty cool improvements of their own in their paper on RRB-Trees, http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

jasonwatkinspdx|12 years ago

The relevant quote from section 1:

"In Clojure, immutable vectors are an essential part of the language implementation design. Ideal Hash Tries (HAMTs) [1] were used as a basis for immutable hash maps and the same structure, 32-way branching trees, was used for immutable vectors."