A Critique of Rust's `std::collections`

[+] chowells|9 years ago|reply

Yes, you really do need collision attack resistance in your standard hash table implementation. You know why? Because we've known for a long time that insecure defaults are broken. I'm glad rust makes the correct choice here.

[+] tptacek|9 years ago|reply

Moreover, despite the author's continued invocation of the term "cryptographic hash", SIPHash was designed to be a default hash table hash. Very few things use it as a standalone cryptographic hash. You don't see a lot of SIP-based KDFs.

It's not the fastest general purpose hash --- but it's not supposed to be. It's supposed to be performant enough to be used as a default, and in being a default, it's supposed to take a pretty common attack off the table.

It's the right default. In the rare situations where container hashing winds up high on the profile, the faster hash should be the optional hash.

[+] rurban|9 years ago|reply

Nonsense. Rust made a non-technical fear based choice. Robin-Hood is enough protection, SipHash only adds slowness. There are many twice as fast hash funcs with less collisions, and with Robin Hood you are fine with FNV1. For concurrency leapfrog would be better.

[+] AstralStorm|9 years ago|reply

No, cryptographic hash has an additional property. It is one way. So, given a hash it is hard to derive original input.

Collision resistance is a separate property which can be fulfilled without one way property.

Maybe you want hash numbers that are not predictable by an attacker, but sip hash is not salted either. Plus salting does not require a cryptographic hash either.

[+] cpeterso|9 years ago|reply

Does Rust define collection traits for abstract interfaces like Set, Map, and Queue like Java's Collections interfaces? I don't see any in the std::collections documentation. Standardizing those APIs in the std lib seems as (or more!) important than providing concrete implementations like BTreeMap. I see some 2015 discussion about missing collections traits:

https://internals.rust-lang.org/t/collection-traits-take-2/1...

https://internals.rust-lang.org/t/traits-that-should-be-in-s...

For comparison, here are Java's core collection interfaces: Collection, Set, SortedSet, List, Queue, Deque, Map, and SortedMap.

http://docs.oracle.com/javase/tutorial/collections/interface...

[+] steveklabnik|9 years ago|reply

IIRC, we need higher kinded types before we can create a truly great set of generic collections traits.

[+] grayrest|9 years ago|reply

This showed up on r/rust last night and there's a lot of discussion there:

https://www.reddit.com/r/rust/comments/52grcl/rusts_stdcolle...

[+] jsnell|9 years ago|reply

A lot of this article might be a matter of opinion, but the discussion on HashMaps seems to be outright false. The author has set up some kind of a double hashing strawman, when the actual implementation uses linear probing (which is the only kind of Robin Hood hash table anyone would implement today).

[+] petters|9 years ago|reply

> And, ignoring that point for a moment, The idea that your code is ‘secure by default’ is a dangerous one and promotes ignorance about security. You code is not secure by default

Bah. This argument can be used to show anything. I could argue that C++ is better than Rust with the same argument.

[+] bluejekyll|9 years ago|reply

I'm not entirely sure I get all of these arguments. It sounds like someone complaining about the language being young, and the early implementations aren't perfect.

For not being perfect, it's amazing how awesome the std lib is!

Also, the concurrent collections didn't show up in Java until 1.5, I believe. I think that was ten years into the life of the language, which was after the 1.3/4 releases which actually made the JVM substantially more performant.

Rust is young, concurrent data models are hard, but it is being worked on, and you can lend a hand! There are good points in this article, and some of them are being discussed on the Rust forums. This is a great time to be involved in helping shape a language and its libraries!

[+] Manishearth|9 years ago|reply

I strongly believe that concurrent datastructures should be kept out of the stdlib. There are just way too many tradeoffs to worry about here, and these tradeoffs are far more important in the concurrent case.

Take a concurrent hashmap for example. Do you want a "vec of lock free lists" kind which is lock free and infinite-capacity (no need to copy-reallocate), or do you want a probing one which has fine-grained locks and needs to reallocate (but has better cache behavior)? Or something in between? If it matters enough that `Mutex<HashMap<K,V>>` isn't good enough, then these tradeoffs matter too. Rust could have a rogues' gallery of concurrent hashmaps, but that feels like bloat.

Some already exist in the ecosystem, though, and you can always use those. But I'd love to see more work here!

Having a semi-official/nursery library containing a lot of concurrent datastructures would be nice, though. crossbeam provides better primitives for writing such things, but not the actual datastructures.

[+] hyperpape|9 years ago|reply

It's true that Java took a long time to develop (good) concurrent collections, but I will suggest that they're more important in 2016 than they were in 1995. Core counts are increasing, and the competition from other languages is better.

I'm not saying we should be disgusted if there aren't immediately awesome concurrent data structures, but you can't be satisfied with waiting until 2024 (the equivalent of Java 1.5, I believe).

[+] thenewwazoo|9 years ago|reply

This particular quote rang very true to me:

> If the allocation is hidden to the programmer, she might not realize the expensive operations behind the scenes.

I am a green Rust programmer, and I currently have little insight into the allocation costs of the various things I'm doing, and I haven't yet found a strategy for gaining it. I feel like the actual mechanisms for memory management are so deeply buried that it would take lots and lots of spelunking to get an idea of what's going on behind the scenes. Even something as simple as a "This function allocates <x> memory" tag with stdlib functions would help me immensely.

[+] kibwen|9 years ago|reply

I agree that we should add some more documentation in this area (though we always have to be careful that we don't document things so thoroughly that we de-facto stabilize the internals and make it impossible for us to improve them in the future). But in the meantime, if you're just learning the language, I would recommend not being overly concerned with the performance details of collections until you start seeing them at the top of your performance profiles.

EDIT: And of course, I do encourage you to check out the documentation that we have in the std::collections module that goes into at least a little detail about each of the standard collections, along with tips for choosing when to use each: https://doc.rust-lang.org/std/collections/#when-should-you-u...

[+] steveklabnik|9 years ago|reply

I am interested in improving this. It's tough because as an experienced Rust programmer, I feel like it's obvious, but have a hard time figuring out how to put it into words. It should be something that can be intuitive, rather than needing explicit docs, though those can help too.

[+] paulddraper|9 years ago|reply

He says BinaryHeap is superfluous, but then complains there's not a priority queue. Huh?

A binary heap is a priority queue. (Specifically, a binary heap is an implementation of a priority queue, just as a linked list is an implementation of a list.)

Even after the "they're fundamentally very different" update, he's still wrong.

Wanna know what the (non-synchronized) priority queue is called in Python? heapq https://docs.python.org/3.5/library/heapq.html Critique that next, please.

[+] desdiv|9 years ago|reply

Yeah, he made an embarrassing mistake. I googled "rust binary heap" and the second link[0] is a priority queue example using the std binary heap.

[0] https://doc.rust-lang.org/std/collections/binary_heap/

[+] exDM69|9 years ago|reply

The variant of Robin Hood hash table isn't a double hashing scheme like the original research paper. This is more like linear probing with a collision resolution trick to reduce worst case in lookups.

The article suggests using quadratic probing instead, but that has two weaknesses. Cache locality is worse than linear probing. But the deal breaker is that the hash table must have a prime number size for correctness. This makes the hash table grow quite quickly once the size reaches tens of thousands. And finding primes is another issue.

[+] agentgt|9 years ago|reply

In all fairness I have yet to see a language with an awesome builtin collections library that meets everyones needs (and I use JDK collections all the time.... it took a long time to get where it is and still is sort of ok). If you know of a language that does please chime in.

The big thing I missed in my limited playing around with Rust is more types of queues. This is in large part because I was attacking Rust in a similar way to Java. Because of its lack of other concurrency models other than threads I was resorting to queues (I'm sure there are more "reactive-like" libraries now).

Overall I can't decide if a better approach might be to let library writers work on adding more structures instead of a batteries included approach. Maybe just have the traits included and thats it.

[+] loeg|9 years ago|reply

> The other popular self-balancing trees are good candidates as well (AVL and LLRB/Red-black). While they do essentially the same, they can have very different performance characteristics, and switching can influence the program’s performance vastly.

In what scenario would you prefer an RB-tree (or AVL tree) to a B-tree?

My understanding is that the performance characteristics of RB/AVL trees can mostly be described as "worse." Maybe insert speed? But in that case, you'd really prefer an LSM tree.

The author is also very confused about how hash table DoS attacks are protected. And what a cryptographic hash is. There are very fast non-cryptographic hash functions like xxHash (in fact, I thought siphash was much faster than 1GB/s, but I could be misremembering).

[+] aidenn0|9 years ago|reply

GB/s is a fairly useless metric for hash-functions designed to be used in hash tables. This is because the keys for hash tables are usually very short, and keyed hashes can have long setup times. I don't know the speed of sip-hash for bulk data, but I wouldn't be surprised to see that VMAC is probably faster than siphash for large data, but VMAC has a long setup time (on the order of 5000 cycles IIRC compared to siphash24 at on the order of 100).

Heck, for large inputs, python's random hash isn't much faster than hmacmd5, but for 8 byte inputs python's random hash is faster than siphash24.

The better measurement is X hashes per second at N key size, and then perform that at a few key sizes.

[edit]

I finally found a comparison that does this and includes siphash:

http://cglab.ca/~abeinges/blah/hash-rs/

Note that siphash is competitive with everything except FNV for small inputs, but FNV blows it away up to about 60 bytes.

[+] wahern|9 years ago|reply

In C and C++ code it's common to use a red-black tree implementation where the node is part of the object. For example, using the standard BSD library

  #include <sys/tree.h>
  struct foo {
    RB_ENTRY(foo) rbe; // tree node
  };
  RB_HEAD(foo_tree, foo);
  RB_GENERATE(foo_tree, foo, rbe, foo_cmp);
  ...
  struct foo_tree rbt = { 0 };
  struct foo *foo;
  ...
  RB_INSERT(foo_tree, &rbt, foo); // no memory allocation

In such implementations the RB tree will be faster. The reason to prefer a B-tree is to minimize the number of cache misses incurred when loading objects _through_ separately allocated nodes. But if the node is part of the object, you've already eliminated that cost entirely and so a B-tree is strictly inferior.

<sys/tree.h> is written in the same vein as <sys/queue.h>, a collection of list implementations originally included with BSD and even provided by glibc. The tree and especially list routines are used for most of the data structures in *BSD kernels and utilities, and the Linux kernel and other systems often use very similar libraries.

The downside to using predefined, preallocated nodes is that you can't mix-and-match types in the same collection. And an object can only be inserted into a single collection unless you define multiple nodes.

In practice this isn't a concern. Especially in systems programming it's uncommon to need or even want to stash random objects in a collection, or for objects to exist in many multiple trees or lists simultaneously.

And it's a very useful property to be able to rely on the fact that insertion can _never_ fail (always for lists; qualified never for trees). That dramatically simplifies code dealing with complex data structures and that must remain robust in the face of resource exhaustion. It means that once you've allocated a new object, you can guarantee insertion into any number of collections. That also simplifies backing out of a transaction because you can make guarantees about consistency--i.e. that an object is definitely a member of this list or that tree. This benefit is often overlooked in Rust and elsewhere because idiomatic Rust code is expected to exit on memory exhaustion anyhow.

I can't remember the last time I actually used a hash table in C that wasn't specialized (e.g. perfect hash function). The fact that a dynamic hash table may need to be resized introduces an undesirable failure mode; it may happen only once in a million insertions, but that's significantly more often than never. For general use I default to a red-black tree, which usefully can serve as a sorted container, too (e.g. for priority queues). When performance matters (i.e. a proven bottleneck) I'll invariably use a more specialized data structure and implementation suited to the particular problem, perhaps also refactoring other aspects of the code as appropriate.

IMHO, all this concern about the fastest general purpose algorithm and implementation seems like bike shedding. When performance _truly_, _manifestly_ matters, a general purpose implementation is unlikely to suffice.

[+] jeffdavis|9 years ago|reply

I appreciate reading thoughtful, constructive criticisms like this. But ultimately, I didn't find any of his points paricularly compelling. Maybe the author can point to a language that really gets this right?

I haven't written much rust, though.

[+] zerosign|9 years ago|reply

I think std means standard collections, not custom ("special" case) collection library. People may look for this kind of specialization in outside std or create their own library. Or you could start be the first one who create this extended collection std.. CMIIW

[+] programmer_man|9 years ago|reply

Typo: "alternatives and way(s) to improve it"

60 comments