top | item 19780387

Topics in Advanced Data Structures [pdf]

931 points| htiek | 6 years ago |web.stanford.edu

83 comments

order
[+] jhayward|6 years ago|reply
I was going to roll my eyes about stuff no one will ever hear of, much less use in their day job, but there are some really relevant structures here. Finger trees, cache-oblivious structures, R-trees, etc., just to name a couple from a random page or two. The "why they're worth studying" summaries are gold.

Thanks!

[+] bhaavan|6 years ago|reply
A doctor has to mostly always treat common cold, and influenza. But that's no excuse for her / him to not know the purpose of the left pulmonary vein. The more basics they know, the better doctors they are. Sorry if this analogy is a bit extreme, but I want to make a point.
[+] jwr|6 years ago|reply
> much less use in their day job

Yes, some are indeed very relevant. In one of my previous businesses, we built a search engine where we used DAWGs (Directed Acyclic Word Graphs) [I didn't do this part]. And they worked really, really well, in fact they work well to this day.

[+] mcguire|6 years ago|reply
When was the last time you wrote a finger tree? At work?

[I'm just responding to your first clause. :-)]

[+] amelius|6 years ago|reply
> Traditional data structures assume a single-threaded execution model and break if multiple operations canbe performed at once. (Just imagine how awful it would be if you tried to access a splay tree with multiplethreads.) Can you design data structures that work safely in a parallel model – or, better yet, take maxi-mum advantage of parallelism? In many cases, the answer is yes, but the data structures look nothing liketheir single-threaded counterparts.

Makes me wonder which data structures have "parallel" versions besides the two mentioned.

[+] bogdanoff_2|6 years ago|reply
This kind of stuff fascinates me. General software engineering seems comparatively boring. I was considering going back to university to do a phd in cs and this is making me realize that research in algorithms/data structure would actually be viable (still lots of stuff to discover).

Does anyone here know what it is like doing research in these areas? Any general advice?

[+] xtracto|6 years ago|reply
I've got a PhD in CS, and these structures fascinate me as well. The Bloomier Filter reads amazing.

Unfortunately, day to day job has almost nothing to do with these algorithms, but mainly software architecture and maintainability.

I think the majority of these algorithms are not really something you will find implementing in day to day work. And at most I can see myself using a library with one of those. Even if you do research, it will have to be very focused on data structures and algorithms to really get deep into some of these.

Still, very enjoyable.

[+] lame88|6 years ago|reply
Does anyone in their work find that they are able to employ data structures like this, and if so, what do you work on? I've almost always had to delegate all my state to a database using default indexes, etc., which is productive, yet a little disappointing, because I'm always applying my brain power instead toward more mundane tasks.
[+] sempron64|6 years ago|reply
When working on a high volume web application I found data sketches (specifically the count-min sketch) very useful for finding and logging unique query patterns (Check if we've seen it before, how often have we seen it, if it's new, log it). Using a sketch bounded how much memory we used and eliminated database trips. In theory a hashmap data structure might have worked as well, but its size is unbounded given too many unique queries, while all we needed was an estimate.

Probabilistic data structures like bloom filters and sketches are really useful for gathering statistics on data sets that are too large to manage or would have a negative performance impact by having an extra database trip. So something to do with internal application diagnostics/debugging/logging is a great low-risk place to use these sorts of algorithms even when it makes sense to keep all business logic in the database.

[+] jcranmer|6 years ago|reply
There's a general class of problems to be solved with data structures, which is that you have a collection of records of data, and you're trying to find [1] a record, or collection of records, given some criteria. In essence, you have a database of some kind, so it should not surprise you to learn that databases (and things that are databases but not normally thought of as such--filesystems, for example) rely very heavily on these more "exotic" data structures internally. Indeed, a database is basically a heavily-optimized data structure that usually optimizes for the bursty nature of disk access [2]. And given that implementing these data structures tends to be a rich source of bugs, if you can offload the work to a database implementation, you probably should.

I will also point out that when you view the problem of data structures as an implementation that solves various queries, you can start building tools that automate the implementation of complex data structures given the queries you want to ask the data structure. I fully expect to see these tools start to become available over the next decade or so, given the current progress of program synthesis I see in the academic literature.

[1] Or insert, delete, or update. But usually you already have the record at that point, and you just want to update the data structure's representation of that record.

[2] You can apply the same techniques to cache access and even vector register usage for purely in-memory data structures. Same problem, just with different sizes of disparity.

[+] petschge|6 years ago|reply
I do plasma simulations and recently had the problem of finding the distance to the nearest neighbor for every of the particles in the simulation. Doing that naively is O(n^2) and took hours even for small test problems. Building an R-tree once and using if for nearest-neighbor look-ups brought that down to 5 minutes.

libspatialindex lacks documentation, but worked really nicely. The rtree interface in python is much friendlier.

[+] robmaister|6 years ago|reply
Gamedev, but even then actually implementing the data structures is rare. And nothing outside of geometric/parallel data structures listed towards the end.

I've had to use a concurrent priority queue once (locking a regular priority queue was too slow, dropped in a concurrent implementation), implemented kd trees a few times, and I've used half-edge once (a cousin of quadedges). Mainly doing graphics work.

Often the naive data structures work better because there's less indirection and it's easier to vectorize. In my case, depending on how much GPU budget you have available, you can also just scrap the CPU implementation and throw it in a compute shader.

[+] 01100011|6 years ago|reply
I think the reality is that >80% of SWEs will never need any of this, because most SWEs are high-level integrators and consumers of these algorithms via libraries. For a lot of people, their time is better spent learning how to architect and use available tools to quickly solve problems while writing bug-free code.

But then life changes and you may find yourself in need of this knowledge. I used to laugh at graph algorithm questions on interviews because I never directly used a graph in 20 years of SWE. Then I got a job where I am on a team maintaining a graph-based API. Jokes on me, now those 'silly' algorithms are very relevant.

[+] jwr|6 years ago|reply
I managed to use DAWGs (Directed Acyclic Word Graphs) in a search engine, and later HyperLogLog+ and a Bloom Filter for a scalable distributed Multi-Armed Bandit implementation.

Otherwise I find that the built-in Clojure data structures are so good for all-round use that it's difficult to justify the time to use anything else, except in extreme cases.

[+] atombender|6 years ago|reply
I recently had to store large numbers of geospatial points (lat/lng pairs with data) in RAM for querying. I used a basic quad tree, which is similar to R-trees, but much simpler. Getting a subset of the points based on a bounding box is very fast and simple.
[+] chrisseaton|6 years ago|reply
You've answered your own question - the person writing the database you're using, for example.
[+] segmondy|6 years ago|reply
computing is so cheap that most people can delegate to a database. at some point tho, if money is a problem and you don't have the computing resources, you can squeeze more out of that hardware by using more specialized DBs, (graph, doc, column), something like redis, etc, but then at some point those might not map very nice to your problem and in which case custom data structures will get you far. Hence the reason FAANGs are big on algorithm & DS questions during interviews.
[+] urbanslug|6 years ago|reply
I'm a Bioinformatics student, so it's not for work, but I have been looking into Suffix Arrays, BWTs and FM indexes for my project.

I've seen them applied to some real world next gen genomic tools for example this one https://github.com/vgteam/odgi implements a multi string BWT. Technically they're in academia and not making money afaik but it's interesting stuff.

[+] winrid|6 years ago|reply
Yes, but not often!

Recently used nested R-Trees for in memory geospacial querying for a game (millions of atomic updates a second which would be a pain to manage sharding as items move around the world).

For example as a tree gets too big I split it at the hot spot and run the job processing for each tree on separate threads.

Few years ago implimented a simple tree for a survey product I built. The pathing in the survey is stored in the tree and I use the tree to find loops (which are invalid) in the survey.

[+] holy_city|6 years ago|reply
Pro multimedia software here. We deal with fundamentally concurrent architectures with a variety of not-so-soft realtime requirements. So thread safety, lock/wait free guarantees, and cache performance are my key concerns.

Most off-the-shelf data structures are poorly suited for that environment so I need to roll my own or modify existing structures. I'm no data structure expert, so this kind of stuff is really helpful.

[+] Groxx|6 years ago|reply
Though my use of stranger data structures is relatively rare in my line of work, I enjoy learning them because they keep revealing new ways to approach problems. It's less "I use CRDTs" and more "CRDTs have changed what I think is feasible", similarly for stuff like bloom filters (no-false-negatives can be very powerful) and locality-sensitive hashing (don't search for all X every time, pre-compute a simpler version so you only check promising items). Concurrent data structures have also been pretty fruitful, since they're also often "how can we guarantee X in our distributed system, and what primitives do we need" strategies / solutions.

(obviously some of this is "I learned a tangential thing while learning about X", but that's kinda the point. you may never use a rope, but you might be introduced to a new idea in the process of learning about it)

[+] kevinventullo|6 years ago|reply
It's hard to beat an R-Tree for doing in-memory geospatial queries.
[+] vsef|6 years ago|reply
Work in speech recognition, from this list use lots of advanced automata algorithms, bloomier filters, interesting types of hashing, fancy priority queues.
[+] oplav|6 years ago|reply
It's not as interesting as some of the structures in the PDF, but we are working on a problem where we have lots of timestamped data and need to quickly and frequently access the nearest data point from a given input timestamp. We ended up using a tree-like structure but that said, we ended up using whatever tree-like structure was available in the language.
[+] tmh79|6 years ago|reply
I work on a mapping team. We implemented our own R tree for spatial search as existing libraries had some unoptimized hot paths for our use case.
[+] ivalm|6 years ago|reply
FM-index is super useful for large string collections.
[+] sidcool|6 years ago|reply
I love learning Algorithms and Data Structures. The issue is that I don't get to use these frequently, not even basic DS. Most of what I need exists in the language or some framework, and if I am to implement it from scratch, I am sure I will do worse. The only time I really use this knowledge is during interviews.
[+] lichtenberger|6 years ago|reply
Hm, where else are Lowest Common Ancestors in tree-structures useful? Storing for instance ORDPATH/DeweyIDs allows to simply check for a common prefix (they are hierarchical node labels). I think maybe for locking in a storage system with multiple read/write transactions or to determine which of two given nodes is the firs one in preorder, which is useful for XQuery/XPath processing (to determine the so caslled document order). Can anyone think of other usages? Or for having hierarchical node labels in trees?
[+] htiek|6 years ago|reply
You can use lowest common ancestor queries in conjunction with suffix trees to solve a lot of interesting string problems. For example, take two indices within a string, find their corresponding suffixes in the suffix tree, and then take their LCA. That gives you an internal node corresponding to the longest string that appears starting at both indices (this is called their "longest common extension.") You can use this as a subroutine in a bunch of genomics applications.
[+] bondant|6 years ago|reply
Does anyone know books which explore in depth recent development in advanced data structure (or just not well known advanced data structure) ?
[+] winrid|6 years ago|reply
This is awesome, thank you!

Used nested R-Trees in a personal project recently (sharded in memory spacial db for a game) which is not something I thought I'd ever have to do.

[+] criddell|6 years ago|reply
What font are they using? I find it very unpleasant to read on a 4k screen set to 125% scale. Or maybe it's Firefox.
[+] criddell|6 years ago|reply
I loaded the page on Chrome and the text is definitely heavier (and more readable) but also less sharp.
[+] vaibhavsagar|6 years ago|reply
My favourite somewhat obscure data structure that I didn't see on the list: Hash Array Mapped Tries.
[+] mountainofdeath|6 years ago|reply
Great! More fodder for interview questions /s.
[+] twoquestions|6 years ago|reply
That's probably it, any use of these algorithms in business software would need to be justified for the increased training your fresh-out-of-school replacement would need to maintain this.
[+] pizza|6 years ago|reply
This is quite cool, thanks.
[+] maimeowmeow|6 years ago|reply
Please provide the solutions in git repo. Thanks.