For those reading along, the post describes a vector-like datastructure that has O(1) access time and O(sqrt(N)) insert and deletion time at an arbitrary index. The idea is pretty clever. The datastructure maintains sqrt(N) circular arrays each of size sqrt(N). For reference indexing into a circular array has O(1) access time and O(1) insert and deletion time at the ends.
For accesses, you can in constant time determine which circular array contains the element at the given index (it's simply i % sqrt(N)) and then in constant time access the element from the underlying array. For inserts and deletions, you find the circular array that contains the location you want to insert into. First you make sure there is space in the array to insert. You do this by moving one element from each circular array to the next one. Since deleting and inserting from the end of a circular array takes O(1) and there are O(sqrt(N)) arrays, this takes a total of O(sqrt(N)) time. Then you insert the new element into the middle of the designated circular array which is of size sqrt(N) so it takes in the worst case O(sqrt(N)). This means insertions take a total of O(sqrt(N)) time.
As immawizard pointed out, there is a generalized version of this idea called tiered vectors[0] that supports an arbitrary level of nesting. A 1-tiered vector is a circular array. A k-tiered vector is an array of n^(1/k) tiered vectors of tier (k-1). You can show that for a k-tiered vector, access time is O(k), while insertion and deletion have a runtime of O(n^(1/k)). The datastructure mentioned in the post can be considered a 2-tiered vector.
The post includes benchmarks comparing the datastructure to std::vector. I would be interested in seeing benchmarks vs a binary search tree. Even though the datastructure has O(sqrt(N)) performance, that's still a lot slower than O(log(N)). The square root of a million is 1000, while the log base 2 of a million is only ~20.
One nitpick is that the author names the datastructure after themselves. Naming things after yourself is typically a faux pas.
I don't want to detract from the cleverness here, but I believe your benchmarks could use some work. Here are a few suggestions:
1. Simply testing something 1000 times and (presumably) presenting the arithmetic mean is not very informative. Looking at the detailed reported benchmark times (in the output file in tests), it looks like many of the timing outcomes have high variance. Rather than running the tests 1000 times and taking the mean, you might consider running 10 batches of 100 tests (or 1000, if you can) and presenting the mean and variance of the resulting distribution. In general, k sample groups each of size p will provide more reliable information about the underlying distribution than one sample group of size k*p (for reasonable k and p, obviously).
2. Related to that, the results of the "inserting a number of elements" and "deleting a number of elements" tests are significantly worse for the tiered vector vs the std::vector than the "insert/delete a single element" tests. You don't mention this in the readme, but thinking about why it is might be informative. Thrashing seems like a possible explanation, and one you might be able to mitigate.
3. Are you making sure your cache is warm before starting to measure performance? (Pardon, I didn't look through every line of your tests.) Particularly for std::vector, and likely your intermediate deques too, this will have a big effect on timing.
4. Finally, it looks like you're primarily testing using ints (?). It would probably be a good idea to see if your results hold for a different payload size.
I don't know whether these will improve or worsen your comparison against std::vector, but they will make your claims more robust.
He's going to have cache issues since he requires one extra memory lookup.
His lookup should be roughly twice as expensive as a regular array when cache is cold - it will be dominated by the cache misses two vs one.
With a hot cache, the array lookup becomes a single load op of a few cycles, and even if he gets two cache hits, his will probably be about 6-10 ops with 2 loads, a div, and a few more ops to get the index of the sub bucket.
On one cache hit and a possible miss on the other (eg his top level index is in cache, but the bucket might not be), he's going to be getting high variance in his lookup timings.
OP: to help the iterator access (sum), you might be able to prefetch (eg force a read) of the folling bucket before you need it. when you move over the bucket 2, cause a fetch of bucket 3 and the same time. You might be able to hide some of that cache miss latency then.
Also, with the DEQs you are potentially starting at the middle of a slab and having to roll around its front. That isn't going to prefetch well, so you might want to fetch the next next slab's base address and the address it logically starts at since those cane be different. Just try to hide as much cache miss latency as you can because that is where you are probably getting killed on in the iterator access.
Correct me if I am wrong, but I think the insertion/deletion time complexity is incorrect. It should be O(N). If each DEQ is implemented using array instead of list, wouldn’t the insert/delete operation of IgushArray still take linear time? Since popping or pushing (pick one) of DEQ implementation with array still take linear time, that means, let’s say insertion is at position N/2, then N/2 ... N-1 elements still need to be moved to the right by one? Popping/pushing is only linear if the DEQ is implemented with linked list, but that would mean linear access time instead of constant.
Update: answer is to implement DEQ with circular buffer for O(1) pop/push.
You can implement push/pull on both ends of an array-back deq in O(1) time for all. Store the backing array (know the size as well) and then store two indexes, the front of the deq and the back of the deq. You update the front/back the obvious way on push/pop except you do modular arithmetic if it would take you out of bounds.
You have a limited size of the deq of course, but here that's fine, it's actually constant size for all of them except the final possibly smaller one.
So now, what work do you have to do on insert? For the destination deq, you need to do one pop, one copy-in and up to n^1/2 moves. For every deq after that, up to about n^1/2 of them, you need to do one pop and one push, each of which costs O(1).
Insertion and deletion from DEQ indeed is linear, but in all of DEQs in the structure only one (!) needs insertion, the rest are just popping/pushing front/back.
Interesting that some people found an article about tiered vectors, I didn't find it back then. I implemented this structure many many years ago and also published in Bauman Moscow State Technical University magazine back in 2012:
http://engjournal.ru/articles/101/101.pdf
(in Russian)
Tiered vectors have been around for a while. The paper for them came out in 1998. They've just always been rather underappreciated. I only found out about them through this thread too.
I don't think any of us mean to imply that you copied off another implementation; independent discovery happens all the time! It's great you've made this a usable C++ library and have contributed your own thoughts on the structure!
It looks like you're having difficulty with some of the resizing. The two array approach described here https://stackoverflow.com/questions/10478260/looking-for-a-d... makes resizing a lot easier. You just double the size of the main array and increase the size of the offset array by the square root of two, without needing to fiddle around with anything else. You also get unchanged amortized time bounds.
This is very similar to what std::deque does under the covers already. It's a chunked array. The main difference AFAICT from std::deque is that this offers n^1/2 insertion and deletion within the body of the array. This guarantee is provided by allowing for the possibility of significant overallocation in the chunks themselves, if many modifications happen in the middle of the array. On the other hand, std::deque does not overallocate by as much -- just a chunk at the beginning and a chunk at the end. But as a consequence it has to move trailing elements when an element is inserted or deleted in the middle.
I believe this case is very badly handled in this data structure implementation. It seems to be saying that you want to occasionally manually recreate the DS as it grows bigger, as the deqs won't automatically resize. So you'll just have linear behavior in that case, with extra constants because the deqs are useless logic.
I suspect an amortized data structure could automatically and internally do this rebalancing operation, but I'd have to work out the scheme and the analysis. At a guess, something like normal dynamic arrays do, where you multiply the max size by a constant when it fills up, would give you decent amortized bounds.
In C++ the std::vector type handles this by doubling the capacity when full. I think this approach also works for this data structure. When the capacity has been reached simply double the capacity and rebuild from scratch. Perhaps quadruple because then each double-ended queue would double in size, avoiding that pesky sqrt(2). The amortized time should be the same.
"We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(n^ε) for ε > 0 while using o(n) extra space."
(it basically trades a little more work on indexing for a little loser structure)
I like the use of the DE queue, but having to keep all the sub vectors packed to ensure constant time lookup might not be completely needed if you are willing to keep a little more information around or have two candidate buckets. Not sure if would be a win or not though, since they your probably just better using a VeB tree.
e.g., I don't think you always need to compress everything back to the front if you can make sure all the buckets are within 1 or each other, but you can hit cases where you have to look in two buckets which isn't that difficult since you how the high and low of each bucket at the head and tail of each DEQ.
But then you would need to keep some local coloring info keep the buckets perfectly balanced, as everything should be.
Doesn't really seem much better than just a balanced size-tagged binary tree. You get O(log n) access, insertion, and deletion. Yes access is slower but insertion and deletion is much faster. Remember that for even one billion the logarithm is merely 30. That's hardly anything. Whereas the square root is more than 30 thousand.
It says in bold type at the beginning "The IgushArray class fully implements std::vector interface" but it's clear it cannot provide an equivalent of std::vector::data.
Side-question: is there any way to explicitly tell github whether a file is C or C++? I have this same issue, where C code is marked as C++ and vice-versa. Not a huge issue, but a bit annoying. It seems a bit weird that linguist (github's tool for doing language recognition) marks a file with namespaces in it as C, which is obviously not supported.
That being said, the overload of a complex data structure might multiply the running time by a factor up to ~10 (mostly because of cache misses), so simpler structures will be more performant for short inputs.
There is a more general pattern in most tree data structures where you can transform the log(n) recursive operations on the tree into sqrt(n) operations on a simpler structure with 2 levels.
I happen to have described the solution to a problem where you must use a sqrt-decomposition datastructure to have updates in O(1) and queries in O(sqrt(n)) because O(log(n)) everywhere is not good enough as there are a lot of updates to do (https://tryalgo.org/en/2017/09/01/path-statistics/).
The performance numbers would be most interesting if he benchmarked against std::vector, std::list, std::multiset and std::deque just to see how it compares to all the major collections. Using different items, say std::string, int64_t and a big POD struct might fill in a few things too.
If deq had to be fixed length beforehand with length k but with many insertions k can be much smaller than n^0.5 I think you might end up with a much worse worst case for insertion? Something in the order of O(n) than O(n^0.5)
The Push Front time could be improved to O(1)* by making it circular. Assuming "*" means amortized. But as-is it does provide a better comparison to an array. A circular IgushArray would be better compared to a circular array (which an std::vector isn't).
You could do this with O(N^1/K) for arbitrary K, using K layers of indirection, no?
This is similar to how you can do O(logN) for insertion / deletion & access using an indexed B-tree, except you're using a fixed height instead of a fixed branching factor.
True. If you go k levels deep you get indexing complexity of O(k) (because you do a constant time operation for each level) and insertion/removal complexity of O(k + n^(1/k)) (because that’s indexing plus a linear insert/remove operation for the array at the bottom). If you push k towards log n you get O(log n) for both. (I’m ignoring rebalancing issues here but these can be dealt with too)
The article did mention that you can push k to values larger than 2 in the Generalization section. I think it would’ve been nice if they had explored such values when doing the performance tests, but either way I found it clear and nicely written.
[+] [-] malisper|6 years ago|reply
For accesses, you can in constant time determine which circular array contains the element at the given index (it's simply i % sqrt(N)) and then in constant time access the element from the underlying array. For inserts and deletions, you find the circular array that contains the location you want to insert into. First you make sure there is space in the array to insert. You do this by moving one element from each circular array to the next one. Since deleting and inserting from the end of a circular array takes O(1) and there are O(sqrt(N)) arrays, this takes a total of O(sqrt(N)) time. Then you insert the new element into the middle of the designated circular array which is of size sqrt(N) so it takes in the worst case O(sqrt(N)). This means insertions take a total of O(sqrt(N)) time.
As immawizard pointed out, there is a generalized version of this idea called tiered vectors[0] that supports an arbitrary level of nesting. A 1-tiered vector is a circular array. A k-tiered vector is an array of n^(1/k) tiered vectors of tier (k-1). You can show that for a k-tiered vector, access time is O(k), while insertion and deletion have a runtime of O(n^(1/k)). The datastructure mentioned in the post can be considered a 2-tiered vector.
The post includes benchmarks comparing the datastructure to std::vector. I would be interested in seeing benchmarks vs a binary search tree. Even though the datastructure has O(sqrt(N)) performance, that's still a lot slower than O(log(N)). The square root of a million is 1000, while the log base 2 of a million is only ~20.
One nitpick is that the author names the datastructure after themselves. Naming things after yourself is typically a faux pas.
[0] https://www.ics.uci.edu/~goodrich/pubs/wads99.pdf
[+] [-] k_s|6 years ago|reply
[+] [-] igushev|6 years ago|reply
[+] [-] detrino|6 years ago|reply
Basically, it has O(log n) everything like a binary search tree you suggested but also very good constant factors.
By default, leaves hold 1024/sizeof(T) elements and branches hold 47 elements, so it can access up to 13,289,344 8 byte elements with only 4 levels.
[+] [-] immawizard|6 years ago|reply
[+] [-] igushev|6 years ago|reply
[+] [-] birdbrain|6 years ago|reply
1. Simply testing something 1000 times and (presumably) presenting the arithmetic mean is not very informative. Looking at the detailed reported benchmark times (in the output file in tests), it looks like many of the timing outcomes have high variance. Rather than running the tests 1000 times and taking the mean, you might consider running 10 batches of 100 tests (or 1000, if you can) and presenting the mean and variance of the resulting distribution. In general, k sample groups each of size p will provide more reliable information about the underlying distribution than one sample group of size k*p (for reasonable k and p, obviously).
2. Related to that, the results of the "inserting a number of elements" and "deleting a number of elements" tests are significantly worse for the tiered vector vs the std::vector than the "insert/delete a single element" tests. You don't mention this in the readme, but thinking about why it is might be informative. Thrashing seems like a possible explanation, and one you might be able to mitigate.
3. Are you making sure your cache is warm before starting to measure performance? (Pardon, I didn't look through every line of your tests.) Particularly for std::vector, and likely your intermediate deques too, this will have a big effect on timing.
4. Finally, it looks like you're primarily testing using ints (?). It would probably be a good idea to see if your results hold for a different payload size.
I don't know whether these will improve or worsen your comparison against std::vector, but they will make your claims more robust.
[+] [-] jnordwick|6 years ago|reply
His lookup should be roughly twice as expensive as a regular array when cache is cold - it will be dominated by the cache misses two vs one.
With a hot cache, the array lookup becomes a single load op of a few cycles, and even if he gets two cache hits, his will probably be about 6-10 ops with 2 loads, a div, and a few more ops to get the index of the sub bucket.
On one cache hit and a possible miss on the other (eg his top level index is in cache, but the bucket might not be), he's going to be getting high variance in his lookup timings. OP: to help the iterator access (sum), you might be able to prefetch (eg force a read) of the folling bucket before you need it. when you move over the bucket 2, cause a fetch of bucket 3 and the same time. You might be able to hide some of that cache miss latency then.
Also, with the DEQs you are potentially starting at the middle of a slab and having to roll around its front. That isn't going to prefetch well, so you might want to fetch the next next slab's base address and the address it logically starts at since those cane be different. Just try to hide as much cache miss latency as you can because that is where you are probably getting killed on in the iterator access.
[+] [-] igushev|6 years ago|reply
[+] [-] dwohnitmok|6 years ago|reply
There's also some benchmarks for another implementation of the same idea linked to in the README here: https://github.com/mettienne/tiered-vector/blob/master/READM.... That implementation also just punted on growing the maximum capacity of the area.
[+] [-] panda88888|6 years ago|reply
Update: answer is to implement DEQ with circular buffer for O(1) pop/push.
[+] [-] kadoban|6 years ago|reply
You have a limited size of the deq of course, but here that's fine, it's actually constant size for all of them except the final possibly smaller one.
So now, what work do you have to do on insert? For the destination deq, you need to do one pop, one copy-in and up to n^1/2 moves. For every deq after that, up to about n^1/2 of them, you need to do one pop and one push, each of which costs O(1).
[+] [-] igushev|6 years ago|reply
[+] [-] juliusmusseau|6 years ago|reply
It's a neat restructuring of the vector into a SQRT(N) * SQRT(N) arrangement.
[+] [-] igushev|6 years ago|reply
[+] [-] dwohnitmok|6 years ago|reply
I don't think any of us mean to imply that you copied off another implementation; independent discovery happens all the time! It's great you've made this a usable C++ library and have contributed your own thoughts on the structure!
It looks like you're having difficulty with some of the resizing. The two array approach described here https://stackoverflow.com/questions/10478260/looking-for-a-d... makes resizing a lot easier. You just double the size of the main array and increase the size of the offset array by the square root of two, without needing to fiddle around with anything else. You also get unchanged amortized time bounds.
[+] [-] mises|6 years ago|reply
[+] [-] asdfasgasdgasdg|6 years ago|reply
Interesting!
[+] [-] igushev|6 years ago|reply
[+] [-] juliusmusseau|6 years ago|reply
[+] [-] kadoban|6 years ago|reply
I suspect an amortized data structure could automatically and internally do this rebalancing operation, but I'd have to work out the scheme and the analysis. At a guess, something like normal dynamic arrays do, where you multiply the max size by a constant when it fills up, would give you decent amortized bounds.
[+] [-] kccqzy|6 years ago|reply
[+] [-] michaelrpeskin|6 years ago|reply
https://www.codeproject.com/Articles/5425/An-In-Depth-Study-...
[+] [-] saagarjha|6 years ago|reply
I hope your mother taught you the value of using generics ;)
[+] [-] meuk|6 years ago|reply
Interesting alternatives that also support insertion, deletion, and lookup (is there a name for this interface?) are AVL trees and hash tables.
As described in https://arxiv.org/pdf/1711.00275.pdf, tiered vectors can be improved further:
"We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(n^ε) for ε > 0 while using o(n) extra space."
[+] [-] igushev|6 years ago|reply
[+] [-] carl8|6 years ago|reply
https://en.wikipedia.org/wiki/Judy_array http://judy.sourceforge.net
To install the Judy C lib:
Then compile with -lJudy and #include <Judy.h>Example: http://judy.sourceforge.net/doc/JudyL_3x.htm
[+] [-] proverbialbunny|6 years ago|reply
An rrb-tree is an effective O(1) time for every category. Scala's Vector type implements one.
If interested about the data structure: https://youtu.be/sPhpelUfu8Q
Scala bigO doc: https://docs.scala-lang.org/overviews/collections-2.13/perfo... Scala's doc on it: https://github.com/nicolasstucki/scala-rrb-vector/blob/maste...
[+] [-] jnordwick|6 years ago|reply
https://en.wikipedia.org/wiki/Van_Emde_Boas_tree
(it basically trades a little more work on indexing for a little loser structure)
I like the use of the DE queue, but having to keep all the sub vectors packed to ensure constant time lookup might not be completely needed if you are willing to keep a little more information around or have two candidate buckets. Not sure if would be a win or not though, since they your probably just better using a VeB tree.
e.g., I don't think you always need to compress everything back to the front if you can make sure all the buckets are within 1 or each other, but you can hit cases where you have to look in two buckets which isn't that difficult since you how the high and low of each bucket at the head and tail of each DEQ.
But then you would need to keep some local coloring info keep the buckets perfectly balanced, as everything should be.
[+] [-] kccqzy|6 years ago|reply
[+] [-] michaelrpeskin|6 years ago|reply
[+] [-] kadoban|6 years ago|reply
[+] [-] bt848|6 years ago|reply
[+] [-] igushev|6 years ago|reply
[+] [-] mises|6 years ago|reply
[+] [-] Labo333|6 years ago|reply
That being said, the overload of a complex data structure might multiply the running time by a factor up to ~10 (mostly because of cache misses), so simpler structures will be more performant for short inputs.
There is a more general pattern in most tree data structures where you can transform the log(n) recursive operations on the tree into sqrt(n) operations on a simpler structure with 2 levels.
I happen to have described the solution to a problem where you must use a sqrt-decomposition datastructure to have updates in O(1) and queries in O(sqrt(n)) because O(log(n)) everywhere is not good enough as there are a lot of updates to do (https://tryalgo.org/en/2017/09/01/path-statistics/).
[+] [-] bnolsen|6 years ago|reply
[+] [-] superphil0|6 years ago|reply
Or there has to be a readjustment of k.
Cab somebody point out my thinking error?
[+] [-] igushev|6 years ago|reply
[+] [-] Thorrez|6 years ago|reply
[+] [-] igushev|6 years ago|reply
[+] [-] ummonk|6 years ago|reply
This is similar to how you can do O(logN) for insertion / deletion & access using an indexed B-tree, except you're using a fixed height instead of a fixed branching factor.
[+] [-] baiwan123|6 years ago|reply