(no title)
josefcullhed | 4 years ago
>> It's hard for me to see how that could be done much faster unless you find a way to parallelize the process
We actually parallelize the process. We do it by separating the URLs to three different servers and indexing them separately. Then we just make the searches on all three servers and merges the result URLs.
>> I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing?
It is not very complicated, we use hashes a lot to simplify things. The index is basically a really large hash table with the word_hash -> [list of url hashes] Then if you search for "The lazy fox" we just take the intersection between the three lists of url hashes to get all the urls which have all words in them. This is the basic idea that is implemented right now but we will of course try to improve.
details are here: https://github.com/alexandria-org/alexandria/blob/main/src/i...
kreeben|4 years ago
josefcullhed|4 years ago
I actually don't know what roaring bitmaps are, please enlighten me :)
kreeben|4 years ago
There are some algorithms that have been optimized for intersect, union, remove (OR, AND, NOT) that work extremely well for sorted lists but the problem is usually: how to efficiently sort the lists that you wish to perform boolean operations on, so that you can then apply the roaring bitmap algorithms on them.
https://roaringbitmap.org/