top | item 41176511

(no title)

Yeah I tend to work with strings data around the 15 to 30 ish char count so I'm also sceptical of german strings when it comes to raw memory usage. What really interests me, is that in theory, an SSTable built from German Strings that point into a larger text buffer further down could result in less pagefaults during a binary search? Maybe?

discuss

o11c|1 year ago

I mean, if you're doing binary searches, the first thing to do is to switch to Eytzinger searches (at least for the prefixes; it's unclear if Eytzing-sorting large strings matters) which are much more cache friendly. (incidentally, Eytzinger was Austrian, not German - same language though!)

Second, I'd probably switch to struct-of-array instead of array-of-struct, since the search part only needs the (head of the) key. Possibly some trie-like logic would also help (that is, avoid duplicating the common prefix, instead point to an array of continuations rather than a single one), but beware of costly indirections.

Third, I would make sure the user is able to specify the SSO limit, since different domains will have different requirements.

Fourth ... if you're column-oriented (or for simple data in the first place), consider implicit union of different `Set` (or whatever) implementations. For example, store all short strings in one array, and all long strings in a different array.

lor_louis|1 year ago

I did not know about Eytzing-sorting, I'll look into it after my vacation, thanks! And yeah our current system is column oriented, and I already tried most of your recommendations (including tries) but the biggest limitation we face is that different kinds of queries will be better served by different kinds of data structures but you can only store a few of them on disk before the cost to index becomes too big.