(no title)
lor_louis | 1 year ago
<https://cedardb.com/blog/german_strings/>
You could probably write a C++ implementation based on the article.
Note that it's not all that useful if you don't plan on searching for text based on their prefix. From my understanding, this is mostly a better way to store SStables in RAM/partially on disk if you mmap.
o11c|1 year ago
That said, note that there are a lot of strings around 20 bytes (stringified 64-bit integers or floats), so pushing to 24-1 is a reasonable choice too.
I'd use 3 size classes:
* 0-15 bytes, stored in-line. If you need C-string compatibility (which is fairly likely somewhere in your application), ensure that size 15 is encoded as a zero at the last byte.
* up to 3.75GB (if my mental math is right), stored with a mangled 32-bit size. Alternatively you could use a simpler cutoff if it makes the mangling easier. Another possibility would be to have a 16-bit size class.
* add support for very large strings (likely with a completely different allocation method) too; a 4GB limit sucks and is easily exceeded. If need be this can use an extra indirection.
Honestly, with a 16-byte budget, I'd consider spending more of that on the prefix - you can get 8 bytes with enough contortion elsewhere.
Duplicating the prefix is probably worth it in more cases than you might think, since it does speed up comparisons. Just remember, you have to check the size to distinguish "a\0" from "a" too, not just from "a\0\0\0\0".
NikkiA|1 year ago
As for short strings, they don't have a prefix at all - the 12 bytes simply follow the 4 bytes of the length.
Long strings will need the pointer 64bit aligned, so a 4 byte length means you'd have 4 bytes wasted to the pointer alignment anyway, and you fill those with the preview 'prefix'. dword length + 4 bytes + 64bit address = 16 bytes, again. They both occupy the same space in the cache, and only the data at the other end of a pointer on long strings gets pulled into cache if you decide the prefix matches and you need to follow it.
lor_louis|1 year ago
twic|1 year ago
danvonk|1 year ago