top | item 46556185

(no title)

mac3n | 1 month ago

see https://gitlab.com/mac3n/ksip binary search on mmpa'd sorted text files no index needed

discuss

agavra|1 month ago

this is pretty different but reminds me of https://en.wikipedia.org/wiki/Bitcask - if you're storing it all in memory why not just use a hash index?

mac3n|1 month ago

The trick is, it's not all in memory - it's a memory-mapped file If you look at the cache (with `fincore` or similar) you'll see that the buinary search only loads the pages it examines, roughly logarithmetic in the file size.

And a text file is the most useful general format - easy to write, easy to process with standard tools.

I've used this in the past on data sets of hundreds of millions of lines, maybe billions.

It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.

mac3n|1 month ago

what this could really use is a compression format that compresses variable amount of text into fixed-size blocks. with that, it could binary-search compressed text

agavra|1 month ago

RocksDB actually does something somewhat similar with its prefix compression. It prefix-compresses texts and then "resets"the prefix compression every N records so it stores a mapping of reset point -> offset so you can skip across compressed records. It's pretty neat