top | item 46859853

(no title)

jlokier | 28 days ago

I've worked on a hobby database that did something like this, but instead of "flat" dictionary compression over columns, it used a tree of compression contexts - trillions of them.

Data was compressed in interior and exterior trees, where interior trees were the data structure inside blocks (similar to B-tree block contents), and exterior trees were the structure between blocks (similar to B-tree block pointers, but it didn't use a B-tree, it was something more exotic for performance).

Each node provided compression context for its children nodes, while also being compressed itself using its parent's context.

As you can imagine, the compression contexts had to be tiny because they were everywhere. But you can compress compression contexts very well :-)

Using compression in this way removed a few special cases that are typically required. For example there was no need for separate large BLOB storage, handling of long keys, or even fixed-size integers, because they fell out naturally from the compression schema instead.

The compression algorithms I explored had some interesting emergent properties, that weren't explicitly coded, although they were expected by design. Some values encoded into close to zero bits, so for example a million rows would take less than a kilobyte, if the pattern in the data allowed. Sequence processing behaved similar to loops over fast, run-length encodings, without that being actually coded, and without any lengths in the stored representation. Fields and metadata could be added to records and ranges everywhere without taking space when not used, not even a single bit for a flag, which meant adding any number of rarely-used fields and metadata was effectively free, and it could be done near instantly to a whole database.

Snapshots were also nearly free, with deltas emerging naturally from the compression context relationships, allowing fast history and branches. Finally, it was able to bulk-delete large time ranges and branches of historical data near instantly.

The design had a lot of things going for it, including IOPS performance for random and sequential access, fast writes as well as reads, and excellent compression.

I'd like to revive the idea when I have time. I'm thinking of adding a neural network this time to see how much it might improve compression efficient, or perhaps implementing a filesystem with it to see how it behaves under those conditions.

discuss

No comments yet.