mwlon's comments

mwlon | 2 years ago

Numerical data is full of rich patterns, but the general-purpose compressors we've historically used on them (e.g. snappy, gzip, zstd) are designed for unstructured, string-like data. Pcodec (or pco) is a new approach for numerical sequences that gets better compression ratio and decompression speed than alternatives. It usually improves compression ratio substantially, given the same compression time. Plus it's built to perform on all common CPU architectures, decompressing around 1-4GB/s.

You might have seen me post about Quantile Compression in previous years. Pco is its successor! Pco gets slightly better compression ratio, robustly handles more types of data, and (most importantly) decompresses much faster.

If you're interested in using it, there's a Rust API, Python (PyO3) API, and a CLI.

mwlon | 4 years ago | on: PancakeDB Is Now Free

PancakeDB is an event ingestion solution, an important part of most tech companies' data stacks. Write to it one event at a time, and process billions at a time with the Spark connector or other tools.

I've released it under BSL so that any company can run it on their own servers for free.

mwlon | 4 years ago | on: PancakeDB offers columnar reads 30% faster than Parquet

It is a new startup I'm building.

It's a new type of database that can take in streaming data with very fast (~10ms) response times and output batch data with very fast throughput. To do that, it uses a new columnar file format and compression algorithm. Together, this makes its columnar files 30-50% smaller under most circumstances while decoding just as quickly. That means storage costs are lower and it's 30+% faster assuming the same network bandwidth is used to transfer the data for all columns. And this is a pessimistic scenario, since most queries have a `select column_0, column_1, ...` clause that PancakeDB can leverage better than Parquet, transferring only the exact columns needed!

You can find edge cases (e.g. very long strings of uniformly random bytes) where it's only a few % faster instead of 30%, but in every real-world-resembling scenario I've tried, the advantage is much greater.

mwlon | 4 years ago | on: New, better compression for columns of numerical data

I made this open source compression algorithm as part of a database I'm creating. It typically compresses columns of numerical data to ~25% smaller than alternatives (think .snappy.parquet or .gzip.parquet) at similar or cheaper compute cost. It decompresses 15-100 million 64-bit numbers per second on a single i5 CPU per second.

I also made a blog post that introduces the idea more from the math perspective: https://graphallthethings.com/posts/quantile-compression

page 1