This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100 -- Sample size.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)
EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
The optimization should work well for your sampling query since the ORDER BY and LIMIT operations would happen before materializing the large columns, but the randomization function might force early evaluation - worth benchmarking both approaches.
I really like Clickhouse. Discovered it recently, and man, it's such a breath of fresh air compared to suboptimal solutions I used for analytics. It's so fast and the CLI is also a joy to work with.
I always dismissed ClickHouse, because it's all super low level. Building a reliable system out of it, requires a lot of internal knowledge. This is the only DB I know, where you will have to deal with actual files on disk, in case of problems.
However, I managed to look besides that, and oh-my-god it is so fast. It's like the tool is optimized for raw speed and whatever you do with it is up for you.
Same here. I come from a strong Postgres and Microsoft SQL Server background and I was able to get up to speed with it, ingesting real data from text files, in an afternoon. I was really impressed with the docs as well as the performance of the software.
I remember a few years ago when the views on Clickhouse was it some "legacy" "bulky" and used by "the big guys" and not very much discussion or opinions of it in spaces like this. Seems like its come a long way.
Unrelated to the new materialization option, this caught my eye:
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
> I guess sorting 150 million integers in 70ms shouldn't be surprising.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
Let's do a back of the envelope calculation. 150M u32 integers are 600MB. Modern SSD can do 14,000MB/s sequential read [1]. So reading 600MB takes about 600MB / 14,000MB/s = 43ms.
Memory like DDR4 can do 25GB/s [2]. It can go over 600MB in 600MB / 25,000MB/s = 24ms.
L1/L2 can do 1TB/s [3]. There're 32 CPU's, so it's roughly 32TB/s of L1/L2 bandwidth. 600MB can be processed by 32TB/s in 0.018ms. With 3ms budget, they can process the 600MB data 166 times.
The rank selection algorithms like QuickSelect and Floyd-Rivest have O(N) complexity. It's entirely possible to process 600MB in 70ms.
Strong and up to date intuition on "slow vs. fast" queries is an underrated software engineering skill. Reading blogs like this one is worth it just for that alone.
Slow VMs on overprovisioned cloud hosts which cost as much per month as a dedicated box per year have broken a generation of engineers.
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
Same thing with columnar/vectorized execution. It has been known for a long time that's the "correct" way to process data for olap workflows, but only became "mainstream" in the last few years(mostly due to arrow).
It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.
IMHO if ClickHouse had Windows native release that does not need WSL or a Linux virtual machine it would be more popular than DuckDB. I remember for years MySQL being way more popular than PostgreSQL. One of the reasons being MySQL had a Windows installer.
>Despite the airport drama, I’m still set on that beach holiday, and that means loading my eReader with only the best.
What a nice touch. Technical information and diagrams in this were top notch, but the fact there was also some kind of narrative threaded in really put it over the top for me.
Has anyone compared ClickHouse and StarRocks[0]? Join performance seems a lot better on StarRocks a few months ago but I'm not sure if that still holds true.
It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong, they can't even approach these speeds with btree index structures. I know they like transactions more than Clickhouse, but it's just amazing to see how fast modern machines are, billions of rows per second.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
Obvious solutions are often hard to do right. I bet the code that was needed to pull this off is either very complex or took a long time to write (and test). Or both.
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
God clickhouse is such great software, if it only it was as ergonomic as duckdb, and management wasn't doing some questionable things (deleting references to competitors in GH issues, weird legal letters, etc.)
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
chdb and clickhouse-local is nearly as ergonomic as duckdb with all of the features of ch.
duckdb has unfortunately been leaning away from pure oss - the ui they released is entirely hosted on motherduck’s servers (which, while an awesome project, makes me feel like the project will be cannibalized by a proprietary extensions.)
We adopted ClickHouse ~4 years ago. We COULD have stayed on just Postgres. With a lot of bells, whistles, aggregation, denormalisation, aggressive retention limits and job queues etc. we could have gotten acceptable response times for our interactive dashboard.
But we chose ClickHouse and now we just pump in data with little to no optimization.
We migrated some analytics workloads from postgres to clickhouse last year, it's crazy how fast it is. It feels like alien technology from the future in comparison.
I imagine with Postgres there's also an option of using a plugin like Greenplum or something else, which may help to bridge the gap, but probably not to the level of ClickHouse.
tmoertel|10 months ago
zX41ZdbW|10 months ago
tschreiber|10 months ago
Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
ethan_smith|10 months ago
jurgenkesker|10 months ago
lukaslalinsky|10 months ago
However, I managed to look besides that, and oh-my-god it is so fast. It's like the tool is optimized for raw speed and whatever you do with it is up for you.
EvanAnderson|10 months ago
pests|10 months ago
theLiminator|10 months ago
shmerl|10 months ago
simonw|10 months ago
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
amluto|10 months ago
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
https://en.m.wikipedia.org/wiki/Selection_algorithm
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
ww520|10 months ago
Memory like DDR4 can do 25GB/s [2]. It can go over 600MB in 600MB / 25,000MB/s = 24ms.
L1/L2 can do 1TB/s [3]. There're 32 CPU's, so it's roughly 32TB/s of L1/L2 bandwidth. 600MB can be processed by 32TB/s in 0.018ms. With 3ms budget, they can process the 600MB data 166 times.
The rank selection algorithms like QuickSelect and Floyd-Rivest have O(N) complexity. It's entirely possible to process 600MB in 70ms.
[1] https://www.tomshardware.com/features/ssd-benchmarks-hierarc...
[2] https://www.transcend-info.com/Support/FAQ-292
[3] https://www.intel.com/content/www/us/en/developer/articles/t...
skeptrune|10 months ago
baq|10 months ago
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
unknown|10 months ago
[deleted]
kwillets|10 months ago
https://dspace.mit.edu/bitstream/handle/1721.1/34929/MIT-CSA...
ignoreusernames|10 months ago
It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.
mmsimanga|10 months ago
skeptrune|10 months ago
codedokode|10 months ago
Onavo|10 months ago
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
sirfz|10 months ago
justmarc|10 months ago
skeptrune|10 months ago
What a nice touch. Technical information and diagrams in this were top notch, but the fact there was also some kind of narrative threaded in really put it over the top for me.
xiasongh|10 months ago
[0] https://www.starrocks.io/
fermuch|10 months ago
vjerancrnjak|10 months ago
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
apavlo|10 months ago
They're not "doing something wrong". They are designed differently for different target workloads.
Row-based -> OLTP -> "Fetch the entire records from order table where user_id = XYZ"
Column-based -> OLAP -> "Compute the total amount of orders from the order table grouped by month/year"
simianwords|10 months ago
But credit where it is due, obviously clickhouse is an industry leader.
ahofmann|10 months ago
ryanworl|10 months ago
ohnoesjmr|10 months ago
hexo|10 months ago
apwell23|10 months ago
anentropic|10 months ago
higeorge13|10 months ago
devcrafter|10 months ago
meta_ai_x|10 months ago
nasretdinov|10 months ago
unknown|10 months ago
[deleted]
jangliss|10 months ago
dangoodmanUT|10 months ago
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
simonw|10 months ago
skeptrune|10 months ago
Contribution is strong and varied enough that I think we're good for the long term.
memset|10 months ago
duckdb has unfortunately been leaning away from pure oss - the ui they released is entirely hosted on motherduck’s servers (which, while an awesome project, makes me feel like the project will be cannibalized by a proprietary extensions.)
unknown|10 months ago
[deleted]
tnolet|10 months ago
But we chose ClickHouse and now we just pump in data with little to no optimization.
xmodem|10 months ago
nasretdinov|10 months ago
tucnak|10 months ago
This is how we consume Langfuse traces!
unknown|10 months ago
[deleted]
AndreKR|10 months ago
[deleted]
curtisszmania|10 months ago
[deleted]
momono|10 months ago
[deleted]