(no title)
sdairs
|
4 months ago
Pretty big caveat; 5 seconds AFTER all data has been loaded into memory - over 2 minutes if you also factor reading the files from S3 and loading memory. So to get this performance you will need to run hot: 4000 CPUs and and 30TB of memory going 24/7.
philbe77|4 months ago
It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...
otterley|4 months ago
It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.
sdairs|4 months ago
Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas
jamesblonde|4 months ago
You can find those disks on Hetzner. Not AWS, though.
lumost|4 months ago
eulgro|4 months ago
CaptainOfCoit|4 months ago
For background, here is the initial ideation of the "One Trillion Row Challenge" challenge this submission originally aimed to participate in: https://docs.coiled.io/blog/1trc.html
lisbbb|4 months ago
mey|4 months ago
In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So how would a more traditional non-clustered DB stand up?
1000 k8s pods, each with 30gb of ram, there has to be a bit of overhead/wastage going on.
mulmen|4 months ago
BigQuery is comparable to DuckDB. I’m curious how the various Redshift flavors (provisioned, serverless, spectrum) and Spark compare.
I don’t have a lot of experience with DuckDB but it seems like Spark is the most comparable.
trhway|4 months ago