hi sdairs, we did store the data on the worker nodes for the challenge, but not in memory. We wrote the data to the local NVMe SSD storage on the node. Linux may cache the filesystem data, but we didn't load the data directly into memory. We like to preserve the memory for aggregations, joins, etc. as much as possible...It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...
otterley|4 months ago
It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.
sdairs|4 months ago
Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas
tanelpoder|4 months ago
jamesblonde|4 months ago
You can find those disks on Hetzner. Not AWS, though.
jiggawatts|4 months ago
Not to mention that Azure now exposes local drives as raw NVMe devices mapped straight through to the guest with no virtualisation overheads.