top | item 38380307

(no title)

adren123 | 2 years ago

An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.

Here is the command to input the files:

CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');

discuss

order

supercoco9|2 years ago

Thank you! (original article writer here)

DuckDB is awesome. A couple of comments here. First of all, this is totally my fault, as I didn't explain it properly.

I am trying to simulate performance for streaming ingestion, which is the typical use case for QuestDB, Clickhouse, and Timescale. The three of them can do batch processing as well, but they shine when data is coming at high throughput in real time. So, while the data is presented on CSV (for compatibility reasons), I am reading line after line, and sending the data from a python script using the streaming-ingestion API exposed by those databases.

I guess the equivalent in DuckDB would be writing data via INSERTS in batches of 10K records, which I am sure would still be very performant!

The three databases on the article have more efficient methods for ingesting batch data (in QuestDB's case, the COPY keyword, or even importing directly from a Pandas dataframe using our Python client, would be faster than ingesting streaming data). I know Clickhouse and Timescale can also efficiently ingest CSV data way faster than sending streaming inserts.

But that's not the typical use case, or the point of the article, as removing duplicates on batch is way easier than on streaming. I should have made that clearer and will probably update it, so thank you for your feedback.

Other than that, I ran the batch experiment you mention on the box I used from the article (had already done it in the past actually) and the performance I am getting is 37 seconds, which is slower than your numbers. The reason here is that we are using a cloud instance using an EBS drive, and those are slower than a local SSD on your laptop.

You can use local drives on AWS and other cloud providers, but those are way more expensive and have no persistence or snapshots, so not ideal for a database (we use them sometimes for large one-off imports to store the original CSV and speeding up reads while writing to the EBS drive).

Actually one of the claims in DuckDB's entourage is that "big data is dead". With the power in a developer's laptop today you can do things faster than with cloud at unprecedented scale. DuckDB is designed to run locally on a data scientist machine, rather than on a remote server (of course you have Motherduck if you want to do remote, but you are now adding latency and a proprietary layer on top).

Once again thank you for your feedback!