(no title)
adren123 | 2 years ago
Here is the command to input the files:
CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');
adren123 | 2 years ago
Here is the command to input the files:
CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');
supercoco9|2 years ago
DuckDB is awesome. A couple of comments here. First of all, this is totally my fault, as I didn't explain it properly.
I am trying to simulate performance for streaming ingestion, which is the typical use case for QuestDB, Clickhouse, and Timescale. The three of them can do batch processing as well, but they shine when data is coming at high throughput in real time. So, while the data is presented on CSV (for compatibility reasons), I am reading line after line, and sending the data from a python script using the streaming-ingestion API exposed by those databases.
I guess the equivalent in DuckDB would be writing data via INSERTS in batches of 10K records, which I am sure would still be very performant!
The three databases on the article have more efficient methods for ingesting batch data (in QuestDB's case, the COPY keyword, or even importing directly from a Pandas dataframe using our Python client, would be faster than ingesting streaming data). I know Clickhouse and Timescale can also efficiently ingest CSV data way faster than sending streaming inserts.
But that's not the typical use case, or the point of the article, as removing duplicates on batch is way easier than on streaming. I should have made that clearer and will probably update it, so thank you for your feedback.
Other than that, I ran the batch experiment you mention on the box I used from the article (had already done it in the past actually) and the performance I am getting is 37 seconds, which is slower than your numbers. The reason here is that we are using a cloud instance using an EBS drive, and those are slower than a local SSD on your laptop.
You can use local drives on AWS and other cloud providers, but those are way more expensive and have no persistence or snapshots, so not ideal for a database (we use them sometimes for large one-off imports to store the original CSV and speeding up reads while writing to the EBS drive).
Actually one of the claims in DuckDB's entourage is that "big data is dead". With the power in a developer's laptop today you can do things faster than with cloud at unprecedented scale. DuckDB is designed to run locally on a data scientist machine, rather than on a remote server (of course you have Motherduck if you want to do remote, but you are now adding latency and a proprietary layer on top).
Once again thank you for your feedback!