I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
RobinL|1 month ago
The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
ayhanfuat|1 month ago
biophysboy|1 month ago
riku_iki|1 month ago
mr_toad|1 month ago
However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.
simlevesque|1 month ago
gdulli|1 month ago
akhundelar|1 month ago
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
0cf8612b2e1e|1 month ago
mpalmer|1 month ago