top | item 46648073

(no title)

tjchear | 1 month ago

I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?

discuss

RobinL|1 month ago

It is true that for json and csv you need a full scan but there are several mitigations.

The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.

Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.

In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.

Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns

ayhanfuat|1 month ago

If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.

biophysboy|1 month ago

Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.

riku_iki|1 month ago

that's true for duckdb native tables, but the question was about json.

mr_toad|1 month ago

Many analytical queries require full scans of fact tables anyway, so indexes are less useful. Joins are usually to dimensional tables, which are quite small. Snowflake doesn’t use indexes at all, and it’s built for handling the largest volumes of data.

However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.

simlevesque|1 month ago

But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.

gdulli|1 month ago

But you can have a surprisingly large amount of data before the inefficiency you're talking about becomes untenable.

akhundelar|1 month ago

Not a duckdb user, but I use polars a lot (mentioned in the article).

Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.

0cf8612b2e1e|1 month ago

Visidata is in Python and has offered “real time” analytics of fixed files for a long time. Computers are stupidly fast. You can do a lot of operations within a few seconds time window.

mpalmer|1 month ago

I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.