top | item 39663870

(no title)

jtigani | 2 years ago

The article mentioned that DuckDB keeps improving very quickly. The next couple of months of DuckDB are all about stabilization, with no new features getting added. Once it is robust enough it will be declared "1.0". My guess is that will be in late April.

You mentioned OOMs, this has been a focus for a while and ha gotten steadily better over the past few releases. 0.9 added spill to disk to prevent most OOMs. And 0.10, released a couple of weeks ago, fixes a bunch more memory usage problems. The storage format, which another commenter brought up, is now fully backwards compatible.

I'd suggest giving it another try, especially once 1.0 comes out.

discuss

order

LunaSea|2 years ago

It might be getting better, but the examples are currently so egregious that it's tough to keep giving DuckDB a chance.

Example of a query that should never, ever, out-of-memory, but absolutely will in the latest DuckDB:

  COPY
    (
      SELECT
        rs.my_int,
        rs.my_bigint
      FROM
        READ_PARQUET('s3://some/folder/my-large-files-*.parquet')
        AS rs
    )
  TO
    '/my/home/folder/my-large-file.parquet'
    (
      FORMAT PARQUET,
      ROW_GROUP_SIZE 100000,
      COMPRESSION 'ZSTD'
    )
  ;
This query should simply read the two column series selected based on the parquet metadata and then stream the data to the disk.

And yet it will try to load data in memory before crashing.

cmollis|2 years ago

I've been testing duckdb's ability to scan multi-tb parquet datasets in S3. I have to say that i've been pretty impressed with it. I've done some pretty hairy SQL (window functions, multi-table joins, etc).. stuff that takes less time in Athena, but not by that much. Coupled with its ability to pull and join that data with information in RDB's like mysql make it a really compelling tool. Strangely, the least performant operations were the mysql look ups (had to set SET GLOBAL mysql_experimental_filter_pushdown=true;). Anyway.. definitely worth another look.. i'm using v 9.2