top | item 42056672

(no title)

chrisjc | 1 year ago

Real awesome project. This would be even better if some of the differential storage functionality makes its way into duckdb.

https://motherduck.com/blog/differential-storage-building-bl...

Unsure if it's even possible, but if there was a way to write out (simulate) the streams to WAL files, you might be able to accomplish the same without having to consolidate the duckdb file every time.

A couple of other ideas, that may or may not diverge from your project's purpose or simplicity...

It's also too bad that duckdb isn't included in the out-of-the-box Snowpark packages, potentially allowing you to run all of this in a Snowflake procedure/DAG and persisting the duckdb file to object storage. You could of course achieve this with Snowflake containers. (But this would probably ruin any plans to support Redshift/BigQuery/etc as sources)

If the source tables were Iceberg, then you could have an even simpler duckdb file to produce/maintain that would just wrap views around the Iceberg tables using the duckdb iceberg extension.

    create view x 
    as select 
        * 
    from 
        iceberg_scan('metadata_file');

If you're going to enable change tracking on tables then perhaps another solution is to use CHANGES to select the data from the table instead of reading streams. That way you could use the same timestamp across all the selects and get better consistence between your materializations to duckdb.

    set ts = {last_materialization_ts};
    select * from X CHANGES AT(timestamp=>$ts);
    select * from Y CHANGES AT(timestamp=>$ts);

    -- triggered task to kick off materialization to duckdb
    create task M
    when
        SYSTEM$HAS_DATA(X_STREAM) AND
        SYSTEM$HAS_DATA(Y_STREAM) AND ...
    AS 
        CALL MY_MATERIALIZE(); -- send SNS, invoke ext-func, etc

Here's another similar (but different) project that I've been tracking https://github.com/buremba/universql

discuss

No comments yet.