(no title)
pgguru | 3 months ago
Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.
In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.
spenczar5|3 months ago
pgguru|3 months ago
For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.
In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.
Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.
There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.
How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.