top | item 42991248

(no title)

kipukun | 1 year ago

For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.

discuss

benrutter|1 year ago

Actually you're almost 100% describing how Wimsey works! It's using native df code rather than a UDF of some kind. Under the hood it uses Narwhal's which converts polars style expressions into native pandas/polars/spark/dask code with super minimal overheads.

If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.