top | item 27027396

(no title)

dmitrykoval | 4 years ago

Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.

So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter. For example: "SELECT value, np.log(value) FROM t WHERE ..".

https://github.com/dmitrykoval/vinum

Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.

discuss

order

justsomeuser|4 years ago

Also I think SQLite lets you call Python functions from the SQL program.

dmitrykoval|4 years ago

That's correct, but SQLite would require to serialize/deserialize the data sent to Python func (from C to Python and back), while Arrow allows to get a "view" of the same data without making a copy. Which is probably not an issue in OLTP workloads, but may become more visible in OLAP.