DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.
donor20|1 year ago
You can do dual AMD 192 core CPU's (384 cores / 768 threads) with 9 TB of memory and a 24 disk SSD array in a 2U box.
spratzt|1 year ago
SPARK and its modern counterpart Databricks are essentially obsolete for these organizations. Whatever justification they may have had in the past is no longer true.
I’ve recently closed down several in house SPARK clusters and replaced them with single nodes.
In addition to the simplicity of the design and reduction in cost there was a massive increase in performance. I expect this will become more common in the future; leaving distributed architecture for a small and increasingly niche group.
elasticventures|1 year ago
lidavidm|1 year ago
I think the difference is more that DataFusion is built as a library so you can plug it into the product you're building (e.g. Comet, which plugs it into Spark, or pg_lakehouse, which plugs it into Postgres). Polars could be used that way, but it's also a functional package you can pip install and use as a Pandas alternative right now.