In short: Compatible with existing Spark jobs but executing them much faster. Benchmarks in the README file and docs [1] show improvements up to 3x while not even all operations are implemented yet (i.e. if an operation is not available in Comet it falls back to Spark), so there is room for further improvements. Across all TPC-H queries the total speedup is currently 1.5x, the docs state that based on datafusion's standalone performance 2x-4x is a realistic goal [1]
Haven't seen any memory consumption benchmarks but suspect that it's lower than Spark for same jobs since datafusion is designsd from the ground up to be columnar-first.
For companies spending 100s of thousands if not millions on compute this would mean substantial savings with little effort.
Because it's a drop-in replacement that lets you (theoretically) spend O(1) development effort on speeding up your Spark jobs instead of O(N).
I say theoretically, because I have no idea how Comet works with the memory limits on Spark executors. If you have to rebalance the memory between regular memory and memory overhead or provision some off-heap memory for Comet, then the migration won't be so simple.
They have their own implementation that is closed source (last time I checked), Photon [1], which is written with a C++ engine.
Databricks' terms prevent(ed?) publishing benchmarks, it would be interesting to see how Comet performs relative to it over time.
Photon comes at a higher cost, so one big advantage of Comet is being able to deploy it on a standard Databricks cluster that doesn't have Photon, at a lower running cost.
OutOfHere|1 year ago
simicd|1 year ago
Haven't seen any memory consumption benchmarks but suspect that it's lower than Spark for same jobs since datafusion is designsd from the ground up to be columnar-first.
For companies spending 100s of thousands if not millions on compute this would mean substantial savings with little effort.
[1] https://datafusion.apache.org/comet/contributor-guide/benchm...
MrPowers|1 year ago
Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.
orthoxerox|1 year ago
I say theoretically, because I have no idea how Comet works with the memory limits on Spark executors. If you have to rebalance the memory between regular memory and memory overhead or provision some off-heap memory for Comet, then the migration won't be so simple.
necubi|1 year ago
threeseed|1 year ago
I want to be able to connect to interact with the full services from GCS, Azure, AWS, OpenAI etc none of which DataFusion supports.
As well as use libraries such as SynapseML, SparkNLP etc.
And do all of this with full support from my cloud provider.
pitah1|1 year ago
How does it compare to Blaze[1] and Gluten[2]?
I'm interested in running some benchmarks soon against all three for my project to see how they all go.
[1] https://github.com/kwai/blaze
[2] https://github.com/apache/incubator-gluten
ed_elliott_asc|1 year ago
scirob|1 year ago
I live in a dream world :)
nevi-me|1 year ago
Databricks' terms prevent(ed?) publishing benchmarks, it would be interesting to see how Comet performs relative to it over time.
Photon comes at a higher cost, so one big advantage of Comet is being able to deploy it on a standard Databricks cluster that doesn't have Photon, at a lower running cost.
[1] https://www.databricks.com/product/photon