top | item 29235016

(no title)

drawturkey | 4 years ago

Photon, which was used in their benchmark, is not open source. Don't be fooled by DB.

discuss

order

hack_daddy|4 years ago

Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.

Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.

This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.

Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.

There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.

By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.

This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.

If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.

Open APIs enables an open ecosystem, which encourages competition.