(no title)
turk- | 4 years ago
With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.
I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.
adeelk93|4 years ago
https://cloud.google.com/bigquery-ml/docs/introduction
glogla|4 years ago
You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.
You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:
Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.
The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.
You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.
Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.
In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.
doppelganger1|4 years ago
turk-|4 years ago
If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.
buttaphingas|4 years ago