top | item 29239307

(no title)

Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.

discuss

glogla|4 years ago

Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!

saj1th|4 years ago

> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.

feqgmmr2|4 years ago

This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.

buzzscale|4 years ago

Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).

jeltz|4 years ago

I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.

saj1th|4 years ago

> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.