(no title)
rxin | 4 years ago
I don't think your argument holds here at all. It's a common misconception to think high performance would require tight coupling of storage and query processing.
"Think columnar storage, high compression, vectorized query, materialized views, etc." All of those are possible in Lakehouse, and all but one (materialized views) are fully implemented on Databricks. And the remaining one isn't far away either (materialized views is really just incremental query processing + view selection, and neither problem has much to do with storage).
hodgesrm|4 years ago
In fact the Lakehouse paper seems to be setting up a strawman. Here are three examples.
* The new low-latency SQL data warehouses are open source. They are are not locking data in proprietary formats. We're not Snowflake.
* SQL data warehouses are already headed toward support for object storage for the same reason everyone else is: costs and durability in large datasets. Here's just one sample of many: https://altinity.com/blog/tips-for-high-performance-clickhou...
* Not everyone cares about ML and data warehouse integration. From my experience working on ClickHouse only a small percentage of users integrate ML. By contrast 100% of our users care about efficient visualization and keeping data pipelines as short as possible, hence the benefit of a tightly integrated server.
I think there's actually a bifurcation of the market into low-latency use cases driven by event streams versus much larger datasets containing unstructured/semi-structured data stored in low-cost object storage. Lakehouse addresses the latter. SQL data warehouses are focused on the former. I don't see one "winning"--both markets are growing.
hodgesrm|4 years ago
I was already thinking it would be great to get a lakehouse presentation. If you are interested please submit a proposal!!