top | item 28746712

(no title)

rwhaling | 4 years ago

Always an interesting read. A ton of insights, but I always find it a bit hard to make sense of, and it feels so disconnected from a modern distributed systems context.

In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.

I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?

discuss

darksaints|4 years ago

I haven't read it but I understand where you're coming from here.

Does this doc talk about the problems with nullability / ternary logic? What about algebraic sum types? Those have always been some of the most difficult aspects of relational data modeling, at least with respect to SQL.

overkalix|4 years ago

Again, what's the difference between a data warehouse and a data lake? Honest question.

lrobinovitch|4 years ago

As I understand it, a data lake is a storage space for unstructured data. A data warehouse is a storage + compute layer, usually with data sourced from a data lake, that is ready for querying. This understanding comes from the description in this paper[1]

> To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.

[1] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

cdcarter|4 years ago

In my experience, a data warehouse usually has an ETL process at the beginning. Data comes in from disparate sources and on a regular basis, it is ETLd into a shape that is ready to use by the business. On the other hand, a data lake slurps in all the data as soon as it is available, in whatever form it is in. You have to process it into the business-consumable form when you query/egress it, but you don't have to know your dream schema up front.

snidane|4 years ago

Data lake: you own your files. Data warehouse: database owns your files.