(no title)
rwhaling | 4 years ago
In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.
I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?
darksaints|4 years ago
Does this doc talk about the problems with nullability / ternary logic? What about algebraic sum types? Those have always been some of the most difficult aspects of relational data modeling, at least with respect to SQL.
overkalix|4 years ago
lrobinovitch|4 years ago
> To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.
[1] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
cdcarter|4 years ago
snidane|4 years ago