top | item 29213073

(no title)

asavinov | 4 years ago

Most of the self-service or no-code BI, ETL, data wrangling tools are am aware of (like airtable, fieldbook, rowshare, Power BI etc.) were thought of as a replacement for Excel: working with tables should be as easily as working with spreadsheets. This problem can be solved when defining columns within one table:

  ColumnA = ColumnB + ColumnC, ColumnD = ColumnA * ColumnE

we get a graph of column computations similar to the graph of cell dependencies in spreadsheets.

Yet, the main problem is in working multiple tables: how can we define a column in one table in terms of columns in other tables? For example:

  Table1::ColumnA = FUNCTION(Table2::ColumnB, Table3::ColumnC)

Different systems provide different answers to this question but all of them are highly specific and rather limited.

Why it is difficult to define new columns in terms of other columns in other tables? Short answer is that working with columns is not the relational approach. The relational model is working with sets (rows of tables) and not with columns.

One generic approach to working with columns in multiple tables is provided in the concept-oriented model of data which treats mathematical functions as first-class elements of the model. Previously it was implemented in a data wrangling tool called Data Commander. But them I decided to implement this model in the Prosto data processing toolkit which is an alternative to map-reduce and SQL:

https://github.com/asavinov/prosto

It defines data transformations as operations with columns in multiple tables. Since we use mathematical functions, no joins and no groupby operations are needed and this significantly simplifies and makes more natural the task of data transformations.

Moreover, now it provides Column-SQL which makes it even easier to define new columns in terms of other columns:

https://github.com/asavinov/prosto/blob/master/notebooks/col...

discuss

No comments yet.