(no title)
wesm | 5 years ago
* BigQuery: no * Redshift: no * Spark SQL: no * Snowflake: no * Clickhouse: no * Dremio: no * Impala: no * Presto: no ... list continues
We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.
aldanor|5 years ago
E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.
lmeyerov|5 years ago
```
import pyarrow as pa
my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])
```
If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.
What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.
Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.
ahachete|5 years ago
mkl|5 years ago
* BigQuery: no
* Redshift: no
* Spark SQL: no
* Snowflake: no
* Clickhouse: no
* Dremio: no
* Impala: no
* Presto: no
jhgb|5 years ago
est|5 years ago
https://github.com/ClickHouse/ClickHouse/issues/12284
zX41ZdbW|5 years ago