top | item 26021076

(no title)

wesm | 5 years ago

Almost no database systems support multidimensional arrays. So they are not appropriate for many use cases?

* BigQuery: no * Redshift: no * Spark SQL: no * Snowflake: no * Clickhouse: no * Dremio: no * Impala: no * Presto: no ... list continues

We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.

discuss

aldanor|5 years ago

I think that implementing good ndim=2 support would already be a huge leap forward, it doesn't have to be something super generic. E.g., given that most of the classic machine learning is essentially using 2-dimensional data (samples x features) as inputs, this is a very common use case.

E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.

lmeyerov|5 years ago

This may help you do zero copy for a column of multi-dim without losing value types, just that it's encoding a multi-dim. This example is for values that are 3x3 of int8's:

```

import pyarrow as pa

my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])

```

If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.

What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.

Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.

ahachete|5 years ago

Postgres: https://www.postgresql.org/docs/current/arrays.html

mkl|5 years ago

If you out a blank line between your bullet points, they'll display properly:

* BigQuery: no

* Redshift: no

* Spark SQL: no

* Snowflake: no

* Clickhouse: no

* Dremio: no

* Impala: no

* Presto: no

jhgb|5 years ago

I suspect that AllegroCache accepts arrays with rank>=2, although I never got around to trying it out. (At the very least its documentation has nothing to say about any limitations on what kinds of arrays can be stored, so I'm assuming it stores all of them.)

est|5 years ago

On a side note, Clickhouse had some Arrow support

https://github.com/ClickHouse/ClickHouse/issues/12284

zX41ZdbW|5 years ago

ClickHouse has support for multidimensional arrays.