I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.
I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.
Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.
Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.
You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.
You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.
I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.
sixdimensional|17 days ago
I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.
Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.
Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.
[1] https://avro.apache.org/
[2] https://hudi.apache.org/
yencabulator|17 days ago
amluto|17 days ago
I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.
gregw2|17 days ago
There is room still for an open source HTAP storage format to be designed and built. :-)