top | item 41103884

Memory Efficient Data Streaming to Parquet Files

28 points| danthelion | 1 year ago |estuary.dev

2 comments

order

LatexWriter|1 year ago

Your article does not mention how much runtime improvement you have observed, can you share those numbers ?

danthelion|1 year ago

With the 2-pass strategy, we can write arbitrary row group sizes while using a fixed amount of memory, with probably 100-200 MiB of overhead for the parquet file processing, depending on how large the metadata is for the scratch file. without the 2 pass strategy, the amount of memory is proportional to the size of the row group.