top | item 23251874

(no title)

lurker458 | 5 years ago

I've also been looking for that. In an ideal world there would be a small, fast, standalone cli tool that can convert csv to parquet. There is a (sadly, unfinished) parquet writer Rust library in the Arrow repository that looks promising. All approaches I've tried so far (spark, pyarrow, drill, ...) require everything and the kitchen sink. So far I've settled on a java cli tool that uses jackson + org.apache.parquet internally, but it's cpu bound and has a huge amount of maven dependencies.

discuss

order

meritt|5 years ago

pandas + fastparquet fairly lightweight. but yes I would love to see a simple c++/golang binary that's just a simple csv2parq call.

MrPowers|5 years ago

Newer versions of Pandas don't even need fastparquet anymore. This code works:

import pandas as pd

df = pd.read_csv('data/us_presidents.csv')

df.to_parquet('tmp/us_presidents.parquet')