top | item 41122866

(no title)

Most people don't directly query or otherwise operate on raw CSV, though. Large source datasets in CSV format still reign in many enterprises, but these are typically read into a dataframe, manipulated and stored as Parquet and the like, then operated upon by DuckDB, Polars, etc., or modeled (E.g. DBT) and pushed to an OLAP target.

discuss

wenc|1 year ago

There are folks who still directly query CSV formats in a data lake using a query engine like Athena or Spark or Redshift Spectrum — which ends up being much slower and consuming more resources than is necessary due to full table scans.

CSV is only good for append only.

But so is Parquet and if you can write Parquet from the get go, you save on storage as well has have a directly queryable column store from the start.

CSV still exists because of legacy data generating processes and dearth of Parquet familiarity among many software engineers. CSV is simple to generate and easy to troubleshoot without specialized tools (compared to Parquet which requires tools like Visidata). But you pay for it elsewhere.

fragmede|1 year ago

how about using Sqlite database files as an interchange format?

cmollis|1 year ago

exactly.. parquet is good for append only.. stream mods to parquet in new partitions.. compact, repeat.