top | item 33937007

(no title)

jointpdf | 3 years ago

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)

(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?

discuss

scrollaway|3 years ago

Arrow and pandas are massive dependencies.

wenc|3 years ago

Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.

Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.

If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.

But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.