(no title)
talolard | 3 years ago
For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast.
The traditional alternative I’m familiar with is spark, but it’s such a hassle to setup, expensive to run and not as fast on these kinds of use cases.
I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted manner in a parquet file, which is ETL overhead.
Still, it’s a very powerful and convenient tool for working with large datasets locally
stinos|3 years ago
RyEgswuCsn|3 years ago
Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.
talolard|3 years ago
But you(I) wouldn’t use it like a standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere
talolard|3 years ago