top | item 45924036

(no title)

throwaway-aws9 | 3 months ago

650GB? Your data is small, fits on my phone. Dump the hyped tooling and just use gnu tools.

Here's an oldie on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

discuss

order

faizshah|3 months ago

This isn’t true anymore we are way beyond 2014 Hadoop (what the blog post is about) at this point.

Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).

Demiurge|3 months ago

What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries.

CraigJPerry|3 months ago

At 650tb it's not a memory bound problem:

working memory requirements

    1. Assume date is 8 bytes
    2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.

That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.

This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.

I don't have a handle on the 33% longer runtime difference between duckdb and polars here.

adammarples|3 months ago

I think the entire point of the article (reading forward a bit through the linked redshift files posts) is that almost nobody in the world uses datasets bigger than 100Tb, that when they do, they use a small subset anyway, and that 650Gb is a pretty reasonable approximation of the entire dataset most companies are even working with. Certainly in my experience as a data engineer, they're not often in the many terabytes. It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.