top | item 42324764

(no title)

mattewong | 1 year ago

This is misleading. First, as other comments have noted, it is comparing multi-threaded/parallelized vs single-threaded, and its total CPU time is much longer than wc's. Second, it suggests there is something special going on, when there is not. Just breaking the file into parts and running wc -l on it-- or even, running a CSV parser that is much more versatile than DuckDB's-- I'm pretty confident will perform significantly faster than this showing. Bets anyone?

discuss

order

szarnyasg|1 year ago

I am the author of the original post and I also wrote a followup blog post on it yesterday: https://szarnyasg.org/posts/duckdb-vs-coreutils/

Yes, if you break the file into parts with GNU Parallel, you can easily beat DuckDB as I show in the blog post.

That said, I maintain that it's surprising that DuckDB outperforms wc (and grep) on many common setups, e.g., on a MacBook. This is not something many databases can do, and the ones which can usually don't run on a laptop.

mattewong|1 year ago

Your follow-up post is helpful and appreciated!

Re the original analysis, my own opinion is that the outcome is only surprising when the critical detail, highlighting how the two are different, is omitted. It seems very unsurprising if it is rephrased to include that detail: "DuckDB, executed multi-threaded + parallelized, is 2.5x faster than wc, single-threaded, even though in doing so, DuckDB used 9.3x more CPU".

In fact, to me, the only thing that seems surprising about that is how poorly DuckDB does compared to WC-- 9x more CPU for only 2.5x more improvement.

But an interesting analysis regardless of the takeaways-- thank you