It's a great question: fundamentally the Parquet format offers columnar orientation. With datasets like these, there's some research[0] indicating this is a preferable way of storing and querying WARC.
DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.
Good question! As a disclaimer, I work for DuckDB Labs.
There are 2 big benefits to working with Parquet files in DuckDB, and both relate to speed!
DuckDB can query parquet right where it sits, so there is no need to insert it into the db first. This is typically much faster. Also, DuckDB's engine is columnar (SQLite is row based), so it can do faster analytical queries using that format. I have seen 20-100x speed improvements over SQLite in analytical workloads.
DuckDB has SQLite semantics but is natively built around columnar formats (parquet, in-memory Arrow) and strong types (including dates). It also supports very complex SQL.
SQLite is a row store built around row based transactional workloads. DuckDB is built around analytics workloads (lots of filtering, aggregations and transformations) and for these workloads DuckDB is just way way faster. Source: personal experience.
[+] [-] wahnfrieden|3 years ago|reply
[+] [-] llambda|3 years ago|reply
DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.
[0] https://www.researchgate.net/figure/Comparing-WARC-CDX-Parqu...
[+] [-] 1egg0myegg0|3 years ago|reply
There are 2 big benefits to working with Parquet files in DuckDB, and both relate to speed!
DuckDB can query parquet right where it sits, so there is no need to insert it into the db first. This is typically much faster. Also, DuckDB's engine is columnar (SQLite is row based), so it can do faster analytical queries using that format. I have seen 20-100x speed improvements over SQLite in analytical workloads.
Happy to answer any questions!
[+] [-] wenc|3 years ago|reply
SQLite is a row store built around row based transactional workloads. DuckDB is built around analytics workloads (lots of filtering, aggregations and transformations) and for these workloads DuckDB is just way way faster. Source: personal experience.
[+] [-] infogulch|3 years ago|reply
[+] [-] 1vuio0pswjnm7|3 years ago|reply
duckdb is 41MB
(q/k, another columnar SQL database, is less than a MB)
[+] [-] mritchie712|3 years ago|reply