top | item 31867179

Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query with DuckDB

116 points| llambda | 3 years ago |github.com | reply

15 comments

order
[+] wahnfrieden|3 years ago|reply
How does this compare with SQLite approaches shared recently?
[+] llambda|3 years ago|reply
It's a great question: fundamentally the Parquet format offers columnar orientation. With datasets like these, there's some research[0] indicating this is a preferable way of storing and querying WARC.

DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.

[0] https://www.researchgate.net/figure/Comparing-WARC-CDX-Parqu...

[+] 1egg0myegg0|3 years ago|reply
Good question! As a disclaimer, I work for DuckDB Labs.

There are 2 big benefits to working with Parquet files in DuckDB, and both relate to speed!

DuckDB can query parquet right where it sits, so there is no need to insert it into the db first. This is typically much faster. Also, DuckDB's engine is columnar (SQLite is row based), so it can do faster analytical queries using that format. I have seen 20-100x speed improvements over SQLite in analytical workloads.

Happy to answer any questions!

[+] wenc|3 years ago|reply
DuckDB has SQLite semantics but is natively built around columnar formats (parquet, in-memory Arrow) and strong types (including dates). It also supports very complex SQL.

SQLite is a row store built around row based transactional workloads. DuckDB is built around analytics workloads (lots of filtering, aggregations and transformations) and for these workloads DuckDB is just way way faster. Source: personal experience.

[+] 1vuio0pswjnm7|3 years ago|reply
SQLite3 is 1.6MB

duckdb is 41MB

(q/k, another columnar SQL database, is less than a MB)

[+] mritchie712|3 years ago|reply
Nice! I've been considering using DuckDB for our product (to speed up join's and aggregates of in-memory data), it's an incredible technology.