(no title)
mrtimo | 1 month ago
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
skeeter2020|1 month ago
Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
tosh|1 month ago
HowardStark|1 month ago
One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
[0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892
oulipo2|1 month ago
jorin|1 month ago
https://github.com/taleshape-com/shaper
freakynit|1 month ago
arjie|1 month ago
exographicskip|1 month ago
mrtimo|1 month ago
falconroar|1 month ago
prometheon1|1 month ago
hk1337|1 month ago
newusertoday|1 month ago
it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
But yes it provides lot of convenience out of the box as you have already listed.