My company tried DuckDB-WASM + parquet + S3 a few months ago but we ended up stripping it all out and replacing it with a boring REST API.
On paper it seemed like a great fit, but it turned out the WASM build doesn't have feature-parity with the "normal" variant, so things that caused us to pick it like support for parquet compression and lazy loading were not supported.
So it ended up not having great performance while introducing a lot of complexity, and also was terrible for first page load time due to needing the large WASM blob. Build pipeline complexity was also inherently higher due to the dependency and data packaging needed.
Just something to be aware of if you're thinking of using it. Our conclusion was that it wasn't worth it for most use cases, which is a shame because it seems like such a cool tech.
> WASM build doesn't have feature-parity with the "normal" variant
It's a good point, but the wasm docs state that feature-parity isn't there - yet. It could certainly be more detailed, but it seems strange that your company would do all this work without first checking the feature-coverage / specs.
> WebAssembly is basically an additional platform, and there might be platform-specific limitations that make some extensions not able to match their native capabilities or to perform them in a different way.
OK, this is really neat:
- S3 is really cheap static storage for files.
- DuckDB is a database that uses S3 for its storage.
- WASM lets you run binary (non-JS) code in your browser.
- DuckDB-Wasm allows you to run a database in your browser.
Put all of that together, and you get a website that queries S3 with no backend at all. Amazing.
S3 might be relatively cheap for storing files, but with bandwidth you could easily be paying $230/mo. If you make it public facing & want to try to use their cloud reporting, metrics, etc. to prevent people for running up your bandwidth, your "really cheap" static hosting could easily cost you more than $500/mo.
S3 is doing quite a lot of sophisticated lifting to qualify as no backend at all.
But yeah - this is pretty neat. Easily seems like the future of static datasets should wind up in something like this. Just data, with some well chosen indices.
I set up something similar at work. But it was before the DuckLake format was available, so it just uses manually generated Parquet files saved to a bucket and a light DuckDB catalog that uses views to expose the parquet files. This lets us update the Parquet files using our ETL process and just refresh the catalog when there is a schema change.
We didn't find the frozen DuckLake setup useful for our use case. Mostly because the frozen catalog kind of doesn't make sense with the DuckLake philosophy and the cost-benefit wasn't there over a regular duckdb catalog. It also made making updates cumbersome because you need to pull the DuckLake catalog, commit the changes, and re-upload the catalog (instead of just directly updating the Parquet files). I get that we are missing the time travel part of the DuckLake, but that's not critical for us and if it becomes important, we would just roll out a PostgreSQL database to manage the catalog.
My initial thought is why query 1TB of data in a browser, maybe I'm the wrong target audience for this but it seems that it's pushing that everything has to be in a browser rather than using appropriate tools
Browsers are now the write-once works everywhere target. Where java failed, many hope browsers succeed. WASM is definitely a key to that, particularly because it can be output by tools like rust, so they can also be the appropriate tools.
I doubt they are querying 1 TB of data in the browser. DuckDB-WASM issues http range requests on behalf of client to request only the bytes required, especially handy with parquet files (columnar format) that will exclude columns you don't even need.
But the article is a little light on technical details. In some cases it might make sense to bring the entire file client-side.
But, if you'd like to instead read the article, you'll see that they qualify the reasoning in the first section of the article, titled, "Rethinking the Old Trade-Off: Cost, Complexity, and Access".
We have been doing it for quite some time in our product to bring real time system observability with eBPF to the browser and have even found other techniques to really max-it-out beyond what you get off the shelf.
I built something on top of DuckDB last year but it never got deployed. They wanted to trust Postgres.
I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI.
DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.
Cool thing about DuckDB is it can be embedded. We have a data pipeline that produces a duckdb file and puts it on S3. The app periodically checks that assets etag and pulls it down when it changes. Most of our DB interactions use PSQL, but we have one module that leverages DuckDB and this file for reads. So it's definitely not all-or-nothing.
I can recommend earlyoom (https://github.com/rfjakob/earlyoom). Instead of freezing or crashing your system this tool kills the memory eating process just in time (in this case duckdb). This allows you repeat with smaller chunks of the dataset, until it fits into your mem.
+1 this was my experience trying it out as well. I find that for getting started and for simple usecases it works amazing. But I have quite a lot of concerns about how it scales to more complex and esoteric workloads.
Non-deterministic OOMs especially are some of the worst things in the sort of tools I'd want to use DuckDB in and as you say, I found it to be more common than I would like.
Also similar procedure used on joblist.today https://github.com/joblisttoday to fetch hiring companies and their jobs and store them into sqlite and duckdb, and retrieved on the client side with their wasm modules. The database are generated with a daily github workflow and hosted as artifact on a github page.
Where do I learn how to set up this sort of stuff? Trial and error? I kinda never need it for personal projects (so far), which always leads me to forget this stuff in between jobs kinda quickly. Is there a decent book?
If you want to learn it the best way is probably to come up with a personal project idea that requires it specifically? Idk how much you'd get out of a book but you could always do a side project with the specific goal of doing it just to learn a particular stack or whatever
The UI element is a scrollable table with a fixed size viewport window, memory shouldn't be a problem since they just have to retrieve and cache a reasonable area around that window. Old data can just be discarded.
This is brilliant guys, omg this is brilliant. If you think about it, freely available data always suffer with this burden... "But but we don't make money, all this stuff is public data by law, and government doesn't give us a budget". This solves that, the "can't afford it" spirit of public agencies.
dtech|4 months ago
On paper it seemed like a great fit, but it turned out the WASM build doesn't have feature-parity with the "normal" variant, so things that caused us to pick it like support for parquet compression and lazy loading were not supported. So it ended up not having great performance while introducing a lot of complexity, and also was terrible for first page load time due to needing the large WASM blob. Build pipeline complexity was also inherently higher due to the dependency and data packaging needed.
Just something to be aware of if you're thinking of using it. Our conclusion was that it wasn't worth it for most use cases, which is a shame because it seems like such a cool tech.
ludicrousdispla|4 months ago
How large was your WASM build? I'm using the standard duckdb-wasm, along with JS functions to form the SQL queries, and not seeing onerous load times.
mentalgear|4 months ago
It's a good point, but the wasm docs state that feature-parity isn't there - yet. It could certainly be more detailed, but it seems strange that your company would do all this work without first checking the feature-coverage / specs.
> WebAssembly is basically an additional platform, and there might be platform-specific limitations that make some extensions not able to match their native capabilities or to perform them in a different way.
https://duckdb.org/docs/stable/clients/wasm/extensions
mlissner|4 months ago
Put all of that together, and you get a website that queries S3 with no backend at all. Amazing.
timeflex|4 months ago
thadt|4 months ago
But yeah - this is pretty neat. Easily seems like the future of static datasets should wind up in something like this. Just data, with some well chosen indices.
codedokode|4 months ago
rubenvanwyk|4 months ago
jdnier|4 months ago
pacbard|4 months ago
We didn't find the frozen DuckLake setup useful for our use case. Mostly because the frozen catalog kind of doesn't make sense with the DuckLake philosophy and the cost-benefit wasn't there over a regular duckdb catalog. It also made making updates cumbersome because you need to pull the DuckLake catalog, commit the changes, and re-upload the catalog (instead of just directly updating the Parquet files). I get that we are missing the time travel part of the DuckLake, but that's not critical for us and if it becomes important, we would just roll out a PostgreSQL database to manage the catalog.
85392_school|4 months ago
SteveMoody73|4 months ago
cyanydeez|4 months ago
shawn-butler|4 months ago
But the article is a little light on technical details. In some cases it might make sense to bring the entire file client-side.
some_guy_nobel|4 months ago
But, if you'd like to instead read the article, you'll see that they qualify the reasoning in the first section of the article, titled, "Rethinking the Old Trade-Off: Cost, Complexity, and Access".
majormajor|4 months ago
(Does not seem like a realistic scenario to me for many uses, for RAM among other resource reasons.)
simonw|4 months ago
r3tr0|4 months ago
We have been doing it for quite some time in our product to bring real time system observability with eBPF to the browser and have even found other techniques to really max-it-out beyond what you get off the shelf.
https://yeet.cx
mrbluecoat|4 months ago
leetrout|4 months ago
I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI.
DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.
whalesalad|4 months ago
victor106|4 months ago
What are data ponds? Never heard the term before
wewewedxfgdf|4 months ago
But found it to be a real hassle to help it understand the right number of threads and the amount of memory to use.
This led to lots of crashes. If you look at the projects github issues you will see many OOM out of memory errors.
And then there was some indexed bug that crashed seemingly unrelated to memory.
Life is too short for crashy database software so I reluctantly dropped it. I was disappointed because it was exactly what I was looking for.
tuhgdetzhh|4 months ago
lalitmaganti|4 months ago
Non-deterministic OOMs especially are some of the worst things in the sort of tools I'd want to use DuckDB in and as you say, I found it to be more common than I would like.
thenaturalist|4 months ago
DuckDB has introduced spilling to disk and some other tweaks since a good year now: https://duckdb.org/2024/07/09/memory-management
mritchie712|4 months ago
ngc6677|4 months ago
barrenko|4 months ago
vikramkr|4 months ago
unknown|4 months ago
[deleted]
amazingamazing|4 months ago
didip|4 months ago
Copenjin|4 months ago
bzmrgonz|4 months ago