Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data

m_d_|1 year ago

I'd like to point out that fastparquet has been built for wasm (pydide/pyscript) for some time and works fine, producing pandas dataframes. Unfortunately, the thread/socket/async nature of fsspec means you have to get the files yourself into the "local filesystem" (meaning: the wasm sandbox). (I am the fastparquet author)

jasonjmcghee|1 year ago

Seeing as the popular alternative here would be DuckDB-WASM, which (last time I checked) is on the order of 50MB, this is comparatively super lightweight.

leeoniya|1 year ago

i think duckdb-wasm is closer to 6MB over wire, but ~36MB once decompressed. (see net panel when loading https://shell.duckdb.org/)

the decompressed size should be okay since it's not the same as parsing and JITing 36MB of JS.

leeoniya|1 year ago

in my [albeit outdated] experience ArrowJS is quite a bit slower than using native JS types. i feel like crossing the WASM<>JS boundary is very expensive, especially for anything other than numbers/typed arrays.

what are people's experiences with this?

kylebarron|1 year ago

Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.

Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])

[0]: https://github.com/kylebarron/arrow-js-ffi

[1]: https://observablehq.com/@kylebarron/zero-copy-apache-arrow-...

domoritz|1 year ago

One of the ArrowJS committers here. We have fixed a few significant performance bottlenecks over the last few versions so try again. Also, I'm also ways curious to see specific use cases that are slow so we can make ArrowJS even better. Some limitations are fundamental and you may be better off converting to the corresponding JS types (which should be fast).

ingenieroariel|1 year ago

I'll let Kyle chime in but I tested it a few months ago with millions of polygons on an M2 16GB of RAM laptop and it worked very well.

There is a library by the same author called lonboard that provides the JS bits inside JupyterLab. https://github.com/developmentseed/lonboard

<speculation>I think it is based on the Kepler.gl / Deck.gl data loaders that go straight to GPU from network.</speculation>

FridgeSeal|1 year ago

@dang we have a mass spam incursion in this comment thread.

unknown|1 year ago

[deleted]

seanw444|1 year ago

It's site-wide.

rubenvanwyk|1 year ago

Can this read and write Parquet files to S3-compatible storage?

kylebarron|1 year ago

It can read from HTTP urls, but you'd need to manage signing the URLs yourself. On the writing side, it currently writes to an ArrayBuffer, which then you could upload to a server or save on the user's machine.

SEXMCNIGGA25023|1 year ago

[deleted]

SEXMCNIGGA13676|1 year ago

[deleted]

SEXMCNIGGA8568|1 year ago

[deleted]

SEXMCNIGGA10073|1 year ago

[deleted]

SEXMCNIGGA21924|1 year ago

[deleted]

SEXMCNIGGA27511|1 year ago

[deleted]