top | item 31531718

(no title)

Datasette is pretty cool.

But AFAICT, it just doesn’t scale whatsoever. That SQLite db is both the dataset index and the dataset content combined, right? So you're limited by how big that SQLite db can realistically be. The docs say "share data of any shape or any size", but AFAICT it can't handle large datasets containing large unstructured data like images and video and multi-billion data point datasets are hard to store in a single machine/file.

Not really a criticism, but more wondering if there are scale optimizations in Datasette I'm not aware of since the docs do say any shape or size.

discuss

simonw|3 years ago

You're right, Datasette isn't the right tool for sharing billion point datasets (actually low-billions might be OK if each row is small enough).

I think of Datasette as a tool for working with "small data" - where I define small data as data that will fit on a USB stick, or on my phone.

My iPhone has a TB of storage these days, so small data can get you a very long way!

Using it for unstructured image and video would work fine using the pattern where those binary files live somewhere like S3 and the Datasette instance exposes URLs to them. I should find somewhere in the documentation to talk about that.

But yes, I should probably take "of any size" off the homepage, it does give a misleading impression.

simonw|3 years ago

Opened an issue exploring alternatives here: https://github.com/simonw/datasette.io/issues/109

I decided to just drop "any size" but keep "any shape".

samwillis|3 years ago

Not quite the scale you are suggesting but I used it with a 7gb 20m row dataset and it worked incredible well.

redredrobot|3 years ago

Yeah - it’s probably unfair of me to say it doesn’t scale at all. But between large data and 2 extra orders of magnitudes of rows, the single SQLite file approach quickly breaks down, even if you don’t store the large content in-db.

wswope|3 years ago

> AFAICT it can't handle large datasets containing large unstructured data like images and video and multi-billion data point datasets are hard to store in a single machine/file

Images and videos can easily be yeeted in as binary blobs (same as with any other standard DB), and SQLite DBs scale into the hundreds of TB range as a single file. Are you comparing the single file strategy to something like a sharded cluster of DBs, or is your thought that a DB that stores objects as independent files is somehow superior?