(no title)
theboat | 2 years ago
I work with large text datasets, and I typically have to go through hundreds of samples to evaluate a dataset's quality and determine if any cleaning or processing needs to be done.
A tool that lets me sample and explore a dataset living in cloud storage, and then share it with others, would be incredibly valuable, but I haven't seen any tools that support long-form non-tabular text data well.
simonw|2 years ago
SQLite is great at JSON - so I often dump JSON structures in a TEXT column and query them using https://www.sqlite.org/json1.html
I also have plugins for running jq() functions directly in SQL queries - https://datasette.io/plugins/datasette-jq and https://github.com/simonw/sqlite-utils-jq
SQLite's FTS search is surprisingly decent, and I have tools for quickly turning that on both from a CLI: https://sqlite-utils.datasette.io/en/stable/cli.html#configu... and as a Datasette Plugin (available in Datasette Cloud): https://datasette.io/plugins/datasette-configure-fts
I've been trying to drive the cost of turning semi-structured data into structured SQL queries down as much as possible with https://sqlite-utils.datasette.io - see this tutorial for more: https://datasette.io/tutorials/clean-data
This is also an area that I'm starting to explore with LLMs. I love the idea that you could take a bunch of messy data, tell Datasette Cloud "I want this imported into a table with this schema"... and it does that.
I have a prototype of this working now, I hope to turn it into an open source plugin (and Datasette Cloud feature) pretty soon. It's using this trick: https://til.simonwillison.net/gpt3/openai-python-functions-d...