Show HN: Sketch – AI code-writing assistant that understands data content
252 points| bluecoconut | 3 years ago |github.com | reply
I’m excited to share sketch: a tool to help anyone who uses python and pandas quickly iterate and get to answers for their data questions.
Sketch installs as a pandas extension that offers utility functions that operate on natural language prompts. Using the `ask` interface you can get answers in natural language. Using the `howto` interface you can get get python and pandas code directly. The primary benefit of this over copilot and chatGPT is that this adds data-content based context so that the generated answers are much more accurate and relevant to the data problem at hand.
Check out the demo video[1] and try it out using the colab notebook (on github)!
[1] https://user-images.githubusercontent.com/916073/212602281-4...
[+] [-] jonwinstanley|3 years ago|reply
Regarding the choice of name, presumably you already know about Sketch, the popular image editing software.
I wonder if the image editing guys will in the future incorporate AI functionality too? Which might make "Googling" for your product difficult for your potential customers?
[+] [-] Jugglerofworlds|3 years ago|reply
[+] [-] Arelius|3 years ago|reply
I just googled "sketch application" or "sketch image editing software" I found sketchpad... alternatively, I found sketch.com, which seems to be design software?
I mean, it's a vague enough name that I don't think it's a good name anyways, but I'm not sure it should be obvious that this is a taken name in the way that "Photoshop" would be.
[+] [-] blakeburch|3 years ago|reply
We were just talking last week about how we should create a feature to describe transformations you want in Natural Language that get compiled to pandas/SQL. Input data is everything associated with the original file/dataframe.
Visual transformation tools are typically limited and non-reproducible. If you could switch it around to be code-compiled but description-driven, that would open up new possibilities.
I'd love to chat if you're open to it. Email in bio.
[+] [-] jadbox|3 years ago|reply
Sadly to be honest, I don't think I'd pay a subscription for such a service. I would prefer to pay a one time tooling fee and just run trained model in the IDE locally.
[+] [-] swyx|3 years ago|reply
I'm wondering about the UX of this vs Copilot. is this basically just a way to get around the fact that you dont have Copilot inside of notebooks? what else am I missing about this experience?
[+] [-] bluecoconut|3 years ago|reply
That is definitely a big part of it, getting to use copilot style answers without having to install any plugins to the IDE (so getting to use this in colab or jupyter notebooks directly feels great).
That said, I use both copilot and sketch in my VScode notebooks, and find that they have slightly different feelings to the iteration loop.
Sketch offers a more "local" data context (pinning the text/prompt to the specific dataframe) which increases the quality of the suggestions (since more relevant information is within the token limit).
[+] [-] SomewhatLikely|3 years ago|reply
[+] [-] ibestvina|3 years ago|reply
[1] https://github.com/ibestvina/datasloth
[+] [-] ethanwillis|3 years ago|reply
In this line, https://github.com/approximatelabs/sketch/blob/9d567ec161015...
I think you can end up marking control characters as "UNKNOWN" characters by accident by assuming that in all contexts/environments that dictionary.items() always returns items in a consistent order. This isn't always true.
edit: actually with the way the code is written if you have any overlapping ranges at all you'll end up double/triple/etc. counting a character into multiple categories.
[+] [-] jerpint|3 years ago|reply
[+] [-] abrichr|3 years ago|reply
Upon closer inspection, it looks like https://github.com/approximatelabs/sketch interfaces with the model via https://github.com/approximatelabs/lambdaprompt, which is made by the same organization. This suggests to me that the former may be a toy demonstration of the latter.
Interesting how as of the time of writing this, most of the comments here (i.e. dozens) are praising this as a legitimate use case. Maybe I'm missing something obvious, but it seems clear to me that uploading data to a third party to verify whether that data contains PII is a non-starter for any serious application.
[+] [-] teaearlgraycold|3 years ago|reply
"Yes, and you just shared it all with Microsoft :D"
[+] [-] marcosfelt|3 years ago|reply
[+] [-] tdebroc|3 years ago|reply
[+] [-] bluecoconut|3 years ago|reply
so: ``` print(data_pd.sketch.ask("Is there any PII in this dataset ?", call_display=False)) ``` should work for you.
Right now it by default assumes its in an ipython context that can display HTML objects.
[+] [-] marcosfelt|3 years ago|reply
One cool feature would be some sort of chaining, where you could anchor a new query to a previous one.
For example, on the sales data demo, I started with the howto query "Plot the sales per month in a bar chart using plotly."
However, I got a bug since "Order Date" wasn't a datetime, so I added "Make sure to make 'Order Date' a date column." The new code worked, but gave months as integers 1-12.
When I added "Include month name on x-axis (e.g., Jan, Feb, ...).", the model sort of gave up and spit out some buggy code that didn't make a bar plot.
In this example, it would be great to be able to chain the howto commands, so the previous result is used as context for the new one.
[+] [-] hgarg|3 years ago|reply
But your approach is much better. Pandas is used a lot. Build a tool on top of pandas. This is awesome.
[+] [-] javierluraschi|3 years ago|reply
[+] [-] mmaia|3 years ago|reply
Otherwise, there's room for other solutions, as airops sidekick [1] that uses browser extensions to embed itself in other data tools.
1- https://www.airops.com/
[+] [-] jamal-kumar|3 years ago|reply
Is there any plans on getting this to work outside of the python/pandas ecosystem or is it intrinsically tied to that environment?
[+] [-] drcongo|3 years ago|reply
[0] https://www.tabnine.com
[+] [-] pfd1986|3 years ago|reply
[+] [-] bluecoconut|3 years ago|reply
Right now this is running off of GPT-3 (`text-davinci-003`) and via a small code change can run on codex (`code-davinci-002`) but the quality only improves a little bit with that change.
That said, this is the first version to show that the interface is viable; we are currently working on training our own foundation model on a hybrid tokenization of data and word tokens. I hope to improve this same toolkit in the future with these new models of our own that we are training.
[+] [-] swyx|3 years ago|reply
this seems to be using their gpt3 frameowrk: https://github.com/approximatelabs/lambdaprompt
which uses text-davinci-003 by default https://github.com/approximatelabs/lambdaprompt/blob/main/la...
[+] [-] daveguy|3 years ago|reply
[+] [-] harvey9|3 years ago|reply
[+] [-] gcatalfamo|3 years ago|reply
[+] [-] irthomasthomas|3 years ago|reply
And TIL you can embed mp4s in a GitHub readme. Is that new?
[+] [-] allisdust|3 years ago|reply
[+] [-] harvey9|3 years ago|reply
[+] [-] ldh0011|3 years ago|reply
[+] [-] mmaia|3 years ago|reply
[+] [-] bufferoverflow|3 years ago|reply
[+] [-] localhost|3 years ago|reply
[+] [-] sean_the_geek|3 years ago|reply
[+] [-] pklee|3 years ago|reply