top | item 34400443

Show HN: Sketch – AI code-writing assistant that understands data content

252 points| bluecoconut | 3 years ago |github.com | reply

Hey HN!

I’m excited to share sketch: a tool to help anyone who uses python and pandas quickly iterate and get to answers for their data questions.

Sketch installs as a pandas extension that offers utility functions that operate on natural language prompts. Using the `ask` interface you can get answers in natural language. Using the `howto` interface you can get get python and pandas code directly. The primary benefit of this over copilot and chatGPT is that this adds data-content based context so that the generated answers are much more accurate and relevant to the data problem at hand.

Check out the demo video[1] and try it out using the colab notebook (on github)!

[1] https://user-images.githubusercontent.com/916073/212602281-4...

49 comments

[+] jonwinstanley|3 years ago|reply

Very cool demo!

Regarding the choice of name, presumably you already know about Sketch, the popular image editing software.

I wonder if the image editing guys will in the future incorporate AI functionality too? Which might make "Googling" for your product difficult for your potential customers?

[+] Jugglerofworlds|3 years ago|reply

There's also a program synthesis project called Sketch, which is much closer to the domain of what the user posted: https://people.csail.mit.edu/asolar/

[+] Arelius|3 years ago|reply

That seems to be a presumption...

I just googled "sketch application" or "sketch image editing software" I found sketchpad... alternatively, I found sketch.com, which seems to be design software?

I mean, it's a vague enough name that I don't think it's a good name anyways, but I'm not sure it should be obvious that this is a taken name in the way that "Photoshop" would be.

[+] blakeburch|3 years ago|reply

This is fantastic and exactly where our team at Shipyard is expecting the data space to go. Context aware, AI driven. Great work on this!

We were just talking last week about how we should create a feature to describe transformations you want in Natural Language that get compiled to pandas/SQL. Input data is everything associated with the original file/dataframe.

Visual transformation tools are typically limited and non-reproducible. If you could switch it around to be code-compiled but description-driven, that would open up new possibilities.

I'd love to chat if you're open to it. Email in bio.

[+] jadbox|3 years ago|reply

I'd love something like a standalone SQL IDE where I can ask an AI to generate queries or migration scripts.

Sadly to be honest, I don't think I'd pay a subscription for such a service. I would prefer to pay a one time tooling fee and just run trained model in the IDE locally.

[+] swyx|3 years ago|reply

This is a great demo, OP.

I'm wondering about the UX of this vs Copilot. is this basically just a way to get around the fact that you dont have Copilot inside of notebooks? what else am I missing about this experience?

[+] bluecoconut|3 years ago|reply

Thanks!

That is definitely a big part of it, getting to use copilot style answers without having to install any plugins to the IDE (so getting to use this in colab or jupyter notebooks directly feels great).

That said, I use both copilot and sketch in my VScode notebooks, and find that they have slightly different feelings to the iteration loop.

Sketch offers a more "local" data context (pinning the text/prompt to the specific dataframe) which increases the quality of the suggestions (since more relevant information is within the token limit).

[+] SomewhatLikely|3 years ago|reply

Copilot does work in notebooks, at least inside pycharm and vs code. I ask it for pandas solutions ask the time.

[+] ibestvina|3 years ago|reply

Great work, and a really interesting application of GPT3. Some time ago I developed Datasloth [1] which might be a nice complementary feature to Sketch. Ping me if you're interested to bounce ideas :)

[1] https://github.com/ibestvina/datasloth

[+] ethanwillis|3 years ago|reply

Well, I'm locked out of my github account right now and don't feel like going through all those hoops right now but I wanted to point something minor out.

In this line, https://github.com/approximatelabs/sketch/blob/9d567ec161015...

I think you can end up marking control characters as "UNKNOWN" characters by accident by assuming that in all contexts/environments that dictionary.items() always returns items in a consistent order. This isn't always true.

edit: actually with the way the code is written if you have any overlapping ranges at all you'll end up double/triple/etc. counting a character into multiple categories.

[+] jerpint|3 years ago|reply

Does using this mean sending all of your potentially private data via an api call to openAI?

[+] abrichr|3 years ago|reply

From https://github.com/approximatelabs/sketch/blob/main/sketch/p... it appears that this library is calling a remote API, which obviates the utility of the demonstrated use case.

Upon closer inspection, it looks like https://github.com/approximatelabs/sketch interfaces with the model via https://github.com/approximatelabs/lambdaprompt, which is made by the same organization. This suggests to me that the former may be a toy demonstration of the latter.

Interesting how as of the time of writing this, most of the comments here (i.e. dozens) are praising this as a legitimate use case. Maybe I'm missing something obvious, but it seems clear to me that uploading data to a third party to verify whether that data contains PII is a non-starter for any serious application.

[+] teaearlgraycold|3 years ago|reply

"Does this data contain PII?"

"Yes, and you just shared it all with Microsoft :D"

[+] marcosfelt|3 years ago|reply

Right now, it sends the first five rows of the dataframe: https://github.com/approximatelabs/sketch/blob/9d567ec161015...

[+] tdebroc|3 years ago|reply

Looks really nice, but I tried it:

  import sketch
  import pandas as pd

  data_pd = pd.read_csv("input.csv", sep=';')
  print(data_pd)
  print(data_pd.sketch.ask("Is there any PII in this dataset ?"))
  print(data_pd.sketch.ask("Which columns are integer type?"))

With this input.csv:

  name;age;address;phone
  Bob;34;106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD;1-541-754-3010
  Anna;34;694 Short Street, Austin, Texas;001-541-754-3010

And I have no results (and no runtime error as well) :-( Here is the console output:

     name  age                                         address             phone
  0   Bob   34  106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD    1-541-754-3010
  1  Anna   34                 694 Short Street, Austin, Texas  001-541-754-3010
  <IPython.core.display.HTML object>
  None
  <IPython.core.display.HTML object>
  None

Am I missing something ? The "ask" interface doesn't seems to need external OpenAI credentials right ?

[+] bluecoconut|3 years ago|reply

to get the strings of the results back out, add the kwarg `call_display=False` to the functions.

so: ``` print(data_pd.sketch.ask("Is there any PII in this dataset ?", call_display=False)) ``` should work for you.

Right now it by default assumes its in an ipython context that can display HTML objects.

[+] marcosfelt|3 years ago|reply

Just played around with this and I think I'll be using it on some research projects!

One cool feature would be some sort of chaining, where you could anchor a new query to a previous one.

For example, on the sales data demo, I started with the howto query "Plot the sales per month in a bar chart using plotly."

However, I got a bug since "Order Date" wasn't a datetime, so I added "Make sure to make 'Order Date' a date column." The new code worked, but gave months as integers 1-12.

When I added "Include month name on x-axis (e.g., Jan, Feb, ...).", the model sort of gave up and spit out some buggy code that didn't make a bar plot.

In this example, it would be great to be able to chain the howto commands, so the previous result is used as context for the new one.

[+] hgarg|3 years ago|reply

I spent few weeks last year building a text to sql tool using codex model to do something like this but for all kinds of data sources. We pivoted away to something else for various reasons.

But your approach is much better. Pandas is used a lot. Build a tool on top of pandas. This is awesome.

[+] javierluraschi|3 years ago|reply

https://hal9.com is focused on building data apps with LLMs, would love to explore integrating and contributing to Sketch. If this sounds interesting I’m at javier at hal9.ai

[+] mmaia|3 years ago|reply

Very promising. I believe the uses of OpenAI that will stick in the long term are like this, and other tools should be experimenting with this kind of integration.

Otherwise, there's room for other solutions, as airops sidekick [1] that uses browser extensions to embed itself in other data tools.

1- https://www.airops.com/

[+] jamal-kumar|3 years ago|reply

Damn, this looks pretty useful. I was finding that github copilot was really good at reading a CSV file and writing all the imports from that into migrations for DB import, but this looks like it does these data transformations even more robustly.

Is there any plans on getting this to work outside of the python/pandas ecosystem or is it intrinsically tied to that environment?

[+] drcongo|3 years ago|reply

I use TabNine [0] for local context aware AI suggestions, and I find it spookily good at guessing what I'm half way through typing. Sadly they've left the Sublime plugin to rot and it's mostly a hinderance in ST4.

[0] https://www.tabnine.com

[+] pfd1986|3 years ago|reply

Hi, cool stuff! Which LLM is being used in the background? I may have missed that info in the readme. Thanks!

[+] bluecoconut|3 years ago|reply

Thanks!

Right now this is running off of GPT-3 (`text-davinci-003`) and via a small code change can run on codex (`code-davinci-002`) but the quality only improves a little bit with that change.

That said, this is the first version to show that the interface is viable; we are currently working on training our own foundation model on a hybrid tokenization of data and word tokens. I hope to improve this same toolkit in the future with these new models of our own that we are training.

[+] swyx|3 years ago|reply

digging thru the code https://github.com/approximatelabs/sketch/blob/9d567ec161015...

this seems to be using their gpt3 frameowrk: https://github.com/approximatelabs/lambdaprompt

which uses text-davinci-003 by default https://github.com/approximatelabs/lambdaprompt/blob/main/la...

[+] daveguy|3 years ago|reply

This is very cool. A useful case for gpt. One question / concern: isn't a person's address considered PII? Is the system flexible enough to add pre-statements such as "treat an address as PII"?

[+] harvey9|3 years ago|reply

Related question: is this done on my machine or do I end up sending possible pii to a cloud service for evaluation?

[+] gcatalfamo|3 years ago|reply

Cool project, although the name kinda clashes with the well-known https://www.sketch.com/ in the UI/UX design space

[+] irthomasthomas|3 years ago|reply

This is very cool! I've literally today been noodling with ideas to use probabilistic data structures in LLMs.

And TIL you can embed mp4s in a GitHub readme. Is that new?

[+] allisdust|3 years ago|reply

I don't have any experience with pandas. Can this directly connect to a db and run queries there (video seems to load a csv file).

[+] harvey9|3 years ago|reply

If you can already write SQL to return a data set then you can get that set to pandas with pyodbc.

[+] ldh0011|3 years ago|reply

So... Microsoft bought 48 or 49% of OpenAI right? Integrating this into Excel would make everyone an excel power user.

[+] mmaia|3 years ago|reply

A lot of people already uses excelformulabot. The impact of something integrated into Excel would be pretty big.

[+] bufferoverflow|3 years ago|reply

But if it makes a logical mistake, it would take a real power user to notice it.

[+] localhost|3 years ago|reply

But wouldn't you need to integrate Python into Excel for this to work?

[+] sean_the_geek|3 years ago|reply

Really cool and helpful. Is there anything similar for R?

[+] pklee|3 years ago|reply

GPT3 model generates a SQL. You can sqldf on top of your data.table. We will be demo'ing at one of the events shortly. BTW, you could do somewhat similar with other LLMs such as GPTJ and GPT NEOX if you have worked with them