I don't think tabular data of any sort is a particularly good fit for LLMs at the moment. What are you trying to do with it?
If you want to answer questions like "how many students does Everglade High School have?" and you have a spreadsheet of schools where one of the columns is "number of students" I guess you could feed that into an LLM, but it doesn't feel like a great tool for the job.
I'd instead use systems like ChatGPT Code Interpreter where the LLM gets to load up that data programatically and answer questions by running code against it. Text-to-SQL systems could work well for that too.
For me personally, a lot of times it's for table augmentation purposes. Appending additional columns to a dataset, such as a cleaned/standardized version of another field, extracting a value from another field, or appending categorization attributes (sometimes pre-seeded and sometimes just giving it general direction).
Or sometimes I'll manually curate a field like that, and then ask it to generate an Excel function that can be used to produce as similar a result as possible for automated categorization in the future.
So in most cases I both want to provide it with tabular data, and also want tabular data back out. In general I've gotten decent results for these sorts of use cases, but when it falls down it's almost always addressable by tinkering with the formatting related instructions – sometimes by tweaking the input and sometimes by tweaking the instructions for the desired output.
Give it the data as separate columns. For each cell give it the row index and the data.
That way it's just working with lists but can easily key that eg all this data is in row 3, etc. Tell it to correlate data by the first value in the pair like that.
> I say "decent" because most of the available training data for Pandas does things in a naive way.
They're around the level of the median user, which is pretty bad as pandas is a big and complicated API with many different approaches available (as is base R, in case people think I'm just hating on pandas).
I've seen enough examples of an LLM misinterpreting a column or row - resulting in returning the incorrect answer to a question because it was off by one in one of the directions - that I'm nervous about trusting them for this.
JSON objects are different - there the key/value relationship is closer in the set of tokens which usually makes it more reliable.
The only reason I'm not immediately answering is because I need to check whether it's a trade secret. We do our own thing that I haven't seen anywhere else and works super well. Sorry for being mysterious, I'll try to get an OK to share.
simonw|1 year ago
If you want to answer questions like "how many students does Everglade High School have?" and you have a spreadsheet of schools where one of the columns is "number of students" I guess you could feed that into an LLM, but it doesn't feel like a great tool for the job.
I'd instead use systems like ChatGPT Code Interpreter where the LLM gets to load up that data programatically and answer questions by running code against it. Text-to-SQL systems could work well for that too.
btown|1 year ago
cosmie|1 year ago
Or sometimes I'll manually curate a field like that, and then ask it to generate an Excel function that can be used to produce as similar a result as possible for automated categorization in the future.
So in most cases I both want to provide it with tabular data, and also want tabular data back out. In general I've gotten decent results for these sorts of use cases, but when it falls down it's almost always addressable by tinkering with the formatting related instructions – sometimes by tweaking the input and sometimes by tweaking the instructions for the desired output.
nprateem|1 year ago
That way it's just working with lists but can easily key that eg all this data is in row 3, etc. Tell it to correlate data by the first value in the pair like that.
__mharrison__|1 year ago
I say "decent" because most of the available training data for Pandas does things in a naive way.
OTOH, they are horrible at Polars. (I figure this is mostly a lack of training data.)
disgruntledphd2|1 year ago
They're around the level of the median user, which is pretty bad as pandas is a big and complicated API with many different approaches available (as is base R, in case people think I'm just hating on pandas).
danielmarkbruce|1 year ago
simonw|1 year ago
JSON objects are different - there the key/value relationship is closer in the set of tokens which usually makes it more reliable.
layer8|1 year ago
irskep|1 year ago
Edit: yeah I can't talk about it, sorry