top | item 40844904

(no title)

dimask | 1 year ago

Thanks for putting all this work and sharing it in such a detail! Data extraction/structuring data is the only serious application of LLMs I have actually engaged in for real work and found useful. I had to extract data from experience sampling reports which I could not share online, thus chatgpt etc was out of question. There were sentences describing onsets and offsets of events and descriptions of what went on. I ran models through llama.cpp to turn these into csv format with 4 columns (onset, offset, description, plus one for whether a specific condition was met in that event or not which had to interpreted through the description). Giving some examples of how I want it all structured in the prompt, was enough for many different models to do it right. Mixtral 8x7b was my favourite because it ran the fastest in that quality level on my laptop.

I am pretty sure that a finetuned smaller model would be better and faster for this task. It would be great to start finetuning and sharing such smaller models: they do not really have to be really better than commercial LLMs that run online, as long as they are not at least worse. They are already much faster and cheaper, which is a big advantage for this purpose. There is already need for these tasks to be offline when one cannot share the data with openai and the like. Higher speed and lower cost also allow for more experimentation with more specific finetuning and prompts, with less care about token lengths of prompts and cost. This is an application where smaller, locally run, finetunable models can shine.

discuss

hubraumhugo|1 year ago

> Data extraction/structuring data is the only serious application of LLMs

I fully agree. I realized this early on when experimenting with GPT-3 for web data extraction. After posting the first prototype on Reddit and HN, we started seeing a lot of demand for automating rule-based web scraping stacks (lots of maintenance, hard to scale). This eventually led to the creation of our startup (https://kadoa.com) focused on automating this "boring and hard" problem.

It comes down to such relatively unexciting use cases where AI adds the most value.

AI won't eliminate our jobs, but it will automate tedious, repetitive work such as web scraping, form filling, and data entry.

furyofantares|1 year ago

The way you cut that quote turns it into an assertion that doesn't exist in parent post.

They didn't make the (incorrect) statement that no other serious, useful application exists.

But that's how it reads when you cut off before "I have actually engaged in for real work and found useful"

strickvl|1 year ago

Thanks! Yes one 'next step' that I'd like to do (probably around the work on deployment / inference that I'm turning to now) will be to see just how small I can get the model. Spacy have been pushing this kind of workflow (models in the order of tens of MB) for years and it's nice that there's a bit more attention to it. As you say, ideally I'd want lots of these tiny models that were super specialists at what they do, small in size and speedy in inference time. As I hinted towards the end of the post, however, keeping all that updated starts to get unwieldy at a certain point if you don't set it all up in the right way.