I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.
I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.
Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?
I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.
I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.
On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.
Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.
I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.
The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.
I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.
How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.
1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.
2) There could be policies about making the data public, but in a way that discourages data scraping.
3) The providers of the data simply don't have the resources or incentives to develop a working API.
What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.
I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.
Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.
<< Stuff like this shows how much better the commercial models are than local models.
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.
(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.
Would you not want to read the XBRL from the filing? I thought those are now mandatory.
This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.
Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.
Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?
[+] [-] synthc|1 year ago|reply
We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.
It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.
[+] [-] dataguy_|1 year ago|reply
[+] [-] danofsteel32|1 year ago|reply
[1] https://dandavis.dev/pnc-virtual-wallet-statement-parser.htm...
[+] [-] morkalork|1 year ago|reply
[+] [-] rcarmo|1 year ago|reply
[+] [-] TrainedMonkey|1 year ago|reply
[+] [-] infecto|1 year ago|reply
[+] [-] druskacik|1 year ago|reply
I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.
The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.
I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.
[+] [-] chx|1 year ago|reply
https://hachyderm.io/@inthehands/112006855076082650
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
[+] [-] TrackerFF|1 year ago|reply
Worked better than any OCR we tried.
[+] [-] thenaturalist|1 year ago|reply
If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?
How does that approach scale financially?
[+] [-] gloosx|1 year ago|reply
[+] [-] marcell|1 year ago|reply
MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’
[+] [-] thenaturalist|1 year ago|reply
[+] [-] 4ad|1 year ago|reply
[+] [-] TrackerFF|1 year ago|reply
1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.
2) There could be policies about making the data public, but in a way that discourages data scraping.
3) The providers of the data simply don't have the resources or incentives to develop a working API.
And many more.
[+] [-] blitzar|1 year ago|reply
[+] [-] jxramos|1 year ago|reply
[+] [-] DrillShopper|1 year ago|reply
[+] [-] tpswa|1 year ago|reply
[+] [-] danso|1 year ago|reply
[0] https://platform.openai.com/docs/guides/structured-outputs/s...
[+] [-] ec109685|1 year ago|reply
[+] [-] philipwhiuk|1 year ago|reply
[+] [-] beoberha|1 year ago|reply
[+] [-] 0tfoaij|1 year ago|reply
[+] [-] minimaxir|1 year ago|reply
Llama isn't on there but a few finetunes of it (Hermes) are OSS.
[+] [-] int_19h|1 year ago|reply
[+] [-] gdiamos|1 year ago|reply
https://lamini-ai.github.io/inference/json_output
Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.
Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.
[+] [-] A4ET8a8uTh0|1 year ago|reply
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
[+] [-] kgeist|1 year ago|reply
[+] [-] tpm|1 year ago|reply
[+] [-] 1oooqooq|1 year ago|reply
an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.
i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations
[+] [-] danso|1 year ago|reply
(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)
[0] https://disclosures-clerk.house.gov/public_disc/financial-pd...
[1] https://disclosures-clerk.house.gov/public_disc/financial-pd...
[+] [-] pooingcode|1 year ago|reply
Huge benefit that you can lock down model performance with as you fine-tune your prompt or extend out use cases.
I wrote about it here on my blog where i replaced a project’s prompt with Structured Output using Pydantic models https://amberwilliams.io/blogs/474b0361-cbc1-4fa5-b047-c042f...
[+] [-] Zaheer|1 year ago|reply
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] matchagaucho|1 year ago|reply
[+] [-] hackernewds|1 year ago|reply
[+] [-] artisandip7|1 year ago|reply
[+] [-] minimaxir|1 year ago|reply
OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision
To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.
[+] [-] tyre|1 year ago|reply
[+] [-] ec109685|1 year ago|reply
[+] [-] mmsc|1 year ago|reply
The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.
I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.
Then just link it up to your automated investment platform and you're ready to go!
[+] [-] infecto|1 year ago|reply
This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.
[+] [-] jsemrau|1 year ago|reply
https://jdsemrau.substack.com/p/mem0-building-a-sec-10k-anal...
[+] [-] jxramos|1 year ago|reply
[+] [-] kiakiaa|1 year ago|reply
[+] [-] _1tem|1 year ago|reply
[+] [-] thibaut_barrere|1 year ago|reply
[+] [-] Imanari|1 year ago|reply