top | item 41803457

Extracting financial disclosure and police reports with OpenAI Structured Output

254 points| danso | 1 year ago |gist.github.com | reply

89 comments

order
[+] synthc|1 year ago|reply
My first job (around 2010) was to extract events from financial news and police reports.

We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.

It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.

[+] dataguy_|1 year ago|reply
I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.
[+] morkalork|1 year ago|reply
Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?
[+] rcarmo|1 year ago|reply
I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.
[+] TrainedMonkey|1 year ago|reply
I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.
[+] infecto|1 year ago|reply
On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.
[+] druskacik|1 year ago|reply
Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.

I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.

The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.

I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.

[+] chx|1 year ago|reply
How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

[+] TrackerFF|1 year ago|reply
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.

Worked better than any OCR we tried.

[+] thenaturalist|1 year ago|reply
How are you going to find (not even talking about correcting) hallucinated errors?

If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?

How does that approach scale financially?

[+] gloosx|1 year ago|reply
Did you finally balance out lol? If you didn't, would you approach finding a mistake by going through each bill manually?
[+] marcell|1 year ago|reply
I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox

MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’

[+] thenaturalist|1 year ago|reply
How do you plan to address prompt injection/ poisoned data for a method that simply vacuums unchecked inputs into an LLM?
[+] 4ad|1 year ago|reply
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.
[+] TrackerFF|1 year ago|reply
To be fair, there are some considerations here:

1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.

2) There could be policies about making the data public, but in a way that discourages data scraping.

3) The providers of the data simply don't have the resources or incentives to develop a working API.

And many more.

[+] blitzar|1 year ago|reply
What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.
[+] jxramos|1 year ago|reply
I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.
[+] DrillShopper|1 year ago|reply
Yeah, wow, humanity is so stupid for not distributing the machine readable format for the local newspaper in 1920. Gosh we're just so dumb
[+] tpswa|1 year ago|reply
Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
[+] danso|1 year ago|reply
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:

     openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own

[0] https://platform.openai.com/docs/guides/structured-outputs/s...

[+] ec109685|1 year ago|reply
I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?
[+] philipwhiuk|1 year ago|reply
I'm deeply worried by the impact of hallucinations in this sort of tool.
[+] beoberha|1 year ago|reply
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
[+] 0tfoaij|1 year ago|reply
OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
[+] int_19h|1 year ago|reply
Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.
[+] gdiamos|1 year ago|reply
I usually come to a different conclusion using the JSON output on Lamini, e.g. even with Llama 3.2 3B

https://lamini-ai.github.io/inference/json_output

Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.

Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.

[+] A4ET8a8uTh0|1 year ago|reply
<< Stuff like this shows how much better the commercial models are than local models.

I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.

edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.

[+] kgeist|1 year ago|reply
In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.
[+] tpm|1 year ago|reply
In my experience the Qwen2-VL models are great at this.
[+] 1oooqooq|1 year ago|reply
if you're "parsing" structured or even semi structured data with a LLM.... sigh.

an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.

i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations

[+] danso|1 year ago|reply
The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.

(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)

[0] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[1] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[+] Zaheer|1 year ago|reply
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/

There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!

[+] matchagaucho|1 year ago|reply
Similarly I've found old-school OCR is needed for more reliability.
[+] hackernewds|1 year ago|reply
Is this simply the OCR bits to feed to openai structured output?
[+] minimaxir|1 year ago|reply
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision

To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.

[+] tyre|1 year ago|reply
or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
[+] ec109685|1 year ago|reply
I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.
[+] mmsc|1 year ago|reply
Adding to the list of "now try it with"....

The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.

I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.

Then just link it up to your automated investment platform and you're ready to go!

[+] infecto|1 year ago|reply
Would you not want to read the XBRL from the filing? I thought those are now mandatory.

This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.

[+] kiakiaa|1 year ago|reply
Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.
[+] _1tem|1 year ago|reply
Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?
[+] thibaut_barrere|1 year ago|reply
This is what I am implementing a the moment (together with sampling for errors).
[+] Imanari|1 year ago|reply
Does anybody have experience with Azure Document Intelligence? How does it compare to OAIs extraction capabilities?