Extracting financial disclosure and police reports with OpenAI Structured Output

[+] synthc|1 year ago|reply

My first job (around 2010) was to extract events from financial news and police reports.

We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.

It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.

[+] dataguy_|1 year ago|reply

I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.

[+] danofsteel32|1 year ago|reply

I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.

[1] https://dandavis.dev/pnc-virtual-wallet-statement-parser.htm...

[+] morkalork|1 year ago|reply

Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?

[+] rcarmo|1 year ago|reply

I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.

[+] TrainedMonkey|1 year ago|reply

I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.

[+] infecto|1 year ago|reply

On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.

[+] druskacik|1 year ago|reply

Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.

I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.

The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.

I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.

[+] chx|1 year ago|reply

How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

[+] TrackerFF|1 year ago|reply

We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.

Worked better than any OCR we tried.

[+] thenaturalist|1 year ago|reply

How are you going to find (not even talking about correcting) hallucinated errors?

If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?

How does that approach scale financially?

[+] gloosx|1 year ago|reply

Did you finally balance out lol? If you didn't, would you approach finding a mistake by going through each bill manually?

[+] marcell|1 year ago|reply

I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox

MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’

[+] thenaturalist|1 year ago|reply

How do you plan to address prompt injection/ poisoned data for a method that simply vacuums unchecked inputs into an LLM?

[+] 4ad|1 year ago|reply

What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.

[+] TrackerFF|1 year ago|reply

To be fair, there are some considerations here:

1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.

2) There could be policies about making the data public, but in a way that discourages data scraping.

3) The providers of the data simply don't have the resources or incentives to develop a working API.

And many more.

[+] blitzar|1 year ago|reply

What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.

[+] jxramos|1 year ago|reply

I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.

[+] DrillShopper|1 year ago|reply

Yeah, wow, humanity is so stupid for not distributing the machine readable format for the local newspaper in 1920. Gosh we're just so dumb

[+] tpswa|1 year ago|reply

Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.

[+] danso|1 year ago|reply

I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:

     openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}

I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own

[0] https://platform.openai.com/docs/guides/structured-outputs/s...

[+] ec109685|1 year ago|reply

I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?

[+] philipwhiuk|1 year ago|reply

I'm deeply worried by the impact of hallucinations in this sort of tool.

[+] beoberha|1 year ago|reply

Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.

[+] 0tfoaij|1 year ago|reply

OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.

[+] minimaxir|1 year ago|reply

The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html

Llama isn't on there but a few finetunes of it (Hermes) are OSS.

[+] int_19h|1 year ago|reply

Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.

[+] gdiamos|1 year ago|reply

I usually come to a different conclusion using the JSON output on Lamini, e.g. even with Llama 3.2 3B

https://lamini-ai.github.io/inference/json_output

Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.

Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.

[+] A4ET8a8uTh0|1 year ago|reply

<< Stuff like this shows how much better the commercial models are than local models.

I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.

edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.

[+] kgeist|1 year ago|reply

In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.

[+] tpm|1 year ago|reply

In my experience the Qwen2-VL models are great at this.

[+] 1oooqooq|1 year ago|reply

if you're "parsing" structured or even semi structured data with a LLM.... sigh.

an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.

i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations

[+] danso|1 year ago|reply

The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.

(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)

[0] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[1] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[+] pooingcode|1 year ago|reply

I am a huge fan of using Stuctured Output to extract data.

Huge benefit that you can lock down model performance with as you fine-tune your prompt or extend out use cases.

I wrote about it here on my blog where i replaced a project’s prompt with Structured Output using Pydantic models https://amberwilliams.io/blogs/474b0361-cbc1-4fa5-b047-c042f...

[+] Zaheer|1 year ago|reply

Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/

There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!

[+] unknown|1 year ago|reply

[deleted]

[+] matchagaucho|1 year ago|reply

Similarly I've found old-school OCR is needed for more reliability.

[+] hackernewds|1 year ago|reply

Is this simply the OCR bits to feed to openai structured output?

[+] artisandip7|1 year ago|reply

tried it works great, ty!

[+] minimaxir|1 year ago|reply

> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision

To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.

[+] tyre|1 year ago|reply

or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems

[+] ec109685|1 year ago|reply

I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.

[+] mmsc|1 year ago|reply

Adding to the list of "now try it with"....

The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.

I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.

Then just link it up to your automated investment platform and you're ready to go!

[+] infecto|1 year ago|reply

Would you not want to read the XBRL from the filing? I thought those are now mandatory.

This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.

[+] jsemrau|1 year ago|reply

The SEC has a well defined API with EDGAR.

https://jdsemrau.substack.com/p/mem0-building-a-sec-10k-anal...

[+] jxramos|1 year ago|reply

Does anyone follow Vik's work? eg https://x.com/VikParuchuri/status/1846153661791011158

[+] kiakiaa|1 year ago|reply

Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.

[+] _1tem|1 year ago|reply

Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?

[+] thibaut_barrere|1 year ago|reply

This is what I am implementing a the moment (together with sampling for errors).

[+] Imanari|1 year ago|reply

Does anybody have experience with Azure Document Intelligence? How does it compare to OAIs extraction capabilities?

89 comments