top | item 36976333

Show HN: Using LLama2 to Correct OCR Errors

54 points| eigenvalue | 2 years ago |github.com | reply

I've been disappointed by the very poor quality of results that I generally get when trying to run OCR on older scanned documents, especially ones that are typewritten or otherwise have unusual or irregular typography. I recently had the idea of using Llama2 to use common sense reasoning and subject level expertise to correct transcription errors in a "smart" way-- basically doing what a human proofreader who is familiar with the topic might do.

I came up with the linked script that takes a PDF as input, runs Tesseract on it to get an initial text extraction, and then feeds this sentence-by-sentence to Llama2, first to correct mistakes, and then again on the corrected text to format it as markdown where possible. This was surprisingly easier than I initially expected thanks to the very nice tooling now available in libraries such as llama-cpp-python, langchain, and pytesseract. But the big issue I was encountering was that Llama2 wasn't just correcting the text it was given-- it was also hallucinating a LOT of totally new sentences that didn't appear in the original text at all (some of these new sentences used words which never appeared elsewhere in the original text).

I figured this would be pretty simple to filter out using fuzzy string matching-- basically check all the sentences in the LLM corrected text and filter out sentences that are very different from any sentences in the original OCRed text. To my surprise, this approach worked very poorly. In fact, lots of other similar tweaks, including using bag-of-words and the spacy NLP library in various ways (spacy worked very poorly in everything I tried).

Finally I realized that I had a good solution staring me in the face: Llama2. I realized I could get sentence level vector embeddings straight from Llama2 using langchain. So I did that, getting embeddings for each sentence in the raw OCRed text and the LLM corrected text, and then computed the cosine similarity of each sentence in the LLM corrected text against all sentences in the raw OCRed text. If no sentences match in the raw OCRed text, then that sentence has a good chance of being hallucinated.

In order to save the user from having to experiment with various thresholds, I saved the computed embeddings to an SQLite database so they only had to be computed once, and then tried several thresholds, comparing the length of the filtered LLM corrected text to the raw OCRed text; if things worked right, these texts should be roughly the same length. So as soon as the filtered length dips below the raw OCRed text length, it backtracks and uses the previous threshold as the final selected threshold.

Anyway, if you have some very old scanned documents laying around, you might try them out and see how well it works for you. Do note that it's extremely slow, but you can leave it overnight and maybe the next day you'll have your finished text, which is better than nothing! I feel like this could be useful for sites like the Internet Archive-- I've found their OCR results to be extremely poor for older documents.

I'm very open to any ideas or suggestions you might have. I threw this together in a couple days and know that it can certainly be improved in various ways. One idea that I thought might be fun would be to make this work with a Ray cluster, sending a different page of the document to each of the workers in the cluster to do it all at the same time.

10 comments

[+] AaronNewcomer|2 years ago|reply

I’ve been doing something similar recently that impressed me. I have been taking handwritten manuscripts from the early 1800s and feeding them into AWS Textract and then feeding the raw OCR data results into Claude2 or GPT4 to have to it make sense of the horrible OCR from the handwriting.

I was even more impressed feeding it handwritten French documents like patents from the same time period. AWS Textract only works with English so even with it’s ML OCR of trying to make English words from the French handwriting, it was still workable when telling the LLM I was feeding it French text that was OCR’d even though it all kind seemed like gibberish when looking at it.

[+] eigenvalue|2 years ago|reply

Cool. I'm sure that would result in better results, but then you have to pay per request and I'm sure it could get pretty expensive if you're talking about long documents. It's nice not having to think about the price of anything and also having full control over how it all works.

[+] version_five|2 years ago|reply

Very cool. I think this is an interesting benchmarking task for a language model (as well as the practical uses). I tried the same thing some time ago, just on a random snippet of tesseract OCR. I had a Vicuna model (I forget which) that failed miserably, and chat GPT did it flawlessly. I did not have any hallucination problem with chatGPT.

It sounds from your writeup then like llama2 (which one) doesn't work well enough without some guardrails but it's possible to make it work? How would you rate the performance overall?

[+] eigenvalue|2 years ago|reply

I’d say that it does work pretty well. It could simply be that I’m sampling too much from the LLM which is causing it to hallucinate more than it should, hence why I needed to spend so much time filtering out the hallucinations. But I think the risk of “made up” stuff in what’s supposed to be an accurate representation of a scanned document is big enough that you might always want to do something like that for quality control purposes, just to be on the safe side.

I used the Llama2 13B Chat model ggml weights from TheBloke on Huggingface.

[+] Reubend|2 years ago|reply

Is it possible that with different hyperparameters, LLAMA 2 would be more faithful to the original? I'm pretty surprised that it would hallucinate anything majorly different than the source material, since that task seems pretty "easy" for LLMs these days.

[+] eigenvalue|2 years ago|reply

Yes, I'm sure there are ways to reduce the hallucination, though I don't know that these would all be accessible via the Python bindings. I think I might also be sampling too much from the model which is leading to it making up more stuff than normal. It's certainly an area I want to focus on more.

[+] eigenvalue|2 years ago|reply

I just realized that a very similar approach could also be very useful in correcting and reformatting automatic speech recognition transcripts. I made another tool recently for making it more pleasant to read the YouTube automatic transcript text:

https://news.ycombinator.com/item?id=36777836

But even that is much worse than a professionally edited transcript like you might see in a magazine. I just tried uploading the raw transcript to ChatGPT Code Interpreter, and it does an amazing job of reformatting it in nice looking markdown, with speaker identification (automatically done from context) and quotation marks. For example:

---

*Lee Kuan Yew*: But when they said, "We are honest, sincere, coming socialists, just like you," I then revealed, onion by onion, I peeled off, broadcast, and I described how I met the plan chief who said he was going to work with me and so on. And it was convincing.

When I met the plan, he said, "We'll work together. You win, you release our people from jail." I said, "Yes, I have to do that." So I said, "How do I know you are the boss?" He said, "You have to take my word for it." I said, "Well, I believe this city councilor is one of your men, and he has penetrated David Marshall, who was a Safadi Jew, very bright fellow, a lawyer, non-Communist. And he formed the Workers' Party."

I said, "You get him to resign, then I will believe you are the boss." Three weeks later, when I was in London conferencing with the British, I opened the Straits Times, and the child had resigned. God, here was a man whom I had, when I became prime minister, I went to the Special Branch and looked up the files, and there was his picture, wanted on site, and deep in the underground, chased by the police. He could send a message to this fellow he did not know, and the man resigned.

So when I disclosed all this, credibility was established. The fight was on. You, in three different languages, when I finish each broadcast, the director of the station couldn't see me, I went into the room and found me lying on the floor, trying to recover my breath. So each speech was interesting.

*David Gergen*: Oh, yes. English, Malay, Mandarin, and you try that.

---

That suggests to me that I should be able to do something similar using Llama2, although probably not quite as good. This is very exciting to me though, since you would either have to sit there telling ChatGPT to keep doing each section manually and reassemble it, or you would have to spend a fortune on OpenAPI credits to automate it with the API. But with a local LLM you can just leave it running on a spare machine for free.