Show HN: Using LLama2 to Correct OCR Errors
54 points| eigenvalue | 2 years ago |github.com | reply
I came up with the linked script that takes a PDF as input, runs Tesseract on it to get an initial text extraction, and then feeds this sentence-by-sentence to Llama2, first to correct mistakes, and then again on the corrected text to format it as markdown where possible. This was surprisingly easier than I initially expected thanks to the very nice tooling now available in libraries such as llama-cpp-python, langchain, and pytesseract. But the big issue I was encountering was that Llama2 wasn't just correcting the text it was given-- it was also hallucinating a LOT of totally new sentences that didn't appear in the original text at all (some of these new sentences used words which never appeared elsewhere in the original text).
I figured this would be pretty simple to filter out using fuzzy string matching-- basically check all the sentences in the LLM corrected text and filter out sentences that are very different from any sentences in the original OCRed text. To my surprise, this approach worked very poorly. In fact, lots of other similar tweaks, including using bag-of-words and the spacy NLP library in various ways (spacy worked very poorly in everything I tried).
Finally I realized that I had a good solution staring me in the face: Llama2. I realized I could get sentence level vector embeddings straight from Llama2 using langchain. So I did that, getting embeddings for each sentence in the raw OCRed text and the LLM corrected text, and then computed the cosine similarity of each sentence in the LLM corrected text against all sentences in the raw OCRed text. If no sentences match in the raw OCRed text, then that sentence has a good chance of being hallucinated.
In order to save the user from having to experiment with various thresholds, I saved the computed embeddings to an SQLite database so they only had to be computed once, and then tried several thresholds, comparing the length of the filtered LLM corrected text to the raw OCRed text; if things worked right, these texts should be roughly the same length. So as soon as the filtered length dips below the raw OCRed text length, it backtracks and uses the previous threshold as the final selected threshold.
Anyway, if you have some very old scanned documents laying around, you might try them out and see how well it works for you. Do note that it's extremely slow, but you can leave it overnight and maybe the next day you'll have your finished text, which is better than nothing! I feel like this could be useful for sites like the Internet Archive-- I've found their OCR results to be extremely poor for older documents.
I'm very open to any ideas or suggestions you might have. I threw this together in a couple days and know that it can certainly be improved in various ways. One idea that I thought might be fun would be to make this work with a Ray cluster, sending a different page of the document to each of the workers in the cluster to do it all at the same time.
[+] [-] AaronNewcomer|2 years ago|reply
I was even more impressed feeding it handwritten French documents like patents from the same time period. AWS Textract only works with English so even with it’s ML OCR of trying to make English words from the French handwriting, it was still workable when telling the LLM I was feeding it French text that was OCR’d even though it all kind seemed like gibberish when looking at it.
[+] [-] eigenvalue|2 years ago|reply
[+] [-] version_five|2 years ago|reply
It sounds from your writeup then like llama2 (which one) doesn't work well enough without some guardrails but it's possible to make it work? How would you rate the performance overall?
[+] [-] eigenvalue|2 years ago|reply
I used the Llama2 13B Chat model ggml weights from TheBloke on Huggingface.
[+] [-] Reubend|2 years ago|reply
[+] [-] eigenvalue|2 years ago|reply
[+] [-] eigenvalue|2 years ago|reply
https://news.ycombinator.com/item?id=36777836
But even that is much worse than a professionally edited transcript like you might see in a magazine. I just tried uploading the raw transcript to ChatGPT Code Interpreter, and it does an amazing job of reformatting it in nice looking markdown, with speaker identification (automatically done from context) and quotation marks. For example:
---
*Lee Kuan Yew*: But when they said, "We are honest, sincere, coming socialists, just like you," I then revealed, onion by onion, I peeled off, broadcast, and I described how I met the plan chief who said he was going to work with me and so on. And it was convincing.
When I met the plan, he said, "We'll work together. You win, you release our people from jail." I said, "Yes, I have to do that." So I said, "How do I know you are the boss?" He said, "You have to take my word for it." I said, "Well, I believe this city councilor is one of your men, and he has penetrated David Marshall, who was a Safadi Jew, very bright fellow, a lawyer, non-Communist. And he formed the Workers' Party."
I said, "You get him to resign, then I will believe you are the boss." Three weeks later, when I was in London conferencing with the British, I opened the Straits Times, and the child had resigned. God, here was a man whom I had, when I became prime minister, I went to the Special Branch and looked up the files, and there was his picture, wanted on site, and deep in the underground, chased by the police. He could send a message to this fellow he did not know, and the man resigned.
So when I disclosed all this, credibility was established. The fight was on. You, in three different languages, when I finish each broadcast, the director of the station couldn't see me, I went into the room and found me lying on the floor, trying to recover my breath. So each speech was interesting.
*David Gergen*: Oh, yes. English, Malay, Mandarin, and you try that.
---
That suggests to me that I should be able to do something similar using Llama2, although probably not quite as good. This is very exciting to me though, since you would either have to sit there telling ChatGPT to keep doing each section manually and reassemble it, or you would have to spend a fortune on OpenAPI credits to automate it with the API. But with a local LLM you can just leave it running on a spare machine for free.