The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.
We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.
> We need to do end to end text recognition. Not "character recognition", it's not the characters we care about.
Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.
Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.
Even I used symbols for different means in a shorthand form when constructing an idea.
I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.
Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.
Could you dumb this down a bit (a lot) for dimmer readers, like myself? The way I am understanding the problem you are getting at is something like:
> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.
> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.
I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?
I will say, if nothing else, I can understand certain physical considerations. For example:
A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.
In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.
The problem is payong experts to properly train a model is expensive, doubly when you want larger context.
Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.
Surpise: Garbage CEOs in, garbage intelligence out.
> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.
Looks like a great project, and I don't want to nitpick, but...
https://www.ocr4all.org/about/ocr4all
> Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.
Any end-user application that uses docker is not an end-user application. It does not matter if the end-user knows how to use docker or not. End-user applications should be delivered as SaaS/WebUi or a local binary (GUI or CLI). Period.
Application installation isn't a user level task. The application being ready for a user to use, and being easy to install are separate. You get your IT literate helper to install for you, then, if the program is easy for users to use you're golden.
"Silicate chemistry is second nature to us geochemists, so it's easy to forget that the average person probably only knows the formulas for olivine and one or two feldspars."
A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.
This has been one of my favorite features Apple added. When I’m in a call and someone shares a page I need the link to, rather than interrupt the speaker and ask them to share the link it’s often faster to screengrab the url and let Apple OCR the address and take me to the page/post it in chat.
After getting an iPhone and exploring some of their API documentation after being really impressed with system provided features, I'm blown away by the stuff that's available. My app experience on iOS vs Android is night and day. The vision features alone have been insane, but their text recognition is just fantastic. Any image and even my god awful handwriting gets picked up without issue.
That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.
I basically wrapped this in a simple iOS app that can take a PDF, turn it into images, and applies the native OCR to the images. It works shockingly well:
How does it work with tables and diagrams? I have scanned pages with mixed media, like some are diagrams, I want to be able to extract the text but tell me where the diagrams are in the image with coordinates.
I wonder if it's possible to reverse engineer that, rip it out, and put it on Linux. Would love to have that feature without having to use Apple hardware
> How is this different from tesseract and friends?
The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.
I didn't have good results in tesseract, so I hope this is really different ;)
I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.
Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.
I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.
I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:
This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.
What kind of accuracy have you reached with this pipeline of Tesseract+LLM? I imagine that there would be a hard limit as to what level the LLM could improve the OCR extract text from Tesseract, since its far from perfect itself.
Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.
That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.
What is this? A new SOTA OCR engine (which would be very interesting to me) or just a tool that uses other known engines (which would be much less interesting to me).
A movement? A socio-political statement?
If only landing pages could be clearer about wtf it actually is ...
„OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.“
It seems to be based on OCR-D, which itself is based on
Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them
Back togethr into a compatible pdf.
> Ocr is well and good, i thought it was mostly solved with tesseract what does this bring?
This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.
Wow. Setup took 12 GB of my disk. First impression: nice UI, but no idea what to do with it or how to create a project. Tells me "session expired" no matter what I try to do. Definitely not batteries-included kind of stuff, will need to explore later.
I've been looking for a project that would have an easy free/extremely cheap way to do OCR/image recognition for generating ALT text automatically for social media. Some sort of embedded implementation that looks at an image and is either able to transcribe the text, or (preferably) transcribe the text AND do some brief image recognition.
I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.
Such an approach moves the cost of accessibility to each user individually. It is not bad as a fallback mechanism, but I hope that those who publish won't decide that AI absolves them of the need to post accessible content. After all, if they generate the alt text on their side, they can do it only once and it would be accessible to everyone while saving multiple executions of the same recognition task on the other end. Additionally, they have more control how the image would be interpreted and I hope that this really would matter.
They lost me when they suggested I install docker.
Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".
I don't wish to speak out of turn, but it doesn't look like this project has been active for about 1 year. I checked GitHub and the last update was in Feb 2024. Their last post to X was 25 OCT 2023. :(
This looks promising, not sure how it stacks up to Transkribus which seems to be the leader in the space since it has support for handwritten and trainable ML for your dataset.
[+] [-] vintermann|1 year ago|reply
We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.
[+] [-] liotier|1 year ago|reply
Arbitrary nonsensical text require character recognition. Sure, even a license plate bears some semantics bounding expectations of what text it contains, but text that has no coherence might remain an application domain for character rather than text recognition.
[+] [-] modeless|1 year ago|reply
Of course there are new models coming out every month. It's feeling like the 90s when you could just wait a year and your computer got twice as fast. Now you can wait a year and whatever problem you have will be better solved by a generally capable AI.
[+] [-] yndoendo|1 year ago|reply
Even I used symbols for different means in a shorthand form when constructing an idea.
I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.
Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.
[+] [-] registeredcorn|1 year ago|reply
> The way person_1 in 1850 wrote a lowercase letter "l" will look consistently like a lowercase letter "l" throughout a document.
> The way person_2 in 1550 wrote a lowercase letter "l" may look more like an uppercase "K" in some parts, and more of a lowercase "l" in others, and the number "0" in other areas, depending on the context of the sentence within that document.
I don't get why you would need to see the entire document in order to gauge some of the details of those things. Does it have something to do with how language has changed over the centuries, or is it something more obvious that we can relate to fairly easily today? From my naive position, I feel like if I see a bunch of letters in modern English (assuming they are legible) I know what they are and what they mean, even if I just see them as individual characters. My assumption is that you are saying that there is something deeper in terms of linguistic context / linguistic evolution that I'm not aware of. What is that..."X factor"?
I will say, if nothing else, I can understand certain physical considerations. For example:
A person who is right-handed, and is writing on the right edge of a page may start to slant, because of the physical issue of the paper being high, and the hand losing its grip. By comparison, someone who is left-handed might have very smudged letters because their hand is naturally going to press against fresh ink, or alternatively, have very "light" because they are hovering their hand over the paper while the ink dries.
In those sorts of physical considerations, I can understand why it would matter to be able to see the entire page, because the manner in which they write could change depending on where they were in the page...but wouldn't the individual characters still look approximately the same? That's the bit I'm not understanding.
[+] [-] cyanydeez|1 year ago|reply
Ots almost like we need a shared commons to benefit society but were surrounded by hoarders whp think they cam just strip mine society automatically bootstrap intelligence.
Surpise: Garbage CEOs in, garbage intelligence out.
[+] [-] abrichr|1 year ago|reply
> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.
Looks like it's built on https://github.com/Calamari-OCR/calamari
[+] [-] seu|1 year ago|reply
https://www.ocr4all.org/about/ocr4all > Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.
https://www.ocr4all.org/guide/setup-guide/quickstart > Quickstart > Open a terminal of your choice and enter the following command if you're running Linux (followed by a 6 line docker command).
How is that addressing the needs of non-technical users?
[+] [-] 7bit|1 year ago|reply
[+] [-] pbhjpbhj|1 year ago|reply
[+] [-] lupusreal|1 year ago|reply
[+] [-] einpoklum|1 year ago|reply
[+] [-] fny|1 year ago|reply
I wrote a simple CLI tool and more featured Python wrapper for it: https://github.com/fny/swiftocr
[+] [-] Moto7451|1 year ago|reply
[+] [-] jjice|1 year ago|reply
That said, I do love me a free and open source option for this kind of thing. I can't use it much since I'm not using Apple products for my desktop computing. Good on Apple though - they're providing some serious software value.
[+] [-] eigenvalue|1 year ago|reply
https://apps.apple.com/us/app/super-pdf-ocr/id6479674248
I probably should have just made it a free app so it would have gotten very popular, but oh well.
[+] [-] syntaxing|1 year ago|reply
[+] [-] acheong08|1 year ago|reply
[+] [-] maCDzP|1 year ago|reply
[+] [-] criddell|1 year ago|reply
[+] [-] mometsi|1 year ago|reply
The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.
[+] [-] amelius|1 year ago|reply
I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.
[+] [-] bonefolder|1 year ago|reply
[+] [-] jjuliano|1 year ago|reply
It combines Tesseract (for images) and Poppler-utils (PDF). A local open-source LLMs will extract document segments intelligently.
It can also be extended to use one or multiple Vision LLM models easily.
And finally, it outputs the entire AI agent API into a Dockerized container.
[+] [-] Krasnol|1 year ago|reply
Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces.
[...] https://www.ocr4all.org/guide/setup-guide/windows
------------------
I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.
[+] [-] eigenvalue|1 year ago|reply
https://github.com/Dicklesworthstone/llm_aided_ocr
This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.
[+] [-] TheNovaBomb|1 year ago|reply
Haven't seen many people mention it, but have just been using the PaddleOCR library on it's own and has been very good for me. Often achieving better quality/accuracy than some of the best V-LLM's, and generally much better quality than other open-source OCR models I've tried like Tesseract for example.
That being said, my use case is definitely focused primarily on digital text, so if you're working with handwritten text, take this with a grain of salt.
https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_e...
https://huggingface.co/spaces/echo840/ocrbench-leaderboard
[+] [-] sgc|1 year ago|reply
[+] [-] jdthedisciple|1 year ago|reply
A movement? A socio-political statement?
If only landing pages could be clearer about wtf it actually is ...
[+] [-] mdp2021|1 year ago|reply
https://www.ocr4all.org/about/ocr4all
[+] [-] amai|1 year ago|reply
It seems to be based on OCR-D, which itself is based on
- https://github.com/tesseract-ocr/tesseract
- https://kraken.re/main/index.html
- https://github.com/ocropus-archive/DUP-ocropy
- https://github.com/Calamari-OCR/calamari
See
- https://ocr-d.de/en/models
It seems to be an open-source alternative to https://www.transkribus.org/ ( which uses amongst others https://atr.pages.teklia.com/pylaia/pylaia/ )
Another alternative is https://escriptorium.inria.fr/ ( which uses kraken)
[+] [-] jaffa2|1 year ago|reply
[+] [-] joecool1029|1 year ago|reply
[+] [-] sandreas|1 year ago|reply
[+] [-] kergonath|1 year ago|reply
Tesseract is nice, but not good enough that there is no opportunity for another, better solution.
[+] [-] aidenn0|1 year ago|reply
This is specifically for historic documents that tesseract will handle poorly. It also provides a good interface for retraining models on a specific document set, which will help for documents that are different from the training set.
[+] [-] aidenn0|1 year ago|reply
[+] [-] fny|1 year ago|reply
[+] [-] miles|1 year ago|reply
[+] [-] krick|1 year ago|reply
[+] [-] pogue|1 year ago|reply
I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.
[+] [-] vladxyz|1 year ago|reply
https://blog.mozilla.org/en/mozilla/ai/help-us-improve-our-a...
[+] [-] gostsamo|1 year ago|reply
[+] [-] cdrini|1 year ago|reply
[+] [-] einpoklum|1 year ago|reply
Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".
[+] [-] alexnewman|1 year ago|reply
[+] [-] registeredcorn|1 year ago|reply
[+] [-] vagab0nd|1 year ago|reply
(It looks like the project started in 2022. So maybe it wasn't obvious at the time)
[+] [-] khaki54|1 year ago|reply