Not sure what “your own” in the title is supposed to mean if you are running a model that you didn’t train using a framework that you didn’t write on a server that you don’t own.
I think in this case "your own" means under your control, rather than a service or license you pay for. "your own" as in ownership of artefacts, not as in being the creator.
Consider the source of the idiom: rolling your own cigarettes.
Which involves taking some rolling papers, a pouch of loose tobacco (or whatever), and perhaps a little device if you're rich. As opposed to manufactured cigarettes, you're just doing some manual assembly for the end-product.
You don't need to cultivate the plants or pulp any trees to roll your own.
Not sure what "baking your own bread" means if you are using wheat grown by someone else in an oven that you didn't build that is run with electricity you didn't created from your muscles' force. You haven't even contributed to the nuclear fusion which created the oxygen for the water molecules you've been using! How dare you, standing of the shoulders of giants!
Wouldn't "Serverless OCR" mean something like running tesseract locally on your computer, rather than creating an AI framework and running it on a server?
hi. i run "ocr" with dmenu on linux, that triggers maim where i make a visual selection. a push notification shows the body (nice indicator of a whiff), but also it's on my clipboard
I am working on a client project, originally built using Google Vision APIs, and then I realized Tesseract is so good. Like really good. Also, if PDF text is available, then pdftotext tools are awesome.
My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.
There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.
HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.
Tried adding a receipt itemization feature into an app using OpenAI. It does 95% right but the remaining 5% are a mess. Mostly it mixes prices between items (Olive oil 0.99 while Banana 7.99). Is there some lightweight open source lib that can do this better?
So I'm trying to OCR 1000s of pages of old french dictionaries from the 1700s, has anything popped up that doesn't cost an arm and a leg, and works pretty decently?
I use Gemini for that. Split the PDF into 50 page chunks, throw it into aistudio and ask it to convert it. A couple of 1000 pages can be done with the free tier.
Different tools for different jobs. Tesseract is free, runs on CPU, and handles clean printed text well. For standard documents with simple layouts, it's hard to beat.
Where it falls apart is complex pages. Multi-column layouts, tables, equations, handwriting. Tesseract works line-by-line with no understanding of page structure, so a two-column paper gets garbled into interleaved text. VLM-based models like DeepSeek treat the page as an image and infer structure visually, which handles those cases much better.
For this specific use case (stats textbook with heavy math), Tesseract would really struggle with the equations. LaTeX-rendered math has unusual character spacing and stacked symbols that confuse traditional OCR engines. The author chose DeepSeek specifically because it outputs markdown with math notation intact.
The tradeoff is cost and infrastructure. Tesseract runs on your laptop for free. The author spent $2 on A100 GPU time for 600 pages. For a one-off textbook that's nothing, but at scale the difference between "free on CPU" and "$0.003/page on GPU" matters. Worth noting that newer alternatives like dots and olmOCR (mentioned upthread by kbyatnal) are also worth comparing if accuracy on complex layouts is the priority.
Question for the crowd -- with autoscaling, when a new pod is created it will still download the model right from huggingface?
I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.
eapriv|15 days ago
ddevnyc|14 days ago
RupertSalt|13 days ago
Which involves taking some rolling papers, a pouch of loose tobacco (or whatever), and perhaps a little device if you're rich. As opposed to manufactured cigarettes, you're just doing some manual assembly for the end-product.
You don't need to cultivate the plants or pulp any trees to roll your own.
ckrapu|15 days ago
jen20|13 days ago
self_awareness|14 days ago
croes|14 days ago
unknown|14 days ago
[deleted]
nkmnz|14 days ago
voidUpdate|15 days ago
cachius|15 days ago
normie3000|15 days ago
esafak|14 days ago
'Serverless' has become a term of art: https://en.wikipedia.org/wiki/Serverless_computing
spockz|15 days ago
But this caught me for a bit as well. :-)
mindslight|14 days ago
kbyatnal|15 days ago
ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.
[1] https://www.ocrarena.ai/compare/dots-ocr/deepseek-ocr
[2] https://www.ocrarena.ai/compare/olmocr-2/deepseek-ocr
ckrapu|15 days ago
segmondy|15 days ago
tclancy|15 days ago
vovavili|14 days ago
grimgrin|15 days ago
brainless|15 days ago
My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.
There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.
coolness|15 days ago
> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).
That... doesn't sound legal
Zababa|15 days ago
Bishonen88|15 days ago
lkm0|15 days ago
ks2048|14 days ago
grumbel|14 days ago
speedgoose|15 days ago
ddtaylor|15 days ago
newzino|15 days ago
Where it falls apart is complex pages. Multi-column layouts, tables, equations, handwriting. Tesseract works line-by-line with no understanding of page structure, so a two-column paper gets garbled into interleaved text. VLM-based models like DeepSeek treat the page as an image and infer structure visually, which handles those cases much better.
For this specific use case (stats textbook with heavy math), Tesseract would really struggle with the equations. LaTeX-rendered math has unusual character spacing and stacked symbols that confuse traditional OCR engines. The author chose DeepSeek specifically because it outputs markdown with math notation intact.
The tradeoff is cost and infrastructure. Tesseract runs on your laptop for free. The author spent $2 on A100 GPU time for 600 pages. For a one-off textbook that's nothing, but at scale the difference between "free on CPU" and "$0.003/page on GPU" matters. Worth noting that newer alternatives like dots and olmOCR (mentioned upthread by kbyatnal) are also worth comparing if accuracy on complex layouts is the priority.
apwheele|15 days ago
I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.
newzino|15 days ago
[deleted]
bovinejoni|15 days ago
velcrovan|15 days ago
ckrapu|15 days ago
sails|15 days ago
unknown|15 days ago
[deleted]
jbs789|14 days ago
smw|14 days ago
fzysingularity|14 days ago
StackTopherFlow|14 days ago
PlatoIsADisease|14 days ago
I have 4 of these now, some are better than others. But all worked great.
zeroq|15 days ago