I have an observation about scanning documents that results in good quality and smaller files, but I can't satisfactorily explain why it works. Consider these two cases:
(1) Scan document at very high resolution as a JPG and then use a third-party program (like Photoshop or whatever) to re-encode the JPG at your preferred low resolution.
(2) Scan document at your preferred low resolution as a JPG straight away. Don't re-encode afterward.
Intuition says that the results of #1 vs #2 should be identical, or that #1 should be worse because you're doing two passes on source material. But I always get better results with case #1 (i.e., high-res scan and re-encoding afterward) regardless of the type or model of scanner, or whether the scanner does the JPG encoding on-board the device itself or through a Windows/Linux/Mac driver bundled with the scanner.
My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result. They want to brag about pages per minute which is an easily measured metric. Quality of JPG encoding and file size take effort to compare, but everyone understands pages per minute.
If anyone has contrary experience I'd like to hear it. I've been seeing this for years with different document scanners and flatbed scanners -- regardless of how I tweak the scanner's settings, I can always get good quality in a small file by re-encoding afterward.
In addition to some other points, the downscaling step in #1 may also smooth out some noise in the source image. Less noise yields more compressible data.
In my own scanning workflow, I scan at 600 or 1200dpi to png, deskew, downscale, and apply a black/white threshold. This is all done with imagemagick: mogrify -deskew 40% -scale 25% -threshold 50%
If you're scanning at a lower resolution, the scanner has fewer samples to work with when trying to make a visual representation of your document. If you scan at a higher resolution, the algorithm could at the very least average together nearby samples. It could also detect sharp lines vs fuzzy borders and decide whether the low-res version has a sharp transition or an averaged color between areas.
> My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result.
This is more-or-less correct. The chips in the printers have a lot less power than your CPU, and the algorithms are a lot worse than those in Photoshop.
My guess slightly different: manufacturers are deliberately choosing the JPG encoding profile that gets them good quality in worst cases. Which also happens to get fast encoding and bigger files. Their motivation is simple. One case of negative user experience outweigh a thousand positive ones and hurts their reputation hard.
While the JPEG encoder in Photoshop is likely a lot better than what's in your scanner, I am, similarly to others here, fairly convinced that the majority of the difference is in the sample rate by the scanner.
I did some vanilla JS that simulates scaling from high DPI vs sampling directly at a specific DPI and the result does resemble the result of a low resolution scan.
Here is why: if you go with (2), then (1) is still done: by the crap firmware in your printer or its driver. It scans at high resolution and then downsamples in some way over which you have no control. It might not even be done with floating-point math.
1. is a somewhat like getting the raw image from a camera: a higher quality source for your own processing.
On the top image, I see that the back side of the page has clearly leaked through. In my experiences with scanning paper, I found a trick that essentially eliminates any visible backside content: Using a flatbed scanner, I would scan with the lid open, and the room darkened.
The worst thing to do is to scan with the lid closed, with a lid that has a white background. This would increase the reflection from the backside of the page.
You achieve the same effect with a black paper on top of the document you’re scanning, or in between the pages if it’s a book. As a bonus you can leave the light on :)
Nice to see a bit of k-means clustering. I was worried that this might attempt to be "smart" by converting to symbols, replicating the "Xerox changes numbers in copied documents" bug, but it's pure pixel image processing.
Very clean results. In some ways it's a smarter version of the "posterize" feature.
Using a blurred version of the whole image as background is probably better than what the OP is doing, as I understand it he's treating the background as a single fixed color.
In particular my use case is cleaning up pictures of whiteboards, where the brightness from the room is not as constant as a scan, and the approach wouldn't work at all.
Really? Do you care to explain? What is the dividend and what is the divisor? Why can dividing a image by its low pass filtered version (or vice versa) be used to "clean up" the image, i.e. subtract the background, find main colors and cluster similar colors with k-means? What if the divisor has pixels near zero?
I like the idea, but DjVu seems to be very proprietary / single vendor and not in widespread use. This has made me reluctant to use it for archival purposes (vs say PDF, which has its own issues, but feels slightly more future proof to me).
I think PDF can cover pretty much the same ground with JBig and Jpeg2k. (And I believe archive.org is doing that.) But I don't know of any open source code to do the segmentation / encoding. (You have to split the bitmap from the background for jbig / jpeg encoding.)
Is this really standard writing paper? I assume it would be useful for calligraphy or learning how to write (as you can use the subdivision to draw letters to the correct height) but I find it weird for it to be standard issue paper.
It helps children learning how to write.
Lowercase letters start from the thicker line to the first thinner line above.
Uppercase letters and taller lowercase letters like "t" or "d" go to the second line.
And the tails of letters like "g" or "y" go the first line below.
For anyone having issues getting this to work on macOS with homebrew dependencies, I was able to get it to work after finally getting an old version of numpy installed using the following command.
If you don't use the numpy==1.9.0 you'll get the 1.14.2 version which is also broke.
The rest of the options allow pip to soft-override the macOS built-in numpy 1.8.0 which is immutable in the /System/ directory.
Anyway, after I did all that I was able to start playing with the app, I had previously been using a kludge workflow to get a nice output in black and white by using the imagemagick convert -shave option to remove the scanned edges of images, then doing a -depth 1 to force the depth down (which only works well on really clean scans), then I can -trim to clear the framing white pixels and re-center using the -gravity center -extent 5100x6600 to frame the contents centered inside a 600dpi image.
Rough but it works, I was hassling with trying to isolate "spot colors" for another thing, but this might actually do the trick!!!
This is awesome, and a depressingly large factor better than any blog post I’ll ever write.
I totally identify with the need for this. I also want to archive images of notes and whiteboards, and they must be kept small as so far my life fits in google drive and github.
Currently I use Evernote to do this. I don’t use any other functionality in Evernote but the “take photo” action does processing and size reduction very like the blog post.
Great job with that. I've only just started taking notes by hand once again, after being keyboard-only for many years.
In your scenario, since you have assigned "scribes" taking the notes, you might be able to streamline the process with a "smart pen."
There are several on the market. The one I got as a hand-me-down from a family member lets you write dozens of pages of notes, then Bluetooth them to a smartphone app that exports to PDF, Box, Google Drive, etc... Or it can actually copy the notes to the app in real time. Combined with a projector, this might be useful for the other students during class.
It's supposed to be able to OCR the notes, too, but I haven't bothered to figure out how. But there's a cool little envelope icon in the corner of each notebook page that if you put a checkmark on, it will automatically e-mail the page to a pre-designated address.
Again, there are several models on the market. Mine retails for about $100. Notebooks come in about 15 different sizes and cost about the same as a regular quality notebook.
I have found that my Galaxy Note 2014 is pretty much hands-down the best note taking tablet in my opinion. It's better than the crap that Microsoft and Apple are trying to hawk off. It doesn't have as many fancy apps but for _strictly_ note taking, sharing notes via email, and book reading, it's pretty awesome.
I just wish its price would come down. It's still full price from four years ago :| and even getting more expensive because it's so old
I used to use a free software "ComicEnhancerPro" (The author is Chinese, there is English version but may not easy to find reliable download site) specially designed to enhance scanned comics.
You can remove the background very effectively by dragging a curve with preview.
You almost always need to preview and adjust some parameters, unless you have a template for similar cases.
In terms of compression for scanned notes, I haven't found anything that comes close to what even an older version of Adobe Acrobat yields, due to the use of the JBIG2 codec. Has anybody found any way to compress PDF files with JBIG2 on Linux/Mac? It's pretty much the only reason I have to find a Windows machine with Acrobat installed a couple of times a year, to postprocess a batch of scanned PDFs.
I have made some progress on this as my home project using same compression and scan. I call it DFA - digital file analytics where data/images/scanned documents are sent remotely using Kafka to Hadoop and then run OCR to extract text and compression. If the document is more then 10MB go to HBase otherwise HDFS. Near real-time streaming using Spark and Flink is done too. Visualization using Banana dashboard is not so cool as it shows word counts, storage location, images and tags. Analytics on top of extracted data using ML would like to do next.
Could you expand on your archive+ocr? I long wanted to start doing something like this, but never got to. I guess reading others' experience can be useful.
Is there much room for improvement? Looks pretty good to me.
It seems to me that the inaccuracies/inefficiencies/errors/whatever you like in using RGB are basically truncated out of existence by the very, very harsh binning that is occurring. I wouldn't expect any visible differences to emerge from any alternate color space.
The link says that generated pdf is a container for the png or jpg image.
Is it possible to get a true pdf from the scan?
Specifically so that i can search inside the pdf.
[+] [-] alister|8 years ago|reply
(1) Scan document at very high resolution as a JPG and then use a third-party program (like Photoshop or whatever) to re-encode the JPG at your preferred low resolution.
(2) Scan document at your preferred low resolution as a JPG straight away. Don't re-encode afterward.
Intuition says that the results of #1 vs #2 should be identical, or that #1 should be worse because you're doing two passes on source material. But I always get better results with case #1 (i.e., high-res scan and re-encoding afterward) regardless of the type or model of scanner, or whether the scanner does the JPG encoding on-board the device itself or through a Windows/Linux/Mac driver bundled with the scanner.
My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result. They want to brag about pages per minute which is an easily measured metric. Quality of JPG encoding and file size take effort to compare, but everyone understands pages per minute.
If anyone has contrary experience I'd like to hear it. I've been seeing this for years with different document scanners and flatbed scanners -- regardless of how I tweak the scanner's settings, I can always get good quality in a small file by re-encoding afterward.
[+] [-] discreditable|8 years ago|reply
In my own scanning workflow, I scan at 600 or 1200dpi to png, deskew, downscale, and apply a black/white threshold. This is all done with imagemagick: mogrify -deskew 40% -scale 25% -threshold 50%
If I want a PDF after that I'll use img2pdf.
[+] [-] sp332|8 years ago|reply
[+] [-] dtech|8 years ago|reply
This is more-or-less correct. The chips in the printers have a lot less power than your CPU, and the algorithms are a lot worse than those in Photoshop.
[+] [-] ComodoHacker|8 years ago|reply
[+] [-] Benjaminsen|8 years ago|reply
I did some vanilla JS that simulates scaling from high DPI vs sampling directly at a specific DPI and the result does resemble the result of a low resolution scan.
http://chrisbenjaminsen.com/shared/samplerate/
I am aware optics does not work exactly work like this, but it's a reasonable approximation.
[+] [-] kazinator|8 years ago|reply
Here is why: if you go with (2), then (1) is still done: by the crap firmware in your printer or its driver. It scans at high resolution and then downsamples in some way over which you have no control. It might not even be done with floating-point math.
1. is a somewhat like getting the raw image from a camera: a higher quality source for your own processing.
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] nayuki|8 years ago|reply
The worst thing to do is to scan with the lid closed, with a lid that has a white background. This would increase the reflection from the backside of the page.
[+] [-] MagerValp|8 years ago|reply
[+] [-] pjc50|8 years ago|reply
Very clean results. In some ways it's a smarter version of the "posterize" feature.
[+] [-] wmu|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] kazinator|8 years ago|reply
Here is a casual job on the first image:
https://i.imgur.com/Sy2rvsU.png
The steps:
1. Duplicate the layer.
2. Gaussian-blur the top layer with big radius, 30+.
3. Put the top layer in "Divide" mode. Now the image is level.
4. Merge the layers together into one.
5. Use Color->Curves to clean away the writing bleeding through from the opposite side of the paper.
6. To approximate the blurred look of Matt Zucker's result, apply Gaussian blur with r=0.8.
Notes:
The unblurred image before step 6 is here: https://i.imgur.com/RbWSUnD.png
Here is approximately the curve used in step 5: https://i.imgur.com/lvfqCNK.png
I suspect Matt worked at a higher resolution; i.e. the posted images are not the original resolution scans or snapshots.
[+] [-] fouc|8 years ago|reply
BTW I'm curious how you'd fare on the graph one. I didn't like his results for it. https://github.com/mzucker/noteshrink/blob/master/examples/g...
[+] [-] remram|8 years ago|reply
In particular my use case is cleaning up pictures of whiteboards, where the brightness from the room is not as constant as a scan, and the approach wouldn't work at all.
Probably an easy PR for that repo though?
[+] [-] keenerd|8 years ago|reply
[+] [-] tkp|8 years ago|reply
[+] [-] donquichotte|8 years ago|reply
[+] [-] trurl42|8 years ago|reply
It's a great file format for space-efficient archiving of scans like that, with a bit of scripted preprocessing.
[1]: https://en.wikipedia.org/wiki/DjVu
[+] [-] dunham|8 years ago|reply
I think PDF can cover pretty much the same ground with JBig and Jpeg2k. (And I believe archive.org is doing that.) But I don't know of any open source code to do the segmentation / encoding. (You have to split the bitmap from the background for jbig / jpeg encoding.)
[+] [-] Softcadbury|8 years ago|reply
I never really understood why they were so many lines...
[1]: https://images-na.ssl-images-amazon.com/images/I/815WQQdAHBL...
[+] [-] John_KZ|8 years ago|reply
[+] [-] l9k|8 years ago|reply
[+] [-] haikuginger|8 years ago|reply
[+] [-] eltoozero|8 years ago|reply
The rest of the options allow pip to soft-override the macOS built-in numpy 1.8.0 which is immutable in the /System/ directory.
Anyway, after I did all that I was able to start playing with the app, I had previously been using a kludge workflow to get a nice output in black and white by using the imagemagick convert -shave option to remove the scanned edges of images, then doing a -depth 1 to force the depth down (which only works well on really clean scans), then I can -trim to clear the framing white pixels and re-center using the -gravity center -extent 5100x6600 to frame the contents centered inside a 600dpi image.
Rough but it works, I was hassling with trying to isolate "spot colors" for another thing, but this might actually do the trick!!!
[+] [-] Myrmornis|8 years ago|reply
I totally identify with the need for this. I also want to archive images of notes and whiteboards, and they must be kept small as so far my life fits in google drive and github.
Currently I use Evernote to do this. I don’t use any other functionality in Evernote but the “take photo” action does processing and size reduction very like the blog post.
[+] [-] reaperducer|8 years ago|reply
In your scenario, since you have assigned "scribes" taking the notes, you might be able to streamline the process with a "smart pen."
There are several on the market. The one I got as a hand-me-down from a family member lets you write dozens of pages of notes, then Bluetooth them to a smartphone app that exports to PDF, Box, Google Drive, etc... Or it can actually copy the notes to the app in real time. Combined with a projector, this might be useful for the other students during class.
It's supposed to be able to OCR the notes, too, but I haven't bothered to figure out how. But there's a cool little envelope icon in the corner of each notebook page that if you put a checkmark on, it will automatically e-mail the page to a pre-designated address.
Again, there are several models on the market. Mine retails for about $100. Notebooks come in about 15 different sizes and cost about the same as a regular quality notebook.
Just some thoughts.
[+] [-] inetknght|8 years ago|reply
I just wish its price would come down. It's still full price from four years ago :| and even getting more expensive because it's so old
[+] [-] WalterGR|8 years ago|reply
[+] [-] dracodoc|8 years ago|reply
You can remove the background very effectively by dragging a curve with preview.
You almost always need to preview and adjust some parameters, unless you have a template for similar cases.
[+] [-] goerz|8 years ago|reply
[+] [-] ramses0|8 years ago|reply
[+] [-] BlackLotus89|8 years ago|reply
Anyway very nice writeup and I will add it to my arsenal and give it a closer look later. Could be useful for my document archive+ocr solution.
Edit: too bad seems like it didn't see any activity in the last year
[+] [-] eadmund|8 years ago|reply
That's not necessarily bad: sometimes a piece of software can be done, or nearly so.
[+] [-] mkjmkumar|8 years ago|reply
More you can find at https://medium.com/@mukeshkumar_46704/digital-files-ingestio...
[+] [-] krick|8 years ago|reply
[+] [-] andimai|8 years ago|reply
https://docs.opencv.org/3.4.0/d7/d4d/tutorial_py_thresholdin...
[+] [-] amai|8 years ago|reply
[+] [-] jonathanyc|8 years ago|reply
It’s also interesting for me to think about how this is a generalization of converting a scan to black and white for clarity :)
[+] [-] anc84|8 years ago|reply
[+] [-] jerf|8 years ago|reply
It seems to me that the inaccuracies/inefficiencies/errors/whatever you like in using RGB are basically truncated out of existence by the very, very harsh binning that is occurring. I wouldn't expect any visible differences to emerge from any alternate color space.
[+] [-] krsree|8 years ago|reply
[+] [-] davidzweig|8 years ago|reply