top | item 6156238

Xerox scanners and photocopiers randomly alter numbers in scanned documents

570 points| sxp | 12 years ago |dkriesel.com | reply

112 comments

order
[+] agl|12 years ago|reply
This class of error is called (by me, at least) a "contoot" because, long ago, when I was writing the JBIG2 compressor for Google Books PDFs, the first example was on the contents page of book. The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".

The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.

Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.

For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.

More than you want to know about this topic can be found here: https://www.imperialviolet.org/binary/google-books-pdf.pdf

[+] gngeal|12 years ago|reply
How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...
[+] gngeal|12 years ago|reply
It just occurred to me...

The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".

Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.

Although, I realize that on "Google scale", such a complex solution could be a problem.

[+] unknown|12 years ago|reply

[deleted]

[+] linohh|12 years ago|reply
This was predictable. JBIG2 is in no way secure for document processing, archiving or whatsoever. The image is sliced into small areas and a probabilistic matcher finds other areas that are similar. This way similar areas only have to be stored once.

Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.

I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.

Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.

This will make lawsuits a lot of fun in the future.

[+] rayiner|12 years ago|reply
Thinking about how often I use scan to PDF and e-mail with important documents, this article give me the shivers. This is an epic fuck-up. Nothing less than grossly-negligent.
[+] ams6110|12 years ago|reply
This will make lawsuits a lot of fun in the future.

Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"

[+] tjoff|12 years ago|reply
"The image is sliced into small areas and a probabilistic matcher finds other areas that are similar."

"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."

If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.

[+] nsxwolf|12 years ago|reply
Truly surprising. I would never have imagined this to be in the domain of possible problems one would expect to encounter scanning or photocopying a document.

It is like taking a picture of my wife with a digital camera and her face being replaced with that of some other person.

[+] eksith|12 years ago|reply
With personal video recording (a la Google Glass and friends) it won't be long before we're subjected to this sort of thing. It's amazing how close we're getting to Ghost in the Shell and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening.
[+] harrytuttle|12 years ago|reply
This should be on the computer risks digest.

There is virtually no reason whatsoever for this problem to exist. This is the domain of "making a problem more risky and complicated than it needs to be" and royally screwing people in the process.

Might as well throw the paperwork in a bin and set fire to it.

[+] candeira|12 years ago|reply
Sufficiently advanced bugs are indistinguishable from sabotage.
[+] ElliotH|12 years ago|reply
I can't quite see the reason why you would lossily compress something when your machine's purpose is to duplicate things.

Anyone got a reasonable reason for doing this?

[+] rly_ItsMe|12 years ago|reply
In the good old days of analog copiers this would be impossible - the scanner send the light through a system of mirrors to the drum, the drum gets static charged, the toner is pulled on the charged parts and gets transferred to the transfer belt, here the paper has the opposite charge and pulls the toner off of the transfer belt, goes through the fusing unit and here is the toner 'burned' to the paper. End of Story

On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it). From that hard disk the data where transmitted via laser to the drum

Tadaaa - you have the reason for having data be compressed on a modern copier.

[+] Someone|12 years ago|reply
Others have pointed out a credible explanation: to have the document take less space on their hard disk.

However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.

I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.

[+] simonster|12 years ago|reply
If you're scanning a long document to a PDF, compression makes a lot of sense. It's the difference between being able to email the PDF as an attachment and having to find a place to put the file online.
[+] dietrichepp|12 years ago|reply
The article has been updated with the probable cause for this error.
[+] wahnfrieden|12 years ago|reply
Cheaper components, maybe? (If it lets them get by with less memory for example.)
[+] hga|12 years ago|reply
Good point. Looking at a product page (http://www.office.xerox.com/multifunction-printer/color-mult...), I see that the first model mentioned is multifunction, it can "Copy, email, fax, print, [and] scan".

So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).

[+] lifeformed|12 years ago|reply
Geeze. This could result in some catastrophic errors. An order for 900 servers instead of 200. $7M loss instead of $1M in your quarterly earnings. Pricing your product at $3 instead of $8. Makes you realize you need some redundancy and double-checks for important communications.
[+] ams6110|12 years ago|reply
Especially considering that faxes, copies, and scans of documents are legally the same as the originals, at least for ordinary business purposes.
[+] scrumper|12 years ago|reply
I don't think it's necessarily an issue of inexcusable incompetence: it seems like one of those faults which is obvious in retrospect but very difficult to predict. Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers? That would seem to be a safer choice than writing a lossy compression algorithm from scratch. QA testing probably was on the order of 'picture looks right'; after all, why bother testing that the semantics of the copied content match the original when what you're building is a bitmap duplicator? (Of course, the OCR stuff would be tested more rigorously, but this explicitly bypasses that piece). It's not hard to see the chain of individually reasonable decisions that could lead to something like this.

The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.

[+] rdtsc|12 years ago|reply
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?

I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.

> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.

Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?

> QA testing probably was on the order of 'picture looks right';

Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.

[+] rayiner|12 years ago|reply
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?

The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.

A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.

I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.

[+] linohh|12 years ago|reply
JBIG2 compression is in no way a standard compression algorithm, as the standard only describes decompression. The compression depends on the implementation. And this is where incompetence comes back into the game.
[+] micheljansen|12 years ago|reply
Ouch, imagine this happens in a hospital with a prescription or something. It could really have some serious implications.
[+] hga|12 years ago|reply
Indeed, I keep a copy of my lab results for the last N years because they sometimes get lost, once through no real fault of the doctor (http://en.wikipedia.org/wiki/2011_Joplin_tornado).

Grrr, I'm now going to have to view every lab report that's not an original with suspicion, and make sure my doctors aren't making recommendations due to screwed up copies.

Lossy compression is not an acceptable default for a general purpose device.

[+] model-m|12 years ago|reply
If I were a sentient network and wanted to cause panic among the humans, as a prelude to full-blown warfare, this is how I'd start. Let's send all those Xerox copiers to Guantanamo, they are obviously terrorists.
[+] D9u|12 years ago|reply
My first thought was, "I wonder if this has anything to do with copy protections related to anti counterfeiting?"

Not that I have any valid reasons to consider this.

[+] ChuckMcM|12 years ago|reply
Given the challenges of JBIG2 it seems one should be able to construct a 'test' page which, when scanned, will test the algorithm's accuracy.

Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.

[+] gmac|12 years ago|reply
Wow, how terrifically and fundamentally negligent. Let's hope nobody dies — the potential hazards seem almost endless.
[+] tingletech|12 years ago|reply
Humm, I use one of these to create PDFs of reciepts to attach to my exense reports.
[+] noonespecial|12 years ago|reply
That's one hell of an error. It is literally better for these machines never to have existed at all.
[+] tudorconstantin|12 years ago|reply
Now that's a bug I wouldn't like being responsible for
[+] akleen|12 years ago|reply
I don't think the programmer who coded it is to blame. The manager who (very likely) cut the QA needed to save a few bugs to find it is.
[+] randomfool|12 years ago|reply
This is a massive error- on the order of Intel's FDIV bug.
[+] w_t_payne|12 years ago|reply
Wow. I cannot imagine how much chaos this could cause.