This class of error is called (by me, at least) a "contoot" because, long ago, when I was writing the JBIG2 compressor for Google Books PDFs, the first example was on the contents page of book. The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...
The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.
Although, I realize that on "Google scale", such a complex solution could be a problem.
This was predictable. JBIG2 is in no way secure for document processing, archiving or whatsoever. The image is sliced into small areas and a probabilistic matcher finds other areas that are similar. This way similar areas only have to be stored once.
Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.
I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.
Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.
This will make lawsuits a lot of fun in the future.
Thinking about how often I use scan to PDF and e-mail with important documents, this article give me the shivers. This is an epic fuck-up. Nothing less than grossly-negligent.
This will make lawsuits a lot of fun in the future.
Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"
"The image is sliced into small areas and a probabilistic matcher finds other areas that are similar."
"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."
If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.
Truly surprising. I would never have imagined this to be in the domain of possible problems one would expect to encounter scanning or photocopying a document.
It is like taking a picture of my wife with a digital camera and her face being replaced with that of some other person.
With personal video recording (a la Google Glass and friends) it won't be long before we're subjected to this sort of thing. It's amazing how close we're getting to Ghost in the Shell and I'm sure it won't be long when live video feeds can be hacked in real time to show something contrary to what's actually happening.
There is virtually no reason whatsoever for this problem to exist. This is the domain of "making a problem more risky and complicated than it needs to be" and royally screwing people in the process.
Might as well throw the paperwork in a bin and set fire to it.
In the good old days of analog copiers this would be impossible - the scanner send the light through a system of mirrors to the drum, the drum gets static charged, the toner is pulled on the charged parts and gets transferred to the transfer belt, here the paper has the opposite charge and pulls the toner off of the transfer belt, goes through the fusing unit and here is the toner 'burned' to the paper. End of Story
On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it).
From that hard disk the data where transmitted via laser to the drum
Tadaaa - you have the reason for having data be compressed on a modern copier.
Others have pointed out a credible explanation: to have the document take less space on their hard disk.
However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.
I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.
If you're scanning a long document to a PDF, compression makes a lot of sense. It's the difference between being able to email the PDF as an attachment and having to find a place to put the file online.
So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).
Geeze. This could result in some catastrophic errors. An order for 900 servers instead of 200. $7M loss instead of $1M in your quarterly earnings. Pricing your product at $3 instead of $8. Makes you realize you need some redundancy and double-checks for important communications.
I don't think it's necessarily an issue of inexcusable incompetence: it seems like one of those faults which is obvious in retrospect but very difficult to predict. Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers? That would seem to be a safer choice than writing a lossy compression algorithm from scratch. QA testing probably was on the order of 'picture looks right'; after all, why bother testing that the semantics of the copied content match the original when what you're building is a bitmap duplicator? (Of course, the OCR stuff would be tested more rigorously, but this explicitly bypasses that piece). It's not hard to see the chain of individually reasonable decisions that could lead to something like this.
The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?
I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.
> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.
Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?
> QA testing probably was on the order of 'picture looks right';
Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.
> Why shouldn't Xerox use a standard compression algorithm in their scanner/copiers?
The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.
A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.
I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.
JBIG2 compression is in no way a standard compression algorithm, as the standard only describes decompression. The compression depends on the implementation. And this is where incompetence comes back into the game.
Grrr, I'm now going to have to view every lab report that's not an original with suspicion, and make sure my doctors aren't making recommendations due to screwed up copies.
Lossy compression is not an acceptable default for a general purpose device.
If I were a sentient network and wanted to cause panic among the humans, as a prelude to full-blown warfare, this is how I'd start. Let's send all those Xerox copiers to Guantanamo, they are obviously terrorists.
Given the challenges of JBIG2 it seems one should be able to construct a 'test' page which, when scanned, will test the algorithm's accuracy.
Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.
Just an update: the author states on Twitter that he already had notified Xerox a week ago [1]. Apparently, Xerox has only now contacted him because they thought it was a joke [2] ...
[+] [-] agl|12 years ago|reply
The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
More than you want to know about this topic can be found here: https://www.imperialviolet.org/binary/google-books-pdf.pdf
[+] [-] gngeal|12 years ago|reply
[+] [-] gngeal|12 years ago|reply
The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.
Although, I realize that on "Google scale", such a complex solution could be a problem.
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] linohh|12 years ago|reply
Yeah right, you get it, don't you? They are similar, not equal. Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0.
I wonder which prize idiot had the idea of using this algorithm in a copier. JBIG2 can only be used where mistakes won't mean the world is going to end. A photocopier is expected to copy. If the machines were used for digital document archiving, some companies will face a lot of trouble when the next tax audit is due.
Digital archives using this kind of lossy compression are not only worthless, they are dangerous. As the paper trail is usually shredded after successful redundant storage of the images, there will be no way of determining correctness of archived data.
This will make lawsuits a lot of fun in the future.
[+] [-] rayiner|12 years ago|reply
[+] [-] ams6110|12 years ago|reply
Given the way the algorithm works, it would seem to me that "fine print" would be the most vulnerable to the bug (well not really a bug, it's the behavior of JBIG2). I wonder if there will be a clear dividing line, e.g. "smaller than 10pt type is subject to reasonable doubt if a Xerox copier was used"
[+] [-] tjoff|12 years ago|reply
"Whenever there's a probability less than 1, there's a complementary event with a probability larger than 0."
If that alone is reason for why JBIG2 is in no way secure for document processing, archiving or whatsoever - then I've got some bad news for you. Because if that's the case you really shouldn't be using a computer for, well, anything.
[+] [-] nsxwolf|12 years ago|reply
It is like taking a picture of my wife with a digital camera and her face being replaced with that of some other person.
[+] [-] gcr|12 years ago|reply
I can imagine someone turning the technique into a novel form of image compression, maybe for surveillance databases or something.
[+] [-] eksith|12 years ago|reply
[+] [-] harrytuttle|12 years ago|reply
There is virtually no reason whatsoever for this problem to exist. This is the domain of "making a problem more risky and complicated than it needs to be" and royally screwing people in the process.
Might as well throw the paperwork in a bin and set fire to it.
[+] [-] candeira|12 years ago|reply
[+] [-] ElliotH|12 years ago|reply
Anyone got a reasonable reason for doing this?
[+] [-] rly_ItsMe|12 years ago|reply
On a modern copier the scanner transfers the data first to RAM and than usually to a hard disk (the most of the people do not even know that the "copy machine" has one and saves the scanned stuff to it). From that hard disk the data where transmitted via laser to the drum
Tadaaa - you have the reason for having data be compressed on a modern copier.
[+] [-] Someone|12 years ago|reply
However, it does not have to be compression, per se. Modern copiers want to correct all kinds of errors such as creases and staples. They also want to optimize the colors. To do that, they have logic for detecting what areas of the page are full-color and which are black and white, which are half-tone printed, which are text, line art, photograph, whether the paper might have aged, etc.
I don't know what tricks they use, but I do not rule out that they will replace 'looks somewhat dirty' patches with an 'obviously higher quality version' of them, and use too aggressive parameters in some of those heuristics.
[+] [-] simonster|12 years ago|reply
[+] [-] dietrichepp|12 years ago|reply
[+] [-] wahnfrieden|12 years ago|reply
[+] [-] hga|12 years ago|reply
So it sounds like there's one code path and it's seriously broken. I looked at the first settings page, and while it's in German I can see it's 200 DPI. There's no excuse for default lossy compression when you're at 200 DPI and doing office sized paper. We didn't do that in 1991, we got CCITT Group 4 lossless compression of around 50KB per image plus or more generally minus for 8.5x11 inch paper, although we did do thinks like noise reduction and straightening documents (that makes them compress better, among other things).
[+] [-] lifeformed|12 years ago|reply
[+] [-] ams6110|12 years ago|reply
[+] [-] scrumper|12 years ago|reply
The real failure is probably something more cultural: there was nobody with the discipline, experience, and power to write an engineering policy prohibiting the use of lossy compression in duplication equipment. I have no idea about Xerox's corporate history, but the evisceration of engineering departments in US giants and the concomitant decline in what one might call 'standards' or 'rigor' is an established concept.
[+] [-] rdtsc|12 years ago|reply
I have never heard of JBIG2. I implemented JPEG2000 codecs from scratch, arithmetic coding compression and I have never heard of JBIG2. And here the are using and it others claiming it is just a standard run of the mill thing.
> That would seem to be a safer choice than writing a lossy compression algorithm from scratch.
Going out on a limb here, wouldn't the safest be to just not use a lossy codec at all or use something like JPEG?
> QA testing probably was on the order of 'picture looks right';
Sorry. This is the company whose name is the equivalent to the verb "to copy". If plugging in an obscure codec from some place and checking if one picture looks "OK" is their idea of QA then they deserve all the ridicule and lawsuits stemming from this.
[+] [-] rayiner|12 years ago|reply
The problem isn't using a standard compression algorithm. It's failing to consider the properties of the algorithm used in relation to the problem domain.
A classic mistake engineering students make is to try and use familiar equations anywhere that the units work out. As a result, engineering professors hammer in the idea that before using any equation, you have to ask yourself: what are the assumptions underlying this equation, and do those assumptions hold for my specific problem? Similarly, if you're writing software for copiers, you should ask the basic question of whether a particular compression algorithm was appropriate for the particular types of images being compressed. It's incredibly basic.
I can totally see why this error happened. It was the equivalent of the engineering student blithely applying any equation where the units work out. Uncompressed pixels go in, compressed data comes out. Compression algorithms are substitutable... except when they're not.
[+] [-] linohh|12 years ago|reply
[+] [-] micheljansen|12 years ago|reply
[+] [-] hga|12 years ago|reply
Grrr, I'm now going to have to view every lab report that's not an original with suspicion, and make sure my doctors aren't making recommendations due to screwed up copies.
Lossy compression is not an acceptable default for a general purpose device.
[+] [-] wahnfrieden|12 years ago|reply
[+] [-] xnxn|12 years ago|reply
[+] [-] greenyoda|12 years ago|reply
Edit: In the last section, it is now sketched what the reasons for the issue may be, on the basis of several emails I got.
[+] [-] model-m|12 years ago|reply
[+] [-] D9u|12 years ago|reply
Not that I have any valid reasons to consider this.
[+] [-] a3_nm|12 years ago|reply
[+] [-] ChuckMcM|12 years ago|reply
Once you have that, you can turn it into a sales too for folks selling Multi-function Printers such that there are "good" printers and "bad" printers, and then everyone will be forced to pass the test or be labeled a 'bad' printer.
[+] [-] gmac|12 years ago|reply
[+] [-] tingletech|12 years ago|reply
[+] [-] noonespecial|12 years ago|reply
[+] [-] raphman|12 years ago|reply
[1] https://twitter.com/davidkriesel/status/364345036407709697
[2] https://twitter.com/davidkriesel/status/364329334300880896
[+] [-] uptown|12 years ago|reply
"Digital Photocopiers Loaded With Secrets" http://www.youtube.com/watch?v=Wa0akU8bsOQ
[+] [-] deletes|12 years ago|reply
[+] [-] tudorconstantin|12 years ago|reply
[+] [-] akleen|12 years ago|reply
[+] [-] randomfool|12 years ago|reply
[+] [-] w_t_payne|12 years ago|reply