The best way to verify the accuracy of a JBIG2 implementation is to run it on a set of files and visually inspect the results. Whenever the JBIG2 compressed files differ from the original images you would want them to have an improvement in image quality. At a minimum you should insist that they contain no degradation in image quality. While there is no substitute for looking at the compressed images and seeing if they appear acceptable, it can be a time consuming process. As a result, many people are interested in verification methods that can be automated.
For this reason, we recommend that an OCR engine be used to compare the quality of the image before compression and afterwards. The words in the text files produced by each image can be programmatically looked up in a dictionary to see if they are valid or not. This can produce an easily measurable score of how well each document did. A good JBIG2 implementation should produce a compressed file that does about as well, if not better, than the original image.
For example, when measured by an OCR Validation tool, the original document shown in the previous section had a score of 469 (i.e. 469 words in the OCRed text file had a match in a standard dictionary) while the CVISION compressed file had a score of 472. On the other hand, the badly mismatched document produced by another vendor had an OCR score of 449.
No OCR engine is 100% accurate. They all miss an occasional word that is clear to a human reader. Since there is an element of chance in even the best OCR engines, you can’t be certain that the JBIG2 implementation degrades quality by testing it against the original on just a few files. However, over a large database it can be a very good measure of image quality. Within a small margin of error, you want the JBIG2 files to have OCR recognition rates about the same as or even better than those of the original files.
Shown below is a chart for a 186-page file in which a file compressed with one JBIG2 encoder (CVISION PdfCompressor) had a few more OCR hits than the original. This is a good indication that the compression preserves image quality for this type of document, i.e., no OCR-based information loss.