Reliable OCR for Degraded Documents

In All, Archived, OCR, OCR Accuracy, OCR Software, Optical Character Recognition by ChrisLeave a Comment

Question: We have documents that have been degraded due to undersampling and JPEG compression. I have tried several OCR engines, none of which is picking up the text reliably. What should I do? We need to search these documents!

Answer: There is an old maxim of “garbage in, garbage out”. So the quality of the input documents really does matter. There is a gap between human readability and machine readability. Just because you can read the document does not mean that your OCR program can.

There are several steps that can be taken to improve your OCR results in this kind of situation:

1. Verify that the existing resolution (dpi) is correct. OCR engines are calibrated based on the dpi that is typically given in the image header file. If this value is incorrect, then the OCR results will degrade.

2. Assuming the dpi has now been set correctly, upsample to a reasonably high dpi. Typically, 300 dpi is a good number. The upsampling method does matter – use bicubic spline interpolation.

3. OCR engines usually perform better on bitonal documents that are thresholded correctly than on the original color files. Of course, if the threshold is poorly chosen, the OCR engine is better off with the original color or grayscale image file. So if possible, threshold each upsampled image file manually so that the text is most readable.

Obviously, in a batch production environment, step 3 which is manual would not be practical. CVISION’s PdfCompressor OCR engine http://www.cvisiontech.com/pdf_compressor_31.html does try to mimic these steps (1-3) for degraded image documents.

Leave a Comment