Texture Patterns and Small Fonts OCR
How Texture Patterns relate to OCR
Many methods in OCR and image processing make assumptions about the image background. Often, a constant background is assumed. A texture can be defined as a tessellated, approximately repeating pattern in an image. This texture might be real, e.g., wood paneling, or synthetic, e.g., the screening pattern caused by a color printer to represent a constant color background. Sometimes, it is beneficial to descreen the image before thresholding or further image processing.
Understanding textured regions can be very complex, but is sometimes necessary for proper separation of foreground and background.
Solving for texture patterns is very helpful in segmentation and MRC based coding. Effective compression of scanned documents, and reliable OCR output, require accurate background foreground discrimination. When lifting the foreground in segmentation or MRC coding, the original background pattern, or some facsimile thereof, must be reconstituted. There are several ways to do this, which include building up a Markov model of the original texture and using this statistical model to regenerate the background regions that need to be covered. Alternatively, one can find a tessellation element and replace the lifted text region with a “pure” background region.
Small Fonts & OCR
One of the problem areas in reliable OCR and data extraction is small fonts. When the fonts are large enough, almost any threshold will preserve the topology and geometry of the underlying font characters. When the font size gets small, it is hard to find a good threshold such that readability of the text regions are preserved. It is important to identify small font regions before thresholding, rather than after. Depending on the font size, there are different recommended methods for converting the greyscale or color text region into bitonal prior to OCRing the region.
Certain font sizes (e.g., font size 10, 300 dpi scan) will still support a semi-static region threshold and preserve topology. It is also important that, to some extent, geometric properties of the characters be preserved as well. We do need to recognize these characters in an OCR sense, and maybe allow for JBIG2 style font matching. So for certain font sizes and text region backgrounds, a semi-static threshold will do. If the background has a lot of texture, then even though the font size is reasonably large, some preprocessing (before thresholding) may be necessary. For very small fonts, that fall below the Nyquist sampling rate, standard thresholding is not adequate. These fonts must be handled very carefully or readability will be lost. These small font text regions can either be given directly to the OCR engine or upsampled to a higher resolution space and then thresholded. It might be preferable to sharpen the greyscale text region before thresholding.