Small Fonts & OCR
One of the problem areas in reliable OCR and data extraction is small fonts. When the fonts are large enough, almost any threshold will preserve the topology and geometry of the underlying font characters. When the font size gets small, it is hard to find a good threshold such that readability of the text regions are preserved. It is important to identify small font regions before thresholding, rather than after. Depending on the font size, there are different recommended methods for converting the greyscale or color text region into bitonal prior to OCRing the region.
Certain font sizes (e.g., font size 10, 300 dpi scan) will still support a semi-static region threshold and preserve topology. It is also important that, to some extent, geometric properties of the characters be preserved as well. We do need to recognize these characters in an OCR sense, and maybe allow for JBIG2 style font matching. So for certain font sizes and text region backgrounds, a semi-static threshold will do. If the background has a lot of texture, then even though the font size is reasonably large, some preprocessing (before thresholding) may be necessary.
For very small fonts, that fall below the Nyquist sampling rate, standard thresholding is not adequate. These fonts must be handled very carefully or readability will be lost. These small font text regions can either be given directly to the OCR engine or upsampled to a higher resolution space and then thresholded. It might be preferable to sharpen the greyscale text region before thresholding.
Click here to read next topic: Neural Networks and other Machine Learning Techniques
Return to Table of Content





