OCR & Novel Fonts, Multidirectional and Undersampled Text
OCR & Novel Fonts
In classical OCR, the recognition systems were trained on a very specific set of fonts. If these fonts varied in any material way, recognition rates would fall off accordingly. Today’s systems are much more robust and can handle the myriad of novel fonts that are used in publishing and available on the Web. What becomes more relevant for modern OCR systems is adaptability. If the shapes of characters in a new font is fairly unpredictable, what can be relied upon? It would be nice if, at the very least, topological properties, e.g., Euler number, are preserved. But often even this property is not invariant either due to novel fonts that modify basic character topology or because of scanning noise that introduces or eliminates holes.
As a result, what has become more prevalent in recent OCR technology is “shape- free” OCR. These algorithms seek to find the appropriate mapping between learned font symbols and the symbol alphabet. These newer methods seek to solve the OCR problem relying heavily on order statistics. Among the methods used, numbered strings that make use of the word structure to limit or uniquely identify the correct mapping. Obviously, the longer the document being analyzed, the more relevant the document statistics (such as K-tuples) will be.
It would seem that shape-only OCR systems have somewhat limited in applicability. Such systems want to solve the OCR puzzle strictly from the shape of a component image. This method can also be referred to as context-free, since no neighboring context is required to solve for the correct ASCII mapping. Similarly, OCR methods that are highly statistical can be thought of as context- sensitive, as these methods want to first compute order stats, or k-tuples, and only then infer the ASCII mapping. A combination of context-free and context-sensitive methods, incorporating geometric and topological properties of each component in conjunction with shape-free statistical methods, is probably most likely to yield accurate OCR results.
Locating Multidirectional Text with OCR
Multidirectional text is one area where current commercial OCR systems fail. In most
commercial systems, a dominant text direction is found. Text is then OCRed along this
dominant direction, but not along any subdominant directions. If text on the same page
occurs in multiple directions, such as horizontal and vertical text occurring at the same
time, it is generally recognized only in the predominant direction. Many documents, e.g.,
patent litigation, have important text running in multiple directions. The ability to detect
these text regions, in any direction, and OCR them is a good litmus test for any commercial
CVISION OCR (PdfCompressor) passes this test. Most commercial systems do not.
OCR & Undersampled Text
Undersampled text is a serious problem for OCR engines. Although people have no problem with typical undersampled text, machines do have a recognition problem. Most OCR engines do NOT handle undersampled text well, and this is currently an area where there is considerable disparity between human and machine recognition rates.
If there is some control over the document capture environment, it is highly advisable to scan at 300 dpi. With new compression formats available (JBIG2, JBIG2 PDF, MRC-coded PDF, JPEG2000), there is very little reason not to scan to higher resolution. TIFF G4 compression increases linearly with the scanning resolution, so that a 300 dpi scan is about 2x the size of a 150 dpi scan. With JBIG2-encoded perceptually lossless PDF, however, a 300 dpi scan is actually smaller than a 150 dpi scan. This is because the font library is minimal, with no topologically false connections or disconnections.
In any event, if there is control over the document capture process, current OCR methods being what they are, a higher dpi is strongly recommended (e.g., 300 dpi) for attaining most accurate OCR results. The file size will not go up (using JBIG2) and the OCR recognition rates will be significantly better than at lower scanning resolution.
If one has documents already sampled at low resolution scanning rates, proper (post) processing of these files is also necessary to achieve good recognition results. First, if these files are currently in greyscale or color formats (e.g., JPEG) then we would strongly suggest not thresholding prior to OCRing. Rather, these documents should be upsampled to 300 dpi, preferably using bicubic splines. These upsampled greyscale or color image documents should then be presented directly to the OCR engine.