OCR & Undersampled Text
Undersampled text is a serious problem for OCR engines. Although people have no problem with typical undersampled text, machines do have a recognition problem. Most OCR engines do NOT handle undersampled text well, and this is currently an area where there is considerable disparity between human and machine recognition rates.
If there is some control over the document capture environment, it is highly advisable to scan at 300 dpi. With new compression formats available (JBIG2, JBIG2 PDF, MRC-coded PDF, JPEG2000), there is very little reason not to scan to higher resolution. TIFF G4 compression increases linearly with the scanning resolution, so that a 300 dpi scan is about 2x the size of a 150 dpi scan. With JBIG2-encoded perceptually lossless PDF, however, a 300 dpi scan is actually smaller than a 150 dpi scan. This is because the font library is minimal, with no topologically false connections or disconnections.
In any event, if there is control over the document capture process, current OCR methods being what they are, a higher dpi is strongly recommended (e.g., 300 dpi) for attaining most accurate OCR results. The file size will not go up (using JBIG2) and the OCR recognition rates will be significantly better than at lower scanning resolution.
If one has documents already sampled at low resolution scanning rates, proper (post)processing of these files is also necessary to achieve good recognition results. First, if these files are currently in greyscale or color formats (e.g., JPEG) then we would strongly suggest not thresholding prior to OCRing. Rather, these documents should be upsampled to 300 dpi, preferably using bicubic splines. These upsampled greyscale or color image documents should then be presented directly to the OCR engine.
Click here to read next topic: Dictionary Lookup & OCR
Return to Table of Content





