Question: What accuracy rate can I expect when OCR’ing my corporate documents – mostly accounting and revenue earnings statements? Is OCR accuracy affected by the scanning resolution? Do I gain any significant advantage in using the most accurate setting the OCR product supports?
Answer: OCR accuracy can never be guaranteed and the phrase “garbage in – garbage out” can definitely be used with respect to OCR. If the source input documents are clean originals, the OCR engine will probably yield considerably more accurate results than if the source documents are 3rd generation scanned files with some image skew. The higher the scanning resolution, at least until 300 dpi, the more accurate the typical OCR results will be.
On a perfectly clean, well-behaved scan there is often little if any accuracy difference between the “fastest” vs. “most accurate” OCR modes. Which means that if the most accurate OCR mode runs 3x-6x slower than the fast mode, it is often not cost-justified as the end-user will do virtually as well in fast mode, with a much faster workflow processing rate. If the source files are messy, not original files, have considerable skew, many different fonts, texture noise, or other mitigating factors, then running the most accurate OCR setting may indeed be necessary to obtain accurate OCR results. The best way to determine for a given OCR engine the optimal setting, is through some testing on a given dataset using the allowable OCR settings and measuring both recognition accuracy and timing for each OCR setting.