Question: Can I OCR my files and guarantee that each document has been OCRed correctly with a given confidence, e.g., 99.5% ?
Answer: Yes, sort of. OCR verificarion is really a semi-automated process. What can be expected from the OCR system is to i. correctly determine the reliability we have in each OCR ASCII assignment, and ii. flagging for human intervention all words in the document below the pre-assigned confidence level.
Getting an accurate OCR confidence measure is non-trivial. Most OCR packages return a confidence assignment to each word, but that measure is often unreliable. So it is important to run your files on a system with a somewhat reliable confidence measure. These measures often consider attributes that include “is this word returned by the OCR engine in the language dictionary?” and “does this word have a reasonable intra-document frequency?”. There are many other indicators that can be useful in obtaining an accurate confidence measure for each word.
At some point, any such OCR verification system needs to be semi-automated, with a human in the loop. Say that a document requires a recognition rate of 99.5%, then this recognition rate is with respect to human recognition, not machine recognition. For example, if there was a paragraph in the document that was completely unreadable to any human, e.g., a third generation scan, using some very small fonts in the text, then the words in this paragraph should not be counted us unrecognized since this text is beyond readability, and in an information theoretic sense this information is already lost, no fault of the OCR system. On the other hand, if a paragraph is small font and very difficult to read but still clearly human readable, but the OCR engine does not pick it up then these words must be counted as unrecognized.
To guarantee a certain minmum OCR accuracy, then, all document pages below the minimum OCR recognition threshold must be shown to a human to determine if these words are human readable. If so, then the human can manually correct any words with incorrect text assignments. The task of any OCR verification system is to semi-automate the recognition process such that, with minimal human intervention, a certain minimal OCR confidence level can be established for a collection of documents.