Optical Character Recognition

Question: What factors need to be considered in purchasing an OCR (optical character recognition) system for our company’s document production system?

Answer: The 5 primary factors that one probably needs to consider in acquiring an optical character recognition system are i. Speed, ii. Accuracy, iii. Functionality, iv. Control, and v. Pricing. Let me briefly describe each of these:

1. Speed: An optical character recognition system must often run in sync with other processes, such as a scanner or MFP device. As such, the OCR processing rate may be crucial in deciding which system is appropriate. Sometimes, OCR is done as an after-process, not in real-time. In this case, OCR processing rates are much less of a factor. Generally speaking, there is speed vs. accuracy tradeoffs in OCR so that the faster the optical character recognition system, the less accurate the recognition process.

2. Accuracy: The accuracy of the recognition system may often be the deciding factor in determining which optical character recognition system is appropriate. Many OCR systems let the user select the desired accuracy level. There are at least 2 aspects to OCR accuracy that should be considered in evaluating an OCR system. One issue is how accurate the recognized text is with respect to the original document. The 2nd accuracy issue to consider is how accurate the page segmentation is. In other words, are column headers, multi-column text, pictures, and graphs, etc. correctly detected?

3. Functionality: The 2 aspects of optical character recognition accuracy (above, in 2.) relate directly to how functional the OCR output is going to be. For example, an optical character recognition system with very accurate text recognition but poor page decomposition is perfect in an indexing environment since we just need to find all pages in the database with a given string, e.g., “Enron litigation”, and have the ability to highlight the occurrences of the string on any page in which it occurs. For archiving in a records management environment, with documents of record, this is probably the kind of OCR accuracy that is required. A correct logical page decomposition is not (strictly) required.

The 2nd aspect of optical character recognition functionality is reusability. This relates to taking part of a scanned document, like a graph, paragraph of text etc, and reusing it in a new document that is being generated. This is generally not the reason for OCRing on a company level as reusability quickly gets into copyright issues. But it is one of the main reasons for OCRing in academia, as in citing some previous paper or results. When reusability is the intended function, then the measure of OCR accuracy used needs to consider page decomposition. The more accurate the page segmentation or decomposition, the more accurate the recovered OCR page will be.

4. Control is an important factor in optical character recognition selection. How does the user control the optical character recognition process? In particular, does a given OCR system allow for multi-page processing? (Most systems do.) Does an optical character recognition system support batch control? Does it have a watched folder mode? Can it be operated from the command line or via an SDK? I.e., how much OCR control is afforded to the user?

5. Finally, of course, there is pricing. Optical character recognition systems vary greatly in pricing, from $99 up to the $100,000’s. Of course, to some extent you get what you pay for. Pricing is often determined by several criteria, including: Is the optical character recognition software out of the box? , Is it for manual use or automated production use?, Does it have any unique features that make it compelling for a certain application ? How accurate can the system be when run in its most accurate mode? These are among the criteria used in determining optical character recognition pricing.