Data Extraction within OCR
The data extraction problem is very closely coupled with form recognition. Usually, when a company needs data extraction it is in the context of form recognition. This means that one cannot extract meaningful data in the absence of recognizing the form type. This is distinct from the general OCR problem. The general OCR problem is to extract as much meaningful text from an image document as possible. There is no assumption about prior knowledge with respect to this document, other than perhaps what language the document is in. So OCR is ideal for full-text search where a database index needs to be constructed to allow for arbitrary text-based queries. But general OCR is not ideal for field coding, when certain fields need to be precisely coded into the record of a database and they must be entered correctly or there may be no way to find this document later.
For a reliable automated or semi-automated data extraction / field coding system to work, characteristics of the application need to be known ahead of time. These are aspects the system needs to train on to have effective recognition rates. For example, if the system being constructed or maintained is a University database then the fields necessary for each student record must be known a priori. In addition, whatever constraints are available for each record field must either be explicitly entered, e.g., XML-based file, or learned by the system during training. So the data extraction system, if looking to code a social security field per student, should know that the field is numeric, consisting of exactly 9 digits (with possible embedded "-"'s).
Forms, Data Constraints and Redundancies: Check Deposit
There are many factors to solving the data extraction problem correctly. Among them are form-based constraints, data constraints, and data redundancies. For example, these three factors are all very useful in accurate coding of check deposit information. When you drop your checks off for deposit, there are usually some checks and a deposit slip. The deposit slip is usually handwritten, though company related information like account number may already be printed on the deposit slip. The checks themselves are either handwritten or machine printed. Routing and bank branch information are encoded on the bottom of each check using special numeric symbols that are easily recognizable.
One of the issues in solving this problem reliably is that the check deposit process usually contains handwritten data, and handwritten data recognition is still considered largely an unsolved problem. However, there are some data redundancies and form & data constraints that make the problem largely solvable. In particular, on the deposit slip, which is basically a form, there are boxes for each numeric character. This does not allow the user to write unconstrained cursive for the dollar amount. It also handles the difficult segmentation problem, as each numeric character has already been isolated. In addition, the state of numeric handwritten character recognition is significantly higher than unconstrained handwritten character recognition. Furthermore, the check deposit slip asks for each check amount, even though it is already on the check, and the check deposit total is requested twice. So there is redundancy with respect to each check amount, redundancy with respect to the total deposit, and additional redundancy in that the sum of all the checks (and other deposits) must ADD up to the total deposit amount. The semi-automated check deposit system in place at many large banks takes advantage of all these constraints and redundancies and, as a result, processes the average check for considerably less cost than 10 years ago, i.e., pre-automation.
I Know that I Don't Know that I Know .....What is very important in the design and implementation of any field coding / form learning / data extraction system is to know what you know. And what you don't. The reason for this is simple: If an automated system performs correctly the Company saves money and see ROI (return on investment). If the automated system makes mistakes that go UNDETECTED, even occasionally, it could cost the Company a lot more in correcting the situation than the automation saved.
Going back to the semi-automated check deposit example: if the system correctly recognizes all the dollar amounts, on both checks and deposit slip, 90% of the time, is this a win for the Bank or not? The answer is totally dependent on whether the system knows what it knows. Meaning, if the system knows when a numeric value MAY be incorrect because all the numeric information, which is heavily redundant, is not in sync then any such case can be shown to a human operator without penalty so that the automation is a win for the bank. If the system, however, does not have the controls in place to verify the correctness of the extracted data then this system is probably not commercially viable, unless each transaction is shown to a human for the purpose of verification.
Click here to read next topic: Business Process Automation and how it relates to OCR
Return to Table of Content





