Even OCR needs a helping hand – Quality Assurance

Aug 04

Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Expectations bite the dust

Mar 10

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It’s clear as to why 100% field accuracy is important for them.  That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.

Again, a case of expectations and assumptions.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.