Know your accuracy before you even test

Feb 25
2013

One of the natural abilities that develops as you see millions of sample images and their associated recognition results, is you begin to notice patterns and instantly indentify if a document will read well for both full-page document conversion and for field level. It has more or less become a natural ability of mine, but I can identify its components.

First is initial image quality. Without yourself identifying any objects on the page, look objectively at the document as a collection of questionable objects and see if you think the image quality is good. This is determined by coherence of each object. Are object borders tight and determinable? Are there objects interfering with other objects? Is the background of the image significantly different than all objects?

Second am identification of objects. Find text, graphics, lines, paragraphs, etc. Are their borders far enough apart? Is their type clear? This is most important for text. Is their printing consistent? For example does text go from one background color to another, this would make it inconsistent. Or another example does the straightness of lines change throughout the document? And can one object be confused for another?

And third, now that you know the objects, how easy is it to determine their value. Is the value obvious? Do you have to look at it for a while to figure it out?

Essentially the three above steps are exactly what the conversion ( OCR, ICR, OMR ) product does in order to read a document. With field level recognition it’s a bit more elaborate, but the core is the same. By identifying early on what the anticipated accuracy is of a document, you can then adjust your scan, or input settings accordingly even before looking at any technology. Doing this will give the best chance for success.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Machine Based vs. Software Based OCR

Mar 04
2012

Well they are all software, perhaps the better comparison is in-line OCR vs. PC based OCR. In any case, there is an important differences between the two. What I’m talking about are two very different modes of OCR ( Optical Character Recognition ) . One is done at a very fast rate on documents or images as they are scanned, the other is done after the scanning and achieved by PCs or Servers.  When someone is referring to OCR,  they are most likely discussing the software that is installed on a PC to convert image to text. Let’s look at the differences between the two.

In-line OCR is used primarily for mail-room processing on high speed high volume scanners, or on manufacturing assembly lines. Both scenarios need data from the input asset quickly. The benefit’s of in-line OCR is it’s the fastest OCR around. Usually the OCR is apart of firmware, and optimized for speed. If you imagine an assembly line of bottles, the bottles pass the camera at millisecond time. To wait for OCR would be a huge bottleneck in the quality control and inventory process. The downside to in-line OCR is accuracy. Usually in the case of the assembly line the engine has been so tuned that it is extremely accurate for a single image type. Where accuracy is proven to be less, is when it comes to document scanning, the digital mail-room. In the digital mail-room the in-line OCR, in order to be as fast as it is, must be an engine that is reduced in complexity, namely removing document analysis and reading of complex fonts. Because of this, when documents are scanned the accuracy cannot compare to that of PC based OCR.

PC based OCR has the benefit of scalability. It can work on the widest range of document types. Also because it’s using the PC, it has the latest and greatest technologies that work on degraded documents and complex documents. The downside of PC based OCR is that it’s not as fast as in-line. 99.9% it is fast enough. Many times PC based OCR is used at document scanner rated speed of 60 pages a minute. This is plenty fast for those who’s primary concern is quality. It is not fast enough for machine to machine hand-off’s, but this is not its primary use.

You may never encounter in-line OCR, but knowing about the technology helps understand the world of recognition and applications of such technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Even OCR needs a helping hand – Quality Assurance

Jun 05
2010

Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“You vote engines! Of course it’s better” – Reality of voting

May 15
2010

The trend of companies promoting OCR voting has become less common, but you will still occasionally find products that promote their accuracy by saying they don’t just use one engine, they use many and vote them together. The presumption of this approach is that of course they are more accurate then single engine solutions. This would seem to be the case, but it’s not that easy.

All the OCR engines have a system of voting internally already. This is how OCR technologies have made their advances throughout the years. They take algorithms that are expert in one particular way to interpret text, such as trigrams, words, fonts, etc. and vote their character guesses against each other for the final guess. This works great. This is very different from the voting that is often promoted of taking several engines and voting their result together. When you take two separate OCR engines and vote them together, it would seem you are getting the best of what’s available, but there is one major problem. Voting requires that each engine guess the same way, and this is not the case. For example Engine A might report a confidence on the letter “c” at 98% that it’s actually an “e” while Engine B might report with a 78% confidence that I is a “c”. When you vote these two, Engine A will win even though it’s wrong. This is typically how it goes, one engine in a voting scenario will win most of the time right or wrong, just because of how it reports its confidence levels.

This blog is not in combat with voting. Voting is a great tool, it’s used internally in the engines, and it can be used externally as well. How? Vote Engine A settings A against Engine A settings B. The same engine voted against itself just with different settings. This is a tremendous tool especially when dealing with varied documents, or highly degraded documents. By doing so you are comparing apples-to-apples confidence levels and not apples-to-elephants.

So next time you are turned on by voting, take a second look and see if it’s a marketed or real value.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Measuring Document Automation Efficiency

Dec 31
2009

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The trick of the inverted text

Dec 25
2009

The search for greater accuracy when it comes to document automation, never stops. It’s true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can’t get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I’m about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it’s not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It’s important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Expectations bite the dust

Dec 09
2009

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It’s clear as to why 100% field accuracy is important for them.  That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.

Again, a case of expectations and assumptions.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Check-mark accuracy, all or none

Nov 20
2009

Check-mark processing ( OMR ) is one of the most accurate recognition technologies. Companies who properly utilize OMR are  able to process documents quickly and accurately. But for the same reason OMR is accurate, it can also be very inaccurate, when not used properly.

For the most part, OMR is an all or nothing technology. Unlike the varying degrees of accuracy and uncertainty in OCR, with OMR, a field is checked or not. Where accuracy and uncertainty come into play is when you deal with collections of check-marks where the technology will compare the results of all to see whichever ones are most likely checked. The three areas where organizations make the mistake when using OMR is: improper OMR type, poor thresholds, and bad rules.

Many think of OMR fields as the traditional bubble on school tests. But there are several types of OMR fields. Rectangle, Round, Automatic, and White Field. Unlike text recognition, the wrong field type selection in OMR results in 100% incorrect results, most of the time.

Rectangle and round are the traditional fields that comes to mind when thinking of check-marks. The technology used to processes these, also includes a way to tell if a field has been corrected ( slashed out, and answer changed ). For these fields, the borders of the field are detected and when a high enough amount of black pixels is found within the border, the field is considered checked. The only time this will not be the case is when a field has been detected as having a correction.

Automatic field types are for those forms that have non-traditional border types for their fields, OR have some sort of text already existing in the field. For example, if you scan a Scantron form as a black and white image without dropout, you will get for each field a round circle with some letter or number printed in the middle. In this case you would have to use the automatic field type. What happens is that the software compares an EMPTY form to the form being processed. If for example, a field has the letter “A” printed in the middle, the software will count how many pixels in the field the A consist of and use that as a baseline. For a field to be checked, it will have to contain some number of black pixels OVER the baseline. If in this case, you used a rectangle or round check-mark type field, all fields would be considered checked because no baseline was established. Now finally are white fields.

White fields are check-mark fields that have no border. The are most often forms that have dropout scanning or sometimes fields used for unique and cool cases such as detecting signatures. These are a useful type of checkmark that simply expects there to be no border and no printed text in the field area. If there is a small amount of black pixels in the field area it’s considered checked. If you use a white field on a rectangle OMR field it will always be considered checked because of the borders. The biggest challenge for white fields is that the size of the field directly impacts it’s accuracy so proper sizes must be chosen. All check-marks have degrees of thresholds assigned to them.

A threshold is the setting that determines the amount of pixels (as a percent ) that is required before a field is considered checked. Organizations usually never need to toggle the default thresholds, and this is one of the biggest mistakes that is made. Most OMR processing packages have default thresholds for all field types. These vendors have done the research to know what the optimum field threshold is for both accuracy and avoiding false positives. Companies, when they pick the wrong threshold, get fields considered checked when they are not and the other way around. The problem is most of these are never reviewed, because they never get flagged due to custom thresholds which creates a false positive, the worse possible outcome of any exception.

As with all data capture and forms processing tools, there is usually a step of validation and rules. For whatever reason, organizations tend to over-think the rules associated with check-marks. The most common rule is that for any given collection of check-marks associated with a single question, only one or combination of ones can be checked. So for example, for a multiple choice question that asks for one answer, if the software sees two checked it will flag both fields. These rules are very useful but when improperly implemented result in either too much verification of fields, which is OK just a time waster, or like the threshold false positives. Sometimes the rules are applied during recognition and thus effect recognition results. For example, a question that has no answer but one is expected, is forced an answer. It’s easy to blame the software, but most of the time it’s just a bad rule.

OMR is a great tool when used right because it’s extremely fast and accurate, but when it’s used wrong, it’s still fast but just extremely inaccurate.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Multi-Pass document recognition

Nov 04
2009

When accuracy is the primary concern in document recognition, the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times, why not have an OCR engine convert it 3, 4, or even 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used successfully is with documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A and a second read can be done with the same OCR engine but with settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum, 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it’s going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by the environment, for example, environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process, it’s still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Data Capture – Problem Fields

Oct 21
2009

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms, certain fields are easy to capture while others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don’t have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don’t have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It’s best to let it read what it’s going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain the amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like addresses, it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it’s very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see, the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.