Know your accuracy before you even test

Feb 25
2013

One of the natural abilities that develops as you see millions of sample images and their associated recognition results, is you begin to notice patterns and instantly indentify if a document will read well for both full-page document conversion and for field level. It has more or less become a natural ability of mine, but I can identify its components.

First is initial image quality. Without yourself identifying any objects on the page, look objectively at the document as a collection of questionable objects and see if you think the image quality is good. This is determined by coherence of each object. Are object borders tight and determinable? Are there objects interfering with other objects? Is the background of the image significantly different than all objects?

Second am identification of objects. Find text, graphics, lines, paragraphs, etc. Are their borders far enough apart? Is their type clear? This is most important for text. Is their printing consistent? For example does text go from one background color to another, this would make it inconsistent. Or another example does the straightness of lines change throughout the document? And can one object be confused for another?

And third, now that you know the objects, how easy is it to determine their value. Is the value obvious? Do you have to look at it for a while to figure it out?

Essentially the three above steps are exactly what the conversion ( OCR, ICR, OMR ) product does in order to read a document. With field level recognition it’s a bit more elaborate, but the core is the same. By identifying early on what the anticipated accuracy is of a document, you can then adjust your scan, or input settings accordingly even before looking at any technology. Doing this will give the best chance for success.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Even OCR needs a helping hand – Quality Assurance

Jun 05
2010

Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“You vote engines! Of course it’s better” – Reality of voting

May 15
2010

The trend of companies promoting OCR voting has become less common, but you will still occasionally find products that promote their accuracy by saying they don’t just use one engine, they use many and vote them together. The presumption of this approach is that of course they are more accurate then single engine solutions. This would seem to be the case, but it’s not that easy.

All the OCR engines have a system of voting internally already. This is how OCR technologies have made their advances throughout the years. They take algorithms that are expert in one particular way to interpret text, such as trigrams, words, fonts, etc. and vote their character guesses against each other for the final guess. This works great. This is very different from the voting that is often promoted of taking several engines and voting their result together. When you take two separate OCR engines and vote them together, it would seem you are getting the best of what’s available, but there is one major problem. Voting requires that each engine guess the same way, and this is not the case. For example Engine A might report a confidence on the letter “c” at 98% that it’s actually an “e” while Engine B might report with a 78% confidence that I is a “c”. When you vote these two, Engine A will win even though it’s wrong. This is typically how it goes, one engine in a voting scenario will win most of the time right or wrong, just because of how it reports its confidence levels.

This blog is not in combat with voting. Voting is a great tool, it’s used internally in the engines, and it can be used externally as well. How? Vote Engine A settings A against Engine A settings B. The same engine voted against itself just with different settings. This is a tremendous tool especially when dealing with varied documents, or highly degraded documents. By doing so you are comparing apples-to-apples confidence levels and not apples-to-elephants.

So next time you are turned on by voting, take a second look and see if it’s a marketed or real value.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The trick of the inverted text

Dec 25
2009

The search for greater accuracy when it comes to document automation, never stops. It’s true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can’t get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I’m about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it’s not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It’s important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

What you OCR is what you get

Dec 02
2009

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Playing tricks with images, up-sampling

Nov 05
2009

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately, there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR, up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. Often  I will recommend up-sampling over working with the source that has lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it’s better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It’s like taking a very small image and trying to zoom in as far as you can to get details that you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better than just converting from 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you are doing is magnifying the mistakes created during up-sampling as they get propagated two times now. These will likely decrease your quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan, one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Multi-Pass document recognition

Nov 04
2009

When accuracy is the primary concern in document recognition, the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times, why not have an OCR engine convert it 3, 4, or even 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used successfully is with documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A and a second read can be done with the same OCR engine but with settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum, 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it’s going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by the environment, for example, environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process, it’s still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Data Capture – Problem Fields

Oct 21
2009

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms, certain fields are easy to capture while others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don’t have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don’t have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It’s best to let it read what it’s going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain the amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like addresses, it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it’s very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see, the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Guarantees, Guarantees, Guarantees

Oct 09
2009

One of the most popular questions to ask when organizations purchase data capture or OCR software, “what accuracy can you guarantee?”. If you have ever asked this question of a vendor you got one of two responses: the first was a percentage of accuracy, the second is a long explanation on why they can’t guarantee anything. If the vendor gave you a percentage you should probably run, because it’s the start of a bad relationship.

Why? It’ not really possible for a vendor to tell you how accurate your recognition will be on your documents. Vendors can estimate accuracy based on samples, they can give you an idea of range, but because of the nature of the technology there is no way to guarantee anything. The first fact of OCR is that you can ALWAYS find a document that breaks the norms of recognition and accuracy. Because of this possibility, it’s hard to know how exception documents will effect the accuracy of the entire system. So lets talk about what is reasonable.

It is reasonable to provide a sample set of documents and expect an average accuracy level as a percentage on the samples. Because they are a discrete subset of documents, this is something that can actually be measured. It is the job of the organization to pick samples that most closely represent production. It would be wise to include bad, average, and good documents in the sample set so as to cover the entire range of possibilities.

What organizations often forget is that even if 50% of the documents are automated there is a cost savings as compared to manual entry. The industry standard for accuracy is 85% however this changes heavily based on document type and the organizations perception of accuracy. The ideal way to measure accuracy is to compare recognition results to truth data. If truth data is not available the next best thing is to count not accuracy but level of uncertainty on the document. If a document is 5% uncertain according to the OCR engine, then it is 95% certain and this should be your measure.

Next time a vendor is faced with the question of “how accurate are you?” or “what accuracy do you guarantee” I hope they issue the proper response of “how accurate will your process allow us to be?”. It’s a fair question to ask when you are not familiar with the technology, but hopefully the above gives you the proper approach to measuring a solution.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

You can read the fine-print

Oct 08
2009

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt or even 6 pt very accurately. It used to be the case that unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines, reading small fonts will not be a problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason, original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” are very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it, you will scan and OCR it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.