Check-mark accuracy, all or none

Nov 20
2009

Check-mark processing ( OMR ) is one of the most accurate recognition technologies. Companies who properly utilize OMR are  able to process documents quickly and accurately. But for the same reason OMR is accurate, it can also be very inaccurate, when not used properly.

For the most part, OMR is an all or nothing technology. Unlike the varying degrees of accuracy and uncertainty in OCR, with OMR, a field is checked or not. Where accuracy and uncertainty come into play is when you deal with collections of check-marks where the technology will compare the results of all to see whichever ones are most likely checked. The three areas where organizations make the mistake when using OMR is: improper OMR type, poor thresholds, and bad rules.

Many think of OMR fields as the traditional bubble on school tests. But there are several types of OMR fields. Rectangle, Round, Automatic, and White Field. Unlike text recognition, the wrong field type selection in OMR results in 100% incorrect results, most of the time.

Rectangle and round are the traditional fields that comes to mind when thinking of check-marks. The technology used to processes these, also includes a way to tell if a field has been corrected ( slashed out, and answer changed ). For these fields, the borders of the field are detected and when a high enough amount of black pixels is found within the border, the field is considered checked. The only time this will not be the case is when a field has been detected as having a correction.

Automatic field types are for those forms that have non-traditional border types for their fields, OR have some sort of text already existing in the field. For example, if you scan a Scantron form as a black and white image without dropout, you will get for each field a round circle with some letter or number printed in the middle. In this case you would have to use the automatic field type. What happens is that the software compares an EMPTY form to the form being processed. If for example, a field has the letter “A” printed in the middle, the software will count how many pixels in the field the A consist of and use that as a baseline. For a field to be checked, it will have to contain some number of black pixels OVER the baseline. If in this case, you used a rectangle or round check-mark type field, all fields would be considered checked because no baseline was established. Now finally are white fields.

White fields are check-mark fields that have no border. The are most often forms that have dropout scanning or sometimes fields used for unique and cool cases such as detecting signatures. These are a useful type of checkmark that simply expects there to be no border and no printed text in the field area. If there is a small amount of black pixels in the field area it’s considered checked. If you use a white field on a rectangle OMR field it will always be considered checked because of the borders. The biggest challenge for white fields is that the size of the field directly impacts it’s accuracy so proper sizes must be chosen. All check-marks have degrees of thresholds assigned to them.

A threshold is the setting that determines the amount of pixels (as a percent ) that is required before a field is considered checked. Organizations usually never need to toggle the default thresholds, and this is one of the biggest mistakes that is made. Most OMR processing packages have default thresholds for all field types. These vendors have done the research to know what the optimum field threshold is for both accuracy and avoiding false positives. Companies, when they pick the wrong threshold, get fields considered checked when they are not and the other way around. The problem is most of these are never reviewed, because they never get flagged due to custom thresholds which creates a false positive, the worse possible outcome of any exception.

As with all data capture and forms processing tools, there is usually a step of validation and rules. For whatever reason, organizations tend to over-think the rules associated with check-marks. The most common rule is that for any given collection of check-marks associated with a single question, only one or combination of ones can be checked. So for example, for a multiple choice question that asks for one answer, if the software sees two checked it will flag both fields. These rules are very useful but when improperly implemented result in either too much verification of fields, which is OK just a time waster, or like the threshold false positives. Sometimes the rules are applied during recognition and thus effect recognition results. For example, a question that has no answer but one is expected, is forced an answer. It’s easy to blame the software, but most of the time it’s just a bad rule.

OMR is a great tool when used right because it’s extremely fast and accurate, but when it’s used wrong, it’s still fast but just extremely inaccurate.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Don’t over think business card scanning

Nov 19
2009

I am amused at the detail of thought and complexity people put into business card scanning, BCR. In my time in the enterprise content management industry, I’ve scanned hundreds of thousands of business cards. Yes I know that is a lot, and no I’m not that popular. A vast majority of the scanning was for testing of business card scanners, and OCR/BCR technology to read them. But what cards I do want to keep are scanned and stored for later retrieval. I DO NOT use a specific card scanner, nor do I use a special card scanning application. I truly keep it simple and in my experience this is the right right way to do it.

Card scanners are inexpensive and useful when all you are scanning are business cards, but why not scan all your documents? ADF feed document scanners today support feed trays that dial down to the size of business cards. They are also able to scan stacks of business cards, front and back without a problem. So all I need is one scanner for all my documents. After extensive testing, the image quality difference is negligible, and in the top 3 are leading desktop scanners which are actually higher.

Now that you have the image, how do you get the data? Most people want to use a dedicated BCR technology to extract each data element from the card. I too am very amused with this and have had fun setting up systems to do it. But as a practicality, it does not make much sense to me. BCR that extracts separate fields such as name, email etc. can be very accurate. But when it’s not accurate it’s a problem. It takes a lot of time to correct a problem if it occurs, but more often then not, you don’t even know the problem occurred so you get part of an address in the phone number field and miss the phone number completely for example. You will only know this is the case when it’s time to follow-up with people. My second practicality complaint is that you are adding one program to operate and use regularly. We all have our favorite email client, or CRM where we keep all our contacts. Most BCR applications promote their ability to export to the most popular email clients, but as soon as there is an update to your database application you also have to buy a new version or might be stuck. Some BCR applications do not even allow the export of data so you have to manually copy and paste anyway. Is this really necessary?

OK it’s only fair now for me to tell you how I do it. I scan my business cards with my ADF fed document scanner to a hot-folder. This hot-folder is watched by a Full-Page OCR system and the cards are automatically converted to search-able PDFs. I am not getting field data and I’m not saving into a separate application. I’m making search-able PDFs just like all my other documents. This has worked for me very well for the past 4 years. When it comes time for me to find someone, all I really want is to see the card and the info. Most likely, if it’s really important I’ve already emailed them and captured their information that way. With my system I can search for people, websites, company names, even topics to find the cards of the people I want. I don’t have to worry about searching in a special UI field by field. Nor do I have to worry about missing data as full-text does not fall to the mercy of field extraction, it gets everything that is readable. In the areas of business where card scanning is used for reading medical insurance cards and drivers licenses the technology is very useful and necessary. I’m speaking only of personal business card scanning

In my experience most users of card scanners and BCR application use them very actively for a period of a month or two and soon the use dies to nothing and they revert to manual entry. I’m not doing manual entry, I’m using the latest technology in one unified process and full-text search. I’m just keeping it simple and practical.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Why buy what I already own!?

Nov 18
2009

Many people inherit full-page Optical Character Recognition (OCR) technology by simply purchasing a scanner or a multi-function (MFP) device. All these pieces of hardware include various software packages and OCR is one of the most common. Often the software is never used or the use isn’t always clear. Other times, the bundle is a tight integration with the hardware and the OCR is a part of configuration of the scanner and is used during scanning unknown to the user.

Bundled OCR technology is the easiest way to learn through use, and get the technology for a low price. Bundled software has contributed a great deal to market education and understand around the advance technologies. All the top OCR engines have a consumer product bundled with a document scanner or multi-function device. But because it’s already there, it leaves many wondering why you would ever purchase the software directly.

For many, the bundled OCR is sufficient for use. The quality of documents is clean, and the demand for advanced options is not required. But for others they just need more. This is why more advanced versions exist. Bundled OCR, even from the best vendors, is limited or an older version of the product. Some of the vendors make a special “bundle only version”, while others choose to incorporate non-current versions. Not only is buying the software directly getting the latest technology with the best features, the biggest drive to purchase is a greater more specific need to focus on OCR functionality. This could be because you are scanning old documents, degraded documents, or you need special settings such as compression and PDF/A functionality that is simply not found in bundled versions.

Vendors don’t make any money on bundled OCR other than to cover costs. Because vendors use for the most part bundled versions as marketing, they don’t incorporate the latest, greatest, and most advanced features. For those who the document version process is very important, there is a clear benefit in quality OCR packages.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Outsourcing document recognition

Nov 16
2009

It’s common for organizations to outsource their scanning and document conversion. Organizations find it sometimes that the skill required, the convince factor, and liability is worth the additional cost. Other organizations that have one time backlog conversions save money by using an outsourcing company vs. bringing the software in-house. In recent years, service bureaus and business process outsourcing companies have dramatically improved their use of recognition technology and prices have dropped substantially. Though as an organization who chooses to outsource, you are removing the responsibility of picking document conversion technology. Shouldn’t you want to know what technology your service bureau is using?

YOU SHOULD! Absolutely you should be concerned about the OCR and Data Capture technology that your outsourcing company is using. It’s just as important than if you were bringing the technology in-house. It’s your job to make sure your vendor is using the best technology but also in the best way. The education level between outsourcing companies is different and they each often specialize in one document type or one type of processing. Proper evaluation of a service bureau will include reviews of sample results. You should have your prospect service bureau or BPO run a good number of your production documents and provide you with results. Make sure the technology they used to produce the results is the same that is used when in production. Don’t be afraid to ask the vendor what engine or engines are being used and even what version. Make sure you understand how your vendor handles exceptions.

While it’s easy to overlook these items when you are looking at a service instead of a technology, it’s still important that you are educated. Service bureaus make money based on how much they save. This can occasionally create motives to use poor technology to gain greater margins. Some outsourcing companies put customers into categories by volume and those with the greater volume get the best technology. Most of the outsourcing companies out there are very good at ensuring their document quality, and many will even go as far to give you a guarantee on quality. But the nature of production environments is such that you cannot check everything always. It’s about relationship. Sometimes paying a higher price per page for a better solution is worth it!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

On the fly OCR – Click-Entry and Rubber-band OCR

Nov 11
2009

If you were to put the degrees of automation on a scale, you would first have no automation, semi-automation, and the varying degrees of full automation which is dependent on system accuracy. No automation is of course manual entry of documents into an organization. Full-automation is an attempt to collect all data automatically from the document and only using manual labor when required for exceptions and quality assurance steps. The degree of automation here is dependent on accuracy and the lower the accuracy, the more exceptions there will be and less documents in quality assurance.

Semi-automated data capture and OCR has not been thought about much. The primary reason for this is because when document automation technology was introduced, people wanted to go full force. It was a combination of poor market education and grand dreams. Semi-automated is an intermediary step where the operator will see every image, but their time spent per image is far less than manual entry. It allows organizations to start using the technology with less risk, more control, and lower cost. The challenge with the adoption of semi-automated data capture is that it’s hard to change from or upgrade. Some packages out there allow you a seamless integration into full-automation, but you are stuck with a solution. Now that you know what it is, how does it work?

Semi-automated data capture is very basic. When an image is scanned, it is displayed for the operator to see in as much real-estate as possible. If it’s a click-entry solution, then a full-page OCR read has already happened and if it’s a rubber-band solution, then it’s just the image. In both scenarios, an operator on some other portion of the screen has a field list, in which they field by locating information on the page. Since the OCR is already done, using click-entry, they highlight the word or words on the document they want to populate in the field and they click. When they click, the text is transferred to the next unpopulated field. In rubber-band OCR, all the fields are rubber-banded in advance and a “read” button is clicked after the rubber-banding is done and then all the text is populated into each field.

Semi-automated data capture is becoming more popular for organizations that are budget prohibited or scared from adopting full automation and surprisingly, companies that have adopted full automation, did not do it well. I very much believe in full document automation, but semi-automated data capture has a necessary place in the spectrum of document automation.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Re-OCR, Lessons learned

Nov 09
2009

To my surprise, I still receive requests from companies needing to start over on their OCR processes. Companies that have used the technology, did not plan, and are now finding themselves in a situation where they have to repeat OCR efforts. These companies fall into two categories.

First category is where the companies find they have processed large volumes of paper and the accuracy was not what they expected. This can be discovered in a relatively short time-frame or long after initial integration of the technology. It can be as easy as fixing bad settings for a particular document type to as bad as purchasing and correcting a bad choice in software solutions.

For companies in category one, it’s truly a lesson learned scenario. I will work with these companies to evaluate proper OCR settings and to test future prospect engines.  My hope is that the company at least scanned their documents at a high enough quality so that the already converted or scanned images can be used for backlog conversion versus a re-scan if that is even possible.

The second category is companies who discovered they were collecting too little of data from their documents. This usually happens in data capture environments where companies configure to capture 3 key fields only to find later that there were an additional 2 fields required for downstream processes. Depending on the severity of it, it’s often better to do day forward processing with proper settings on new documents and to key in missing fields for incorrect documents. The reason for this is that sometimes the work of getting the additional fields and reconciliation on old documents takes away from day forward production and may not be worth the additional cost it imposes. Or a common practice is to have the backlog documents run from scratch through the new process.

The trend in both categories is due to improper planning by the organization before evaluating technology. It’s important for companies to take the time and plan for capture technology. A part of this planning is forward looking the need for the data. One of the best tricks to exposing the requirements is to involve ALL constituents that create, use, and benefit from extracted data. Plan, Plan, Plan.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCR and Paste

Nov 09
2009

You probably use the copy and paste functionality on your computer daily. I too use copy and paste on a regular basis, but I also use OCR and paste nearly as much. OCR and paste is what I’m referring to as the process of selecting a region on your computer screen and using OCR to read that region as a screen-shot and converting it to text. Even to my surprise, it has become quite the habit and one of my favorite ways to collect data from one location on my computer to another. Many wonder why this might be the case, as most information on the screen is available as text anyways. The reasons are: it’s more efficient than copying and pasting into a program. It maintains structure of information using document analysis, and there are times when the information I want is not in text form but in an image only.

I have actually taken it one step further and used the technology to automate the extraction of data from web pages that are scroll heavy. Instead of scrolling forever for information on a web page, I can use the tool to take a screenshot of the entire web page and convert it to text for me. You can imagine how the technology could be used maliciously, but in this case, it’s just to get information.

The ability of OCR to read screen-shots is quite impressive. Though screen-shots usually come out in low 72 or 96 DPI resolution which is traditionally not optimal for OCR, the text and text in image is what is called pixel perfect so it provides an excellent candidate for conversion. Also leveraging document analysis technologies built into OCR, I can grab a table and have it export a table versus having to copy and paste text and manipulate back to original form later.

When you become an expert in OCR, you find yourself using the technology in the oddest places, but this is one case where my productivity has increased because of the tool, and I think it’s worth sharing. I suspect that OCR of screen-shots is only going to increase in the future. Because of this and malicious reasons, so will counter mal-ware technologies. As well as a very easy way to convert data from one locked down legacy system to a new one.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Turning off the latest technology

Nov 06
2009

Our culture is built on the fact that the newer and more means better. In the advanced technologies that exist, this for the most part is true, but people are always surprised when I tell them that disabling some of the newer technology will actually produce a better result. I am going to give you three examples of where technology demands time travel to older approaches for higher accuracy.

In data capture and OCR, there is a component of the technology called document analysis. Document analysis prior to any collection of data tells the structure of a page including columns, rows, tables, pictures, paragraphs, lines, etc. It’s the biggest contributor to modern day OCR accuracy. Document analysis is really designed for documents that are more traditional such as an article, a book page, or a letter. Document analysis ( although there have been special ones ) does not excel at form type documents. One of the most difficult documents in the world is an Explanation of Benefits EOB. This document has its own structure per variant typically. Surprisingly, the best way to process such a document is to turn off document analysis and default to a basic full-page read of the text. The reason for this is that document analysis provides an overwhelming bias for tables that no EOB will match.

It is the same case when reading text from photographs. When reading text from license-plates and product-plates ( serial number plates welded or stuck to many products ) during assembly it is best done with engines that do not have document analysis. In this case, the document analysis is trying too hard to find information. Because of the nature of these images, what ends up happening is characters in the photo are split into multiple lines and characters. Without document analysis, the engine sees the whole image as one text block and just reads it, thus creating better results. Looking at the license-plate readers that snap pictures of your license plate at toll booths, they are all using older antiquated OCR technology. By turning off document analysis they can use the newer engines.

Finally, there is mobility. This one makes a lot of people uncomfortable. Our society wants to believe their cell phone can do anything. Just today I was wondering why my cell phone did not brush my teeth for me. You can have your cell phone do OCR sure, but it requires older smaller and limited OCR engines to do so. I prefer to send an image to a server and use more advance OCR, but many demand OCR on the phone though in practice it’s usually slower. The reason for this is OCR requires specific processing power, and specific types of processing. Chips in phones today, and likely for a very long time to come will not compete with the power of a computer nor will they, and most importantly, include the proper math operators it takes for efficient and math heavy modern OCR. Cell phones cannot adopt proper chips because we demand long lasting batteries, small size, and low cost. Intense math is simply not important for 99.9% of mobile applications.

There you have it. Modern OCR taken down a few notches to solve current day problems. The best engines that exist today allow you to turn on and off all the various functionality you need thus making it possible to purchase the latest OCR technology and limiting it however you need. Most organizations don’t understand why anyone would want to turn off the new but today I’ve proven new is not always better!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Playing tricks with images, up-sampling

Nov 05
2009

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately, there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR, up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. Often  I will recommend up-sampling over working with the source that has lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it’s better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It’s like taking a very small image and trying to zoom in as far as you can to get details that you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better than just converting from 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you are doing is magnifying the mistakes created during up-sampling as they get propagated two times now. These will likely decrease your quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan, one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Multi-Pass document recognition

Nov 04
2009

When accuracy is the primary concern in document recognition, the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times, why not have an OCR engine convert it 3, 4, or even 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used successfully is with documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A and a second read can be done with the same OCR engine but with settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum, 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it’s going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by the environment, for example, environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process, it’s still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.