Read signatures maybe not, make sure a doc is signed, easy!

May 19

A lot of the documents we encounter require there to be a signature. In data capture, these documents add an additional complexity as an operator either before data capture or after has to make sure each document is signed. When a document is not signed it very often has to go a different path of approvals. Often organizations will ask OCR vendors to read the signature in a form. Ability to recognize signatures is very expensive and requires a database of pre-existing signatures so often not feasible. But ability to find a signature and confirm it’s presence is not that difficult at all.

Because documents with a signature line almost always have to be checked to assure a signature is there.  This is an additional step of processing. However, companies often don’t realize that the data capture software they are using can get all the fields off of the document and check accurately if a signature is present. By doing so they remove any additional steps and can flag only the documents that are not reporting a signature.

Using OMR, optical mark recognition technology, you can determine if a signature is present. In its simplest form, OMR check’s to see if there is a substantial amount of black pixels in a white space. At a certain threshold of black, that field will be considered checked. If in a data capture setup you put an OMR field in the location where a signature should be then you will know that if it reports checked, there is signature present, and if unchecked, there likely is no signature.

Although you are not reading the signature, OMR is a fast and accurate way to see if signatures are present and avoid the additional manual step of checking for signed documents.

Chris Riley – About

Find much more about document technologies at

Convert now Export later

Apr 14

It’s not surprising that an organization’s focus of any sort of document automation is the export format and data coming out of the system. But sometimes this focus has organizations choosing poor data capture and OCR products just for and ideal export format. The places this occurs the most is in healthcare and accounting where these industries’ specific repositories expect a format and the vendors of these repositories are unwilling to change. This post is to assure you that the accuracy and features of your data capture and OCR product are more important than the file format it creates.

By focusing on file export format, organizations are limiting their possibilities of solutions and perhaps locking them into a more expensive proposition then they should. Industry specific applications are able to charge a premium for connectors and their products because they understand where the focus is. However the most accurate data capture and OCR systems out there are general. Some data capture applications have connectors to say a specific accounting system, but even without specific connectors all data capture systems can export data in such a way that it can be converted to ANY desired format.

Data capture application support CSV, XML, ODBC, or text exports that can be molded in to any required format. Often because they support ODBC there is an opportunity to export directly to any application also supporting it. Because a conversion utility or a custom connector takes weeks to create vs. data capture and OCR’s man years to create, the focus should be given to the accuracy and capability of the OCR and data capture system before it’s export functionality.

While it would be ideal to find a data capture application that had the accuracy, the features, and the export you desire, I urge organizations not to limit themselves to it. Picking a poor data capture and OCR system will be far more costly than creating even a custom export from scratch.

Chris Riley – About

Find much more about document technologies at

Expectations bite the dust

Mar 10

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It’s clear as to why 100% field accuracy is important for them.  That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.

Again, a case of expectations and assumptions.

Chris Riley – About

Find much more about document technologies at

Data Capture – Problem Fields

Feb 10

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms, certain fields are easy to capture while others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don’t have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don’t have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It’s best to let it read what it’s going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain the amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like addresses, it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it’s very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see, the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Chris Riley – About

Find much more about document technologies at

Invisible characters

Jan 27

Exceptions in OCR and data capture are usually thought of as mis-recognized characters only, but in reality there are several other types of exceptions that exist. One of those is called “high confidence blanks”. A “high confidence blank” in OCR or data capture is where the software looked in a particular region for a character but no text was identified or read. In data capture “high confidence blanks” usually occur for entire fields or just the first character; in full-page OCR they are less common but can occur sporadically throughout the text of the document or the entire text. This type of exception is elusive and hard to detect. Obviously if entire fields and text is missed where you expect there to be text it is easy to spot, but for the one-off missing characters it’s tough. With full-page OCR detection is done with spell-check. Missing characters in a word will surely flag the word as being misspelled. In data capture it’s much more tricky and the best thing to do is to take certain steps to avoid “high confidence blanks”.

1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.
2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.

This type of exception is one that leads to hidden downstream problems when organizations don’t realize that it might happen. Being aware and taking the proper steps to avoid “high confidence blanks” is the solution.

Chris Riley – About

Find much more about document technologies at

Not that you want to pay that invoice any faster

Dec 09

But you can, and you can with a lower cost, and perhaps take advantage of net discounts. With Data Capture and OCR technology you can automate the entry and routing of commercial invoices. The reality for organizations that receive many invoices a day is that the accounting department is paying high salaries and taking time a way from other activities to data enter paper invoices. Using recognition technology to replace this process has been a tremendous benefit to many organizations. There are a few keys to success.

Start out simple: don’t try to tackle the entire paper world with your solution, start out simple. First identify the process and where the opportunities for saving are. Usually the biggest opportunity is going to be in the entry of data into some accounting system. To automate this you will need data capture and scanning capabilities. Starting out simple does not mean to overlook all the possibilities but to find the technology that will fit all your wildest dreams of automation but start out slow with it. More specifically with invoices, first start by scanning, then by getting vendor, invoice number, and total due using recognition technology, etc.

Wait for an ROI before you make a major change: These technologies if implemented correctly can provide a great return on investment. Sometimes organizations make the mistake of not waiting until they get an ROI before making another major change. The change likely will have positive results, but requires another round of additional effort and could be problematic. This does not allow you to see when the value of the technology starts kicking in and could have you repeating effort. Wait until you succeed at a basic implementation before you seek even more cost savings. Saving money is addicting, but let each phase actualize itself.

Never forget your business process is boss: Organizations have processes that are set in stone. Staff understands how to execute them, technology is setup to facilitate them, and other processes are feeding or fed by them. Sometimes new technology is so exciting that it forces you to change what you are doing right when you acquire it. Often organizations don’t realize the upstream and downstream impact of dramatically changing business processes. A technology should give you the option to keep doing what you are doing only faster, or to change things if you choose. At first try to keep it as consistent with the already in place AP business processes, then look for process improvement later.

No maybe you don’t want to pay that invoice faster, but you do want to reduce the cost of working with it. With Data Capture and OCR you can save a ton as long as you prepare yourself and do your homework.

Chris Riley – About

Find much more about document technologies at

Exceptional exceptions – Key to winning with Data Capture

Dec 02

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Chris Riley – About

Find much more about document technologies at

Data Type, Dictionary, Database Lookup = First Verification

Oct 28

After viewing the power of data capture technology, I’ve yet to see an organization un-impressed, until the conversation explores quality assurance steps. Though the technology is extremely powerful, there will always be some level of quality checking to get a 100% accurate results. Think of it this way;if you were to spill coffee on a perfectly printed document, scan it soon after ( rollers making a nice smudge ) you likely would be unable to read the text yourself, so how can the software? In this scenario, QA would be required for the smudged fields. This seems obvious but illustrates the fact. I have good news however, if you provide the right tools you can use a computer to do the first pass of verification.

It’s just like a human verifying a document but much faster and less expensive. Organizations that deploy these methods can eliminate a large percentage of verification, but the caveat is they must first know their documents. After data capture has happened, if you combine first data types with a dictionary or database lookup, you have created an electronic verifier.

A data type tells the software what structure a field should be in. A data type can be used to confirm a field result OR can modify uncertain results based on the knowledge contained within. For example take a date field. After data capture, the field is recognized as 1O/13/8I. We see there are two errors an “O” instead of a “0” and a “I” instead of a “1”. If you were to deploy a date data type that says simply you will always have numbers 1-12 followed by a “/” followed by numbers 1-31 followed by a “/” followed by two numbers. Then the date would automatically be converted to 10/13/81 which is correct. Some data types are universal such as date and time, others are specific to a document type and the organization if they know ALL of them stands to benefit greatly.

Dictionaries and database lookup functions are essentially the same with a slight variation. The purpose of these two approaches is to validate what was extracted via data capture against pre-existing acceptable results. The simplest example to consider is existing customer names. If you are processing a form distributed to existing customers that contains first name and last name because you already know they exist, you should be able to look in a database for the customer and confirm the results. If no match is found then likely there is a problem with the form. Dictionaries can provide the same value but are more static and often used for fields such as product type, or rate type that have one set of possibilities that rarely change. The point is that organizations should look at the database or dictionary assets they already have to augment the data capture process and make it more accurate.

There will always be quality assurance steps with any technology that involves interpretation of data. Organizations wanting to deny these steps either do not understand the technology, do not understand their own processes, or were mislead by a vendor. Quality assurance is the place where much effort should be spent to streamline, and one of the ways to do that is by leveraging data types, dictionaries, and databases that already exist.

Chris Riley – About

Find much more about document technologies at

If it’s not semi-structured why fix it – know your form’s class?

Sep 24

There are two major classes of Data Capture technology, fixed or semi-structured. When processing a form, it’s critical that the right class is chosen. To complicate things, there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience, organizations are having a very hard time figuring out if their form is fixed or not. The most common misdiagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. To most this seems fixed, but it’s actually not. Text in these boxes can move around substantially and in addition, the boxes themselves, while they are in the same location relative to each other, can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes referred to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle’s from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form. A pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items, it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Chris Riley – About

Find much more about document technologies at

Black belt in data capture processes an EOB

Sep 14

Explanation of Benefit’s (EOB) next to student transcripts are without a doubt the most difficult documents to automate. The value to automate these documents however is tremendously high as they are very expensive to data enter. 3 years ag,o the fad to automating these documents was to use semi-structure data capture to locate information no matter the variation. Companies buying into this fad quickly found themselves in an expensive and deep data capture implementation. This is where I get to tout the power of simplicity and beat down the over complicators.

Just as a Sensei would practice meditation before a bout to calm the nerves so should an implementer of data capture when facing the bloody battle with EOB documents. Simplicity is key when processing EOBs. Organizations should:

1.) Consider processing first those EOBs that are clear. Clarity is a vague term and includes document structure and scanning quality. But because of the variation across EOB types, its best for an organization to focus on automating the best quality, the ones they know will provide the highest accuracy and then move onto the rest when they have succeeded.

2.) Consider classification as a primary step. If you can very accurately classify EOBs by type then you don’t need to use semi-structured technology on the EOBs. You simply need to isolate each type and use a combination of coordinate and semi-structured based field location. Because you are working with a single type, you will be way more accurate in locating the fields and reading them.

3.) Ignore document structure. Very often EOBs don’t follow their own document structure especially when it comes to tables. Often EOBs have tables within tables, or data in tables that does not align to table headings. Additionally EOBs have patients that span pages, and totals for items on previous pages. EOBs should be thought about as a collection of lines that start with a header ( easy to collect the data ) and a footer ( also easy to collect data ). Your job then is to classify lines, and extract data per-line.

4.) Extract the data then convert it. In EOB processing, there are many items contained within the EOB that have to be converted to another format prior to reconciliation. When trying to extract data, if you focus on the conversions they often muddy up the extraction process. First very accurately get the data from the paper then convert it to the desired format.

For those who are currently processing EOBs and receiving the great value that automation can provide, you truly are black-belts of data capture and have mastered the nuances of document automation. For those of you wanting to process EOBs, it’s very possible, just keep it simple.

Chris Riley – About

Find much more about document technologies at