Line-Items : Picking the correct field type

Feb 22

Documents containing tables have the majority of information of the document printed thus the demand to collect this data is very high. In data capture organizations will choose three scenarios to collect data from these documents; ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the last option, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is whether to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefits and downside to both.

Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it’s own individual field. The reason for this is because you will accurately located field, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create so many fields is faster. This is a great tool because the downside to tables as a collection of individual fields is in the time it takes to create all fields and maybe this is too great to justify the increase in accuracy.

If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates its simplicity but also its problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.

There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume, individual fields are ideal. In any case, it’s something that needs to be decided prior to any integration work.

Chris Riley – About

Find much more about document technologies at

Fixed, Semi-structured, UNSTRUCTURED!?

Jan 13

I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document.  A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.

Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.

Chris Riley – About

Find much more about document technologies at

Tax Return OCR

Jun 09

If you are thinking about using data capture to read text from tax returns it’s time now to start thinking about the steps to accomplish this. Reading typographic tax returns from current and previous years has proven to be very accurate and a great use of data capture and OCR technology. Tax Returns fall into the medium complexity to automate category. There are a few things that make tax returns unique.

Checkmarks: Tax returns have two types of checkmarks, ones that are standard and printed in the body of the document. These can be handled similar to all other common checkmark types. The other type of checkmark is unique only to tax forms and they are typically on the right side of the document. They are boxes that within can be filled with a character or a checkmark symbol. With these checkmarks, the best approach is to create a field the entire size of where the checkmark can be printed and set the checkmark type to be of type “white field”. In this case the software will expect there to be only white space and a presence of enough black pixels will consider it checked.

Tabular Data: Much of the data in a tax form is presented as a table. When considering capturing data from a table, organizations have to decide if they want to capture each cell of the table as its own field OR if they would like to capture the data in the table as a table field that later must be parsed. This can dramatically effect the exported results so knowing before hand is very important.

Delivery Type: Tax forms usually come as eFile which is a pixel perfect document that is never printed and never scanned, or as a scanned document received first as paper then scanned. For the most part the eFile version of the tax form will be more accurate, however the eFile version of the form has non-traditional checkmarks that could cause a problem. Organizations need to decide if they are going to process all delivery types together as a single type or separate them. There are advantages to both. By combining them integration time is less, by separating them accuracy is higher.

I much rather OCR a tax return than file one. Because of this, the skills I’ve developed in processing tax returns are better than creating them, and I hope today I imparted some of that knowledge.

Chris Riley – About

Find much more about document technologies at

When you got it design it – Form Design

Jun 02

It is tot too often to companies using Data Capture technology that they have the chance to change their forms design or even create new ones. If you have this ability, USE IT! A properly designed form is the fist step to success in automating that form. There are many things you can do to make sure your form is as machine readable as possible. Typically the forms we are talking about are hand-written but occasional also machine filled. I will highlight the major points.

  1. Corner stones. Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each neighbor one and the ideal type is black 5 mm squares.

  2. Form title. A clear title in 24 pt or higher print and no stylized font.

  3. Completion Guide. This is optional but sometimes is useful at the top of the form to print a guide on how best to fill in the fields of the type you use.

  4. Mono-Spaced fields. For the fields to be completed, it’s best to use field types that are character by character separation. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more distance. The best types of fields to use in order are letters separated by dotted frame, letters separated by drop-out color frame, letters separated by complete square frames.

  5. Segmented fields by data type. For certain fields, it will be important to segment the field in portions to enhance ICR accuracy. The best example is date; instead of having one field for the complete data, split it into 3 separate parts with the first being a month field, next a day field, and finally a year field. Same is often done for numbers, codes, and phone numbers.

  6. Separate fields. Separate each field by 3 mm or more.

  7. Consistent fields. Make sure the form uses consistent field types stated in 4.

  8. Form breaks. It’s OK to break the form up into sections and separate those sections with solid lines. This often helps template matching.

  9. Placement of field text. For the text that indicates what a field is “first name”, “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in drop-out in the field itself.

  10. Barcode. Barcode form identifiers are useful in form identification. Use a unique id per form page and place the barcode at the bottom of the page at lease 10 mm from any field.

Chris Riley – About

Find much more about document technologies at

Read signatures maybe not, make sure a doc is signed, easy!

May 19

A lot of the documents we encounter require there to be a signature. In data capture, these documents add an additional complexity as an operator either before data capture or after has to make sure each document is signed. When a document is not signed it very often has to go a different path of approvals. Often organizations will ask OCR vendors to read the signature in a form. Ability to recognize signatures is very expensive and requires a database of pre-existing signatures so often not feasible. But ability to find a signature and confirm it’s presence is not that difficult at all.

Because documents with a signature line almost always have to be checked to assure a signature is there.  This is an additional step of processing. However, companies often don’t realize that the data capture software they are using can get all the fields off of the document and check accurately if a signature is present. By doing so they remove any additional steps and can flag only the documents that are not reporting a signature.

Using OMR, optical mark recognition technology, you can determine if a signature is present. In its simplest form, OMR check’s to see if there is a substantial amount of black pixels in a white space. At a certain threshold of black, that field will be considered checked. If in a data capture setup you put an OMR field in the location where a signature should be then you will know that if it reports checked, there is signature present, and if unchecked, there likely is no signature.

Although you are not reading the signature, OMR is a fast and accurate way to see if signatures are present and avoid the additional manual step of checking for signed documents.

Chris Riley – About

Find much more about document technologies at

Expectations bite the dust

Mar 10

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It’s clear as to why 100% field accuracy is important for them.  That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.

Again, a case of expectations and assumptions.

Chris Riley – About

Find much more about document technologies at

Data Capture – Problem Fields

Feb 10

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms, certain fields are easy to capture while others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don’t have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don’t have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It’s best to let it read what it’s going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain the amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like addresses, it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it’s very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see, the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Chris Riley – About

Find much more about document technologies at

Exceptional exceptions – Key to winning with Data Capture

Dec 02

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Chris Riley – About

Find much more about document technologies at

Data Type, Dictionary, Database Lookup = First Verification

Oct 28

After viewing the power of data capture technology, I’ve yet to see an organization un-impressed, until the conversation explores quality assurance steps. Though the technology is extremely powerful, there will always be some level of quality checking to get a 100% accurate results. Think of it this way;if you were to spill coffee on a perfectly printed document, scan it soon after ( rollers making a nice smudge ) you likely would be unable to read the text yourself, so how can the software? In this scenario, QA would be required for the smudged fields. This seems obvious but illustrates the fact. I have good news however, if you provide the right tools you can use a computer to do the first pass of verification.

It’s just like a human verifying a document but much faster and less expensive. Organizations that deploy these methods can eliminate a large percentage of verification, but the caveat is they must first know their documents. After data capture has happened, if you combine first data types with a dictionary or database lookup, you have created an electronic verifier.

A data type tells the software what structure a field should be in. A data type can be used to confirm a field result OR can modify uncertain results based on the knowledge contained within. For example take a date field. After data capture, the field is recognized as 1O/13/8I. We see there are two errors an “O” instead of a “0” and a “I” instead of a “1”. If you were to deploy a date data type that says simply you will always have numbers 1-12 followed by a “/” followed by numbers 1-31 followed by a “/” followed by two numbers. Then the date would automatically be converted to 10/13/81 which is correct. Some data types are universal such as date and time, others are specific to a document type and the organization if they know ALL of them stands to benefit greatly.

Dictionaries and database lookup functions are essentially the same with a slight variation. The purpose of these two approaches is to validate what was extracted via data capture against pre-existing acceptable results. The simplest example to consider is existing customer names. If you are processing a form distributed to existing customers that contains first name and last name because you already know they exist, you should be able to look in a database for the customer and confirm the results. If no match is found then likely there is a problem with the form. Dictionaries can provide the same value but are more static and often used for fields such as product type, or rate type that have one set of possibilities that rarely change. The point is that organizations should look at the database or dictionary assets they already have to augment the data capture process and make it more accurate.

There will always be quality assurance steps with any technology that involves interpretation of data. Organizations wanting to deny these steps either do not understand the technology, do not understand their own processes, or were mislead by a vendor. Quality assurance is the place where much effort should be spent to streamline, and one of the ways to do that is by leveraging data types, dictionaries, and databases that already exist.

Chris Riley – About

Find much more about document technologies at

If it’s not semi-structured why fix it – know your form’s class?

Sep 24

There are two major classes of Data Capture technology, fixed or semi-structured. When processing a form, it’s critical that the right class is chosen. To complicate things, there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience, organizations are having a very hard time figuring out if their form is fixed or not. The most common misdiagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. To most this seems fixed, but it’s actually not. Text in these boxes can move around substantially and in addition, the boxes themselves, while they are in the same location relative to each other, can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes referred to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle’s from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form. A pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items, it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Chris Riley – About

Find much more about document technologies at