Fixed, Semi-structured, UNSTRUCTURED!?

Jan 13

I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document.  A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.

Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.

Chris Riley – About

Find much more about document technologies at

Clock is ticking

Jun 30

When considering the ROI on a data capture integration, setup time is one of the most important and often miscalculated factors. Not just the setup time for initial integration, but the setup time used for any fine-tuning and optimization may sometimes postpone production.

The difference in setup time between a fixed data capture environment where coordinate based fields are used and rules based semi-structured environments is substantial. It’s not usually the fixed data capture environments that pose the biggest challenge in calculating ROI or predicting it. It takes an administrator on average between 15 to 45 seconds to create and fine-tune a fixed form field. In semi-structured processing, the field setup time can take between 60 seconds and hours, depending on the complexity of the document and the logic being deployed. It’s this large gap that throws a wrench in some ROI calculations.

For experienced integrators, the ability to put a document and it’s associated fields into complexity classes is usually pretty easy. After doing so gauging, the average amount of time to setup each field, and thus all fields should be accurate. There is always a field or two that requires extra fine-tuning. The key is a complete understanding of the document. Sometimes document variations are obvious, other times they sneak up on you and you have no idea the variation exists until you start working with it. Knowing all variations is the easiest way to understand the additional time any field will take to setup. Variants are the biggest contributor of time in semi-structured data capture setup. Second is odd field types, such as fields that take up one to many lines, or are continuous across two separate lines, and finally tables. The third and final largest contributor to setup time is poor document quality. This means the administrator has to be more general when creating fields and likely has to deploy multiple logic per each field to locate information in several possible ways.

When calculating the ROI on your data capture project, make sure to be aware of these sometimes sneaky factors that can eat at integration time. Bottom-line, know your documents, and know the technology before any work is done. If you are unsure, seek professional assistance.

Chris Riley – About

Find much more about document technologies at

If it’s not semi-structured why fix it – know your form’s class?

Sep 24

There are two major classes of Data Capture technology, fixed or semi-structured. When processing a form, it’s critical that the right class is chosen. To complicate things, there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience, organizations are having a very hard time figuring out if their form is fixed or not. The most common misdiagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. To most this seems fixed, but it’s actually not. Text in these boxes can move around substantially and in addition, the boxes themselves, while they are in the same location relative to each other, can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes referred to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle’s from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form. A pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items, it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Chris Riley – About

Find much more about document technologies at