Line-Items : Picking the correct field type

Feb 22

Documents containing tables have the majority of information of the document printed thus the demand to collect this data is very high. In data capture organizations will choose three scenarios to collect data from these documents; ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the last option, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is whether to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefits and downside to both.

Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it’s own individual field. The reason for this is because you will accurately located field, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create so many fields is faster. This is a great tool because the downside to tables as a collection of individual fields is in the time it takes to create all fields and maybe this is too great to justify the increase in accuracy.

If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates its simplicity but also its problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.

There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume, individual fields are ideal. In any case, it’s something that needs to be decided prior to any integration work.

Chris Riley – About

Find much more about document technologies at

Mysterious tables

Jun 11

In the world of data capture. the one document element that easily doubles the complexity, increases software cost, and is all-around mysterious are tables. In Invoices, table data is all line-item details, in Bills of Lading, they are all shipping details. Many commonly used business documents contain tables. Extracting data from tables starts first with a clear understanding of table structure.

Most tables out there follow the typical structure, a header with column names, 1 to many rows of data below the that align to column names, and a footer which may contain summation data. This structure is ideal. The first added element of complexity that can occur is when column names do not align with data. This can happen intentionally or due to shifts in scanning. If this is an always or common enough occurrence then it’s necessary in data capture setup to ignore table headers completely. Next level of complexity is multi-level headers. Multi-level header structured tables amount to basically tables within tables. There are two levels of headers the first being the parent, and the subsequent levels provide additional details usually a lessor number of items. The levels are usually indicated by using more indents per level. This is most commonly found in EOBs, and what makes EOBs so complex. In this case, you have to capture multiple copies of the same table over and over, and not attempt to collect the whole details as a table. In the most complex documents with this structure, the table data capture element is not used at all but instead a basic field-by-field approach.

One of the biggest mistake’s integrators made is assuming a certain data capture table approach will work for all their tables on all documents. The only way to know for sure is testing. The ability for data capture software to find table structures is based on the process Document Analysis. Document Analysis will tell the data capture software where ALL tables on the document are located allowing it to choose the best one. In the case of tables within tables this very often results in a single table that is cutting data cells in half. Document Analysis is built on probability, so if borders of cells for one column have a high location average than that border is selected right or wrong. The more data in a table, the greater the chance of this probability being wrong.

It’s best to use tables on concrete document types i.e. a single variation of vendor invoice, or class of vendor invoices all with the same table type. If you prepare, you will not be let down by bad expectations and instead, you will be impressed with your table extraction.

Chris Riley – About

Find much more about document technologies at