Capture Products, Data Capture Products, confused?

Jun 16

All technology markets are guilty of coming up with at least one or two confusing terms. In the document imaging world, it’s terms with very similar sounding names. They are technically similar, but strictly different.

One of the most confusing things in the imaging world is the difference between Image Capture software often just called Capture, and Data Capture software. Not only are the names confusing, but technically there is a lot of overlap. All data capture products have imaging capabilities, all capture products have basic data capture. The risk of the confusion is replacing one product for the other. For example, organizations that attempt to take the data capture functionality built into a capture application for a full blown project, end with little success and a lot of frustration. Let me explain where they fit.

Capture products have the primary function of delivering quality images in a proper document structure. They often feature image clean-up, review, and page splitting tools that are more advanced then the scanning found in data capture applications. Most demonstrate what is called rubber-band OCR, the reading of a specific coordinate on a page. Some go as far as creating templates where coordinates zones are saved. This is where the solutions get confused with data capture. Until there is a registration of documents and proper forms processing approaches, it is not data capture. The risk of such basic templates is low accuracy and zones that do not always collect data.

Data capture products need images to function, so it was an obvious choice to add scanning to the solutions. These solutions however are better fed by a full capture application that has the performance and additional features such as batch naming, annotations, page splitting, etc. that the organization may require in the resulting image files. For data capture, the purpose of image capture is for getting data only and sometimes neglect the features that are important for image storage and archival.

In the end, both solutions are improving in the other’s territory. Eventually the lines will blur to the point where feature-wise they will be identical, and the benefit of one over the other will be rooted in the vendors expertise, either capture or data capture. If your primary requirement is quality images, the capture vendors solution is best chosen, but if it’s data extraction, then data capture rooted solutions are better.

Chris Riley – About

Find much more about document technologies at

Dropout, all or none

Jan 20

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But a bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout, performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation, you are better to turn off dropout, or improve it’s use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scanning, the lamp would be turned on thus canceling out the red in the form. Because of this, it was important that printed forms used a certain type of red. If you have ever had experience with color matching you know it’s quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit with some scanners and software packages of being able to dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult, it is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout, it will avoid colors that don’t match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it’s one color and black text, but more complex documents with a low threshold can lose important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out, it is better for data capture accuracy that dropout is turned off and deal with the black and white form than to include it.

Chris Riley – About

Find much more about document technologies at

Don’t over clean – the effects of image clean-up on accuracy

Dec 06

There is always some way to modify a scanned image to improve its recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen where the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function, use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes, consider only those that improve the recognition rates, not destroy them.

Chris Riley – About

Find much more about document technologies at