Know your accuracy before you even test

Feb 25

One of the natural abilities that develops as you see millions of sample images and their associated recognition results, is you begin to notice patterns and instantly indentify if a document will read well for both full-page document conversion and for field level. It has more or less become a natural ability of mine, but I can identify its components.

First is initial image quality. Without yourself identifying any objects on the page, look objectively at the document as a collection of questionable objects and see if you think the image quality is good. This is determined by coherence of each object. Are object borders tight and determinable? Are there objects interfering with other objects? Is the background of the image significantly different than all objects?

Second am identification of objects. Find text, graphics, lines, paragraphs, etc. Are their borders far enough apart? Is their type clear? This is most important for text. Is their printing consistent? For example does text go from one background color to another, this would make it inconsistent. Or another example does the straightness of lines change throughout the document? And can one object be confused for another?

And third, now that you know the objects, how easy is it to determine their value. Is the value obvious? Do you have to look at it for a while to figure it out?

Essentially the three above steps are exactly what the conversion ( OCR, ICR, OMR ) product does in order to read a document. With field level recognition it’s a bit more elaborate, but the core is the same. By identifying early on what the anticipated accuracy is of a document, you can then adjust your scan, or input settings accordingly even before looking at any technology. Doing this will give the best chance for success.

Chris Riley – About

Find much more about document technologies at

Imprint vs. Annotate

Jan 28

Large volume scanning environments often have the need to imprint, herein “Stamp”, usually date of scan on each and every page that is processed. This requirement is created for tracking purposes and sometimes compliance. Many service bureaus require more than just a date, they require batch IDs and other important tracking information. The question becomes how to do this in the best way. There are several options.

Pre-Scan Imprint

Pre-Scan imprint being the most common option allows an organization to have the stamp on both the physical paper copy and the scan. Scanners capable of pre-scan imprint will print in the proper location for the data prior to the image reaching the scanners lamps. By doing so, the stamp will also be part of the scan. The reason this is the most common is because there are times when a scanned image needs to be compared with a physical document and this is what would be required to do so. Scanners with the imprint feature come at a premium and requires more maintenance.

Post-Scan Imprint

If the organization only needs the data or tracking mechanism on the physical paper then they can imprint after scan. Some scanners support post-scan imprinting or organizations feed the paper through an additional printing process. Usually the purpose of this operation is to imprint pages indicating simply if a page has been processed or not. Scanners with the post-scan imprinting feature run nearly the same price as the pre-scan imprint and gradually being faded out in favor of it.

Software Annotation

If the organization only needs the data or tacking mechanism on the scanned image they may elect to do software annotation. Software annotation gives the greatest amount of flexibility of all three options as any combination or sequence of data can be printed on the image anywhere. Software annotation would require an additional piece of software. Very often organizations will choose software annotation instead of the premium for imprinting scanners but sacrifice the physical imprint. The application that provides the annotation needs to be automated and batch driven.

The alternative to the above three methods is manual stamping. Manual stamping is tedious, time consuming and often inaccurate. It’s up to the organization to review the three options and pick the best fit for their production and budgets.

Chris Riley – About

Find much more about document technologies at

Playing tricks with images, up-sampling

Nov 05

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately, there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR, up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. Often  I will recommend up-sampling over working with the source that has lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it’s better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It’s like taking a very small image and trying to zoom in as far as you can to get details that you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better than just converting from 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you are doing is magnifying the mistakes created during up-sampling as they get propagated two times now. These will likely decrease your quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan, one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Chris Riley – About

Find much more about document technologies at