It’s hard for people to accept the possibility of over cleaning a scanned image. I myself would love to believe you can clean-up an image so much that it does not matter what OCR technology you use, it will always be 100% accurate. The fact is however, that OCR engines don’t work this way. There are particular ways to improve the quality of a document, and there are ways that image clean-up hurts your OCR accuracy. I am going to talk about two such phenomenon. Fuzzy characters, and characters with legs.
In data capture, a commonly sought after imaging technique to use is line-removal. Line-removal attempts to find all lines in a form and make them disappear. Especially when considering forms where text is filled in fields where each character and the field itself is bounded in lines. Most forms processing tools have actually advanced in a way that they incorporate the lines in the algorithm and anticipate them being there. They can thus recognize the characters even with lines. What often happens when a line-removal algorithm is used, you get characters with legs. Like the name sounds, these are characters where on the top and or bottom of the character a portion of the line remains where it touches the character. The result is the character no longer looks like its original self. For most characters they become un-recognizable, for others they become another character for example an H becomes an A and an I becomes a T. For this reason, line-removal is no longer a recommended image clean-up tool for data capture.
The next imaging technique is both extremely beneficial to data capture or detrimental. It all has to do with the form itself. I’m talking about despeckle. Despeckle is the algorithm that removes annoying dots on the document and enhances both the read of characters as well as the removal garbage that might be recognized as characters. Despeckle is usually beneficial to data capture, especially hand-print forms where the dots can interfere with the ICR algorithm. Where despeckle hurts data capture and forms processing is when the dots touch characters. Similar to line-removal, if the dots are touching the characters, the segmentation tool believes it’s a part of the character so leaves it. Thus you get fuzzy characters. Fuzzy characters are very difficult for OCR engines to read. It’s a simple test, look at your form and notice weather or not the dots on the form touch the characters or not. If they do, you are better off working with the dots.
These two examples demonstrate huge differences in OCR accuracy and are simply choices made on the image itself not including setup or the software you use.
Chris Riley – AboutFind much more about document technologies at www.cvisiontech.com.