Set it and forget it OCR

Sep 22

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at

You can read the fine-print

Apr 28

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt or even 6 pt very accurately. It used to be the case that unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines, reading small fonts will not be a problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason, original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” are very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it, you will scan and OCR it.

Chris Riley – About

Find much more about document technologies at

Multi-Pass document recognition

Nov 04

When accuracy is the primary concern in document recognition, the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times, why not have an OCR engine convert it 3, 4, or even 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used successfully is with documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A and a second read can be done with the same OCR engine but with settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum, 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it’s going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by the environment, for example, environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process, it’s still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Chris Riley – About

Find much more about document technologies at