Set it and forget it OCR

Sep 22

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at

Replacement for fax right under our noses

Jul 12

How does a technology first invented in 1843 and executed in 1924 still exist as a primary function in our working lives? I’m talking about fax. The fax technology is old and outdated. I personally avoid fax simply because of principle. But my principle alone will not make big changes in adoption. What people don’t understand is that we have a fax replacement right under our noses, one that is both green and as easy to use.

The combination of a document scanner, imaging software, and email software is a complete fax replacement solution. Instead of typing in phone numbers users, can type in email addresses. In fax you double the amount of paper that exists. Paper in, paper out. With the document scanning approach, you are reducing the paper consumption, paper in, email out. Most document scanners today even ship with a pre-configured “Scan to Email” option. On a production level, systems can be setup in offices, your local Kinkos, wherever, to allow multiple users to access the same document scanner and scan to any email with a basic step-by-step wizard.

Not only is fax to email saving trees, it is also increasing efficiency and when combined with workflow, document imaging, OCR, and data capture, it adds much greater value for that single piece of paper.

These systems do in fact exist in small corners of the world, and I have participated in the development and setup of them. The adoption is still very low. What it comes down to is fear of change. People understand paper to paper. Many users of fax don’t even know what email is. There are two ways this can be solved, time and forced adoption. While I would hope for the second which would be a campaign of replacing all fax machines with scanners, it’s very unlikely and requires unity of multiple competing entities.

No I do not like fax, but I understand it. And I hope that sooner rather than later people see there has been a solution to replace fax that is both saving trees, increasing efficiency and has existed for many years.

Chris Riley – About

Find much more about document technologies at

What you OCR is what you get

May 04

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Chris Riley – About

Find much more about document technologies at

Let the OCR do the talking for you

Feb 08

I’ve covered various interesting and non-conventional uses of OCR. I would like to talk about a new one, OCR to Speech. The blind community is familiar with technology and it assists them in their everyday lives. The key to OCR to speech is simplicity. When the concept was first developed, it required some very elaborate combination of software and hardware, now it’s possible to take the latest and greatest OCR technology and make it talk for you with a simple configuration.

It requires a document scanner with a easy physical button interface and programmed to scan an image at 300 DPI to a folder on a machine. Traditional documents work very well for OCR to speech whereas documents that have a lot of graphics and un-traditional formats may be more challenging. It’s important that the technology is able to omit garbage. To do this the OCR process should be driven by a dictionary. The words recognized must be in this dictionary or they will not show up in the final results. The reason for this is a lot of time can be wasted if bad recognition results are spoken.

Once the OCR engine has done it’s job of accurately and automatically converting an image to text, the ASCII text results from OCR will be saved into a directory. Now it’s time to automatically put the text to speech. There are many text to speech applications out there, some free, some for pay. The goal is to find one that also reads results from a directory and automatically speaks the text over computer speakers.

It can be that easy! Some users of such technologies spend more time trying to find an acceptable digital voice then really configuring the solution. I assure you the packages exist and when configured correctly is very accurate. One scanner, One OCR application hot folder driven, and one text to speech application also hot folder driven will give a robust OCR to speech solution that can be setup in minuets.

Chris Riley – About

Find much more about document technologies at

Dropout, all or none

Jan 20

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But a bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout, performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation, you are better to turn off dropout, or improve it’s use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scanning, the lamp would be turned on thus canceling out the red in the form. Because of this, it was important that printed forms used a certain type of red. If you have ever had experience with color matching you know it’s quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit with some scanners and software packages of being able to dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult, it is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout, it will avoid colors that don’t match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it’s one color and black text, but more complex documents with a low threshold can lose important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out, it is better for data capture accuracy that dropout is turned off and deal with the black and white form than to include it.

Chris Riley – About

Find much more about document technologies at

Path to simple yet robust document routing

Dec 30

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
> Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Chris Riley – About

Find much more about document technologies at

File Format Buffet! – choosing your output format

Dec 09

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at

Document Preparation

Jul 21

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – About

Find much more about document technologies at

Why OCR is for everyone

Jul 07

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Sometimes when people who are unfamiliar with scanners see the slew of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the very least, wouldn’t it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it’s more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it’s time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It’s an acronym that stands for Optical Character Recognition, this does not tell you much, so forget about it and use it only to reference the process. Simply it’s just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can imagine.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it’s value.

Chris Riley – About

Find much more about document technologies at