Set it and forget it OCR

Sep 22

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at

Space Age Optical Character Recognition

Aug 24

There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.


How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were,  just look at the other cool uses of OCR technology.

Chris Riley – About

Find much more about document technologies at

There is OCR and then there is Formatting

Jul 26

What is the greatest difference between the most accurate Optical Character Recognition ( OCR ) products and the least? It might not be what you think. The greatest improvements in OCR in the last 10 years has not been so much on character level recognition, it’s been more about how the engine’s understand the structure of documents. This is called document analysis. Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would win.

Document analysis is first how the engine breaks apart components of a document such as paragraphs, lines, columns, graphics, etc. Without this, the engine is OCRing blind, and its assumption is that every object it encounters is text. This sometimes leads to clumping of lines, or OCR of graphics. The second aspect of document analysis is the delivery of formatting in the export that matches the formatting in the document. This can also include font style and color.

With traditional documents you can expect that products with document analysis will get the formatting spot on. This is very important, not only for editing and re-purposing, but also for keeping the readability of a document. Another aspect of document analysis is to determine reading order. For example if you have a multi-column, multi-paragraph page, the software has to decide in what order the paragraphs are read. This is useful during recognition, but also in case a formatted document is converted to a more flat file structure such as TXT file where the order stands a chance of being confused.

The reality is that for clean documents character level recognition is not getting any better, it’s amazingly accurate today. The opportunity to improve is in document analysis and language morphology, but that is another post.

Chris Riley – About

Find much more about document technologies at

Outsourcing document recognition

Apr 28

It’s common for organizations to outsource their scanning and document conversion. Organizations find it sometimes that the skill required, the convince factor, and liability is worth the additional cost. Other organizations that have one time backlog conversions save money by using an outsourcing company vs. bringing the software in-house. In recent years, service bureaus and business process outsourcing companies have dramatically improved their use of recognition technology and prices have dropped substantially. Though as an organization who chooses to outsource, you are removing the responsibility of picking document conversion technology. Shouldn’t you want to know what technology your service bureau is using?

YOU SHOULD! Absolutely you should be concerned about the OCR and Data Capture technology that your outsourcing company is using. It’s just as important than if you were bringing the technology in-house. It’s your job to make sure your vendor is using the best technology but also in the best way. The education level between outsourcing companies is different and they each often specialize in one document type or one type of processing. Proper evaluation of a service bureau will include reviews of sample results. You should have your prospect service bureau or BPO run a good number of your production documents and provide you with results. Make sure the technology they used to produce the results is the same that is used when in production. Don’t be afraid to ask the vendor what engine or engines are being used and even what version. Make sure you understand how your vendor handles exceptions.

While it’s easy to overlook these items when you are looking at a service instead of a technology, it’s still important that you are educated. Service bureaus make money based on how much they save. This can occasionally create motives to use poor technology to gain greater margins. Some outsourcing companies put customers into categories by volume and those with the greater volume get the best technology. Most of the outsourcing companies out there are very good at ensuring their document quality, and many will even go as far to give you a guarantee on quality. But the nature of production environments is such that you cannot check everything always. It’s about relationship. Sometimes paying a higher price per page for a better solution is worth it!

Chris Riley – About

Find much more about document technologies at

OCR and Paste

Apr 13

You probably use the copy and paste functionality on your computer daily. I too use copy and paste on a regular basis, but I also use OCR and paste nearly as much. OCR and paste is what I’m referring to as the process of selecting a region on your computer screen and using OCR to read that region as a screen-shot and converting it to text. Even to my surprise, it has become quite the habit and one of my favorite ways to collect data from one location on my computer to another. Many wonder why this might be the case, as most information on the screen is available as text anyways. The reasons are: it’s more efficient than copying and pasting into a program. It maintains structure of information using document analysis, and there are times when the information I want is not in text form but in an image only.

I have actually taken it one step further and used the technology to automate the extraction of data from web pages that are scroll heavy. Instead of scrolling forever for information on a web page, I can use the tool to take a screenshot of the entire web page and convert it to text for me. You can imagine how the technology could be used maliciously, but in this case, it’s just to get information.

The ability of OCR to read screen-shots is quite impressive. Though screen-shots usually come out in low DPI resolution which is traditionally not optimal for OCR, the text and text in image is what is called pixel perfect so it provides an excellent candidate for conversion. Also leveraging document analysis technologies built into OCR, I can grab a table and have it export a table versus having to copy and paste text and manipulate back to original form later.

When you become an expert in OCR, you find yourself using the technology in the oddest places, but this is one case where my productivity has increased because of the tool, and I think it’s worth sharing. I suspect that OCR of screen-shots is only going to increase in the future. Because of this and malicious reasons, so will counter mal-ware technologies. As well as a very easy way to convert data from one locked down legacy system to a new one.

Chris Riley – About

Find much more about document technologies at

eDiscovery and OCR

Mar 23

I have touched on this topic a little on one of my previous posts but because of eDiscovery’s popularity I thought it was fitting to look at OCRs interaction with eDiscovery preparedness. Organizations who are not ready for audits and court orders to deliver documents are spending tremendous amounts of money to undo bad document processes. Because of this, preparing yourself to be ready for possible legal future events is critical and a long term cost saver.

The purpose of OCR technology in conjunction with eDiscovery readiness is based in the principle of having as much data at your finger tips as possible. The proper policies of being ready is heavy in records management policies, and a good taxonomy that is strictly followed. Because of this, sometimes OCR is overlooked as a tool. With the proper above practices, it should be possible to pull up any document at any time. However, OCR should be viewed as an insurance policy because by OCRing every document you have would give you even more information than you would have otherwise, and information is the key to success in these situations.

eDiscovery also includes other types of data email being one of the most popular. But what about the data contained in email attachments that are PDF, TIFF, JPEG? OCR is the only tool to extract the data from the images in these formats. Surprisingly products that provide eDiscovery tools just for email still do not yet heavily deploy OCR technology, but the information contained in these attachments is often as valuable as the emails themselves.

In addition to all the traditional proper records management practices, and eDiscovery tools, OCR should be considered as a must have for organizations preparing themselves for audits or court orders, and sometimes even more importantly knowing what to omit.

Chris Riley – About

Find much more about document technologies at

Let the OCR do the talking for you

Feb 08

I’ve covered various interesting and non-conventional uses of OCR. I would like to talk about a new one, OCR to Speech. The blind community is familiar with technology and it assists them in their everyday lives. The key to OCR to speech is simplicity. When the concept was first developed, it required some very elaborate combination of software and hardware, now it’s possible to take the latest and greatest OCR technology and make it talk for you with a simple configuration.

It requires a document scanner with a easy physical button interface and programmed to scan an image at 300 DPI to a folder on a machine. Traditional documents work very well for OCR to speech whereas documents that have a lot of graphics and un-traditional formats may be more challenging. It’s important that the technology is able to omit garbage. To do this the OCR process should be driven by a dictionary. The words recognized must be in this dictionary or they will not show up in the final results. The reason for this is a lot of time can be wasted if bad recognition results are spoken.

Once the OCR engine has done it’s job of accurately and automatically converting an image to text, the ASCII text results from OCR will be saved into a directory. Now it’s time to automatically put the text to speech. There are many text to speech applications out there, some free, some for pay. The goal is to find one that also reads results from a directory and automatically speaks the text over computer speakers.

It can be that easy! Some users of such technologies spend more time trying to find an acceptable digital voice then really configuring the solution. I assure you the packages exist and when configured correctly is very accurate. One scanner, One OCR application hot folder driven, and one text to speech application also hot folder driven will give a robust OCR to speech solution that can be setup in minuets.

Chris Riley – About

Find much more about document technologies at

Not all Documents are Equal – OCRing Newspapers

Dec 23

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure; they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that but articles can span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document (newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at high volumes have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles, you can simply merge the result, review the remaining portions and your accuracy will then increase.

OCRing newspapers has its challenges, not to mention the difficulty in scanning them, but it’s possible and can be very accurate if in the right state of mind, and using the right approaches.

Chris Riley – About

Find much more about document technologies at

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at

Why OCR is for everyone

Jul 07

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Sometimes when people who are unfamiliar with scanners see the slew of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the very least, wouldn’t it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it’s more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it’s time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It’s an acronym that stands for Optical Character Recognition, this does not tell you much, so forget about it and use it only to reference the process. Simply it’s just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can imagine.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it’s value.

Chris Riley – About

Find much more about document technologies at