Set it and forget it OCR

Sep 22

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at

Not all Documents are Equal – OCRing Newspapers

Dec 23

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure; they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that but articles can span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document (newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at high volumes have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles, you can simply merge the result, review the remaining portions and your accuracy will then increase.

OCRing newspapers has its challenges, not to mention the difficulty in scanning them, but it’s possible and can be very accurate if in the right state of mind, and using the right approaches.

Chris Riley – About

Find much more about document technologies at

Rich Media OCR

Oct 15

I often speak of unique uses of OCR, and here is yet another. OCRing video files! But why? Part of the management of rich media assets is indexing these files. Technologies such as speech recognition and optical character recognition give a greater index and search value to rich media.

By using OCR technology to find and extract text from video frames, the data can be stored as meta-data. In the simplest scenario, this is a text file that accompanies the video file. More complex environments will even tell you the minuet and second the text occurs. Because this is not a traditional use of the technology, some special consideration must take place.

First is converting and separating frames to individual images files. For the OCR to be effective it needs to work on a series of images. Although a video is only a sequence of images that repeat at a high rate of speed, it’s still somewhat of a challenge to convert video files such as MPEG to a series of images. Not only that, dealing with motion blurs that might occur in some frames will also be a problem.

The second challenge is dealing with frames that are repeats. Essentially, because there are so many similar images that are only slightly different from each other, the text on a series of frames might not change. Better OCR results will account for this and not repeat text as the frames would.

And finally dealing with the variations of fonts, and often small sizes. This requires an OCR engine with specific settings for specialized OCR, and one that is very accurate on complex low quality documents.

I expect that in the future, this technique in conjunction with speech recognition will be used in eDiscovery, content management, and robust search of rich media files.

Chris Riley – About

Find much more about document technologies at

Don’t over think business card scanning

Nov 19

I am amused at the detail of thought and complexity people put into business card scanning, BCR. In my time in the enterprise content management industry, I’ve scanned hundreds of thousands of business cards. Yes I know that is a lot, and no I’m not that popular. A vast majority of the scanning was for testing of business card scanners, and OCR/BCR technology to read them. But what cards I do want to keep are scanned and stored for later retrieval. I DO NOT use a specific card scanner, nor do I use a special card scanning application. I truly keep it simple and in my experience this is the right right way to do it.

Card scanners are inexpensive and useful when all you are scanning are business cards, but why not scan all your documents? ADF feed document scanners today support feed trays that dial down to the size of business cards. They are also able to scan stacks of business cards, front and back without a problem. So all I need is one scanner for all my documents. After extensive testing, the image quality difference is negligible, and in the top 3 are leading desktop scanners which are actually higher.

Now that you have the image, how do you get the data? Most people want to use a dedicated BCR technology to extract each data element from the card. I too am very amused with this and have had fun setting up systems to do it. But as a practicality, it does not make much sense to me. BCR that extracts separate fields such as name, email etc. can be very accurate. But when it’s not accurate it’s a problem. It takes a lot of time to correct a problem if it occurs, but more often then not, you don’t even know the problem occurred so you get part of an address in the phone number field and miss the phone number completely for example. You will only know this is the case when it’s time to follow-up with people. My second practicality complaint is that you are adding one program to operate and use regularly. We all have our favorite email client, or CRM where we keep all our contacts. Most BCR applications promote their ability to export to the most popular email clients, but as soon as there is an update to your database application you also have to buy a new version or might be stuck. Some BCR applications do not even allow the export of data so you have to manually copy and paste anyway. Is this really necessary?

OK it’s only fair now for me to tell you how I do it. I scan my business cards with my ADF fed document scanner to a hot-folder. This hot-folder is watched by a Full-Page OCR system and the cards are automatically converted to search-able PDFs. I am not getting field data and I’m not saving into a separate application. I’m making search-able PDFs just like all my other documents. This has worked for me very well for the past 4 years. When it comes time for me to find someone, all I really want is to see the card and the info. Most likely, if it’s really important I’ve already emailed them and captured their information that way. With my system I can search for people, websites, company names, even topics to find the cards of the people I want. I don’t have to worry about searching in a special UI field by field. Nor do I have to worry about missing data as full-text does not fall to the mercy of field extraction, it gets everything that is readable. In the areas of business where card scanning is used for reading medical insurance cards and drivers licenses the technology is very useful and necessary. I’m speaking only of personal business card scanning

In my experience most users of card scanners and BCR application use them very actively for a period of a month or two and soon the use dies to nothing and they revert to manual entry. I’m not doing manual entry, I’m using the latest technology in one unified process and full-text search. I’m just keeping it simple and practical.

Chris Riley – About

Find much more about document technologies at