There is OCR and then there is Formatting

Jul 26

What is the greatest difference between the most accurate Optical Character Recognition ( OCR ) products and the least? It might not be what you think. The greatest improvements in OCR in the last 10 years has not been so much on character level recognition, it’s been more about how the engine’s understand the structure of documents. This is called document analysis. Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would win.

Document analysis is first how the engine breaks apart components of a document such as paragraphs, lines, columns, graphics, etc. Without this, the engine is OCRing blind, and its assumption is that every object it encounters is text. This sometimes leads to clumping of lines, or OCR of graphics. The second aspect of document analysis is the delivery of formatting in the export that matches the formatting in the document. This can also include font style and color.

With traditional documents you can expect that products with document analysis will get the formatting spot on. This is very important, not only for editing and re-purposing, but also for keeping the readability of a document. Another aspect of document analysis is to determine reading order. For example if you have a multi-column, multi-paragraph page, the software has to decide in what order the paragraphs are read. This is useful during recognition, but also in case a formatted document is converted to a more flat file structure such as TXT file where the order stands a chance of being confused.

The reality is that for clean documents character level recognition is not getting any better, it’s amazingly accurate today. The opportunity to improve is in document analysis and language morphology, but that is another post.

Chris Riley – About

Find much more about document technologies at

Not all Documents are Equal – OCRing Newspapers

Dec 23

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure; they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that but articles can span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document (newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at high volumes have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles, you can simply merge the result, review the remaining portions and your accuracy will then increase.

OCRing newspapers has its challenges, not to mention the difficulty in scanning them, but it’s possible and can be very accurate if in the right state of mind, and using the right approaches.

Chris Riley – About

Find much more about document technologies at

“No text left behind” – Color’s Impact on OCR

Jul 24

OCR technology has come a long way since its creation. On the 300 DPI clean, letter type documents the technology has arrived and there is not much room for improvement. But what about the rest of the documents out there?Hhow is OCR improving on them? When comparing that perfect letter document to that not so perfect article or newspaper say, the big difference is text placement and configuration. One of the keys to getting even better OCR is to improve your ability to identify what is graphics and what is text. Within the text, you have to identify columns, paragraphs, sentences, words, and finally characters. Only then can the OCR take a whack at interpreting the text. This is called Document Analysis. Sometimes OCR accuracy is lower not because of the actual read of the text but because the OCR software tries to read things that are not text, or some of the text in the document is simply ignored because it was never found.

In the last few years and moving forward, text identification, Document Analysis, has been one of the areas of greatest improvement. Many of the new products have been leveraging color as one more tool in not leaving any text behind. With color, the ability to locate different parts of a document is even easier and more accurate, thus the overall OCR is more accurate. The most obvious benefit of color is ability to locate graphics. Sometimes index level OCR requires that even text within graphics be read to enhance the search-ability of a document. With color detection, the modern engines are advancing to locate text in pictures and ignore the rest. Very stylized documents pose the greatest challenge to Document Analysis, and color is one of the best tools to attack them. Expect to see similar trends and focus on Document Analysis and the pursuit of no text left behind.

Chris Riley – About

Find much more about document technologies at

Turning off the latest technology

Mar 03

Our culture is built on the fact that the newer and more means better. In the advanced technologies that exist, this for the most part is true, but people are always surprised when I tell them that disabling some of the newer technology will actually produce a better result. I am going to give you three examples of where technology demands time travel to older approaches for higher accuracy.

In data capture and OCR, there is a component of the technology called document analysis. Document analysis prior to any collection of data tells the structure of a page including columns, rows, tables, pictures, paragraphs, lines, etc. It’s the biggest contributor to modern day OCR accuracy. Document analysis is really designed for documents that are more traditional such as an article, a book page, or a letter. Document analysis ( although there have been special ones ) does not excel at form type documents. One of the most difficult documents in the world is an Explanation of Benefits EOB. This document has its own structure per variant typically. Surprisingly, the best way to process such a document is to turn off document analysis and default to a basic full-page read of the text. The reason for this is that document analysis provides an overwhelming bias for tables that no EOB will match.

It is the same case when reading text from photographs. When reading text from license-plates and product-plates ( serial number plates welded or stuck to many products ) during assembly it is best done with engines that do not have document analysis. In this case, the document analysis is trying too hard to find information. Because of the nature of these images, what ends up happening is characters in the photo are split into multiple lines and characters. Without document analysis, the engine sees the whole image as one text block and just reads it, thus creating better results. Looking at the license-plate readers that snap pictures of your license plate at toll booths, they are all using older antiquated OCR technology. By turning off document analysis they can use the newer engines.

Finally, there is mobility. This one makes a lot of people uncomfortable. Our society wants to believe their cell phone can do anything. Just today I was wondering why my cell phone did not brush my teeth for me. You can have your cell phone do OCR sure, but it requires older smaller and limited OCR engines to do so. I prefer to send an image to a server and use more advance OCR, but many demand OCR on the phone though in practice it’s usually slower. The reason for this is OCR requires specific processing power, and specific types of processing. Chips in phones today, and likely for a very long time to come will not compete with the power of a computer nor will they, and most importantly, include the proper math operators it takes for efficient and math heavy modern OCR. Cell phones cannot adopt proper chips because we demand long lasting batteries, small size, and low cost. Intense math is simply not important for 99.9% of mobile applications.

There you have it. Modern OCR taken down a few notches to solve current day problems. The best engines that exist today allow you to turn on and off all the various functionality you need thus making it possible to purchase the latest OCR technology and limiting it however you need. Most organizations don’t understand why anyone would want to turn off the new but today I’ve proven new is not always better!

Chris Riley – About

Find much more about document technologies at