Jul 26

What is the greatest difference between the most accurate Optical Character Recognition ( OCR ) products and the least? It might not be what you think. The greatest improvements in OCR in the last 10 years has not been so much on character level recognition, it’s been more about how the engine’s understand the structure of documents. This is called document analysis. Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would win.

Document analysis is first how the engine breaks apart components of a document such as paragraphs, lines, columns, graphics, etc. Without this, the engine is OCRing blind, and its assumption is that every object it encounters is text. This sometimes leads to clumping of lines, or OCR of graphics. The second aspect of document analysis is the delivery of formatting in the export that matches the formatting in the document. This can also include font style and color.

With traditional documents you can expect that products with document analysis will get the formatting spot on. This is very important, not only for editing and re-purposing, but also for keeping the readability of a document. Another aspect of document analysis is to determine reading order. For example if you have a multi-column, multi-paragraph page, the software has to decide in what order the paragraphs are read. This is useful during recognition, but also in case a formatted document is converted to a more flat file structure such as TXT file where the order stands a chance of being confused.

The reality is that for clean documents character level recognition is not getting any better, it’s amazingly accurate today. The opportunity to improve is in document analysis and language morphology, but that is another post.

