CVISION home
 
 
 
Litigation Support Web Repositories Scanning Bureaus Wireless Telecom
 

 
   CVista Suite Overview
   CVista PdfCompressor
   CVista Viewer
   CVista API Toolkit
   CBatch
   OCR
 
  Professional Services Overview
  LeapReader Overview
  Submit Inquiry
 
   Case Studies
   Litigation Support
   Web Repositories
   Scanning Bureaus
   Wireless Telecom
 
   Resellers
   Service Bureaus
 
   Case Studies
   Clients
   Testimonials
   Information/Support Blog
   Submit a File to our Staff

 

OCR & Novel Fonts

In classical OCR, the recognition systems were trained on a very specific set of fonts. If these fonts varied in any material way, recognition rates would fall off accordingly. Today's systems are much more robust and can handle the myriad of novel fonts that are used in publishing and available on the Web. What becomes more relevant for modern OCR systems is adaptability. If the shapes of characters in a new font is fairly unpredictable, what can be relied upon? It would be nice if, at the very least, topological properties, e.g., Euler number, are preserved. But often even this property is not invariant either due to novel fonts that modify basic character topology or because of scanning noise that introduces or eliminates holes.

As a result, what has become more prevalent in recent OCR technology is "shape-free" OCR. These algorithms seek to find the appropriate mapping between learned font symbols and the symbol alphabet. These newer methods seek to solve the OCR problem relying heavily on order statistics. Among the methods used, numbered strings that make use of the word structure to limit or uniquely identify the correct mapping. Obviously, the longer the document being analyzed, the more relevant the document statistics (such as K-tuples) will be.

It would seem that shape-only OCR systems have somewhat limited in applicability. Such systems want to solve the OCR puzzle strictly from the shape of a component image. This method can also be referred to as context-free, since no neighboring context is required to solve for the correct ASCII mapping. Similarly, OCR methods that are highly statistical can be thought of as context-sensitive, as these methods want to first compute order stats, or k-tuples, and only then infer the ASCII mapping. A combination of context-free and context-sensitive methods, incorporating geometric and topological properties of each component in conjunction with shape-free statistical methods, is probably most likely to yield accurate OCR results.

Click here to read next topic: Locating Multidirectional Text with OCR

Return to Table of Content

 
 
   
 


Copyright (c) 1998-2007 CVISION Technologies, Inc.
CVISION, CVista, CBatch, and the CVISION logo are registered
trademarks of CVISION Technologies, Inc.

 
Litigation Support Web Repositories Scanning Bureaus Wireless Telecom