OCR Support within PDF Format

Typical OCR conversion of documents leave users stuck between two limited formats. First, there is the OCR'ed text file, which is easy to text search on and allows simple copying to a text editor such Microsoft Word. Second, there is the original document, which contains formatting, graphics and other information not usually found in a text file, and which an OCR program cannot reliably reproduce.

To view an OCR'ed text file in tandem with the original TIFF or JPEG can be quite tedious. Fortunately, PDF solves this problem by embedding a hidden text layer into an image PDF. This gives the user a fully searchable document which still contains all the visual information available in the original. Searching on a word can then take you to the exact location in the image where that word appears. An example of this image + text PDF is shown in the figure below, where the search is on the text string "determine."

 

Figure 6. A hidden text word is highlighted in Adobe Reader.

One possible drawback to this embedded OCR approach is that it increases the file size. In addition to storing the original document, you must also store an additional hidden text layer. JBIG2 solves this by greatly decreasing the size of the image layer, so that the compressed JBIG2 PDF including its text layer will generally be much smaller than the original file. You can have more information presented in a more useful manner, even as the file size has been greatly reduced. For example, the size of the original TIFF for US Patent 6122633 was 849.372 bytes, while the size of the JBIG2 wrapped PDF with OCR (using CVision PdfCompressor) is less than 15% of the size of the original at 120.648 bytes.

When you consider that even the most accurate OCR engines frequently produce errors, it is always a good idea to have easy access to the original document in case the text file seems inaccurate.

 
Generated in 0.75263 Seconds