OCR to create Searchable PDF's
In the new "paperless" office, there is no more paper. Any file, contract, meeting notes, etc. you need are found instantly with Google-style searching. To make all this happen, paper needs to be made searchable. Pure OCR is usually not the answer for a variety of reasons. Among them pure paper to electronic conversion is not a reliable process. For most corporate documents, before searchability is a factor there must be a guarantee of accuracy in the conversion process. Any process that can modify the look and feel of the document is not acceptable.
Paper to electronic format conversion is not reliable for purposes of record management. The OCR engine attempts to understand the document logically, and then reconstruct it electronically. Any error on the part of the OCR engine in interpreting the document will result in perceptual errors in the electronic reconstruction. Instead, what has become heavily used in the document management industry is that of searchable image documents. Within searchable image documents, the most popular format is that of PDF image + hidden text. In this format, what you "see" is just the imaged document. What you can search on, however, is the fully OCRed version of the file. Since the OCR layer is hidden, mistakes and substitution errors at the OCR level are not visible on the document.
Searchable PDF is an ideal format for several reasons. One of the main reasons for searchable PDF is database portability. When porting between databases, it is very good practice for documents to be self-contained. This essentially means that OCRed text and any document-related metadata should be available within the document. With PDF this is easily realizable. With other formats, such as TIFF, it is virtually impossible. Of course, other information including headers, footers, Bates Stamping, and security can easily be added to a PDF file.
Click here to read next topic: OCR & Logical Decomposition
Return to Table of Content





