Searchable PDF using OCR
In the new “paperless” office, there is no more paper. Any file, contract, meeting notes, etc. you need are found instantly with Google-style searching. To make all this happen, paper needs to be made searchable. Pure OCR is usually not the answer for a variety of reasons. Among them pure paper to electronic conversion is not a reliable process. For most corporate documents, before search-ability is a factor there must be a guarantee of accuracy in the conversion process. Any process that can modify the look and feel of the document is not acceptable.
Searchable Image Documents
Paper to electronic format conversion is not reliable for purposes of record management. The OCR engine attempts to understand the document logically, and then reconstruct it electronically. Any error on the part of the OCR engine in interpreting the document will result in perceptual errors in the electronic reconstruction. Instead, what has become heavily used in the document management industry is that of searchable image documents. Within searchable image documents, the most popular format is that of PDF image + hidden text. In this format, what you “see” is just the imaged document. What you can search on, however, is the fully OCRed version of the file. Since the OCR layer is hidden, mistakes and substitution errors at the OCR level are not visible on the document.
Benefits of Searchable PDF
Searchable PDF is an ideal format for several reasons. One of the main reasons for searchable PDF is database portability. When porting between databases, it is very good practice for documents to be self-contained. This essentially means that OCRed text and any document-related metadata should be available within the document. With PDF this is easily realizable. With other formats, such as TIFF, it is virtually impossible. Of course, other information including headers, footers, Bates Stamping, and security can easily be added to a PDF file.
Within searchable image documents, the most popular format is that of PDF image + hidden text, where you “see” just the imaged document but search on the fully OCRed version of the file. This way, mistakes and substitution errors at the OCR level are not visible on the document.
Many OCR users want basic search from an OCR engine. This means that they want to find the needle in the haystack. They need to search a database and find all files that contain a certain expression. This type of OCR does not depend on a logical decomposition of the document image. It is sufficient to get back all the text associated with each page of a document image and feed the OCR text to a full-text search database engine. The database will then index on the full-text and allow general text-based database queries, e.g., proximity search.
There are times, however, when a logical decomposition of the document is required. This happens when part of the document is to be used in composing another document. In this case, the document needs to be logically understood, including word readability order, tables, and graphs, so that an excerpt can be utilized as part of another document. Certain OCR processes need this logical decomposition, and looking at OCR word accuracy is not sufficient in evaluating OCR systems for these applications.