OCR in Robust File Conversion

In All, Archived, Batch PDF OCR, Convert PDF, OCR, PDF Conversion, Tiff by ChrisLeave a Comment

Question: I want to convert company legacy files from both image and electronic formats to PDF. Is the process different for image formats, such as TIFF vs. electronic formats, such as Word? Do I need to run an OCR process on the converted electronic files or only on the image files?

Answer: Converting image files to a standardized format, such as PDF, is usually done in just one way. It involves converting from one image format to another, which is pretty standard, and then optionally adding OCR for text search and document-related meta-data. Converting electronic files to a canonical format, such as PDF, can usually be done in one of two ways: i. electronic conversion, or ii. image conversion.

Image conversion of electronic documents, say Word to TIFF or G4-based PDF, for long-term archiving is generally considered a safe practice, almost like microfiching a dataset. Very little can go wrong during the conversion process. On the other hand, the image files are apt to be larger than the electronic files usually by a factor of 10x-20x or more. These files need to be OCRed before being attached to the database if we want them text searchable. In addition, state-of-the-art compression methods can reduce their image size back down to the original source document’s electronic size (see http://www.cvisiontech.com/pdf_compressor_31.html ) .

Electronic file conversion is simple in that it generates an output file that is already searchable and in electronic form. So, for example, starting with a Word file using some standard conversion method (e.g., Distiller) should, hopefully, produce an equivalent PDF file that is electronic and already searchable. Things to watch out for include the fact that sometimes in electronic conversion important graphic, tabular, or mathematical information is either lost or modified during the conversion process. A verification process (of the conversion process) is generally recommended.

Leave a Comment