OCR Conversions: Image with Hidden Text vs. Electronic Formats

In All, Archived, OCR, OCR Accuracy, OCR Download, OCR Software, Optical Character Recognition by ChrisLeave a Comment

Question: When making captured documents text searchable, is it preferable to generate the output in strictly electronic format (e.g., Word) or is it better to generate it in PDF image + hidden text format?

Answer: The correct output format is a function of what the resulting document needs to be used for. If the output document is just required for semi-accurate text purposes, e.g., document indexing, “borrowing” text to generate part of a new document, then an electronic output file (e.g., txt or Word) is usually adequate. Of course, there is typically a certain OCR accuracy rate, so the user has to expect a certain mismatch rate in the electronic file. Also, since the OCR engine must interpret the scanned document, sometimes the OCRed file will look very different than the captured original. For documents that are sufficiently degraded, or scanned at low dpi, there may be no OCR text generated even though the document is entirely human readable.

The advantage of keeping a scanned document in PDF image + hidden text format is that this file, though electronic, is guaranteed to look exactly like the original document, while still having the searchability and indexing capability of a Word or text file. PDF image format is the same as scanning to TIFF, except that different compression filters including JBIG2 can be used to compress the file. This insures that a PDF image, generated with the right compression technology, is always going to be significantly smaller than the corresponding TIFF or G4-encoded PDF. The hidden text layer is ideal for searchability, as query results get highlighted on the image based on the bounding boxes of the hidden text layer. For document archiving and records management applications, PDF image with hidden text layer is an ideal file format.

Leave a Comment