In todays digital world to have searchable text for all important documents is becoming compulsory feature. This can be achieved by using OCR (Optical Character Recognition). OCR translates the images to a document format that indexers already know.
Now a days there are several OCR applications available which can convert scanned images to text, Word, HTML or searchable PDF.
The important characteristics that separate various OCR applications are:
Its Character recognition accuracy
Its support for searchable PDF output its page layout reconstruction accuracy
Its support for various languages
Its user interface
In a searchable PDF, the original scanned image is kept as it is so any person can read the document. However the text content that is extracted via OCR is put behind the image hence search indexers can see it and Acrobat Reader will allow us to select this as text.
To generate searchable PDF from scanned document is very simple. It includes following steps:
Decompressing the image
Pre processing the image to make OCR more accurate
OCR the image to extract the text
Re encode the image in a choice of formats for the maximum smallest file size possible.
Building a PDF with the image and the extracted text with each word exactly positioned behind the appropriate place in the image.
Creating scanned image from a paper document plays a crucial role in formation of searchable PDF. Following points are important while scanning paper documents to PDF.
Always choose 'Text-Under-Image" Option. : While scanning a document you should choose the option that applies OCR to make the document text searchable.
Get the Right Resolution: Scan resolution should be ideally 300 dpi (dots per inch). OCR quality otherwise can suffer from lower scan resolution.
Scan to B&W, Grayscale or colour: For the purpose of OCR `Grayscale is generally the safest choice as more information is retained for the OCR to work with
Watch the Other Settings: Scanners will always have some other settings which can help improve scan quality.
Always get a quali OCR Programme
Thus by keeping in mind all the above important points we can create PDF files searchable with OCR.