The word OCR means Optical Character Recognition. OCR is a latest technology that helps computer to scan printed as well as written text and then convert it to electronic editable document. Such edible documents can be opened and used in your desktop publishing software, word processor, or other text editor. Almost all paper documents can be scanned by OCR software and turned into edible files such as Word, Excel, etc. Todays OCR Software packages also contain advanced support for multiple languages, PDF and HTML output, along with format retention.
A typical OCR system consists of three logical components: an image scanner, OCR software and hardware, and an output interface. In this, the image scanner optically catches the text images which are further processed with OCR software and hardware. This process involves three operations.
- Document Analysis (involves extracting individual character images.)
- Recognizing these images ( as per shape and size)
- Appropriate processing.( by correcting misclassifications made by recognition algorithm or by limiting recognition choices)
The output interface communicates the OCR system results to the outside world.
Most of the OCR softwares available are generally accurate in the range of 90%. Hence some proofreading or editing is always required. The higher the accuracy rate, the better the OCR software is since it saves your time in proofreading and manual correction.
The important factors that decide top OCR software are:
- The speed of scanning and conversion
- compatibility with popular PC applications such as Microsoft Word, Excel
- the ability to adjust the quality of scanned documents (i.e. Page layout reconstruction capacity)
- Its wide ranging language support
OCR software is thus extremely helpful in document management. With suitable OCR software within a short period of time literally piles of papers can be easily converted into compressed PDF files for indexing or archiving purpose. Similarly OCR scanning can be used for internet publishing as data on paper can be easily transformed into HTML files. Hence OCR software is now increasingly used by liabraries, govt agencies and other enterprises to make lengthy documents quickly available electronically.