Question: Do we need to OCR our Company documents? We already field code them on 21 fields, like author, date, amount, State, etc. These are mortgage files that range in size from 50 to 400 pages. Most pages are scanned to black and white, with a few being scanned to color. What are the benefits of using OCR? Is it worth utilizing OCR for this sort of project?
Answer: There is clearly some overlap between field coding and OCR (optical character recognition). When a document is scanned, or entered into the system, key information about the document also needs to be entered into the system. This would often include author, date, last modified, etc. It is often hard to predict in advance all the fields of a document that would be relevant to a search later.
If the documents are reasonably unstructured, as in legal discovery, then OCR is usually applied to all the files considered even tangentially relevant as any one of these might be the needle in the haystack that we’re looking for. In very structured forms, general OCR (optical character recognition) might not be required since the fields can be precisely coded when the document is entered into the database or specialized form recognition software can be used to zone extract only the fields of interest.
The less structured a document is, the more OCR is going to be helpful. OCR recognition gives scanned documents the same full text search available in fully electronic files like Word and Excel. To search across an entire corporate database, including scanned documents, OCRing scanned files is necessary. To scan all incoming corporate matter, including faxes, digital mailroom, and MFP-captured document, OCR is required.
There are very considerable differences in both OCR speed and accuracy. For black and white documents expect processing rates that vary from 2 pages per second to 30 seconds per page, with an average rate of about 3 seconds per page using standard accuracy on a 3 GHz processor machine. For color OCR, this process tends to run much slower. Expect processing rates between 3 seconds per page and 90 seconds per page, depending on page complexity and the software and settings used, with average processing time of about 10 seconds per page.
OCR recognition rates can be enhanced through various methods. Typically, for color files avoid any serious compression, particularly JPEG-based, prior to the OCR process. For black and white files, scan to a higher DPI (i.e., 300) if at all possible. There may still be thresholding problems introduced when the background is textured. In this case, high resolution scanning coupled with the right Gaussian smoothing prior to OCR is recommended. There are many other causes for poor OCR recognition rates and these typically need to be considered on a case by case basis.
To see other Blog entries discussing optical character recognition, the advantages of OCR, and how and when to use OCR, click on this link: http://www.cvisiontech.com/wordpress/?cat=5