Question: How do I automate the OCR process? I want our files OCRed, but in the background, not in real-time.
Answer: One problem with creating searchable, captured corporate documents is that the time for “capture” and “search” are quite different. The time to capture a single image page is generally about 1-3 secs per page on an average device, and maybe up to 2 pgs./sec. on a high speed device. OCR processing time ranges from 1-2 secs in fast mode to as slow as 30 secs/page (or higher) for complex documents run with OCR in accurate mode. So the OCR process can potentially run 10x slower than the scanning process. This suggests that in many cases these 2 processes should be decoupled – one running in realtime (image capture), and the 2nd process (OCR) running offline.
What is meant here by offline? Offline means at a slower processing rate than the scanner, not directly tied into the scanning workflow. This way the scanner, MFP, or whatever the capture device, can process documents at a maximal rate of speed without waiting for the OCR to finish. One easy way to implement this decoupling is by using a watched folder; see, for example, http://www.cvisiontech.com/pdf_compressor_31.html.
Using a watched folder, the capture device can put all the captured files into a folder being watched by the OCR process. The OCR process watches this folder and OCRs each document dropped into the folder. The processed (i.e., searchable) file is then put into a watched outfolder. Files in the watched outfolder can assume to be searchable, and can be either processed further, e.g., Bates stamping, or indexed and inserted directly into a database system.