Question: I work in a hospital. We are planning to scan very old files into our computer. What we want to do is to get specific data from certain parts of the files, so that we can put this in our database. Is this possible?
Answer: Yes, this is possible. There is a function in the PdfCompressor called zone OCR. Once you set that setting for where you want the OCR to occur, the data is then put into a Rich Text File. Then, you can put the RTF file into your database. However, to optimize zone OCR results with very old files, you can follow these steps:
1. Verify that the existing resolution (dpi) is correct. OCR engines are calibrated based on the dpi that is typically given in the image header file. If this value is incorrect, then the OCR results will degrade.
2. Assuming the dpi has now been set correctly, up sample to a reasonably high dpi. Typically, 300 dpi is a good number. The up sampling method does matter – use bicubic spline interpolation.
3. OCR engines usually perform better on bitonal documents that are thresholded correctly than on the original color files. Of course, if the threshold is poorly chosen, the OCR engine is better off with the original color or grayscale image file. So if possible, threshold each upsampled image file manually so that the text is most readable.
You can try out the PdfCompressor’s zone OCR below: