CVISION Technologies

Document Imaging, Information, and Tech Support

Archive for the 'All' Category

OCR Scanner

June 3rd, 2008 by Chris

Question: Does your software OCR scanned documents?

Answer: Our software optimizes scanned documents. When your PDFs are the best they can be, everyone who uses your files saves time and you save money. For scanned documents, this means advanced image pre-processing, the best available OCR, and the latest image compression. CVista PdfCompressor 3.1 delivers these things, and much more via an intuitive Windows application and comprehensive command-line interface. PdfCompressor can make your documents fully searchable with CVISION’s super-accurate OCR. Intelligent pre-processing makes our OCR far more accurate and 2-3 times faster than Adobe Capture or Acrobat. Plus, PdfCompressor can OCR in over 60 languages.

To download a free 30 trial version of CVISION’s PdfCompressor software with OCR, go to http://www.cvisiontech.com/download_main.html

Category: All, OCR Accuracy | No Comments »

OCR in Robust File Conversion

June 3rd, 2008 by Chris

Question: I want to convert company legacy files from both image and electronic formats to PDF. Is the process different for image formats, such as TIFF vs. electronic formats, such as Word? Do I need to run an OCR process on the converted electronic files or only on the image files?

Answer: Converting image files to a standardized format, such as PDF, is usually done in just one way. It involves converting from one image format to another, which is pretty standard, and then optionally adding OCR for text search and document-related meta-data. Converting electronic files to a canonical format, such as PDF, can usually be done in one of two ways: i. electronic conversion, or ii. image conversion.

Image conversion of electronic documents, say Word to TIFF or G4-based PDF, for long-term archiving is generally considered a safe practice, almost like microfiching a dataset. Very little can go wrong during the conversion process. On the other hand, the image files are apt to be larger than the electronic files usually by a factor of 10x-20x or more. These files need to be OCRed before being attached to the database if we want them text searchable. In addition, state-of-the-art compression methods can reduce their image size back down to the original source document’s electronic size (see http://www.cvisiontech.com/pdf_compressor_31.html ) .

Electronic file conversion is simple in that it generates an output file that is already searchable and in electronic form. So, for example, starting with a Word file using some standard conversion method (e.g., Distiller) should, hopefully, produce an equivalent PDF file that is electronic and already searchable. Things to watch out for include the fact that sometimes in electronic conversion important graphic, tabular, or mathematical information is either lost or modified during the conversion process. A verification process (of the conversion process) is generally recommended.

Category: All, Batch PDF OCR, Convert PDF, OCR, PDF Conversion, Tiff | No Comments »

OCR

June 2nd, 2008 by Chris

Question: Do we need to OCR our Company documents? We already field code them on 21 fields, like author, date, amount, State, etc. These are mortgage files that range in size from 50 to 400 pages. Most pages are scanned to black and white, with a few being scanned to color. What are the benefits of using OCR? Is it worth utilizing OCR for this sort of project?

Answer: There is clearly some overlap between field coding and OCR (optical character recognition). When a document is scanned, or entered into the system, key information about the document also needs to be entered into the system. This would often include author, date, last modified, etc. It is often hard to predict in advance all the fields of a document that would be relevant to a search later.

If the documents are reasonably unstructured, as in legal discovery, then OCR is usually applied to all the files considered even tangentially relevant as any one of these might be the needle in the haystack that we’re looking for. In very structured forms, general OCR (optical character recognition) might not be required since the fields can be precisely coded when the document is entered into the database or specialized form recognition software can be used to zone extract only the fields of interest.

The less structured a document is, the more OCR is going to be helpful. OCR recognition gives scanned documents the same full text search available in fully electronic files like Word and Excel. To search across an entire corporate database, including scanned documents, OCRing scanned files is necessary. To scan all incoming corporate matter, including faxes, digital mailroom, and MFP-captured document, OCR is required.

There are very considerable differences in both OCR speed and accuracy. For black and white documents expect processing rates that vary from 2 pages per second to 30 seconds per page, with an average rate of about 3 seconds per page using standard accuracy on a 3 GHz processor machine. For color OCR, this process tends to run much slower. Expect processing rates between 3 seconds per page and 90 seconds per page, depending on page complexity and the software and settings used, with average processing time of about 10 seconds per page.
OCR recognition rates can be enhanced through various methods. Typically, for color files avoid any serious compression, particularly JPEG-based, prior to the OCR process. For black and white files, scan to a higher DPI (i.e., 300) if at all possible. There may still be thresholding problems introduced when the background is textured. In this case, high resolution scanning coupled with the right Gaussian smoothing prior to OCR is recommended. There are many other causes for poor OCR recognition rates and these typically need to be considered on a case by case basis.

To see other Blog entries discussing optical character recognition, the advantages of OCR, and how and when to use OCR, click on this link: http://www.cvisiontech.com/wordpress/?cat=5

Category: All, Batch PDF OCR, OCR Download, OCR PDF | No Comments »

Document Capture & OCR

June 2nd, 2008 by Chris

Document Capture, or scanning documents is the first step in the OCR process. Common capture devices include scanners, digital copiers, MFPs, and fax machines. Technically, the capture process is usually a conversion of photonic flux to electronic flux.

The method in which a document is captured affects the subsequent usefulness of the document. Consider a faxed document. Although usually human readable, these documents are often not very machine readable. This is usually directly related to the fax capture process. Because fax machines typically communicate over phone lines, fax scanning resolutions are set to low resolutions to keep the file size transmitted as small as possible. So, for example, normal fax mode is 203×98 dpi, which means that the vertical sampling rate is less than 100 dpi. This poor scan rate might result in a smaller size CCITT file that needs to be encoded and transmitted. This fax-scanned file might also transfer faster and still be human readable on the receiving fax end. However, since this file was captured under less than ideal scanning conditions, at very low resolution, there is a high probability that machine text readability, aka OCR, recognition rates are not very high.

Category: All, OCR, OCR PDF, OCR Software, OCR with Application to the Digital Mailroom | No Comments »

PDF OCR for NARA compliance

June 1st, 2008 by Chris

Retaining documents is getting easier is a sense than it’s ever been. Many documents are already in electronic form and paper documents can easily be processed through a scanner or MFP and converted to electronic form. For serious-minded IT managers, however, this is only where the problem begins.

There are issues of retention that must be resolved. Will these documents open and be readable 5, 10, 30 years from now? While microfiche is fast becoming an outdated technology, it is also tried and true. Meaning, little can and will go wrong by archiving a Company’s database to microfilm. The same cannot easily be said for electronic archiving.

Will the operating system in 5 years from now, say Vista 2012, be able to read the PDFs being archived right now? Are these archived records readable across all machines in the Company, even overseas? To the electronic files satisfy NARA archival requirements? Are we converting to PDF/A?

So although document archiving and records management issues are in some sense simplifying and streamlining, as the corporate fileroom becomes a thing of the past, many aspects of archiving and RM are becoming more complex. What constitutes a legal record? Is a facsimile acceptable?

NARA has put out a set of guidelines that can be helpful to IT departments trying to sort all this out

http://www.archives.gov/records-mgmt/initiatives/pdf-records.html

http://www.archives.gov/records-mgmt/initiatives/scanned-textual.html

OCR is generally an important component with respect to scanned documents. Once converting paper to electronic documents to be archived, it is often important to have full-text search capability with respect to these files. It is important to understand, particularly if archiving to PDF, what the various formats are for scanned documents and which are acceptable from a NARA archival perspective.

In particular, many OCR products by default convert PDF OCR ‘ed documents to electronic form such as text or Word. Any such conversion to a non-image format is not acceptable for archiving (at least from a NARA perspective) since guesses are made in the OCR process and there is a certain error rate as scanned characters are misinterpreted by the OCR system.

Also not acceptable for record retention purposes is PDF normal, which is a hybrid of both image and electronic PDF blended together. PDF normal also has problems with OCR mismatches and font subsitutions. In addition, the combination of electronic and image characters can detract from document readability.

What is acceptable for compliance with NARA, PDF/A, and other archival specifications? The basic requirements with respect to compliance is that captured paper documents remain in some image format, e.g., PDF image. They must look perceptually identical to the documents at the time of capture. To achieve this, a purely image PDF format needs to be selected (e.g., JPEG or JBIG2-based) . This rules out conversion to electronic or normal PDF formats.

It is also OK, from a NARA perspective, to add a hidden text layer to an image PDF document. The hidden text layer does not change the appearance of the PDF in any way, but it does allow for full-text search and indexing of the source document.

Category: All, OCR, PDF OCR, PDF/A | No Comments »

PDF Conversion

May 31st, 2008 by Chris

Question: We’re really a “TIFF shop” right now, with all our processes and workflow geared up for TIFF processing. We are thinking of converting our files to PDF. Why convert to PDF right now? It seems costly, time-consuming, and involves taking some risk to undergo an entire PDF Conversion?

Answer: All change in the industry happens for a reason, PDF conversion is no different. There is a large migration currently in progress across many industries from TIFF to PDF. Assuming that all these IT directors are not just wasting time and money at their respective firms, there must be some compelling reasons to convert from TIFF to PDF.

Among the pros behind conversion to PDF include: hidden OCR text layer, meta-data insertion, web-optimization, compression, and portability. These are all reasonably important considerations from an IT director’s perspective.

Hidden text OCR means the OCR layer is embedded directly into the PDF document, not as an extraneous file. It is not directly apparent visually, but can be searched on. Meta-data insertion is very useful for keeping important document information such as creator, author, etc. directly embedded inside the document. Web-optimization allows rapid, constant-time access into the middle of a large document without streaming the entire file. It’s ideal for keeping large files on a web-based database server. Document compression is often the key in efficient file uploading and downloading in a distributed, web environment. PDF supports both bitonal and color compression so transmission and storage requirements are minimized 5x-10x. The fact that annotations, OCR text, and meta-data are directly embedded into the PDF file makes portability between databases much more straightforward.

There is never an ideal time at the corporate level to migrate between file formats, but the advantages and ROI of migration to PDF are certainly evident. Please feel free to contact us with any other questions concerning PDF conversion.

Category: Adobe PDF Conversion, All, Convert PDF | No Comments »

Document Compression: Lossy vs. Lossless Compression

May 30th, 2008 by Chris

Question: What is the difference between lossy and lossless compression ? How can I be sure that the converted documents look the same as the originals?

Answer: Lossless compression does not change any of the original pixel values as captured by the scanning or MFP device. Lossless color JPEG is easily 10x-100x larger than standard JPEG. In fact, no hospital we’re aware of stores its brain MRIs and CAT scans using a lossless format (such as lossless JPEG). For all real applications, certainly in the greyscale and color domains, some modification of original pixel values is accepted. The general rule is that compression and dpi reduction are allowed separately, or in conjunction, as long as the output image appears identical to the original. So this condition of “appears identical” seems to be key. Clearly, appearing identical is also a function of the device, application, practitioner, and other factors.

With perceptually lossless compression, pixel values are allowed to change provided the output image looks like the input image. With perceptually lossless compression, there should be no loss in readability. With effective perceptually lossless compression, recognition rates after compression should be identical to recognition rates before compression.

Checking the documents manually is still the best way to be sure there are no differences between the original files and the converted ones. CVISION ICert is an automated accuracy checking system to verify that each output file accurately corresponds to its input file.

Category: All, Document Compression, File Compression, JBIG2 Compression | No Comments »

Multipage Output

May 29th, 2008 by Chris

Question: We have been testing the OCR option ‘output Text’ as a Word RTF file, and PdfCompressor is outputting multiple ‘single page’ RTF files instead of a single document (as per the multipage PDF which is outputted). Is there a way we can output a multipage RTF file?

Answer: If the user has Office 97 or 2000 on the machine running PdfCompressor, the word RTF files will be merged.

Category: All | No Comments »

OCR Accuracy and Full-Text Search

May 27th, 2008 by Chris

Question: What accuracy rate can I expect when OCR’ing my corporate documents - mostly accounting and revenue earnings statements? Is OCR accuracy affected by the scanning resolution? Do I gain any significant advantage in using the most accurate setting the OCR product supports?

Answer: OCR accuracy can never be guaranteed and the phrase “garbage in - garbage out” can definitely be used with respect to OCR. If the source input documents are clean originals, the OCR engine will probably yield considerably more accurate results than if the source documents are 3rd generation scanned files with some image skew. The higher the scanning resolution, at least until 300 dpi, the more accurate the typical OCR results will be.

On a perfectly clean, well-behaved scan there is often little if any accuracy difference between the “fastest” vs. “most accurate” OCR modes. Which means that if the most accurate OCR mode runs 3x-6x slower than the fast mode, it is often not cost-justified as the end-user will do virtually as well in fast mode, with a much faster workflow processing rate. If the source files are messy, not original files, have considerable skew, many different fonts, texture noise, or other mitigating factors, then running the most accurate OCR setting may indeed be necessary to obtain accurate OCR results. The best way to determine for a given OCR engine the optimal setting, is through some testing on a given dataset using the allowable OCR settings and measuring both recognition accuracy and timing for each OCR setting.

Category: All, OCR, OCR Accuracy | No Comments »

OCR

May 27th, 2008 by Chris

Question: Do we need to OCR our Company documents? We already field code them on 21 fields, like author, date, amount, State, etc. These are mortgage files that range in size from 50 to 400 pages. Most pages are scanned to black and white, with a few being scanned to color. What are the benefits of using OCR? Is it worth utilizing OCR for this sort of project?

Answer: There is clearly some overlap between field coding and OCR (optical character recognition). When a document is scanned, or entered into the system, key information about the document also needs to be entered into the system. This would often include author, date, last modified, etc. It is often hard to predict in advance all the fields of a document that would be relevant to a search later.

If the documents are reasonably unstructured, as in legal discovery, then OCR is usually applied to all the files considered even tangentially relevant as any one of these might be the needle in the haystack that we’re looking for. In very structured forms, general OCR (optical character recognition) might not be required since the fields can be precisely coded when the document is entered into the database or specialized form recognition software can be used to zone extract only the fields of interest.

The less structured a document is, the more OCR is going to be helpful. OCR recognition gives scanned documents the same full text search available in fully electronic files like Word and Excel. To search across an entire corporate database, including scanned documents, OCRing scanned files is necessary. To scan all incoming corporate matter, including faxes, digital mailroom, and MFP-captured document, OCR is required.

There are very considerable differences in both OCR speed and accuracy. For black and white documents expect processing rates that vary from 2 pages per second to 30 seconds per page, with an average rate of about 3 seconds per page using standard accuracy on a 3 GHz processor machine. For color OCR, this process tends to run much slower. Expect processing rates between 3 seconds per page and 90 seconds per page, depending on page complexity and the software and settings used, with average processing time of about 10 seconds per page.
OCR recognition rates can be enhanced through various methods. Typically, for color files avoid any serious compression, particularly JPEG-based, prior to the OCR process. For black and white files, scan to a higher DPI (i.e., 300) if at all possible. There may still be thresholding problems introduced when the background is textured. In this case, high resolution scanning coupled with the right Gaussian smoothing prior to OCR is recommended. There are many other causes for poor OCR recognition rates and these typically need to be considered on a case by case basis.

To see other Blog entries discussing optical character recognition, the advantages of OCR, and how and when to use OCR, click on this link: http://www.cvisiontech.com/wordpress/?cat=5

Category: All, Batch PDF OCR, OCR Download, OCR PDF | No Comments »