CVISION Technologies

Document Imaging, Information, and Tech Support

Archive for June, 2008

Scanned Document Compression

June 24th, 2008 by Chris

Question: Our office recently began scanning the paper documents that come in. We have noticed the scanned files are large in size, which make them inefficient to work with. I have being reading your website regarding your solution for scanned document compression. What else can you tell me your PDF Compressor product?

Answer: Thanks for the inquiry; CVISION is considered the world leader in scanned document compression. If you are interested in learning more about our compression product for your scanned documents, I would suggest you try our free 30-day trial of PdfCompressor, this way you can test the compression results on your own scanned documents. In addition to compression, PdfCompressor also makes scanned documents text-searchable with OCR.

Here is the link to download our PDF compression product:

www.cvisiontech.com/download_main.html

Category: Uncategorized | No Comments »

Batch scanning software

June 22nd, 2008 by admin

Question: Do you offer batch scanning software? We are currently scanning a large volume of documents.

Answer: I would need clarification on what you are looking for. We offer software to optimize scanned documents after they leave the scanner. We can batch OCR the documents to make them text-searchable, we can batch convert the documents into PDF files, and we can batch compress the file sizes – we offer PDF compression of scanned documents with results of up to 100:1 compression.  We certainly do offer batch scanning software, however, for specific tasks on scanned documents.

If you are interested in downloading PdfCompressor which offer batch features, click the link below:

http://www.cvisiontech.com/index.php?option=com_docman&task=cat_view&gid=45&&Itemid=206

Category: Batch PDF OCR, Compress File, Convert PDF, OCR Software | No Comments »

Convert PDF

June 6th, 2008 by Chris

Question: From an IT perspective, what are the pros and cons of converting all our documents of record into PDF format?

Answer: The process of maintaining files over time, otherwise know as archiving, is complex. There are many factors that argue towards having a uniform format for long term document storage, and yet also some factors that would mitigate against it. Nevertheless, converting to PDF is likely a worthwhile initiative.

Among the reasons to convert all documents of record to one format is that its easier on the IT group to maintain. There is only one format to maintain with respect to both the viewer and the operating system, in the long term. It is easy to convert both electronic and image formats into PDF since the PDF specs have direct support for both of these format types. It is easy to support meta-data, web-optimization, header/footers, and security (view/print).

So these are among the advantages towards standardization and conversion of all database files to a PDF format. Reasons not to standardize/convert include the risk in any conversion of modifying the original document. There is some risk inherent in any document conversion of modifying the original source document. Although the risk is very small per document, taken over millions of documents the risk is non-negligible. Of course, it is also very hard to modify the source documents once it is no longer in its native format, but this is probably a plus with respect to long term archiving of documents of record that should no longer be changing.

PDF/A is a recently introduced version of Adobe PDF that is specifically designed for long-term archiving. Javascript, non-embedded fonts, and encryption are all disallowed within the PDF/A specifications. Widespread adoption of PDF/A within industry appears likely.

If you have any further questions concerning converting to PDF, please email support@cvisiontech.com

If you would like to download a free trial of our software, click the link below:

http://www.cvisiontech.com/download_main.html

Category: All, Convert PDF, PDF Conversion | No Comments »

OCR Text Dictionary

June 6th, 2008 by Chris

Question: How does the TXT file have to be formed within PdfCompressor? Does it need the dictionary words separated on a single line?

Answer: A text dictionary doesn’t have any format. It can be a plain text file. Dictionary words don’t need to be separated on a single line. They can be separated by space, and each dictionary word shouldn’t have more than 64 letters.

Category: Batch PDF OCR, OCR Text Dictionary | No Comments »

OCR & JBIG2

June 5th, 2008 by Chris

There is a clear correlation between OCR and the new ITU bitonal JBIG2 standard. In particular, an important aspect of JBIG2 is font learning. Whereas in the previous CCITT4 TIFF image specifications there was no notion of fonts, or font learning, it is a very important part of the JBIG2 compression specs and is one of the main reasons that JBIG2 compression rates are as high as 10:1 with respect to TIFF G4 compression.

Of course, font learning is important for OCR performance as well. When a font is “learned” it imposes constraints on all the connected components that map to that font character. One of the aspects of JBIG2 is font models, another aspect is global models, and a third is composite model. Each of these is not only useful for compression purposes, but also for effective OCR rates. Models, assuming a perfect font matcher, impose intra-page node constraints, but do not impose any constraints between nodes on different pages. Global models impose inter-page constraints on nodes linked to the same global font model. Composites impose n-gram constraints between groups of n consecutive nodes.

Most OCR engines deal with recognition a page at a time. Thus, there is no constraint satisfaction across different pages of the same document. JBIG2 compression can allow a system to see multiple inter-page constraints, all at the same time. Through the use of model-based propagation, the OCR process can be sped up considerably in this way.

If you are interested in learning more about PdfCompressor with OCR and testing our free 30-Day, click
http://www.cvisiontech.com/pdf_compressor_31.html

Category: All, JBIG2 Compression, OCR | No Comments »

OCR Verification and Confidence

June 4th, 2008 by Chris

Question: Can I OCR my files and guarantee that each document has been OCRed correctly with a given confidence, e.g., 99.5% ?

Answer: Yes, sort of. OCR verificarion is really a semi-automated process. What can be expected from the OCR system is to i. correctly determine the reliability we have in each OCR ASCII assignment, and ii. flagging for human intervention all words in the document below the pre-assigned confidence level.

Getting an accurate OCR confidence measure is non-trivial. Most OCR packages return a confidence assignment to each word, but that measure is often unreliable. So it is important to run your files on a system with a somewhat reliable confidence measure. These measures often consider attributes that include “is this word returned by the OCR engine in the language dictionary?” and “does this word have a reasonable intra-document frequency?”. There are many other indicators that can be useful in obtaining an accurate confidence measure for each word.

At some point, any such OCR verification system needs to be semi-automated, with a human in the loop. Say that a document requires a recognition rate of 99.5%, then this recognition rate is with respect to human recognition, not machine recognition. For example, if there was a paragraph in the document that was completely unreadable to any human, e.g., a third generation scan, using some very small fonts in the text, then the words in this paragraph should not be counted us unrecognized since this text is beyond readability, and in an information theoretic sense this information is already lost, no fault of the OCR system. On the other hand, if a paragraph is small font and very difficult to read but still clearly human readable, but the OCR engine does not pick it up then these words must be counted as unrecognized.

To guarantee a certain minmum OCR accuracy, then, all document pages below the minimum OCR recognition threshold must be shown to a human to determine if these words are human readable. If so, then the human can manually correct any words with incorrect text assignments. The task of any OCR verification system is to semi-automate the recognition process such that, with minimal human intervention, a certain minimal OCR confidence level can be established for a collection of documents.

Category: All, OCR, OCR Accuracy, OCR Verification and Confidence | No Comments »

OCR Scanner

June 3rd, 2008 by Chris

Question: Does your software OCR scanned documents?

Answer: Our software optimizes scanned documents. When your PDFs are the best they can be, everyone who uses your files saves time and you save money. For scanned documents, this means advanced image pre-processing, the best available OCR, and the latest image compression. CVista PdfCompressor 3.1 delivers these things, and much more via an intuitive Windows application and comprehensive command-line interface. PdfCompressor can make your documents fully searchable with CVISION’s super-accurate OCR. Intelligent pre-processing makes our OCR far more accurate and 2-3 times faster than Adobe Capture or Acrobat. Plus, PdfCompressor can OCR in over 60 languages.

To download a free 30 trial version of CVISION’s PdfCompressor software with OCR, go to http://www.cvisiontech.com/download_main.html

Category: All, OCR Accuracy | No Comments »

OCR in Robust File Conversion

June 3rd, 2008 by Chris

Question: I want to convert company legacy files from both image and electronic formats to PDF. Is the process different for image formats, such as TIFF vs. electronic formats, such as Word? Do I need to run an OCR process on the converted electronic files or only on the image files?

Answer: Converting image files to a standardized format, such as PDF, is usually done in just one way. It involves converting from one image format to another, which is pretty standard, and then optionally adding OCR for text search and document-related meta-data. Converting electronic files to a canonical format, such as PDF, can usually be done in one of two ways: i. electronic conversion, or ii. image conversion.

Image conversion of electronic documents, say Word to TIFF or G4-based PDF, for long-term archiving is generally considered a safe practice, almost like microfiching a dataset. Very little can go wrong during the conversion process. On the other hand, the image files are apt to be larger than the electronic files usually by a factor of 10x-20x or more. These files need to be OCRed before being attached to the database if we want them text searchable. In addition, state-of-the-art compression methods can reduce their image size back down to the original source document’s electronic size (see http://www.cvisiontech.com/pdf_compressor_31.html ) .

Electronic file conversion is simple in that it generates an output file that is already searchable and in electronic form. So, for example, starting with a Word file using some standard conversion method (e.g., Distiller) should, hopefully, produce an equivalent PDF file that is electronic and already searchable. Things to watch out for include the fact that sometimes in electronic conversion important graphic, tabular, or mathematical information is either lost or modified during the conversion process. A verification process (of the conversion process) is generally recommended.

Category: All, Batch PDF OCR, Convert PDF, OCR, PDF Conversion, Tiff | No Comments »

OCR

June 2nd, 2008 by Chris

Question: Do we need to OCR our Company documents? We already field code them on 21 fields, like author, date, amount, State, etc. These are mortgage files that range in size from 50 to 400 pages. Most pages are scanned to black and white, with a few being scanned to color. What are the benefits of using OCR? Is it worth utilizing OCR for this sort of project?

Answer: There is clearly some overlap between field coding and OCR (optical character recognition). When a document is scanned, or entered into the system, key information about the document also needs to be entered into the system. This would often include author, date, last modified, etc. It is often hard to predict in advance all the fields of a document that would be relevant to a search later.

If the documents are reasonably unstructured, as in legal discovery, then OCR is usually applied to all the files considered even tangentially relevant as any one of these might be the needle in the haystack that we’re looking for. In very structured forms, general OCR (optical character recognition) might not be required since the fields can be precisely coded when the document is entered into the database or specialized form recognition software can be used to zone extract only the fields of interest.

The less structured a document is, the more OCR is going to be helpful. OCR recognition gives scanned documents the same full text search available in fully electronic files like Word and Excel. To search across an entire corporate database, including scanned documents, OCRing scanned files is necessary. To scan all incoming corporate matter, including faxes, digital mailroom, and MFP-captured document, OCR is required.

There are very considerable differences in both OCR speed and accuracy. For black and white documents expect processing rates that vary from 2 pages per second to 30 seconds per page, with an average rate of about 3 seconds per page using standard accuracy on a 3 GHz processor machine. For color OCR, this process tends to run much slower. Expect processing rates between 3 seconds per page and 90 seconds per page, depending on page complexity and the software and settings used, with average processing time of about 10 seconds per page.
OCR recognition rates can be enhanced through various methods. Typically, for color files avoid any serious compression, particularly JPEG-based, prior to the OCR process. For black and white files, scan to a higher DPI (i.e., 300) if at all possible. There may still be thresholding problems introduced when the background is textured. In this case, high resolution scanning coupled with the right Gaussian smoothing prior to OCR is recommended. There are many other causes for poor OCR recognition rates and these typically need to be considered on a case by case basis.

To see other Blog entries discussing optical character recognition, the advantages of OCR, and how and when to use OCR, click on this link: http://www.cvisiontech.com/wordpress/?cat=5

Category: All, Batch PDF OCR, OCR Download, OCR PDF | No Comments »

Document Capture & OCR

June 2nd, 2008 by Chris

Document Capture, or scanning documents is the first step in the OCR process. Common capture devices include scanners, digital copiers, MFPs, and fax machines. Technically, the capture process is usually a conversion of photonic flux to electronic flux.

The method in which a document is captured affects the subsequent usefulness of the document. Consider a faxed document. Although usually human readable, these documents are often not very machine readable. This is usually directly related to the fax capture process. Because fax machines typically communicate over phone lines, fax scanning resolutions are set to low resolutions to keep the file size transmitted as small as possible. So, for example, normal fax mode is 203×98 dpi, which means that the vertical sampling rate is less than 100 dpi. This poor scan rate might result in a smaller size CCITT file that needs to be encoded and transmitted. This fax-scanned file might also transfer faster and still be human readable on the receiving fax end. However, since this file was captured under less than ideal scanning conditions, at very low resolution, there is a high probability that machine text readability, aka OCR, recognition rates are not very high.

Category: All, OCR, OCR PDF, OCR Software, OCR with Application to the Digital Mailroom | No Comments »