CVISION Technologies

Document Imaging, Information, and Tech Support

Archive for May, 2008

PDF Conversion

May 31st, 2008 by Chris

Question: We’re really a “TIFF shop” right now, with all our processes and workflow geared up for TIFF processing. We are thinking of converting our files to PDF. Why convert to PDF right now? It seems costly, time-consuming, and involves taking some risk to undergo an entire PDF Conversion?

Answer: All change in the industry happens for a reason, PDF conversion is no different. There is a large migration currently in progress across many industries from TIFF to PDF. Assuming that all these IT directors are not just wasting time and money at their respective firms, there must be some compelling reasons to convert from TIFF to PDF.

Among the pros behind conversion to PDF include: hidden OCR text layer, meta-data insertion, web-optimization, compression, and portability. These are all reasonably important considerations from an IT director’s perspective.

Hidden text OCR means the OCR layer is embedded directly into the PDF document, not as an extraneous file. It is not directly apparent visually, but can be searched on. Meta-data insertion is very useful for keeping important document information such as creator, author, etc. directly embedded inside the document. Web-optimization allows rapid, constant-time access into the middle of a large document without streaming the entire file. It’s ideal for keeping large files on a web-based database server. Document compression is often the key in efficient file uploading and downloading in a distributed, web environment. PDF supports both bitonal and color compression so transmission and storage requirements are minimized 5x-10x. The fact that annotations, OCR text, and meta-data are directly embedded into the PDF file makes portability between databases much more straightforward.

There is never an ideal time at the corporate level to migrate between file formats, but the advantages and ROI of migration to PDF are certainly evident. Please feel free to contact us with any other questions concerning PDF conversion.

Category: Adobe PDF Conversion, All, Convert PDF | No Comments »

Document Compression: Lossy vs. Lossless Compression

May 30th, 2008 by Chris

Question: What is the difference between lossy and lossless compression ? How can I be sure that the converted documents look the same as the originals?

Answer: Lossless compression does not change any of the original pixel values as captured by the scanning or MFP device. Lossless color JPEG is easily 10x-100x larger than standard JPEG. In fact, no hospital we’re aware of stores its brain MRIs and CAT scans using a lossless format (such as lossless JPEG). For all real applications, certainly in the greyscale and color domains, some modification of original pixel values is accepted. The general rule is that compression and dpi reduction are allowed separately, or in conjunction, as long as the output image appears identical to the original. So this condition of “appears identical” seems to be key. Clearly, appearing identical is also a function of the device, application, practitioner, and other factors.

With perceptually lossless compression, pixel values are allowed to change provided the output image looks like the input image. With perceptually lossless compression, there should be no loss in readability. With effective perceptually lossless compression, recognition rates after compression should be identical to recognition rates before compression.

Checking the documents manually is still the best way to be sure there are no differences between the original files and the converted ones. CVISION ICert is an automated accuracy checking system to verify that each output file accurately corresponds to its input file.

Category: All, Document Compression, File Compression, JBIG2 Compression | No Comments »

Multipage Output

May 29th, 2008 by Chris

Question: We have been testing the OCR option ‘output Text’ as a Word RTF file, and PdfCompressor is outputting multiple ‘single page’ RTF files instead of a single document (as per the multipage PDF which is outputted). Is there a way we can output a multipage RTF file?

Answer: If the user has Office 97 or 2000 on the machine running PdfCompressor, the word RTF files will be merged.

Category: All | No Comments »

OCR Accuracy and Full-Text Search

May 27th, 2008 by Chris

Question: What accuracy rate can I expect when OCR’ing my corporate documents - mostly accounting and revenue earnings statements? Is OCR accuracy affected by the scanning resolution? Do I gain any significant advantage in using the most accurate setting the OCR product supports?

Answer: OCR accuracy can never be guaranteed and the phrase “garbage in - garbage out” can definitely be used with respect to OCR. If the source input documents are clean originals, the OCR engine will probably yield considerably more accurate results than if the source documents are 3rd generation scanned files with some image skew. The higher the scanning resolution, at least until 300 dpi, the more accurate the typical OCR results will be.

On a perfectly clean, well-behaved scan there is often little if any accuracy difference between the “fastest” vs. “most accurate” OCR modes. Which means that if the most accurate OCR mode runs 3x-6x slower than the fast mode, it is often not cost-justified as the end-user will do virtually as well in fast mode, with a much faster workflow processing rate. If the source files are messy, not original files, have considerable skew, many different fonts, texture noise, or other mitigating factors, then running the most accurate OCR setting may indeed be necessary to obtain accurate OCR results. The best way to determine for a given OCR engine the optimal setting, is through some testing on a given dataset using the allowable OCR settings and measuring both recognition accuracy and timing for each OCR setting.

Category: All, OCR, OCR Accuracy | No Comments »

OCR

May 27th, 2008 by Chris

Question: Do we need to OCR our Company documents? We already field code them on 21 fields, like author, date, amount, State, etc. These are mortgage files that range in size from 50 to 400 pages. Most pages are scanned to black and white, with a few being scanned to color. What are the benefits of using OCR? Is it worth utilizing OCR for this sort of project?

Answer: There is clearly some overlap between field coding and OCR (optical character recognition). When a document is scanned, or entered into the system, key information about the document also needs to be entered into the system. This would often include author, date, last modified, etc. It is often hard to predict in advance all the fields of a document that would be relevant to a search later.

If the documents are reasonably unstructured, as in legal discovery, then OCR is usually applied to all the files considered even tangentially relevant as any one of these might be the needle in the haystack that we’re looking for. In very structured forms, general OCR (optical character recognition) might not be required since the fields can be precisely coded when the document is entered into the database or specialized form recognition software can be used to zone extract only the fields of interest.

The less structured a document is, the more OCR is going to be helpful. OCR recognition gives scanned documents the same full text search available in fully electronic files like Word and Excel. To search across an entire corporate database, including scanned documents, OCRing scanned files is necessary. To scan all incoming corporate matter, including faxes, digital mailroom, and MFP-captured document, OCR is required.

There are very considerable differences in both OCR speed and accuracy. For black and white documents expect processing rates that vary from 2 pages per second to 30 seconds per page, with an average rate of about 3 seconds per page using standard accuracy on a 3 GHz processor machine. For color OCR, this process tends to run much slower. Expect processing rates between 3 seconds per page and 90 seconds per page, depending on page complexity and the software and settings used, with average processing time of about 10 seconds per page.
OCR recognition rates can be enhanced through various methods. Typically, for color files avoid any serious compression, particularly JPEG-based, prior to the OCR process. For black and white files, scan to a higher DPI (i.e., 300) if at all possible. There may still be thresholding problems introduced when the background is textured. In this case, high resolution scanning coupled with the right Gaussian smoothing prior to OCR is recommended. There are many other causes for poor OCR recognition rates and these typically need to be considered on a case by case basis.

To see other Blog entries discussing optical character recognition, the advantages of OCR, and how and when to use OCR, click on this link: http://www.cvisiontech.com/wordpress/?cat=5

Category: All, Batch PDF OCR, OCR Download, OCR PDF | No Comments »

Compress File

May 25th, 2008 by Chris

Question: I would like to compress all my files, electronic and scanned, in various formats to PDF. Will the output files always be significantly smaller than the input files?

Answer: Generally, most files are somewhat bloated and can be significantly compressed before storing in a database or uploading to the web. This is especially true for image files which can often be reduced by a factor of 10x or more for black and white and 100x or more for color. However, there is no guarantee that all files will compress using CVISION PdfCompressor.
In particular, files that are already in electronic format may compress dramatically or not at all. For example, generated business invoices may already be as small as possible, entirely electronic, with no redundant font information. On the other hand, sometimes generated invoices contain corporate logos which might account for 90% of the invoice file size. In such a case, a reduction of 10x or more is still possible using a compression module such as CVISION PdfCompressor by i. compressing the image streams, and ii. sharing these image objects across the multi-page PDF document.

In short, reducing file size for web-based applications is often non-trivial, but may well be a crucial part of getting the application to run efficiently. Trying various compression options to see what works is a good idea. Alternatively, submitting your data to have the “experts” take a look at it, is highly recommended.

If you are interested in submitting files for compression, please email: support@cvisiontech.com. We will gladly look at them, and inform you of the necessary settings to compress the files.

Category: Uncategorized | No Comments »

OCR’ing PDF and TIFF files

May 24th, 2008 by Chris

Question: I am interested in OCRing PDF and TIFF files that are located in folders and subfolders. Is your product able to do this? At what speed, and without effecting daily workflow? Also, can it troll anything new that has been added to the folders and OCR them in real time? Please answer these questions, as I am interested in buying a product that can OCR.

Answer: Our product PdfCompressor Professional offers the ability to compress and OCR your PDF, TIFF, JPG, etc type of files. PdfCompressor has the ability to process the folder and subfolder and replicate the tree structure in the output for users’ convenience. If you have not done so already please use the link below to download a free evaluation of PdfCompressor. This should help you see our results, as far as timing and precision (These results vary based on your system specs and file quality).

To download the trial software, click OCR PDF

Category: All, OCR, OCR Download, OCR PDF, Optical Character Recognition, PDF OCR, PDF Optimize | No Comments »

OCR, ICR, and Bar Code Readers

May 23rd, 2008 by Chris

Question: Do I need just OCR for my files or do I also need to be concerned with ICR and bar code readers?

Answer: Most OCR (optical character recognition) engines do not automatically handle either handwritten character recognition (ICR) or bar codes. If these are important to your document workflow and indexing, then the correct modules need to be installed to find and recognize these instances.

Generally, ICR recognition rates are significantly lower than OCR recognition rates. ICR rates are higher where the text is hand printed and letters are topologically distinct, i.e., not allowed to touch each other. In analyzing general cursive script, with letters connected to each other, recognition rates are significantly lower.

If the ICR fields are known to be of a certain type, say numeric (e.g., social security numbers), then the ICR engine can be calibrated to expect this data type, with recognition rates going up considerably.

Bar code recognition is a generally an accurate process, even when the input scan is degraded, e.g., rescan. Bar codes allow for a certain degree of redundancy, even when the scan quality is poor. However, there are many classes of bar codes and the bar code class must be defined correctly for the recognition process to work properly. If the wrong bar code classification is used during the recognition process, then an incorrect ASCII string will be returned.

Category: All, ICR, OCR, and Bar Code Readers | No Comments »

PDF Optimize

May 22nd, 2008 by Chris

Question: I want to optimize my PDFs for web viewing. What is the best way to optimize PDF documents?

Answer: The major issues for handling large files on the Web would seem to be : i. compression, ii. web-optimization, iii. search, iv. chunking. We’ll briefly review each of these issues.Compression: Just because a file has lots & lots of pages, does not mean that the file size must necessarily also be large. Compare, for example, a 1,000 page electronic file with a 1,000 page scanned color TIFF file. The scanned file can easily be 100x larger than the electronic file (e.g., 1 GB vs. 10 MB). So compression can be a key factor in making sure your documents are amenable to web-hosting. Compression is particularly a factor when dealing with scanned image documents. Compression can yield reductions of up to 10x for black and white image documents and up to 100x for color image documents.

See, for example, http://www.cvisiontech.com/pdf_compressor_31.html.

Web-optimization: If files are large, its unlikely that someone specifically wants to view the first page of the document. More than likely, they want to get to some page in the middle of the document. Web-optimization is a feature that allows a document viewer to view any page in an arbitrarily large document in constant time (e.g., 1-2 seconds), requesting from the server than it jump to the byte boundary where the file page starts. This allows for efficient web browsing of a file. PDF format has native support for web-optimization.

Search: The larger a file is, the more likely you’ll need text search capability. If a file is only one or two pages then perhaps you can find what you’re looking for simply by perusing through the document. If a file is large, say over 30 pages, then it is very difficult to find what you’re looking for without text search capability. Although most electronic files are already searchable, some are not (e.g., vector graphics). For scanned files without OCR (unsearchable), finding what you want in the file is akin to finding a needle in a haystack. So make sure all your large web-hosted files are searchable. For scanned documents, this means running the files through an OCR process.

Chunking: Another problem with large files in a Web-based environment, even if they’re web-optimized, is that just downloading the file in your viewer (e.g., Adobe Reader) may tie up all your computer memory resources. Especially when file sizes run into the 100’s of MegaBytes. For efficient handling of large files, even in a web-optimized viewer, the file being viewed will continue to stream and consume available machine RAM. One solution to this problem is chunking, meaning that a very large file is divided into subfiles, none of which exceeds a maximum byte size. For example, if we select 50 MB as a reasonable chunking size then a very large PDF file would be chunked so that no single PDF subfile exceeds 50 MB. Now the total memory consumption on a document search is bounded.

Adobe PDF is recommended for web hosting document databases. The PDF format has native support for 3 of the 4 features we listed as desirable when web-hosting files, namely, compression, web-optimization, and searching (i.e., hidden text layer). As such, there is very little “engineering” required on the IT side when implementing a web-hosted database that is already in PDF format.

Category: All, Batch PDF OCR, OCR PDF, PDF Optimize | No Comments »

JBIG2 and PDF

May 21st, 2008 by Chris

Question: Does bitonal compression of scanned documents to JBIG2 format makes sense on its own or should such conversion be done as part of a general conversion to PDF format?

Answer: JBIG2 is a new ITU-approved, international standard for compression of scanned black and white files. The effectiveness of JBIG2 compression versus the previous ITU TIFF G4 standard is very much dependent on the JBIG2 compression software used. The quality of the scanned document is also a funcion of the JBIG2 software used since the decompression specs for JBIG2 are open but the individual JBIG2 compression algorithms used are proprietary.

Using the right JBIG2 compression software can results in compression rates where the JBIG2 files are 5x-10x smalller than TIFF G4 and G4 PDF, with No Loss of image quality.
Although JBIG2 is an ITU approved format, it is still very new to the industry. The assumption that a typical client or system user has a pre-installed JBIG2 viewer is probably false. The advantage of using JBIG2-compressed PDF as the document format is several fold. First, PDF fully supports JBIG2 so that the compression advantages of JBIG2 can be fully utilized within the PDF specs. Second, PDF Reader 5.0 and up can handle JBIG2-compressed files, so that your user base most likely has a JBIG2 PDF Reader pre-installed on their computer. Third, adding OCR searchability to your JBIG2-compressed file is very easy within the PDF specs using a hidden text layer. And finally, for multipage files that need to be web-hosted and viewed remotely, JBIG2 files that are made to fit the PDF specs (i.e., given a PDF wrapper) can take full advantage of the web-optimization feature supported by PDF and Adobe Reader, which means that large multipage files will open and display quickly on the Web.

So, in short, there are serious advantages in converting scanned documents into JBIG2 format. But having decided to convert a database to JBIG2, there are additional features available and more file control when the files are converted to JBIG2-compressed PDF format.

Category: All, JBIG2 Compression, JBIG2 and PDF | No Comments »