Path to simple yet robust document routing

Dec 30
2015

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
> Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File Format Buffet! – choosing your output format

Dec 09
2015

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

PDFs that need to lose weight

Sep 22
2014

The cost of hard-drive space has dramatically decreased throughout the years, but the amount of data being created is keeping up. It’s important to find ways to manage the space you have and one way to do that is to consider file compression. PDF files are a great opportunity most times to save space. PDFs consist of layers. If you have a PDF converted using OCR it will most likely have an image and text layer. There are several ways to consider compressing PDFs either during scan, post scan, or a compressed file format.

Compressing PDFs during scan is the fastest way to ensure files are in the size you expect. The downside is that you never start with an uncompressed file so quality is out of your control. Most advanced compression tools are not lossless so the file can be compromised, if you don’t ever have a chance to view the file uncompressed there is no chance for undoing any issues. The during scan compression essentially compresses the TIFF or RAW image file prior to PDF creation so the other downside is that it’s not the highest compression that can be achieved.

Compressing PDFs after scan allows you to leverage the latest technology, and ensure greatest compression. There are tools that instead of compressing an image before PDF creation, that will work specifically on a PDF format. The benefit of this is that they can leverage tools specifically within the PDF to create a compressed file. This usually results in the smallest file format with the greatest residual quality.

The most common tool for compression is a compressed file format such as RAR, ZIP, etc. These tools have the ability to very nicely compress many formats into a single file. The challenge is that for files that need to be viewed regularly it requires a step of un-compression. This is time consuming and increases the risk of file loss. This type of compression is useful for storing files that are not regularly accessed. Because it can compress many formats, it is not as advanced in any one format as specialized file compression tools are.

People commonly overlook the importance of compression. Because compressed files often replace originals you only have one chance to get it right for the life of that file. Companies will use various forms of compression. Because PDF files usually contain important information please consider heavily how you wish to store and compress them.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Attachment Emailing Master

Jan 19
2010

Very often in business, email correspondences are accompanied by a file attachment. While it’s possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it’s not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document’s ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.