Squeeze those files

Sep 14

Compression is a great tool for saving hard drive space. You may not currently be thinking about file compression, but you should. It’s very likely that on your machines data is being created at an increasing rate, and your hard-drive space is decreasing at the same fast pace. Organizations and individuals often only consider file compression when there is far to little space left on their hard-drives or the warning messages about too little space start appearing. This is a big risk.

As we create files on our computer, access them, move them, modify them, we are fragmenting the drive. Overly fragmented drives slow down machines and increase risk for damage and corruption. The more files you have, the more this multiplies. Real-time file compression helps with this because as soon as a file is generated, it’s compressed. There is less space being used, and the need to compress in the future is gone. Back-log compression ( compressing in bulk of all your files ) requires a lot of activity on the hard drive and increases the fragmentation. The other risk of bulk conversion is the fact that you only have one chance to get it right.

Bad compression is not just an irritation, it’s a risk. Usually when you compress a file, you are removing the original. The whole purpose is to save space, not use up more by keeping both copies. But because of the need to make sure you are compressing the file correctly, keeping both files waste a lot of space. When doing day-forward compression or real-time compression it’s easy to check as the files come across to make sure at initial setup everything is good, but if you do bulk compression and make a mistake you could have ruined a large library of files.

I firmly believe in file compression, but I know first hand the risk of doing it incorrectly. I now compress files as they are created and no longer have to think about data piling up faster then I can find ways to save space.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Space Age Optical Character Recognition

Aug 24

There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.


How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were,  just look at the other cool uses of OCR technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Path to simple yet robust document routing

Dec 30

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
> Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File Format Buffet! – choosing your output format

Dec 09

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Compression: Save space, AND MONEY

Aug 04

Yes compression saves valuable hard-drive space, but as the technology world becomes more and more hosted, it’s also just as important for saving money. Previously I have explored various types of compression, general, and file type specific. I have also explored various drivers for compression, archive, and space saving on regularly consumed files. But what I have not talked about in detail is how compression is becoming more and more popular for saving money from hosted storage services.

Hosted software products are being created at a faster rate than installed. Many of these hosted solutions are content driven such as content management, eDiscovery, accounts payable, off-site storage etc. and they are all rooted in storing data. It is the preferred business model for the companies producing these solutions to charge per mega-byte of usage or combination of mega-byte usage and a monthly service charge. For this reason, it’s important to consider how much storage is being used up. Not only because of cost control, but also to make sure the system is being utilized on useful data and not garbage.

Often organizations purchase an allotment of storage that they pay for monthly; their goal is to not exceed their storage limit and have to upgrade to the next level. Often with the content management services and in particular documents, they can be uploaded but are never utilized within the system and are purely space wasters.

For these reasons, compression is a great tool to reduce the size of the files on your hosted service. The type of compression used for hosted services would need to be file specific. Hosted applications understand specific file formats and how to consume them; compression formats such as zip would not be useful for that reason. Instead, compression for particular formats such as PDF compression must be used. In this way, you are still working with a compatible and consumable PDF, but at a much smaller size. The driver for the compression must be compression for regular consumption. There are hosted archival systems, but in this case I’m discussing hosted products where the data contained in them are used on a frequent to semi-frequent basis.

By compressing documents a company can store more data for less storage fee. As hosted software products become more common, you will see people seeking better and better ways to make their files smaller but maintain quality.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Zip, Compress, Tar, Rar ….. confuse me?

May 05

Hard drives continually get cheaper and cheaper, but the rate that people collect information is still filling up hard drive real-estate faster than we can get more storage. One trick to saving space is various compression technologies. Most people when they think about compression now think of either a Zip file or that little check-box on Windows settings to enable compression on a physical drive. What is often overlooked is the ability to compress a single file utilizing a file compression tool for a specific format.

Choosing what is the right way to compress files and save space is based on several things, how often will you access that file again, what ratio of compression are you getting, and what are the long term impacts of the compression. When you use the technologies Zip, Tar, and RAR you are usually combining multiple files together, and don’t have plans to access them soon. These compression tools take multiple files and combine them into a single zipped file. This means that access to any one individual file in that zipped file will take additional time and effort to open. With this approach you can combine many various formats. Some formats will have a compression ratio of 0% and others a compression ratio of 60%. Rarely but occasional when a zip is not successful you can result in file corruption. I always suggest checking that you can un-zip a zip after it’s created. People who need to access their files regularly, or need to be able to search on their content at any given time can still benefit with compression tools that are specific to a format and can be done one file at a time in batch.

The most common file format that people use for search and retrieval and is generated by Data Capture and OCR is PDF. PDFs get good compression usually in a Zip, Tar, or Rar tool but there are specific things that can be done just for a PDF to compress it even further. PDFs often have a text layer that is search-able, and an image layer for viewing. The bulk of the file size is always the image layer, so a specific image compression can be applied to just this layer, and a separate text compression to the text layer. The result is a PDF that opens just like any other file, but is taking up much less space. The benefit of this is that you can access your PDF at any time, it’s still indexed with your search utility, and you are saving space!

Compression is almost always a good choice when considering saving space. Compression technologies have come a long way in the last 4 years. It’s good to know what your purpose is in compression and the frequency you want access to your files. Don’t be afraid to scout out compression tools for specific file formats and give them a try.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File size, get over it – When to consider file size, when not

Feb 17

There are times when an organization’s focus actually stabs them in the back. When it comes to Data Capture, file size is one of these common focuses. The most common mistake companies are making is not the concern of file size, but when they are concerned about it. Many companies will investigate heavily the size of a file at input. Tweaking and tuning to get a smaller input file to their data capture solution, and in their mind final storage. But at what cost? Companies often overlook that file size can be changed at any point, and the best point is not input, but after Data Capture has been run.

When you assign anyone a task, or teach anyone anything, you expect to give them the proper tools to get the job done as best they can. If they are missing some tools, you can expect quality to go down. Data Capture is the same way. Scanning at 150 Dpi vs. 300 Dpi, Scanning at Black and White vs. Grey-scale or Color, are limiting the tools of Data Capture. Yes they all dramatically reduce the file size, but also your quality. Give your Data Capture the best chance at success, then worry about file size.

The proper way to address file size is at the point just before it’s stored into a file system or content management system. At this point you can down sample, reduce bit depth, or even better to keep the re-purposing integrity, use reliable compression technology to get the job done. I say compression is best as it’s the most true to the input image and anytime you consider printing or re-purposing or even another pass in data capture, this will be very important.

So, while file size is important, delay the concern until after OCR or Data Capture is done.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

PDFs that need to lose weight

Sep 22

The cost of hard-drive space has dramatically decreased throughout the years, but the amount of data being created is keeping up. It’s important to find ways to manage the space you have and one way to do that is to consider file compression. PDF files are a great opportunity most times to save space. PDFs consist of layers. If you have a PDF converted using OCR it will most likely have an image and text layer. There are several ways to consider compressing PDFs either during scan, post scan, or a compressed file format.

Compressing PDFs during scan is the fastest way to ensure files are in the size you expect. The downside is that you never start with an uncompressed file so quality is out of your control. Most advanced compression tools are not lossless so the file can be compromised, if you don’t ever have a chance to view the file uncompressed there is no chance for undoing any issues. The during scan compression essentially compresses the TIFF or RAW image file prior to PDF creation so the other downside is that it’s not the highest compression that can be achieved.

Compressing PDFs after scan allows you to leverage the latest technology, and ensure greatest compression. There are tools that instead of compressing an image before PDF creation, that will work specifically on a PDF format. The benefit of this is that they can leverage tools specifically within the PDF to create a compressed file. This usually results in the smallest file format with the greatest residual quality.

The most common tool for compression is a compressed file format such as RAR, ZIP, etc. These tools have the ability to very nicely compress many formats into a single file. The challenge is that for files that need to be viewed regularly it requires a step of un-compression. This is time consuming and increases the risk of file loss. This type of compression is useful for storing files that are not regularly accessed. Because it can compress many formats, it is not as advanced in any one format as specialized file compression tools are.

People commonly overlook the importance of compression. Because compressed files often replace originals you only have one chance to get it right for the life of that file. Companies will use various forms of compression. Because PDF files usually contain important information please consider heavily how you wish to store and compress them.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Cross-Platform document conversion

Sep 20

It’s not a secret that when it comes to the latest and greatest document conversion technologies, they all exist on Windows machines. For some this might be very frustrating. The OCR, Imaging, and Compression packages found for the Mac, Linux, and Unix, are very often ports of older versions of their Windows equivalent. On average, a Windows equivalent will be 3 or more versions ahead. This means big changes in accuracy, stability, and core-functionality. The reason this happens is simple, the initial development of these applications (engines) was on Windows, and the vast majority of the demand is also Windows.

So what happens in an environment that demands accurate document conversion but is not a Windows based system? Not all is lost. While in a perfect world all the latest technology would be on your platform of choice, sometimes you have to make exceptions, and this is not a big one to be made. Because document conversion, and compression products are all designed to have a mode where they run unmanned, it is possible to utilize the technology on a Windows machine but drive it from ANY other platform. Once configured, the stability of a dedicated document conversion machine is very good. They require low maintenance and very little interaction. Simply by networking folders for all other machines to see, no matter the platform, you can from any network device transfer images to your document conversion machine and download results.

OCR itself takes about 50 man-years to develop, so I don’t foresee in the near future technology on other platforms that is at the level of Windows machines. But what I do know is there is no reason NOT to leverage the most advanced technology with a method of set it and forget it automated document conversion machines.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.