Squeeze those files

Sep 14

Compression is a great tool for saving hard drive space. You may not currently be thinking about file compression, but you should. It’s very likely that on your machines data is being created at an increasing rate, and your hard-drive space is decreasing at the same fast pace. Organizations and individuals often only consider file compression when there is far to little space left on their hard-drives or the warning messages about too little space start appearing. This is a big risk.

As we create files on our computer, access them, move them, modify them, we are fragmenting the drive. Overly fragmented drives slow down machines and increase risk for damage and corruption. The more files you have, the more this multiplies. Real-time file compression helps with this because as soon as a file is generated, it’s compressed. There is less space being used, and the need to compress in the future is gone. Back-log compression ( compressing in bulk of all your files ) requires a lot of activity on the hard drive and increases the fragmentation. The other risk of bulk conversion is the fact that you only have one chance to get it right.

Bad compression is not just an irritation, it’s a risk. Usually when you compress a file, you are removing the original. The whole purpose is to save space, not use up more by keeping both copies. But because of the need to make sure you are compressing the file correctly, keeping both files waste a lot of space. When doing day-forward compression or real-time compression it’s easy to check as the files come across to make sure at initial setup everything is good, but if you do bulk compression and make a mistake you could have ruined a large library of files.

I firmly believe in file compression, but I know first hand the risk of doing it incorrectly. I now compress files as they are created and no longer have to think about data piling up faster then I can find ways to save space.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Space Age Optical Character Recognition

Aug 24

There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.


How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were,  just look at the other cool uses of OCR technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Compression: Save space, AND MONEY

Aug 04

Yes compression saves valuable hard-drive space, but as the technology world becomes more and more hosted, it’s also just as important for saving money. Previously I have explored various types of compression, general, and file type specific. I have also explored various drivers for compression, archive, and space saving on regularly consumed files. But what I have not talked about in detail is how compression is becoming more and more popular for saving money from hosted storage services.

Hosted software products are being created at a faster rate than installed. Many of these hosted solutions are content driven such as content management, eDiscovery, accounts payable, off-site storage etc. and they are all rooted in storing data. It is the preferred business model for the companies producing these solutions to charge per mega-byte of usage or combination of mega-byte usage and a monthly service charge. For this reason, it’s important to consider how much storage is being used up. Not only because of cost control, but also to make sure the system is being utilized on useful data and not garbage.

Often organizations purchase an allotment of storage that they pay for monthly; their goal is to not exceed their storage limit and have to upgrade to the next level. Often with the content management services and in particular documents, they can be uploaded but are never utilized within the system and are purely space wasters.

For these reasons, compression is a great tool to reduce the size of the files on your hosted service. The type of compression used for hosted services would need to be file specific. Hosted applications understand specific file formats and how to consume them; compression formats such as zip would not be useful for that reason. Instead, compression for particular formats such as PDF compression must be used. In this way, you are still working with a compatible and consumable PDF, but at a much smaller size. The driver for the compression must be compression for regular consumption. There are hosted archival systems, but in this case I’m discussing hosted products where the data contained in them are used on a frequent to semi-frequent basis.

By compressing documents a company can store more data for less storage fee. As hosted software products become more common, you will see people seeking better and better ways to make their files smaller but maintain quality.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Zip, Compress, Tar, Rar ….. confuse me?

May 05

Hard drives continually get cheaper and cheaper, but the rate that people collect information is still filling up hard drive real-estate faster than we can get more storage. One trick to saving space is various compression technologies. Most people when they think about compression now think of either a Zip file or that little check-box on Windows settings to enable compression on a physical drive. What is often overlooked is the ability to compress a single file utilizing a file compression tool for a specific format.

Choosing what is the right way to compress files and save space is based on several things, how often will you access that file again, what ratio of compression are you getting, and what are the long term impacts of the compression. When you use the technologies Zip, Tar, and RAR you are usually combining multiple files together, and don’t have plans to access them soon. These compression tools take multiple files and combine them into a single zipped file. This means that access to any one individual file in that zipped file will take additional time and effort to open. With this approach you can combine many various formats. Some formats will have a compression ratio of 0% and others a compression ratio of 60%. Rarely but occasional when a zip is not successful you can result in file corruption. I always suggest checking that you can un-zip a zip after it’s created. People who need to access their files regularly, or need to be able to search on their content at any given time can still benefit with compression tools that are specific to a format and can be done one file at a time in batch.

The most common file format that people use for search and retrieval and is generated by Data Capture and OCR is PDF. PDFs get good compression usually in a Zip, Tar, or Rar tool but there are specific things that can be done just for a PDF to compress it even further. PDFs often have a text layer that is search-able, and an image layer for viewing. The bulk of the file size is always the image layer, so a specific image compression can be applied to just this layer, and a separate text compression to the text layer. The result is a PDF that opens just like any other file, but is taking up much less space. The benefit of this is that you can access your PDF at any time, it’s still indexed with your search utility, and you are saving space!

Compression is almost always a good choice when considering saving space. Compression technologies have come a long way in the last 4 years. It’s good to know what your purpose is in compression and the frequency you want access to your files. Don’t be afraid to scout out compression tools for specific file formats and give them a try.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File size, get over it – When to consider file size, when not

Feb 17

There are times when an organization’s focus actually stabs them in the back. When it comes to Data Capture, file size is one of these common focuses. The most common mistake companies are making is not the concern of file size, but when they are concerned about it. Many companies will investigate heavily the size of a file at input. Tweaking and tuning to get a smaller input file to their data capture solution, and in their mind final storage. But at what cost? Companies often overlook that file size can be changed at any point, and the best point is not input, but after Data Capture has been run.

When you assign anyone a task, or teach anyone anything, you expect to give them the proper tools to get the job done as best they can. If they are missing some tools, you can expect quality to go down. Data Capture is the same way. Scanning at 150 Dpi vs. 300 Dpi, Scanning at Black and White vs. Grey-scale or Color, are limiting the tools of Data Capture. Yes they all dramatically reduce the file size, but also your quality. Give your Data Capture the best chance at success, then worry about file size.

The proper way to address file size is at the point just before it’s stored into a file system or content management system. At this point you can down sample, reduce bit depth, or even better to keep the re-purposing integrity, use reliable compression technology to get the job done. I say compression is best as it’s the most true to the input image and anytime you consider printing or re-purposing or even another pass in data capture, this will be very important.

So, while file size is important, delay the concern until after OCR or Data Capture is done.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

PDFs that need to lose weight

Sep 22

The cost of hard-drive space has dramatically decreased throughout the years, but the amount of data being created is keeping up. It’s important to find ways to manage the space you have and one way to do that is to consider file compression. PDF files are a great opportunity most times to save space. PDFs consist of layers. If you have a PDF converted using OCR it will most likely have an image and text layer. There are several ways to consider compressing PDFs either during scan, post scan, or a compressed file format.

Compressing PDFs during scan is the fastest way to ensure files are in the size you expect. The downside is that you never start with an uncompressed file so quality is out of your control. Most advanced compression tools are not lossless so the file can be compromised, if you don’t ever have a chance to view the file uncompressed there is no chance for undoing any issues. The during scan compression essentially compresses the TIFF or RAW image file prior to PDF creation so the other downside is that it’s not the highest compression that can be achieved.

Compressing PDFs after scan allows you to leverage the latest technology, and ensure greatest compression. There are tools that instead of compressing an image before PDF creation, that will work specifically on a PDF format. The benefit of this is that they can leverage tools specifically within the PDF to create a compressed file. This usually results in the smallest file format with the greatest residual quality.

The most common tool for compression is a compressed file format such as RAR, ZIP, etc. These tools have the ability to very nicely compress many formats into a single file. The challenge is that for files that need to be viewed regularly it requires a step of un-compression. This is time consuming and increases the risk of file loss. This type of compression is useful for storing files that are not regularly accessed. Because it can compress many formats, it is not as advanced in any one format as specialized file compression tools are.

People commonly overlook the importance of compression. Because compressed files often replace originals you only have one chance to get it right for the life of that file. Companies will use various forms of compression. Because PDF files usually contain important information please consider heavily how you wish to store and compress them.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Cross-Platform document conversion

Sep 20

It’s not a secret that when it comes to the latest and greatest document conversion technologies, they all exist on Windows machines. For some this might be very frustrating. The OCR, Imaging, and Compression packages found for the Mac, Linux, and Unix, are very often ports of older versions of their Windows equivalent. On average, a Windows equivalent will be 3 or more versions ahead. This means big changes in accuracy, stability, and core-functionality. The reason this happens is simple, the initial development of these applications (engines) was on Windows, and the vast majority of the demand is also Windows.

So what happens in an environment that demands accurate document conversion but is not a Windows based system? Not all is lost. While in a perfect world all the latest technology would be on your platform of choice, sometimes you have to make exceptions, and this is not a big one to be made. Because document conversion, and compression products are all designed to have a mode where they run unmanned, it is possible to utilize the technology on a Windows machine but drive it from ANY other platform. Once configured, the stability of a dedicated document conversion machine is very good. They require low maintenance and very little interaction. Simply by networking folders for all other machines to see, no matter the platform, you can from any network device transfer images to your document conversion machine and download results.

OCR itself takes about 50 man-years to develop, so I don’t foresee in the near future technology on other platforms that is at the level of Windows machines. But what I do know is there is no reason NOT to leverage the most advanced technology with a method of set it and forget it automated document conversion machines.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document longevity

Feb 09

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it’s just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.