When clean is clean enough

Feb 25
2010

It’s hard for people to accept the possibility of over cleaning a scanned image. I myself would love to believe you can clean-up an image so much that it does not matter what OCR technology you use, it will always be 100% accurate. The fact is however, that OCR engines don’t work this way. There are particular ways to improve the quality of a document, and there are ways that image clean-up hurts your OCR accuracy. I am going to talk about two such phenomenon. Fuzzy characters, and characters with legs.

In data capture, a commonly sought after imaging technique to use is line-removal. Line-removal attempts to find all lines in a form and make them disappear. Especially when considering forms where text is filled in fields where each character and the field itself is bounded in lines. Most forms processing tools have actually advanced in a way that they incorporate the lines in the algorithm and anticipate them being there. They can thus recognize the characters even with lines. What often happens when a line-removal algorithm is used, you get characters with legs. Like the name sounds, these are characters where on the top and or bottom of the character a portion of the line remains where it touches the character. The result is the character no longer looks like its original self. For most characters they become un-recognizable, for others they become another character for example an H becomes an A and an I becomes a T. For this reason, line-removal is no longer a recommended image clean-up tool for data capture.

The next imaging technique is both extremely beneficial to data capture or detrimental. It all has to do with the form itself. I’m talking about despeckle. Despeckle is the algorithm that removes annoying dots on the document and enhances both the read of characters as well as the removal garbage that might be recognized as characters. Despeckle is usually beneficial to data capture, especially hand-print forms where the dots can interfere with the ICR algorithm. Where despeckle hurts data capture and forms processing is when the dots touch characters. Similar to line-removal, if the dots are touching the characters, the segmentation tool believes it’s a part of the character so leaves it. Thus you get fuzzy characters. Fuzzy characters are very difficult for OCR engines to read. It’s a simple test, look at your form and notice weather or not the dots on the form touch the characters or not. If they do, you are better off working with the dots.

These two examples demonstrate huge differences in OCR accuracy and are simply choices made on the image itself not including setup or the software you use.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Cross-Platform document conversion

Feb 23
2010

It’s not a secret that when it comes to the latest and greatest document conversion technologies, they all exist on Windows machines. For some this might be very frustrating. The OCR, Imaging, and Compression packages found for the Mac, Linux, and Unix, are very often ports of older versions of their Windows equivalent. On average, a Windows equivalent will be 3 or more versions ahead. This means big changes in accuracy, stability, and core-functionality. The reason this happens is simple, the initial development of these applications (engines) was on Windows, and the vast majority of the demand is also Windows.

So what happens in an environment that demands accurate document conversion but is not a Windows based system? Not all is lost. While in a perfect world all the latest technology would be on your platform of choice, sometimes you have to make exceptions, and this is not a big one to be made. Because document conversion, and compression products are all designed to have a mode where they run unmanned, it is possible to utilize the technology on a Windows machine but drive it from ANY other platform. Once configured, the stability of a dedicated document conversion machine is very good. They require low maintenance and very little interaction. Simply by networking folders for all other machines to see, no matter the platform, you can from any network device transfer images to your document conversion machine and download results.

OCR itself takes about 50 man-years to develop, so I don’t foresee in the near future technology on other platforms that is at the level of Windows machines. But what I do know is there is no reason NOT to leverage the most advanced technology with a method of set it and forget it automated document conversion machines.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Disoriented Images

Feb 18
2010

It’s impossible to avoid, scanning documents upside down is going to happen in all medium to high volume scanning scenarios. Fortunately, the technology exists to very accurately rotate images to the proper orientation. Lets take a look in detail how image auto-rotation and liner distortion correction works.

When an image is scanned via a document scanner in batch, it’s not uncommon to have pages flipped the wrong direction. It’s also not uncommon to have pages where there is a vertical shift from the top of the document to the bottom. In order to leverage the best data capture technology, or even for long term storage of a document, the pages need to be right-side up and without skews or distortion.

There are two phases when correction of images occur. First is during scan or just after scan. These algorithms work on the image only to determine proper direction and eliminate skews. They are very fast and sometimes equipped in the scanner driver itself. Image based auto-rotate is not as accurate as contextual, but it’s speed makes up for it. Additionally the fact that an inaccuracy in rotation simply means a page that was not rotated when it should have been, the risk is not high. Very rarely will image based auto-rotate mis-rotate an image, ie an image that was in the correct orientation but turned upside-down. Once rotation is correct, the distortions can be checked. When a page is in image format, it is the ONLY opportunity to correct liner distortion. The algorithms that perform this function work on a pixel level to determine the base alignment of the document vertically and horizontally to find portions of the document that do not match the base alignment and make proper shifts.

Phase two of document correction is the contextual auto-rotate. Using a full-page OCR read at several orientations the software can determine at which orientation the quality of the read is best. This is the most accurate way to rotate a document. Documents with little text, or text at various angles are the only risky documents. In these cases, the software chooses the orientation of the MOST readable text.

Auto-Rotate and Deskew are a must when scanning documents for purpose of OCR and Data Capture. The technologies are very accurate and sometimes used exclusively for the purpose of accurate document scanning and image storage.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Where do the images go?

Feb 16
2010

Document imaging and scanning are facilitated in large parts by various software applications. Often some of the greatest appeal, for those not too familiar with document imaging, is the functionality contained within the software that is bundled with a document scanner. Many of the vendors, while they are selling document scanners, put all the focus on their applications that are married to the scanner and how they handle the images.

Recently at MacWorld 2010, this was proven to be true from the various scanner vendors who had more to say about their personal content management applications than their actual scanners. What surprised me is how little end-users were concerned about where and how the images are stored.

Knowing how your personal content management application stores images is critical for your future retention and use of those images. To give you an example, if you are now scanning to an application that converts images to a proprietary format and saves them in an SQL Express database you don’t have access to, migrating from this application will be as difficult as re-scanning each and every piece of paper. What if you no longer have the originals?

Many of the sexy software applications out there make it very difficult to get to your data files directly, for use in other applications or for purpose of migration. I would expect this to be a common question asked by vendors but it was not. Only once did I see a vendor explain how you can still get to the files that are contained in their application. Indeed you could, following some non-obvious steps. And once you found all the image files they were bizarrely named, not the name assigned within the software. It is good to know they are there and accessible, but what a tremendous amount of work to get there.

You own the information so make sure you know where the images go, how they are stored, and how you can get to them if at all. If a particular solution is locked down or requires some hacking, it’s not a personal content management system for you.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document longevity

Feb 09
2010

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it’s just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Compression: Save space, AND MONEY

Feb 05
2010

Yes compression saves valuable hard-drive space, but as the technology world becomes more and more hosted, it’s also just as important for saving money. Previously I have explored various types of compression, general, and file type specific. I have also explored various drivers for compression, archive, and space saving on regularly consumed files. But what I have not talked about in detail is how compression is becoming more and more popular for saving money from hosted storage services.

Hosted software products are being created at a faster rate than installed. Many of these hosted solutions are content driven such as content management, eDiscovery, accounts payable, off-site storage etc. and they are all rooted in storing data. It is the preferred business model for the companies producing these solutions to charge per mega-byte of usage or combination of mega-byte usage and a monthly service charge. For this reason, it’s important to consider how much storage is being used up. Not only because of cost control, but also to make sure the system is being utilized on useful data and not garbage.

Often organizations purchase an allotment of storage that they pay for monthly; their goal is to not exceed their storage limit and have to upgrade to the next level. Often with the content management services and in particular documents, they can be uploaded but are never utilized within the system and are purely space wasters.

For these reasons, compression is a great tool to reduce the size of the files on your hosted service. The type of compression used for hosted services would need to be file specific. Hosted applications understand specific file formats and how to consume them; compression formats such as zip would not be useful for that reason. Instead, compression for particular formats such as PDF compression must be used. In this way, you are still working with a compatible and consumable PDF, but at a much smaller size. The driver for the compression must be compression for regular consumption. There are hosted archival systems, but in this case I’m discussing hosted products where the data contained in them are used on a frequent to semi-frequent basis.

By compressing documents a company can store more data for less storage fee. As hosted software products become more common, you will see people seeking better and better ways to make their files smaller but maintain quality.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document Conversion and Law

Feb 04
2010

Both CVISION Technologies and I had the pleasure of attending LegalTech 2010 this year in New York. I was quite impressed with the show and especially how engaged the attendees were. Where does document conversion and compression technologies fit in the legal space? Here is a brief review of the usage of the technologies in this vertical market.

File security:

Starting with the most popular buzz word PDFs. PDFs are the most popular file format in legal for their ability to be secure, and with the right compression tools very small file format. Security is fairly obvious, but compression not so much. Because many of the legal case management platforms, eDiscovery engines, and simply content management are billed by the megabyte of space, keeping files small but usable is critical. The trend of these applications is to be fewer installed products and most hosted. The hosted products usually have a monthly service fee and charge per amount of storage. Keeping the content value but small then becomes a real concern especially when dealing with the hundreds of thousands of pages a case might contain.

Search-ability:

Lawyers work with a lot of paper, getting at the right information is tough. That is why before a document can be loaded to any case management or eDiscovery system, it must be OCRed and made searchable. Good OCR is essential, as is the ability to quickly get the documents converted. Without OCR, eDiscovery simply cannot work on paper. Surprisingly this was a common afterthought, but a large complaint for products with poor OCR. Some organizations simply put the paper or image files aside, risking loss of valuable information. Others did not concern themselves with OCR accuracy and just assumed it was good enough. Both are a mistake and I hope a dying trend in this particular market as they are only hurting themselves. Garbage in garbage out.

Translation:

The number of translation companies at the show was large. Why? Because very often lawsuits are comprised of a large collection of documents that contain a subset of languages. In order for eDiscovery to work well, the data must be normalized i.e. translated. The first challenge is to find the languages. It is a tremendous effort to go through a large collection of documents and identify each page a particular language occurs. Second is in paper documents getting the data into a digital format so manual or software based translation can occur. OCR can facilitate both. First is the conversion of paper to digital, and second is during OCR language detection happens internally in the OCR engine. Again just like the above, the quality of the OCR is imperative, so law firms have every right to be concerned about what OCR engine they or their translation company uses.

If you did not attend, I recommend you keep it on your radar for next year, or the west coast version. While document conversion is not the favorite topic in legal, it finds its way in each step of case management, e-discovery, and billing.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.