The Magic of 300DPI

Jun 02
2010

Many users of OCR don’t realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR, more is better. More resolution, more bit-depth. It’s more information the OCR engine can use to interpret text. But as with many things, there is a point of diminishing returns and when relating to image resolution, diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy without sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides the fact that it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initially trained on this resolution. Some engines, no matter what resolution you give it will actually sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI as to why it is best practiced is that it’s one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Know your accuracy before you even test

Mar 25
2010

One of the natural abilities that develops as you see millions of sample images and their associated recognition results, is you begin to notice patterns and instantly indentify if a document will read well for both full-page document conversion and for field level. It has more or less become a natural ability of mine, but I can identify its components.

First is initial image quality. Without yourself identifying any objects on the page, look objectively at the document as a collection of questionable objects and see if you think the image quality is good. This is determined by coherence of each object. Are object borders tight and determinable? Are there objects interfering with other objects? Is the background of the image significantly different than all objects?

Second am identification of objects. Find text, graphics, lines, paragraphs, etc. Are their borders far enough apart? Is their type clear? This is most important for text. Is their printing consistent? For example does text go from one background color to another, this would make it inconsistent. Or another example does the straightness of lines change throughout the document? And can one object be confused for another?

And third, now that you know the objects, how easy is it to determine their value. Is the value obvious? Do you have to look at it for a while to figure it out?

Essentially the three above steps are exactly what the conversion ( OCR, ICR, OMR ) product does in order to read a document. With field level recognition it’s a bit more elaborate, but the core is the same. By identifying early on what the anticipated accuracy is of a document, you can then adjust your scan, or input settings accordingly even before looking at any technology. Doing this will give the best chance for success.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document longevity

Feb 09
2010

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it’s just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCRing Magazines

Dec 03
2009

Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I’m interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.

Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it’s less challenging; but when text overlaps graphic and white-space, it’s problematic because a single word will change from color to black normal text in order to contrast the images.

Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.

The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don’t allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.

Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

What you OCR is what you get

Dec 02
2009

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Why buy what I already own!?

Nov 18
2009

Many people inherit full-page Optical Character Recognition (OCR) technology by simply purchasing a scanner or a multi-function (MFP) device. All these pieces of hardware include various software packages and OCR is one of the most common. Often the software is never used or the use isn’t always clear. Other times, the bundle is a tight integration with the hardware and the OCR is a part of configuration of the scanner and is used during scanning unknown to the user.

Bundled OCR technology is the easiest way to learn through use, and get the technology for a low price. Bundled software has contributed a great deal to market education and understand around the advance technologies. All the top OCR engines have a consumer product bundled with a document scanner or multi-function device. But because it’s already there, it leaves many wondering why you would ever purchase the software directly.

For many, the bundled OCR is sufficient for use. The quality of documents is clean, and the demand for advanced options is not required. But for others they just need more. This is why more advanced versions exist. Bundled OCR, even from the best vendors, is limited or an older version of the product. Some of the vendors make a special “bundle only version”, while others choose to incorporate non-current versions. Not only is buying the software directly getting the latest technology with the best features, the biggest drive to purchase is a greater more specific need to focus on OCR functionality. This could be because you are scanning old documents, degraded documents, or you need special settings such as compression and PDF/A functionality that is simply not found in bundled versions.

Vendors don’t make any money on bundled OCR other than to cover costs. Because vendors use for the most part bundled versions as marketing, they don’t incorporate the latest, greatest, and most advanced features. For those who the document version process is very important, there is a clear benefit in quality OCR packages.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Dropout, all or none

Oct 16
2009

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But a bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout, performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation, you are better to turn off dropout, or improve it’s use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scanning, the lamp would be turned on thus canceling out the red in the form. Because of this, it was important that printed forms used a certain type of red. If you have ever had experience with color matching you know it’s quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit with some scanners and software packages of being able to dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult, it is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout, it will avoid colors that don’t match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it’s one color and black text, but more complex documents with a low threshold can lose important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out, it is better for data capture accuracy that dropout is turned off and deal with the black and white form than to include it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Path to simple yet robust document routing

Oct 15
2009

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
> Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Check your check scanning

Sep 16
2009

Check scanners are fast, and have very accurate MICR reading. The check scanners get the job done, when the only job is to get MICR from a check. As OCR of checks and reconciliation of check data with remittances, or check images for future verification an reference, gains greater importance and demand, check scanning has some complications.

The typical check scanner has two very key features:

1.) Auto endorsement
2.) MICR reading

Often people think that the way check scanners read MICR is with OCR. This is incorrect MICR is printed with magnetic print that is read via a very specific magnetic reading and conversion process. When companies intend to augment their check scanning with OCR and Data Capture processes there is something major they need to consider and not overlook. Check scanners are great at what they do, but they are not great at producing high quality images. Most check scanners cannot scan past a 200 DPI which as you will see in my previous articles is less then optimum for OCR. Additionally the lamps used to produce the image are fast but not the greatest quality.

So. Here are the options:

1.)Scan checks with a document scanner and a check scanner. The hard part here is the additional time it takes to perform two scans and merging the two data streams.  In this scenario you get the best of both worlds. Great image for storing, OCR and data capture from the document scanner, and great MICR and endorsement speed in the check scanner.

2.)Replace the check scanner with a document scanner. You can actually read the MICR using OCR, but it’s not quite as accurate as magnetic reading. This might be OK as the quality of the rest of the information on the check’s extraction will be higher with the better image. Some times it’s better also because an ADF feeder allows you to scan many checks at one time which is a new time savings. The biggest killer of this approach is the fact that auto endorsement is such a tremendous time saver, it’s impossible to part with it.

3.)And finally option three, the most common, just use a check scanner. This option may be most common but not necessarily the best. In this option the company must make sure they get good image preparation and clean-up software that will enhance the OCR and Data Capture process as well as likely up-sample the images to 300 or 400 DPI. Up-sampling does not produce the same quality as scanning at these resolutions but products that excel in up-sampling can get close.

Check scanning is being more and more augmented with OCR and Data Capture processes, companies should not assume that a check scanner will have the quality of image that a document scanner will have so these above considerations are important.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“eBooks for Reading” By: Oc R.

Sep 04
2009

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Visit Our Friends!

A few highly recommended friends...

Pages List

General info about this blog...