What you OCR is what you get

May 04

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Dropout, all or none

Jan 20

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But a bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout, performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation, you are better to turn off dropout, or improve it’s use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scanning, the lamp would be turned on thus canceling out the red in the form. Because of this, it was important that printed forms used a certain type of red. If you have ever had experience with color matching you know it’s quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit with some scanners and software packages of being able to dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult, it is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout, it will avoid colors that don’t match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it’s one color and black text, but more complex documents with a low threshold can lose important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out, it is better for data capture accuracy that dropout is turned off and deal with the black and white form than to include it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Path to simple yet robust document routing

Dec 30

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
> Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The Magic of 300DPI

Apr 21

Many users of OCR don’t realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR, more is better. More resolution, more bit-depth. It’s more information the OCR engine can use to interpret text. But as with many things, there is a point of diminishing returns and when relating to image resolution, diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy without sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides the fact that it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initially trained on this resolution. Some engines, no matter what resolution you give it will actually sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI as to why it is best practiced is that it’s one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“eBooks for Reading” By: Oc R.

Mar 17

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Check your check scanning

Jan 13

Check scanners are fast, and have very accurate MICR reading. The check scanners get the job done, when the only job is to get MICR from a check. As OCR of checks and reconciliation of check data with remittances, or check images for future verification an reference, gains greater importance and demand, check scanning has some complications.

The typical check scanner has two very key features:

1.) Auto endorsement
2.) MICR reading

Often people think that the way check scanners read MICR is with OCR. This is incorrect MICR is printed with magnetic print that is read via a very specific magnetic reading and conversion process. When companies intend to augment their check scanning with OCR and Data Capture processes there is something major they need to consider and not overlook. Check scanners are great at what they do, but they are not great at producing high quality images. Most check scanners cannot scan past a 200 DPI which as you will see in my previous articles is less then optimum for OCR. Additionally the lamps used to produce the image are fast but not the greatest quality.

So. Here are the options:

1.)Scan checks with a document scanner and a check scanner. The hard part here is the additional time it takes to perform two scans and merging the two data streams.  In this scenario you get the best of both worlds. Great image for storing, OCR and data capture from the document scanner, and great MICR and endorsement speed in the check scanner.

2.)Replace the check scanner with a document scanner. You can actually read the MICR using OCR, but it’s not quite as accurate as magnetic reading. This might be OK as the quality of the rest of the information on the check’s extraction will be higher with the better image. Some times it’s better also because an ADF feeder allows you to scan many checks at one time which is a new time savings. The biggest killer of this approach is the fact that auto endorsement is such a tremendous time saver, it’s impossible to part with it.

3.)And finally option three, the most common, just use a check scanner. This option may be most common but not necessarily the best. In this option the company must make sure they get good image preparation and clean-up software that will enhance the OCR and Data Capture process as well as likely up-sample the images to 300 or 400 DPI. Up-sampling does not produce the same quality as scanning at these resolutions but products that excel in up-sampling can get close.

Check scanning is being more and more augmented with OCR and Data Capture processes, companies should not assume that a check scanner will have the quality of image that a document scanner will have so these above considerations are important.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document’s say “Cheese” – Digital Photo OCR

Nov 11

Contrary to popular belief, it will be many years before a digital photograph of a document will be close to the accuracy of a document scan. Yes, there are document scanners today that are based on a mounted digital camera. This information is very accurate, but not what I’m referring to. I’m talking about photography of documents with your cell phone, or digital camera. One would assume that taking a photograph of a document at the highest possible resolution would be able to eventually replace document scanning, but that is not the whole story. Even your 12 mega-pixel digital camera will not beat a 300 DPI document scan when it comes to document imaging. While it is possible to get better and better digital photographs of documents there is one major problem in converting them using OCR and that is that OCR engines have to account for many more variable elements, the most complicated being layers.

When you take a photograph of a document there is the potential of several different focal points, a table, a finger, the floor. Some of these focal points can be easily be mistaken for the flat surface of a document. The OCR engine has to determine which layer or focal point is the actual document and what it’s borders are. The way the do this is color detection primarily. Because in a document scan, there is only one focal point, as the document is the entirety of the image, the OCR engine does not need to guess and make any modification to the image to find it. This increases the accuracy of both document analysis and character reading. The next challenge is perspective.

A digital photograph of a document should be taken head on. Think about the LCD screen on your camera as being on the same plane as the piece of paper. Any variation to this causes problems with distortion where for example the top portion of the document from left-edge to right-edge has a shorter distance than the bottom portion. There are some capture applications out there for the iPhone and other mobile devices that force you to line the document up in brackets. This forces the capture to focus only on the document and know by virtue of the guide where the borders are, but lining it up is very time consuming. That gets to the final point, time.

It actually would take you much more time to capture 10 page document with a digital photograph than with a ADF or sheet-fed document scanner. Because the quality of the photo is so important in running OCR on a digital photograph It requires a lot of conscious effort on no shaking, lining up the document, and placing the document on a surface that does not contain many layers or focal points. Because of this additional effort it’s actually not saving any time.

I am a fan of blooming technology as well, but for acquiring paper images and converting them, there is not better way then a portable or traditional document scanner. In time, digital photographs of documents will become a popular way to capture single page documents for one-off processing, but as long as paper exists so will the reality of document scanners.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCRing Magazines

Aug 04

Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I’m interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.

Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it’s less challenging; but when text overlaps graphic and white-space, it’s problematic because a single word will change from color to black normal text in order to contrast the images.

Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.

The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don’t allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.

Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Why buy what I already own!?

Feb 21

Many people inherit full-page Optical Character Recognition (OCR) technology by simply purchasing a scanner or a multi-function (MFP) device. All these pieces of hardware include various software packages and OCR is one of the most common. Often the software is never used or the use isn’t always clear. Other times, the bundle is a tight integration with the hardware and the OCR is a part of configuration of the scanner and is used during scanning unknown to the user.

Bundled OCR technology is the easiest way to learn through use, and get the technology for a low price. Bundled software has contributed a great deal to market education and understand around the advance technologies. All the top OCR engines have a consumer product bundled with a document scanner or multi-function device. But because it’s already there, it leaves many wondering why you would ever purchase the software directly.

For many, the bundled OCR is sufficient for use. The quality of documents is clean, and the demand for advanced options is not required. But for others they just need more. This is why more advanced versions exist. Bundled OCR, even from the best vendors, is limited or an older version of the product. Some of the vendors make a special “bundle only version”, while others choose to incorporate non-current versions. Not only is buying the software directly getting the latest technology with the best features, the biggest drive to purchase is a greater more specific need to focus on OCR functionality. This could be because you are scanning old documents, degraded documents, or you need special settings such as compression and PDF/A functionality that is simply not found in bundled versions.

Vendors don’t make any money on bundled OCR other than to cover costs. Because vendors use for the most part bundled versions as marketing, they don’t incorporate the latest, greatest, and most advanced features. For those who the document version process is very important, there is a clear benefit in quality OCR packages.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Don’t over clean – the effects of image clean-up on accuracy

Dec 06

There is always some way to modify a scanned image to improve its recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen where the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function, use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes, consider only those that improve the recognition rates, not destroy them.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.