Set it and forget it OCR

Jul 26
2010

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Down and dirty paperless office

Jul 11
2010

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Workflow, super-charge with OCR

Jun 21
2010

Document workflow can be as easy as saving a file to a single location to as complex as decision tree document routing rules. Throw some paper into the mix and the problem intensifies slightly. Getting your paper documents to fit your already accepted digital document workflow can be challenging. Some organizations choose to keep the paper and digital workflows separate. Others unite them but create separate rules for each. For most however, it would be ideal to have a single workflow engine or product supporting both the digital, image, and paper documents.

To do so with the greatest value, you need not only document conversion using Optical Character Recognition ( OCR ), but some other advanced imaging and recognition tools. In the digital document world, you don’t have only the data contained in the document, you have various other meta data items such as file name, file location ( taxonomy ), tags, etc. In order to marry paper with digital the same has to be duplicated on the paper document and has to occur at time of document processing. This could be a manual process or automated, and depending on your paper volume doing it in manual may be OK. To compete with the efficiency of digital documents however, automatic is the way to go.

Using OCR, image-based and contextual-based classification, paper or image documents that enter the workflow can obtain the same value as digital documents. The OCR is responsible for getting all the content from the document. The purpose of this content is for search, indexing, auto-filing, as well as generation of keywords ( tags ) associated with a taxonomy. In order to determine where the document fits into a taxonomy, you must first classify it.

For classification to be most effective, it happens on two levels. Image-based classification, which is what the document looks like, classifies documents based on their physical structure which is a good indicator of its type and very fast. Contextual classification, which is what words are contained in the document, is one level deeper in classification and looks for the keywords that would make a document one type over another.  For some environments, image-based classification can do the job entirely.  Once classification is known, a classification engine can place the document in the correct spot in an existing taxonomy. Once an ID or classification is determined, it is no challenge to apply tags, file-naming, and file location to a document.

Workflow can stand alone, but injected with the power of OCR and document classification, it becomes a power house that does not know the difference between paper and digital.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Even OCR needs a helping hand – Quality Assurance

Jun 05
2010

Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The Magic of 300DPI

Jun 02
2010

Many users of OCR don’t realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR, more is better. More resolution, more bit-depth. It’s more information the OCR engine can use to interpret text. But as with many things, there is a point of diminishing returns and when relating to image resolution, diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy without sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides the fact that it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initially trained on this resolution. Some engines, no matter what resolution you give it will actually sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI as to why it is best practiced is that it’s one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“You vote engines! Of course it’s better” – Reality of voting

May 15
2010

The trend of companies promoting OCR voting has become less common, but you will still occasionally find products that promote their accuracy by saying they don’t just use one engine, they use many and vote them together. The presumption of this approach is that of course they are more accurate then single engine solutions. This would seem to be the case, but it’s not that easy.

All the OCR engines have a system of voting internally already. This is how OCR technologies have made their advances throughout the years. They take algorithms that are expert in one particular way to interpret text, such as trigrams, words, fonts, etc. and vote their character guesses against each other for the final guess. This works great. This is very different from the voting that is often promoted of taking several engines and voting their result together. When you take two separate OCR engines and vote them together, it would seem you are getting the best of what’s available, but there is one major problem. Voting requires that each engine guess the same way, and this is not the case. For example Engine A might report a confidence on the letter “c” at 98% that it’s actually an “e” while Engine B might report with a 78% confidence that I is a “c”. When you vote these two, Engine A will win even though it’s wrong. This is typically how it goes, one engine in a voting scenario will win most of the time right or wrong, just because of how it reports its confidence levels.

This blog is not in combat with voting. Voting is a great tool, it’s used internally in the engines, and it can be used externally as well. How? Vote Engine A settings A against Engine A settings B. The same engine voted against itself just with different settings. This is a tremendous tool especially when dealing with varied documents, or highly degraded documents. By doing so you are comparing apples-to-apples confidence levels and not apples-to-elephants.

So next time you are turned on by voting, take a second look and see if it’s a marketed or real value.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Getting used to document scanners

Apr 15
2010

When individuals or companies first get involved in document scanning, the first major lesson is in scanners themselves. In the past when you talked about scanners, the visual was of flatbed photo scanners. More and more, document scanning trumps the volumes of photo scanning, and it is the most common use of scanning. Therefore, document scanners were developed instead of flatbed photo scanners.

The difference between a document scanner and a flatbed scanner are substantial. The first biggest difference is now you are scanning multiple pages, and the pages are placed upright. This increases the efficiency of doing a scan. Documents are placed in a Automatic Document Feeder (ADF). ADF’s hold anywhere from 15 and up pages at a time. The next major difference with a document scanner is the use of dual lamps. Although you can purchase document scanners that have one lamp and can scan only one side of a document, most often scanners will be duplex which means they have two lamps and can scan both sides of the document at the same time.

Document scanning comes with a much larger set of features. Just like a photo scanner you can select resolution, which for document should not be less than 300 DPI and not more than 600. But document scanners also have additional features. The most common of which are features such as auto-rotation of upside down pages, and blank page removal. Blank page removal is especially usefully when always performing duplex scanners so that you can omit any pages that have no content.

There is a wide range of document scanners from the desktop on up to high volume. The prices of each tier desktop, departmental, and production vary substantially. They will all demonstrate the basic functionality of document scanning, but as you move up the tiers more and more functionality is added, as well as capacity and speed.

I can’t imagine not having a document scanner now, as I use it daily, and I believe that in the not so distant future it will be the same with most home office to office environments.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Compression – Not for saving for optimizing

Apr 01
2010

The first thing people think of when investigating compression technologies is, “How can I save space?”.  For the advanced users, and some companies, compression is not necessarily for saving space, but optimizing it.  If you calculate the amount of time spent waiting for emails to download, opening large files, and searching, you will start to realize that compression plays a big role in workers efficiency.

The type of compression that I’m discussing here is file specific compression.  These are compression technologies that operate on single file types, and have special algorithms to reduce the size of those file types.  The two most common examples are JPEG image files and PDF files.  Using type specific compression has the benefit of being able to manipulate the files as you would normally.  The opposite of type specific is compression technologies such as Zip or Tar.  Here you have to uncompress the files before utilizing them.

Because the file types are left intact with type specific compression, it means that you can email the files after compression, search engines can index them, and they can be opened in your typical viewer.  The reality is that hard drive space is cheap and adding more is relatively easy.  So for some, compression is more about efficiency.  With proper compression, emails are sent and received faster, search engines crawl faster and indexes are smaller, and opening large files takes less time.

This is not to diminish the use of compression to save space in an ever increasing data collection world.  The purpose of this article is to highlight the other and substantial benefits of type specific file compression.  The trick now becomes finding the right compression tools that create high quality compressed files and compatible with typical file browsers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Know your accuracy before you even test

Mar 25
2010

One of the natural abilities that develops as you see millions of sample images and their associated recognition results, is you begin to notice patterns and instantly indentify if a document will read well for both full-page document conversion and for field level. It has more or less become a natural ability of mine, but I can identify its components.

First is initial image quality. Without yourself identifying any objects on the page, look objectively at the document as a collection of questionable objects and see if you think the image quality is good. This is determined by coherence of each object. Are object borders tight and determinable? Are there objects interfering with other objects? Is the background of the image significantly different than all objects?

Second am identification of objects. Find text, graphics, lines, paragraphs, etc. Are their borders far enough apart? Is their type clear? This is most important for text. Is their printing consistent? For example does text go from one background color to another, this would make it inconsistent. Or another example does the straightness of lines change throughout the document? And can one object be confused for another?

And third, now that you know the objects, how easy is it to determine their value. Is the value obvious? Do you have to look at it for a while to figure it out?

Essentially the three above steps are exactly what the conversion ( OCR, ICR, OMR ) product does in order to read a document. With field level recognition it’s a bit more elaborate, but the core is the same. By identifying early on what the anticipated accuracy is of a document, you can then adjust your scan, or input settings accordingly even before looking at any technology. Doing this will give the best chance for success.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

There is OCR and then there is Formatting

Mar 18
2010

What is the greatest difference between the most accurate Optical Character Recognition ( OCR ) products and the least? It might not be what you think. The greatest improvements in OCR in the last 10 years has not been so much on character level recognition, it’s been more about how the engine’s understand the structure of documents. This is called document analysis. Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would win.

Document analysis is first how the engine breaks apart components of a document such as paragraphs, lines, columns, graphics, etc. Without this, the engine is OCRing blind, and its assumption is that every object it encounters is text. This sometimes leads to clumping of lines, or OCR of graphics. The second aspect of document analysis is the delivery of formatting in the export that matches the formatting in the document. This can also include font style and color.

With traditional documents you can expect that products with document analysis will get the formatting spot on. This is very important, not only for editing and re-purposing, but also for keeping the readability of a document. Another aspect of document analysis is to determine reading order. For example if you have a multi-column, multi-paragraph page, the software has to decide in what order the paragraphs are read. This is useful during recognition, but also in case a formatted document is converted to a more flat file structure such as TXT file where the order stands a chance of being confused.

The reality is that for clean documents character level recognition is not getting any better, it’s amazingly accurate today. The opportunity to improve is in document analysis and language morphology, but that is another post.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Visit Our Friends!

A few highly recommended friends...

Pages List

General info about this blog...