eDiscovery and OCR

Mar 23

I have touched on this topic a little on one of my previous posts but because of eDiscovery’s popularity I thought it was fitting to look at OCRs interaction with eDiscovery preparedness. Organizations who are not ready for audits and court orders to deliver documents are spending tremendous amounts of money to undo bad document processes. Because of this, preparing yourself to be ready for possible legal future events is critical and a long term cost saver.

The purpose of OCR technology in conjunction with eDiscovery readiness is based in the principle of having as much data at your finger tips as possible. The proper policies of being ready is heavy in records management policies, and a good taxonomy that is strictly followed. Because of this, sometimes OCR is overlooked as a tool. With the proper above practices, it should be possible to pull up any document at any time. However, OCR should be viewed as an insurance policy because by OCRing every document you have would give you even more information than you would have otherwise, and information is the key to success in these situations.

eDiscovery also includes other types of data email being one of the most popular. But what about the data contained in email attachments that are PDF, TIFF, JPEG? OCR is the only tool to extract the data from the images in these formats. Surprisingly products that provide eDiscovery tools just for email still do not yet heavily deploy OCR technology, but the information contained in these attachments is often as valuable as the emails themselves.

In addition to all the traditional proper records management practices, and eDiscovery tools, OCR should be considered as a must have for organizations preparing themselves for audits or court orders, and sometimes even more importantly knowing what to omit.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Let the OCR do the talking for you

Feb 08

I’ve covered various interesting and non-conventional uses of OCR. I would like to talk about a new one, OCR to Speech. The blind community is familiar with technology and it assists them in their everyday lives. The key to OCR to speech is simplicity. When the concept was first developed, it required some very elaborate combination of software and hardware, now it’s possible to take the latest and greatest OCR technology and make it talk for you with a simple configuration.

It requires a document scanner with a easy physical button interface and programmed to scan an image at 300 DPI to a folder on a machine. Traditional documents work very well for OCR to speech whereas documents that have a lot of graphics and un-traditional formats may be more challenging. It’s important that the technology is able to omit garbage. To do this the OCR process should be driven by a dictionary. The words recognized must be in this dictionary or they will not show up in the final results. The reason for this is a lot of time can be wasted if bad recognition results are spoken.

Once the OCR engine has done it’s job of accurately and automatically converting an image to text, the ASCII text results from OCR will be saved into a directory. Now it’s time to automatically put the text to speech. There are many text to speech applications out there, some free, some for pay. The goal is to find one that also reads results from a directory and automatically speaks the text over computer speakers.

It can be that easy! Some users of such technologies spend more time trying to find an acceptable digital voice then really configuring the solution. I assure you the packages exist and when configured correctly is very accurate. One scanner, One OCR application hot folder driven, and one text to speech application also hot folder driven will give a robust OCR to speech solution that can be setup in minuets.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Digital Ink – it’s not OCR or ICR

Feb 03

Digital ink is the approach of having a touch screen device that monitors a users movements with a stylus on the screen to determine character was written. This is not OCR or more specifically ICR. Very often companies have asked for OCR technology when they meant digital ink and vice versa. OCR and digital ink overlap but not always. There are cases where you simply cannot do away with paper, and not to mention digital ink does not process typed text.

The first time the technology was seen was back when Apple released the Newton. The newton was the first PDA that had a touchscreen and stylus. Later Apple sold Newton to become Palm Computer. At that time you had to re-learn how to write characters according to a guide. The characters were specifically structure to provide the best recognition and then had to be completed in a single hand-stroke. When mastered, the recognition was very good. Now any tablet PC has a basic version of digital ink software. Digital ink competes with ICR intelligent character recognition or hand-print. Whereas ICR technology is looking at an image of characters written, digital ink is monitoring hand strokes as the character is being written.

The accuracy difference between the two is an argument that can very easily be lost for both sides. There are times when digital ink is way more accurate and times when ICR of paper forms is more accurate. The key really is the business process that the technology is fitting into. Both have their place. Digital ink is usually combined with an elaborate data entry and content management process. Most often digital ink is not about getting a substantial amount of text from the operator but more about the operator answering quick, simple questions usually requiring no writing at all. The amount of characters entered in a digital ink scenario vs. a ICR of a form scenario is many times less. You will not see tablet PCs sent out in the mail to survey a customer base.

The biggest place digital ink is used today is in health-care where the drive is to increase it’s adoption even more. The purpose of the technology in this space is to rapidly populate medical records at the point of examination. However health-care still remains to be one of the top paper generating industries requiring OCR and ICR. This shows that the technologies both satisfy very different needs and should not be confused with each other.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Not quite as fun as the DMV

Jan 25

Understanding the different licensing that is available for data capture and OCR products can sometimes be difficult, but I assure you that the complexities involved will not be as painful as a trip to your local motor vehicle. There are a few aspects of licenses that trip up some users namely license type dongle or serial number, activation process, and finally page-counts.

License type can be very important but is not often clearly explained. The most common license type out there is “software license”. This is a license structure that is a license file tied to a specific machine. The benefits of such a license are, it’s more efficient and easier to install on servers and hardware that are not local. The downside is that because it is tied to a machine, if the license dies you may have downtime while waiting for replacement and proving destruction or may have to purchase a new licenses. Another very common license type is a hardware dongle. Dongles now are most often USB devices very similar to a USB thumb drive we are all used too. The benefit to this type of license is that the software can be installed on every machine in the organization but only the machine with the dongle in can run it. This means that if something happen to one machine it would be very easy to switch to another. The downside to this type of license is that the licenses can be lost, and it’s not the most efficient. After you have whatever license type it is, you will need to go through the activation processes.

Activation can be troublesome for some products and others very simple. The difference is usually the installers effort in understanding the activation processes BEFORE any installation. For many of these products activation has as many as 3 steps and it’s usually always in the form of sending an activation request, receiving an activation file, installing the activation file. The trend is for products to allow web activation and it’s becoming more popular, but because of the premium on some advance data capture products these steps are required. Now with an activated license the most important thing, what does a license give you?

Licenses are usually set with general operation rights, purchased add-on’s if they exist, and very commonly page-count. Page-count is the biggest contention of most any purchaser. Because of this most all vendors have the option to have unlimited page-count license for a premium. In the end most all companies end-up with a page-count licenses and are quite happy. What argument I would like to pose is that a piece of hardware has inherently a page-count, as each piece of hardware will only be able to physically process a certain number of pages a day, month, year. For this reason page-count is actually quite reasonable but a slowly dieing trend. In the future I expect to see far fewer page-count licenses. For most businesses pages are counted on a monthly basis but some seasonal companies may elect for an annual or pure page count.

License structure is important to ALL organizations and I encourage companies to spend the time during the discovery phases of technology acquisition to investigate the structures that are available from each vendor and how that may work in your environment.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Fixed, Semi-structured, UNSTRUCTURED!?

Jan 13

I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document.  A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.

Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The little secrets to OCRing large maps and drawings

Jan 06

Occasionally the need to convert large documents such as maps and engineering documents comes along. Many times the OCR requirement is limited to a small subset of fields and clearly defined, but when it comes to converting the entire document to get as much text as possible there are many things you need to consider.

First is if you already have the ability to scan or are receiving images of large format drawings congratulations, as this can be one of the biggest challenges. Scanning large format documents requires either a large format scanner, or stitching of partial scans ( less preferred ). Because these documents have small fonts it’s important to scan at 300 to 400 DPI. For maps because of the amount of graphics, drop-out of all colors would be ideal or a thresholded black and white scan where you are left with mostly only text in the image.

The purpose of OCR for most of these documents is for index and search-ability, so the goal is to get as much possible text as you can. For maps with a good scan you should be able to get the majority of the text except for names printed on a curve. Running line straightening on these might work but more likely hurt the recognition of the rest of the map so I would recommend avoiding it. Prior to OCR set your OCR engine to disable auto-rotate because there are a lot of things on these documents that can cause a mis-rotation namely text printed in every direction.

Now to the secret, it has to do with rotation. Depending on the setup of the drawing or map if you OCR the document at every 90 degrees, once completing a full 360 degrees will have the majority of the text. That is right, I’m suggesting that you OCR the document 4 times, hopefully in an automated fashion. Now this might leave you thinking that you will end up with a lot of garbage, and you are right. But what you can simply do with the final OCR result is use a dictionary to remove all garbage text.

The end result is a map or drawing with the most amount of index level text possible. I admit that I made it sound a little easier then it is, and most likely you will require an API to get the full job done, but the possibility exists and it’s been proven successful.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Not all Documents are Equal – OCRing Newspapers

Dec 23

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure; they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that but articles can span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document (newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at high volumes have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles, you can simply merge the result, review the remaining portions and your accuracy will then increase.

OCRing newspapers has its challenges, not to mention the difficulty in scanning them, but it’s possible and can be very accurate if in the right state of mind, and using the right approaches.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

It learns right? – The misconception about recognition learning

Dec 16

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded question, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguably learn everytime it’s used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it’s been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions, this is also the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don’t realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it’s exaggerated as characters for a single individual change by the minute, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive, people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document Preparation

Jul 21

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.