Digital Ink – it’s not OCR or ICR

Feb 03
2016

Digital ink is the approach of having a touch screen device that monitors a users movements with a stylus on the screen to determine character was written. This is not OCR or more specifically ICR. Very often companies have asked for OCR technology when they meant digital ink and vice versa. OCR and digital ink overlap but not always. There are cases where you simply cannot do away with paper, and not to mention digital ink does not process typed text.

The first time the technology was seen was back when Apple released the Newton. The newton was the first PDA that had a touchscreen and stylus. Later Apple sold Newton to become Palm Computer. At that time you had to re-learn how to write characters according to a guide. The characters were specifically structure to provide the best recognition and then had to be completed in a single hand-stroke. When mastered, the recognition was very good. Now any tablet PC has a basic version of digital ink software. Digital ink competes with ICR intelligent character recognition or hand-print. Whereas ICR technology is looking at an image of characters written, digital ink is monitoring hand strokes as the character is being written.

The accuracy difference between the two is an argument that can very easily be lost for both sides. There are times when digital ink is way more accurate and times when ICR of paper forms is more accurate. The key really is the business process that the technology is fitting into. Both have their place. Digital ink is usually combined with an elaborate data entry and content management process. Most often digital ink is not about getting a substantial amount of text from the operator but more about the operator answering quick, simple questions usually requiring no writing at all. The amount of characters entered in a digital ink scenario vs. a ICR of a form scenario is many times less. You will not see tablet PCs sent out in the mail to survey a customer base.

The biggest place digital ink is used today is in health-care where the drive is to increase it’s adoption even more. The purpose of the technology in this space is to rapidly populate medical records at the point of examination. However health-care still remains to be one of the top paper generating industries requiring OCR and ICR. This shows that the technologies both satisfy very different needs and should not be confused with each other.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Fixed, Semi-structured, UNSTRUCTURED!?

Jan 13
2016

I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document.  A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.

Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

It learns right? – The misconception about recognition learning

Dec 16
2015

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded question, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguably learn everytime it’s used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it’s been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions, this is also the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don’t realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it’s exaggerated as characters for a single individual change by the minute, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive, people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The Magic of 300DPI

Apr 21
2015

Many users of OCR don’t realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR, more is better. More resolution, more bit-depth. It’s more information the OCR engine can use to interpret text. But as with many things, there is a point of diminishing returns and when relating to image resolution, diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy without sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides the fact that it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initially trained on this resolution. Some engines, no matter what resolution you give it will actually sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI as to why it is best practiced is that it’s one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

If it’s not semi-structured why fix it – know your form’s class?

Sep 24
2014

There are two major classes of Data Capture technology, fixed or semi-structured. When processing a form, it’s critical that the right class is chosen. To complicate things, there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience, organizations are having a very hard time figuring out if their form is fixed or not. The most common misdiagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. To most this seems fixed, but it’s actually not. Text in these boxes can move around substantially and in addition, the boxes themselves, while they are in the same location relative to each other, can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes referred to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle’s from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form. A pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items, it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

It’s CAPTCHA for a reason – Why you can’t OCR CAPTCHA

May 21
2014

I’ve been surprised recently about the number of project requests and Twitter conversation’s insisting that OCR can be used to read CAPTCHA. A CAPTCHA is that crazy set of letters and numbers most websites ask you to enter when completing a web form. The purpose of a CAPTCHA is to prevent web bots to create accounts on websites for use in spamming or other malicious activities. It’s surprising the number of organizations both private and public that want people to solve this problem of reading CAPTCHA for them. Most all of these companies ask for the use of OCR technology to do so.

I’m sorry, but the answer is it’s not possible with OCR. The reason it’s not possible is because CAPTCAH is not an OCR problem. It would be more logical to call it ICR ( Hand Print ), but this is still a stretch. OCR is Optical Character Recognition which is the reading of typographic text. CAPTCHA fonts are clearly not typographic. To be typographic, they would have to have the same baseline (bottom border), same font height for each character in the same class, etc. CAPTCHA fonts resemble more closely to hand-print which is ICR processing. However even ICR technology is expecting some consistency. For the most part in a given day and time you will write the word “CVision” pretty much the same across a form. This allows ICR to understand subject hand strokes etc. in creating the character. This level of consistency is simply not present in CAPTCHA’s. CAPTCHA’s deploy backgrounds and ever moving lines to prevent the consistency of even their already bizarre fonts. For the most part, each CAPTCHA system at any given moment in time will produce a different character variation for each character possible.

While the idea of processing CAPTCHA’s is technically enticing, actually wanting to do it has obvious malicious intent. Conversion of CAPTCHA’s would require a combination of varying recognition technologies, adaptive pattern training, and imaging techniques. I’m not convinced that the effort in creating such an approach is fiscally feasible, especially when the average project is offering fifteen dollars to complete it. My job today is to set the record straight and let the world know that CAPTCHA processing is not a job for OCR and ICR technology period.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Hand-print or Handwriting, makes a big difference

Jan 14
2014

When it comes to forms processing and data capture, working with documents that have hand-print vs. handwriting is a huge difference in accuracy and validity. Sometimes the difference between these two is not so clear. So how do you tell if your form is hand-print, or handwriting, or better yet both!

ICR ( Intelligent Character Recognition ) is the algorithm used in the place of OCR for characters generated by a human hand. The algorithm is more dynamic as a persons hand-print changes slightly by the minute. It’s possible to be very accurate when processing hand-print forms when the form is designed correctly. When doing this type of forms processing you will always have quality assurance steps, but you can get close to the accuracy of any OCR process. Very often forms that were not created with data capture or automatic extraction in mind will contain handwriting. The reason for this is that hand-print is usually guided by the form itself. Forms without hand-print cannot expect to be processed at a high accuracy. So what makes hand-print, hand-print?

Mono-spaced text: What this means that each character as it’s filled out is the same distance apart as all the other characters. In handwriting very often you will have characters that connect, in the extreme form this is cursive. When characters touch or are not spread out equally you get improper segmentation and get characters clumped together as one or split in half during recognition. Mono-spaced text is usually achieved using boxes on the form guiding the user to fill within the boxes.

Uniform Height and Width: Similar to mono-spaced text the text as it is filled in should have a more or less uniform height or width. This forces the completer to not introduce as many variable elements as they would in straight handwriting and increases accuracy. This is also accomplished using boxes on the form keeping users within boundaries.

Stable Base-Line: This aspect of hand-print is the lessor thought about but very important. Text must always be on the same horizontal base-line. What happens typically in handwriting is a user varies up and down on an invisible baseline. You may have noticed sometimes when you write that the end of any line is lower then the beginning. Baselines are important for OCR and ICR to get proper character segmentation and recognition of a few key characters such as “q” and “p” the “tail” characters.

Sans-serif: The last element is keeping characters sans-serif. The reason for this is the extra tails to characters can cause confusion between certain characters like “o” vs. “q” and “c” vs. “e”. The way to achieve this is less obvious, it is by putting a guide on the top of the form that shows a good character and a bad character.

ICR is a technology for Hand-print recognition and can be very accurate when having the proper guides. Today handwriting and cursive automation is not complete and usually only successful when augmented with other technologies such as data base look-up and CAR and LAR. Sometimes the difference between the two is unclear, but the above 4 elements provide a clear definition of hand-print. The best hand-print that can be found is by the highly training creators of engineering drawings whose print is so perfect it resembles very closely typographic text.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

When clean is clean enough

Oct 01
2013

It’s hard for people to accept the possibility of over cleaning a scanned image. I myself would love to believe you can clean-up an image so much that it does not matter what OCR technology you use, it will always be 100% accurate. The fact is however, that OCR engines don’t work this way. There are particular ways to improve the quality of a document, and there are ways that image clean-up hurts your OCR accuracy. I am going to talk about two such phenomenon. Fuzzy characters, and characters with legs.

In data capture, a commonly sought after imaging technique to use is line-removal. Line-removal attempts to find all lines in a form and make them disappear. Especially when considering forms where text is filled in fields where each character and the field itself is bounded in lines. Most forms processing tools have actually advanced in a way that they incorporate the lines in the algorithm and anticipate them being there. They can thus recognize the characters even with lines. What often happens when a line-removal algorithm is used, you get characters with legs. Like the name sounds, these are characters where on the top and or bottom of the character a portion of the line remains where it touches the character. The result is the character no longer looks like its original self. For most characters they become un-recognizable, for others they become another character for example an H becomes an A and an I becomes a T. For this reason, line-removal is no longer a recommended image clean-up tool for data capture.

The next imaging technique is both extremely beneficial to data capture or detrimental. It all has to do with the form itself. I’m talking about despeckle. Despeckle is the algorithm that removes annoying dots on the document and enhances both the read of characters as well as the removal garbage that might be recognized as characters. Despeckle is usually beneficial to data capture, especially hand-print forms where the dots can interfere with the ICR algorithm. Where despeckle hurts data capture and forms processing is when the dots touch characters. Similar to line-removal, if the dots are touching the characters, the segmentation tool believes it’s a part of the character so leaves it. Thus you get fuzzy characters. Fuzzy characters are very difficult for OCR engines to read. It’s a simple test, look at your form and notice weather or not the dots on the form touch the characters or not. If they do, you are better off working with the dots.

These two examples demonstrate huge differences in OCR accuracy and are simply choices made on the image itself not including setup or the software you use.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Multi-Pass document recognition

Nov 04
2009

When accuracy is the primary concern in document recognition, the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times, why not have an OCR engine convert it 3, 4, or even 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used successfully is with documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A and a second read can be done with the same OCR engine but with settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum, 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it’s going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by the environment, for example, environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process, it’s still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Ok Chris, you talk the talk, but what is it?

Sep 18
2009

The constituents of this blog are varied. Some know what OCR and Data Capture is, some do not. Some know they need it but now necessarily how to use it. Others know how powerful it is and have a good understanding of what is out there, but not the best practices. So taking a step back let me tell you what it’s all about. It’s about saving money, and reducing the cost associated with paper based operations.

OCR is commonly used to encompass all of the recognition technologies out there. It specifically stands for Optical Character Recognition. This is simply the process of taking an image scanned or digitally received and converting from an image to text.  While OCR can be used to mean ICR, OCR, Data Capture, OMR, and barcode processing, it is really the process of extracting ALL of the typographic text from an image document and converting it to a digital format. ICR is hand-print extraction, OMR is filled in bubble extraction, and barcode is, well barcode extraction. These later recognition technologies make up Data Capture.

Data capture is the process of extracting field data pairs to be exported in a structured format. It does not have to necessarily get all the information on a document, and is very highly dictated by business processes. Data Capture incorporates ICR, OMR, Barcode, and OCR to extract the data from fields. Fixed From Data Capture are forms that don’t change page to page, and are usually hand-print. Semi-structure forms are 80% of the documents someone sees. Data Capture is usually a more complex technology as compared to just full page OCR.

So there you have it, this is why you are reading this blog to learn about the specifics, nuances, and best practices of these technologies.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.