Check your check scanning

Sep 16
2009

Check scanners are fast, and have very accurate MICR reading. The check scanners get the job done, when the only job is to get MICR from a check. As OCR of checks and reconciliation of check data with remittances, or check images for future verification an reference, gains greater importance and demand, check scanning has some complications.

The typical check scanner has two very key features:

1.) Auto endorsement
2.) MICR reading

Often people think that the way check scanners read MICR is with OCR. This is incorrect MICR is printed with magnetic print that is read via a very specific magnetic reading and conversion process. When companies intend to augment their check scanning with OCR and Data Capture processes there is something major they need to consider and not overlook. Check scanners are great at what they do, but they are not great at producing high quality images. Most check scanners cannot scan past a 200 DPI which as you will see in my previous articles is less then optimum for OCR. Additionally the lamps used to produce the image are fast but not the greatest quality.

So. Here are the options:

1.)Scan checks with a document scanner and a check scanner. The hard part here is the additional time it takes to perform two scans and merging the two data streams.  In this scenario you get the best of both worlds. Great image for storing, OCR and data capture from the document scanner, and great MICR and endorsement speed in the check scanner.

2.)Replace the check scanner with a document scanner. You can actually read the MICR using OCR, but it’s not quite as accurate as magnetic reading. This might be OK as the quality of the rest of the information on the check’s extraction will be higher with the better image. Some times it’s better also because an ADF feeder allows you to scan many checks at one time which is a new time savings. The biggest killer of this approach is the fact that auto endorsement is such a tremendous time saver, it’s impossible to part with it.

3.)And finally option three, the most common, just use a check scanner. This option may be most common but not necessarily the best. In this option the company must make sure they get good image preparation and clean-up software that will enhance the OCR and Data Capture process as well as likely up-sample the images to 300 or 400 DPI. Up-sampling does not produce the same quality as scanning at these resolutions but products that excel in up-sampling can get close.

Check scanning is being more and more augmented with OCR and Data Capture processes, companies should not assume that a check scanner will have the quality of image that a document scanner will have so these above considerations are important.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

It learns right? – The misconception about recognition learning

Sep 14
2009

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded question, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguably learn everytime it’s used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it’s been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions, this is also the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don’t realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it’s exaggerated as characters for a single individual change by the minute, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive, people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Not even your monitor is safe from OCR

Sep 11
2009

I’ve talked about various uses of OCR that are non-conventional: anti-virus, CAPTCHA ( thought this does not work ), and now it’s time for a new one. Screen scraping. OCR technology is not widely used to extract texts from a user’s active screens, and the predominate use has been of the sneaky kind. However I suspect that screen scraping will become more popular for data validation, user identification similar to CAPTCHA, user automation, and even extreme content management. I myself have used screen scraping to convert an on-line address book from one email account to an importable format for another email account where the initial account did not have the option for export!

Essentially, what screen scraping does is it takes a screenshot of the active window, or the entire current session and reads the text in it with OCR. Although screenshot resolution is very-low, 96 dpi, the text contained in it is what is called “pixel perfect”, and does not accompany the distortions, dithering, and splotches that can appear in scans. This makes reading the text itself relatively easy, the hard part is getting to the text.

Look at your screen now. It’s probably filled with various graphics, and text everywhere. For screen scraping, you cannot consider any traditional document analysis to discern where text is and what text is valuable. The most successful screen scraping is that which is focused on one particular portion of the screen. The next biggest challenge for screen scraping that is continuous, is the rate a screen changes. For example, if you are typing a document as I am now, you may scroll up and down very rapidly at times. Deciding when and where to capture data in an active screen can be tricky.

It may be hard for you to image why screen scraping is useful. Especially you techies who realize that the text on the screen is in digital format already somewhere. Where screen scraping is extremely valuable, is when your application has to obtain data from another application. Developing connectors between applications can be very time consuming, and often a major waste of time. You have to learn the other products API, and if they come out with a new version, you now have to support it. But with screen scraping, you can write one way to get data off the screen of ANY active application window, search for the relevant content, and presto, you never have to do it again. In the areas of enterprise content management, and conversion from a legacy system to a new, screen scraping using OCR can be the most amazing tool.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File Format Buffet! – choosing your output format

Sep 10
2009

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Beefy servers don’t always make faster OCR

Sep 09
2009

IT departments like the new, latest, and greatest computer technology, and why shouldn’t they? Usually when shopping for a machine, it’s always true that MORE = BETTER. But in the case of OCR, organizations are surprised when a desktop testing machine outperforms their new Beefy server. In the case of OCR, there are very specific things that increase the performance of processing. Many desktop grade machines will do an amazing job at OCR if you just hit the right points.

1.)Bus speed. If you consider that OCR is moving images in memory and on the hard-drive very rapidly and doing it a lot, then you will quickly realize that the time it takes to move from point A to point B could be one of your biggest bottle necks. Lets try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things. Let’s say San Francisco is computer memory, and New York is a hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country would take a LONG time. But if we were to all load on a jumbo jet we would be there in a matter of hours. This is how the BUS works and the slower the BUS speed on memory, hard-drive, and CPU, the more of a delay for these image files to write. Servers often have fast BUS speeds but have a tremendous amount of overhead that gets in the way.

2.)OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However assuming that a server grade CPU such as the Xeon is better then a Desktop CPU such as the Duo might be a mistake. The reason for this is simple and two fold. Again servers have more overhead which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older. They may be the same speed, but they don’t deploy some of the faster math processing that is very good for OCR and found in the new chip sets.

3.)Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast but you want its connection to the motherboard to be fast. Serial ATA so far is the fastest proven way. Servers tend to implement SCSI which is great for redundancy, but not a promoter of speed because of the overhead.

4.)Memory is important but amount of memory is less important then the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between 266 MHz speed and 666 MHz is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance, you may be surprised that you have to pay less to get more in this case.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Exceptional exceptions – Key to winning with Data Capture

Sep 08
2009

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCR has your back!

Sep 07
2009

There are a few niche uses of OCR technology that many people don’t realize exist and many will never impact the average user. But there are two mainstream use of the technology that impacts everyone. OCR can and is used to thwart spammers, and even detect viruses. How you ask?

Spammers for years have realized that by embedding images with text in their messages they are avoiding the text analysis processes that detects the keywords that give away spammers. But there is away to get around this. By OCRing the images with text the same text analysis process can be run and spammer caught! This is deployed in some anti-spamming applications and it’s usage will get even more popular as the technology becomes even more and more a commodity. The use of it today is primarily done on server side anti-spam detection vs. client side applications. I expect to see in the future all anti-spam applications to also include OCR technology. This trick seems obvious when you think about it, but how does OCR prevent viruses?

If you are familiar with how viruses work, you know that occasionally viruses come to your machine as an invited friend to an already installed malware application already on your machine. Occasional harmless malware applications are just the first step in getting malicious viruses on a machine. The reason this works is because already installed applications are granted greater access to machine resources than applications that are yet installed. Now here is where it gets even tricker. Usually the virus portion of the attack or the “payload” is received from a website or silently downloaded at a certain time. Virus protection applications are very good at spotting both the malware and the payload when it comes across as a text stream. But when the payload comes across as an image containing the code for the payload it’s a little trickier. The attacker is banking on the fact the image passes the virus checking, the malware converts the image to text, using OCR, compiles and runs it secretly. Now Anti-Virus engines are getting privy to this process and can OCR the image first to see if there is any code in it, and stop the payload before it even has a chance.

Attackers are tricky, but so are the makers of protection software. Often times makers of viruses give away the solution to prevent any attack, in this case OCR.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“eBooks for Reading” By: Oc R.

Sep 04
2009

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

It’s CAPTCHA for a reason – Why you can’t OCR CAPTCHA

Sep 03
2009

I’ve been surprised recently about the number of project requests and Twitter conversation’s insisting that OCR can be used to read CAPTCHA. A CAPTCHA is that crazy set of letters and numbers most websites ask you to enter when completing a web form. The purpose of a CAPTCHA is to prevent web bots to create accounts on websites for use in spamming or other malicious activities. It’s surprising the number of organizations both private and public that want people to solve this problem of reading CAPTCHA for them. Most all of these companies ask for the use of OCR technology to do so.

I’m sorry, but the answer is it’s not possible with OCR. The reason it’s not possible is because CAPTCAH is not an OCR problem. It would be more logical to call it ICR ( Hand Print ), but this is still a stretch. OCR is Optical Character Recognition which is the reading of typographic text. CAPTCHA fonts are clearly not typographic. To be typographic, they would have to have the same baseline (bottom border), same font height for each character in the same class, etc. CAPTCHA fonts resemble more closely to hand-print which is ICR processing. However even ICR technology is expecting some consistency. For the most part in a given day and time you will write the word “CVision” pretty much the same across a form. This allows ICR to understand subject hand strokes etc. in creating the character. This level of consistency is simply not present in CAPTCHA’s. CAPTCHA’s deploy backgrounds and ever moving lines to prevent the consistency of even their already bizarre fonts. For the most part, each CAPTCHA system at any given moment in time will produce a different character variation for each character possible.

While the idea of processing CAPTCHA’s is technically enticing, actually wanting to do it has obvious malicious intent. Conversion of CAPTCHA’s would require a combination of varying recognition technologies, adaptive pattern training, and imaging techniques. I’m not convinced that the effort in creating such an approach is fiscally feasible, especially when the average project is offering fifteen dollars to complete it. My job today is to set the record straight and let the world know that CAPTCHA processing is not a job for OCR and ICR technology period.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

You gotta spend money to save money

Sep 02
2009

I’ve been surprised about the nature of the economy and its impact on technology that saves money. OCR and Data Capture have a clear benefit to companies that process even an average amount of documents a day. Paper cost is high, the time of entry is slow with manual entry. Often times companies don’t even realize they are paying none data entry salary for employees to do data entry work as documents are a big factor in many higher paid jobs.

Even though Data Capture and OCR saves companies money, companies today are in spending freezes. You have to spend money to save money in a poor economy. The trick in this economy to get the best bang for the buck and to start saving faster is to take baby steps. Start automating slowly at a low volume. This keeps the cost down and allows the organization to introduce technology faster. At the same time, the organization is building a data capture and OCR infrastructure. Automate the easy documents, take it in steps. The other trick is to pick Data Capture and OCR packages that are robust and provide the general functionality that you would have to find in several packages. For Data Capture and OCR this would include image capture, image clean-up, archive, compression, and export to repositories. Each of these could potentially be a separate software product you would need to purchase but there are many applications out there that contain them all.

Organizations need to free up budgets that would allow them to save money in a month’s time.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.