CAPTCHA: Human and Machine Readability & OCR
There is a gap between human and machine readability. What does this mean exactly? Well, consider the websites that rely on “CAPTCHA” to distinguish between humans and bots. These websites are relying on the fact that there exist images where the text is human readable, but not machine readable.
What is CAPTCHA? A CAPTCHA is a type of challenge-response test used in computing to determine whether the user is human or not. “CAPTCHA” is an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”, trademarked by Carnegie Mellon University. A CAPTCHA involves one computer asking a user to complete a test. While the computer is able to generate and grade the test, it is not able to solve the test on its own.
Because computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. The term CAPTCHA was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University), and John Langford (of IBM). A common type of CAPTCHA asks the user to type in the letters of a distorted image.
For computers, if distinct characters are not separate in the image after thresholding, there is often a sharp decrease in recognition rates.
Human vs. Machine Character Recognition
Typical document scanning takes place in the 200-300 dpi range. In that range, basic topological and geometric properties are preserved, even after thresholding (i.e., converting scan document to black and white). At low resolution scan rates, however, machine OCR systems run into trouble. Some of the reason for this disparity is that humans are adept at reconstructing the shapes of characters even if multiple characters share a pixel. For computers, if distinct characters are not separate in the image after thresholding, there is often a sharp decrease in recognition rates. Usually, an image is adequately sampled if each letter is at least two pixels in thickness; the same applies to white space. When sampling is below the Nyquist sampling rate, such that this constraint is clearly not satisfied, machine recognition fails entirely, while human recognition remains intact until perhaps 25 dpi.
Wherein lies the difference between human and machine readability? For example, there is an explosion in cell phone use worldwide, with the expected number of units to exceed one billion by the end of 2008. Many of these users will have the ability to capture images, including documents. For OCR to work effectively at these cell scan rates, which for documents is well below 50 dpi, there need to be fundamental improvements in OCR technology.