# OCR, Crytorithms, Cryptograms and Substitution Ciphers

## Cryptorithms

**Cryptorithms** are puzzles where the digits in an arithmetic computation are replaced with
letters. The puzzle is presented with the letters, and the object is to find out what the corresponding digits are. A famous example is:

**S E N D +**

M O R E

_________

M O N EY

Another, somewhat simpler example is given by:

PYX +

PYX

_______

YYP

M O R E

_________

M O N EY

Another, somewhat simpler example is given by:

PYX +

PYX

_______

YYP

Here, the trick is, like in a crossword puzzle, to start where the puzzle is “easiest” to break. In this example, we note that in the 2nd column we have that Y + Y + {0,1} = {0,1}Y. But Y cannot be 0 as YYP is the sum of 2 numbers whose leading digits would be non-zero. But then X + X must involve a 1-carry since otherwise Y+ Y + 0 cannot equal Y, for non-zero Y. This forces Y to be odd, since odd + odd + 1 is odd, but even + even + 1 is not even. But then Y must equal 9 since only 9 satisfies the constraint that 9 + 9 + 1 = {0,1}9. Since Y = 9, we also have the constraint from the leftmost column that P + P + 1 (1-carry from column 2) = 9. So that P = 4. We know from our analysis thus far that X > 4 since X + X results in a 1-carry to column 2. But then X + X = 14, so that X = 7. This is how a simple cryptorithm is solved.

Of course, they can get more complicated. Try the first cryptorithm problem, given above. Analysis there would again start from the easiest letter to break, leading us to conclude M = 1.

## Cryptograms and OCR

In **cryptography**, a substitution cipher is a method of encryption by which units of plaintext
are substituted with ciphertext according to a regular system; the “units” may be single
letters (the most common), pairs of letters, triplets of letters, mixtures of the above, and so
forth. The receiver deciphers the text by performing an inverse substitution. A cryptogram
is defined as a short piece of text encrypted with a simple substitution cipher in which each
letter is replaced by a differentletter. To solve the puzzle, one must recover the original
lettering.

Here is a simple example :

**CAEEAEEAOOA Answer: Mississippi**

We can, quite naturally, view certain OCR problems in a similar vein. Of course, in analyzing scanned documents we cannot always assume that each connected component in the image corresponds to a symbol in the alphabet. We have to deal with oversegmented and undersegmented images. In the oversegmented case, more than one model is required to comprise certain letters in the alphabet.

This can happen if the document is not thresholded correctly, or if composite topological structures, such as “i” and “j”, are not combined into single models. In the undersegmented case, one component comprises more that one symbol in the alphabet. This happens often with certain letters such as “fi” and “th”. For best OCR results, these undersegmented cases need to be broken.

« To Section 5: OCR, Neural Networks and other Machine learning Techniques

To Understanding OCR Technology

To Section 7: CAPTCHA: Human and Machine Readability & OCR »