Beefy servers don’t always make faster OCR

Sep 09

IT departments like the new, latest, and greatest computer technology, and why shouldn’t they? Usually when shopping for a machine, it’s always true that MORE = BETTER. But in the case of OCR, organizations are surprised when a desktop testing machine outperforms their new Beefy server. In the case of OCR, there are very specific things that increase the performance of processing. Many desktop grade machines will do an amazing job at OCR if you just hit the right points.

1.)Bus speed. If you consider that OCR is moving images in memory and on the hard-drive very rapidly and doing it a lot, then you will quickly realize that the time it takes to move from point A to point B could be one of your biggest bottle necks. Lets try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things. Let’s say San Francisco is computer memory, and New York is a hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country would take a LONG time. But if we were to all load on a jumbo jet we would be there in a matter of hours. This is how the BUS works and the slower the BUS speed on memory, hard-drive, and CPU, the more of a delay for these image files to write. Servers often have fast BUS speeds but have a tremendous amount of overhead that gets in the way.

2.)OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However assuming that a server grade CPU such as the Xeon is better then a Desktop CPU such as the Duo might be a mistake. The reason for this is simple and two fold. Again servers have more overhead which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older. They may be the same speed, but they don’t deploy some of the faster math processing that is very good for OCR and found in the new chip sets.

3.)Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast but you want its connection to the motherboard to be fast. Serial ATA so far is the fastest proven way. Servers tend to implement SCSI which is great for redundancy, but not a promoter of speed because of the overhead.

4.)Memory is important but amount of memory is less important then the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between 266 MHz speed and 666 MHz is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance, you may be surprised that you have to pay less to get more in this case.

Chris Riley – About

Find much more about document technologies at

Exceptional exceptions – Key to winning with Data Capture

Sep 08

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Chris Riley – About

Find much more about document technologies at

OCR has your back!

Sep 07

There are a few niche uses of OCR technology that many people don’t realize exist and many will never impact the average user. But there are two mainstream use of the technology that impacts everyone. OCR can and is used to thwart spammers, and even detect viruses. How you ask?

Spammers for years have realized that by embedding images with text in their messages they are avoiding the text analysis processes that detects the keywords that give away spammers. But there is away to get around this. By OCRing the images with text the same text analysis process can be run and spammer caught! This is deployed in some anti-spamming applications and it’s usage will get even more popular as the technology becomes even more and more a commodity. The use of it today is primarily done on server side anti-spam detection vs. client side applications. I expect to see in the future all anti-spam applications to also include OCR technology. This trick seems obvious when you think about it, but how does OCR prevent viruses?

If you are familiar with how viruses work, you know that occasionally viruses come to your machine as an invited friend to an already installed malware application already on your machine. Occasional harmless malware applications are just the first step in getting malicious viruses on a machine. The reason this works is because already installed applications are granted greater access to machine resources than applications that are yet installed. Now here is where it gets even tricker. Usually the virus portion of the attack or the “payload” is received from a website or silently downloaded at a certain time. Virus protection applications are very good at spotting both the malware and the payload when it comes across as a text stream. But when the payload comes across as an image containing the code for the payload it’s a little trickier. The attacker is banking on the fact the image passes the virus checking, the malware converts the image to text, using OCR, compiles and runs it secretly. Now Anti-Virus engines are getting privy to this process and can OCR the image first to see if there is any code in it, and stop the payload before it even has a chance.

Attackers are tricky, but so are the makers of protection software. Often times makers of viruses give away the solution to prevent any attack, in this case OCR.

Chris Riley – About

Find much more about document technologies at

“eBooks for Reading” By: Oc R.

Sep 04

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at

You gotta spend money to save money

Sep 02

I’ve been surprised about the nature of the economy and its impact on technology that saves money. OCR and Data Capture have a clear benefit to companies that process even an average amount of documents a day. Paper cost is high, the time of entry is slow with manual entry. Often times companies don’t even realize they are paying none data entry salary for employees to do data entry work as documents are a big factor in many higher paid jobs.

Even though Data Capture and OCR saves companies money, companies today are in spending freezes. You have to spend money to save money in a poor economy. The trick in this economy to get the best bang for the buck and to start saving faster is to take baby steps. Start automating slowly at a low volume. This keeps the cost down and allows the organization to introduce technology faster. At the same time, the organization is building a data capture and OCR infrastructure. Automate the easy documents, take it in steps. The other trick is to pick Data Capture and OCR packages that are robust and provide the general functionality that you would have to find in several packages. For Data Capture and OCR this would include image capture, image clean-up, archive, compression, and export to repositories. Each of these could potentially be a separate software product you would need to purchase but there are many applications out there that contain them all.

Organizations need to free up budgets that would allow them to save money in a month’s time.

Chris Riley – About

Find much more about document technologies at