Measuring Document Automation Efficiency

Jun 29

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

Chris Riley – About

Find much more about document technologies at

Beefy servers don’t always make faster OCR

Nov 18

IT departments like the new, latest, and greatest computer technology, and why shouldn’t they? Usually when shopping for a machine, it’s always true that MORE = BETTER. But in the case of OCR, organizations are surprised when a desktop testing machine outperforms their new Beefy server. In the case of OCR, there are very specific things that increase the performance of processing. Many desktop grade machines will do an amazing job at OCR if you just hit the right points.

1.)Bus speed. If you consider that OCR is moving images in memory and on the hard-drive very rapidly and doing it a lot, then you will quickly realize that the time it takes to move from point A to point B could be one of your biggest bottle necks. Lets try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things. Let’s say San Francisco is computer memory, and New York is a hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country would take a LONG time. But if we were to all load on a jumbo jet we would be there in a matter of hours. This is how the BUS works and the slower the BUS speed on memory, hard-drive, and CPU, the more of a delay for these image files to write. Servers often have fast BUS speeds but have a tremendous amount of overhead that gets in the way.

2.)OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However assuming that a server grade CPU such as the Xeon is better then a Desktop CPU such as the Duo might be a mistake. The reason for this is simple and two fold. Again servers have more overhead which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older. They may be the same speed, but they don’t deploy some of the faster math processing that is very good for OCR and found in the new chip sets.

3.)Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast but you want its connection to the motherboard to be fast. Serial ATA so far is the fastest proven way. Servers tend to implement SCSI which is great for redundancy, but not a promoter of speed because of the overhead.

4.)Memory is important but amount of memory is less important then the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between 266 MHz speed and 666 MHz is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance, you may be surprised that you have to pay less to get more in this case.

Chris Riley – About

Find much more about document technologies at