File Format Buffet! – choosing your output format

Sep 10
2009

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Beefy servers don’t always make faster OCR

Sep 09
2009

IT departments like the new, latest, and greatest computer technology, and why shouldn’t they? Usually when shopping for a machine, it’s always true that MORE = BETTER. But in the case of OCR, organizations are surprised when a desktop testing machine outperforms their new Beefy server. In the case of OCR, there are very specific things that increase the performance of processing. Many desktop grade machines will do an amazing job at OCR if you just hit the right points.

1.)Bus speed. If you consider that OCR is moving images in memory and on the hard-drive very rapidly and doing it a lot, then you will quickly realize that the time it takes to move from point A to point B could be one of your biggest bottle necks. Lets try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things. Let’s say San Francisco is computer memory, and New York is a hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country would take a LONG time. But if we were to all load on a jumbo jet we would be there in a matter of hours. This is how the BUS works and the slower the BUS speed on memory, hard-drive, and CPU, the more of a delay for these image files to write. Servers often have fast BUS speeds but have a tremendous amount of overhead that gets in the way.

2.)OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However assuming that a server grade CPU such as the Xeon is better then a Desktop CPU such as the Duo might be a mistake. The reason for this is simple and two fold. Again servers have more overhead which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older. They may be the same speed, but they don’t deploy some of the faster math processing that is very good for OCR and found in the new chip sets.

3.)Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast but you want its connection to the motherboard to be fast. Serial ATA so far is the fastest proven way. Servers tend to implement SCSI which is great for redundancy, but not a promoter of speed because of the overhead.

4.)Memory is important but amount of memory is less important then the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between 266 MHz speed and 666 MHz is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance, you may be surprised that you have to pay less to get more in this case.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Exceptional exceptions – Key to winning with Data Capture

Sep 08
2009

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCR has your back!

Sep 07
2009

There are a few niche uses of OCR technology that many people don’t realize exist and many will never impact the average user. But there are two mainstream use of the technology that impacts everyone. OCR can and is used to thwart spammers, and even detect viruses. How you ask?

Spammers for years have realized that by embedding images with text in their messages they are avoiding the text analysis processes that detects the keywords that give away spammers. But there is away to get around this. By OCRing the images with text the same text analysis process can be run and spammer caught! This is deployed in some anti-spamming applications and it’s usage will get even more popular as the technology becomes even more and more a commodity. The use of it today is primarily done on server side anti-spam detection vs. client side applications. I expect to see in the future all anti-spam applications to also include OCR technology. This trick seems obvious when you think about it, but how does OCR prevent viruses?

If you are familiar with how viruses work, you know that occasionally viruses come to your machine as an invited friend to an already installed malware application already on your machine. Occasional harmless malware applications are just the first step in getting malicious viruses on a machine. The reason this works is because already installed applications are granted greater access to machine resources than applications that are yet installed. Now here is where it gets even tricker. Usually the virus portion of the attack or the “payload” is received from a website or silently downloaded at a certain time. Virus protection applications are very good at spotting both the malware and the payload when it comes across as a text stream. But when the payload comes across as an image containing the code for the payload it’s a little trickier. The attacker is banking on the fact the image passes the virus checking, the malware converts the image to text, using OCR, compiles and runs it secretly. Now Anti-Virus engines are getting privy to this process and can OCR the image first to see if there is any code in it, and stop the payload before it even has a chance.

Attackers are tricky, but so are the makers of protection software. Often times makers of viruses give away the solution to prevent any attack, in this case OCR.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“eBooks for Reading” By: Oc R.

Sep 04
2009

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

You gotta spend money to save money

Sep 02
2009

I’ve been surprised about the nature of the economy and its impact on technology that saves money. OCR and Data Capture have a clear benefit to companies that process even an average amount of documents a day. Paper cost is high, the time of entry is slow with manual entry. Often times companies don’t even realize they are paying none data entry salary for employees to do data entry work as documents are a big factor in many higher paid jobs.

Even though Data Capture and OCR saves companies money, companies today are in spending freezes. You have to spend money to save money in a poor economy. The trick in this economy to get the best bang for the buck and to start saving faster is to take baby steps. Start automating slowly at a low volume. This keeps the cost down and allows the organization to introduce technology faster. At the same time, the organization is building a data capture and OCR infrastructure. Automate the easy documents, take it in steps. The other trick is to pick Data Capture and OCR packages that are robust and provide the general functionality that you would have to find in several packages. For Data Capture and OCR this would include image capture, image clean-up, archive, compression, and export to repositories. Each of these could potentially be a separate software product you would need to purchase but there are many applications out there that contain them all.

Organizations need to free up budgets that would allow them to save money in a month’s time.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.