OCR has your back!

Sep 07
2009

There are a few niche uses of OCR technology that many people don’t realize exist and many will never impact the average user. But there are two mainstream use of the technology that impacts everyone. OCR can and is used to thwart spammers, and even detect viruses. How you ask?

Spammers for years have realized that by embedding images with text in their messages they are avoiding the text analysis processes that detects the keywords that give away spammers. But there is away to get around this. By OCRing the images with text the same text analysis process can be run and spammer caught! This is deployed in some anti-spamming applications and it’s usage will get even more popular as the technology becomes even more and more a commodity. The use of it today is primarily done on server side anti-spam detection vs. client side applications. I expect to see in the future all anti-spam applications to also include OCR technology. This trick seems obvious when you think about it, but how does OCR prevent viruses?

If you are familiar with how viruses work, you know that occasionally viruses come to your machine as an invited friend to an already installed malware application already on your machine. Occasional harmless malware applications are just the first step in getting malicious viruses on a machine. The reason this works is because already installed applications are granted greater access to machine resources than applications that are yet installed. Now here is where it gets even tricker. Usually the virus portion of the attack or the “payload” is received from a website or silently downloaded at a certain time. Virus protection applications are very good at spotting both the malware and the payload when it comes across as a text stream. But when the payload comes across as an image containing the code for the payload it’s a little trickier. The attacker is banking on the fact the image passes the virus checking, the malware converts the image to text, using OCR, compiles and runs it secretly. Now Anti-Virus engines are getting privy to this process and can OCR the image first to see if there is any code in it, and stop the payload before it even has a chance.

Attackers are tricky, but so are the makers of protection software. Often times makers of viruses give away the solution to prevent any attack, in this case OCR.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“eBooks for Reading” By: Oc R.

Sep 04
2009

As the popularity of reading eBooks increase, so does the demand and need to convert books to an eBook . Legality aside, the promise of using OCR technology to create eBooks is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner, this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It’s important in this approach to keep the book page order correct as often times you have to scan in batches and it’s easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files, it’s time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

You gotta spend money to save money

Sep 02
2009

I’ve been surprised about the nature of the economy and its impact on technology that saves money. OCR and Data Capture have a clear benefit to companies that process even an average amount of documents a day. Paper cost is high, the time of entry is slow with manual entry. Often times companies don’t even realize they are paying none data entry salary for employees to do data entry work as documents are a big factor in many higher paid jobs.

Even though Data Capture and OCR saves companies money, companies today are in spending freezes. You have to spend money to save money in a poor economy. The trick in this economy to get the best bang for the buck and to start saving faster is to take baby steps. Start automating slowly at a low volume. This keeps the cost down and allows the organization to introduce technology faster. At the same time, the organization is building a data capture and OCR infrastructure. Automate the easy documents, take it in steps. The other trick is to pick Data Capture and OCR packages that are robust and provide the general functionality that you would have to find in several packages. For Data Capture and OCR this would include image capture, image clean-up, archive, compression, and export to repositories. Each of these could potentially be a separate software product you would need to purchase but there are many applications out there that contain them all.

Organizations need to free up budgets that would allow them to save money in a month’s time.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.