Why OCR is for everyone

Feb 10
2014

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Sometimes when people who are unfamiliar with scanners see the slew of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the very least, wouldn’t it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it’s more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it’s time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It’s an acronym that stands for Optical Character Recognition, this does not tell you much, so forget about it and use it only to reference the process. Simply it’s just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can imagine.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it’s value.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Attachment Emailing Master

Jan 19
2010

Very often in business, email correspondences are accompanied by a file attachment. While it’s possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it’s not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document’s ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File Format Buffet! – choosing your output format

Sep 10
2009

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.