eDiscovery and OCR

Mar 23
2016

I have touched on this topic a little on one of my previous posts but because of eDiscovery’s popularity I thought it was fitting to look at OCRs interaction with eDiscovery preparedness. Organizations who are not ready for audits and court orders to deliver documents are spending tremendous amounts of money to undo bad document processes. Because of this, preparing yourself to be ready for possible legal future events is critical and a long term cost saver.

The purpose of OCR technology in conjunction with eDiscovery readiness is based in the principle of having as much data at your finger tips as possible. The proper policies of being ready is heavy in records management policies, and a good taxonomy that is strictly followed. Because of this, sometimes OCR is overlooked as a tool. With the proper above practices, it should be possible to pull up any document at any time. However, OCR should be viewed as an insurance policy because by OCRing every document you have would give you even more information than you would have otherwise, and information is the key to success in these situations.

eDiscovery also includes other types of data email being one of the most popular. But what about the data contained in email attachments that are PDF, TIFF, JPEG? OCR is the only tool to extract the data from the images in these formats. Surprisingly products that provide eDiscovery tools just for email still do not yet heavily deploy OCR technology, but the information contained in these attachments is often as valuable as the emails themselves.

In addition to all the traditional proper records management practices, and eDiscovery tools, OCR should be considered as a must have for organizations preparing themselves for audits or court orders, and sometimes even more importantly knowing what to omit.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

File Format Buffet! – choosing your output format

Dec 09
2015

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Not even your monitor is safe from OCR

Apr 07
2015

I’ve talked about various uses of OCR that are non-conventional: anti-virus, CAPTCHA ( thought this does not work ), and now it’s time for a new one. Screen scraping. OCR technology is not widely used to extract texts from a user’s active screens, and the predominate use has been of the sneaky kind. However I suspect that screen scraping will become more popular for data validation, user identification similar to CAPTCHA, user automation, and even extreme content management. I myself have used screen scraping to convert an on-line address book from one email account to an importable format for another email account where the initial account did not have the option for export!

Essentially, what screen scraping does is it takes a screenshot of the active window, or the entire current session and reads the text in it with OCR. Although screenshot resolution is very-low, 96 dpi, the text contained in it is what is called “pixel perfect”, and does not accompany the distortions, dithering, and splotches that can appear in scans. This makes reading the text itself relatively easy, the hard part is getting to the text.

Look at your screen now. It’s probably filled with various graphics, and text everywhere. For screen scraping, you cannot consider any traditional document analysis to discern where text is and what text is valuable. The most successful screen scraping is that which is focused on one particular portion of the screen. The next biggest challenge for screen scraping that is continuous, is the rate a screen changes. For example, if you are typing a document as I am now, you may scroll up and down very rapidly at times. Deciding when and where to capture data in an active screen can be tricky.

It may be hard for you to image why screen scraping is useful. Especially you techies who realize that the text on the screen is in digital format already somewhere. Where screen scraping is extremely valuable, is when your application has to obtain data from another application. Developing connectors between applications can be very time consuming, and often a major waste of time. You have to learn the other products API, and if they come out with a new version, you now have to support it. But with screen scraping, you can write one way to get data off the screen of ANY active application window, search for the relevant content, and presto, you never have to do it again. In the areas of enterprise content management, and conversion from a legacy system to a new, screen scraping using OCR can be the most amazing tool.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Why hot folder’s are so HOT

Aug 21
2014

We are all guilty of over complicating things. In technology products, over complication results in more features then you will ever use and less money you could use, other times over complication creates new problems in business processes. End-users, vendors, and technologist are all commonly trying to add too many elements to automation projects. One of the areas where over complication occurs the most in data capture and OCR integrations is when it comes to passing images and results from one step to another.

Most organizations when it comes to passing images from a capture application to a data capture application ask for a connector specifically written to incorporate the chosen imagines applications API to pass images to the chosen Data Capture applications API. Most organizations similarly when considering export form OCR and Data capture processes want a special connector to their repository or ECM product. I’m not sure what to blame, the warm and fuzzies that come from the realization that a OCR vendor has spent specific effort to develop these connectors, or the faith that somehow connectors are more efficient. What I do know is that in most all cases connectors are overkill and simply not necessary, why? Because there are hot folders, and they are amazingly powerful and simple.

A hot folder ( sometimes called a watch folder ) is a directory virtual or real that is setup to be a staging or queue for applications to put data in and take data from in real-time. The best thing about hot folders is they are free! Almost all imaging, data capture, and content management applications support hot folders. If they don’t you have every right to ask why. When an image capture application scans documents they can scan those documents to a directory. The data capture application can automatically read images as soon as they appear in this directory and process them. Data capture and OCR results can be automatically exported to another directory that a content management application can automatically pick up from. That is two folders vs. two pricey connectors.

You may think that you are losing functionality such as tracking and security, but there are numerous ways in windows to monitor folder activity and protect folder security. You might be surprised that many “connectors” out there are actually just a hot folder with a settings dialog. It’s a hot folder in disguise.

So when it comes to deciding how to get files from one application process to another, first consider hot folders and try your best to disprove their validity. If you can’t, you just saved a bundle of money and probably picked the most efficient method for your OCR solution.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.