Compression – Not for saving for optimizing

May 20
2011

The first thing people think of when investigating compression technologies is, “How can I save space?”.  For the advanced users, and some companies, compression is not necessarily for saving space, but optimizing it.  If you calculate the amount of time spent waiting for emails to download, opening large files, and searching, you will start to realize that compression plays a big role in workers efficiency.

The type of compression that I’m discussing here is file specific compression.  These are compression technologies that operate on single file types, and have special algorithms to reduce the size of those file types.  The two most common examples are JPEG image files and PDF files.  Using type specific compression has the benefit of being able to manipulate the files as you would normally.  The opposite of type specific is compression technologies such as Zip or Tar.  Here you have to uncompress the files before utilizing them.

Because the file types are left intact with type specific compression, it means that you can email the files after compression, search engines can index them, and they can be opened in your typical viewer.  The reality is that hard drive space is cheap and adding more is relatively easy.  So for some, compression is more about efficiency.  With proper compression, emails are sent and received faster, search engines crawl faster and indexes are smaller, and opening large files takes less time.

This is not to diminish the use of compression to save space in an ever increasing data collection world.  The purpose of this article is to highlight the other and substantial benefits of type specific file compression.  The trick now becomes finding the right compression tools that create high quality compressed files and compatible with typical file browsers.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Set it and forget it OCR

Mar 08
2011

My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search.  And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Squeeze those files

Mar 04
2011

Compression is a great tool for saving hard drive space. You may not currently be thinking about file compression, but you should. It’s very likely that on your machines data is being created at an increasing rate, and your hard-drive space is decreasing at the same fast pace. Organizations and individuals often only consider file compression when there is far to little space left on their hard-drives or the warning messages about too little space start appearing. This is a big risk.

As we create files on our computer, access them, move them, modify them, we are fragmenting the drive. Overly fragmented drives slow down machines and increase risk for damage and corruption. The more files you have, the more this multiplies. Real-time file compression helps with this because as soon as a file is generated, it’s compressed. There is less space being used, and the need to compress in the future is gone. Back-log compression ( compressing in bulk of all your files ) requires a lot of activity on the hard drive and increases the fragmentation. The other risk of bulk conversion is the fact that you only have one chance to get it right.

Bad compression is not just an irritation, it’s a risk. Usually when you compress a file, you are removing the original. The whole purpose is to save space, not use up more by keeping both copies. But because of the need to make sure you are compressing the file correctly, keeping both files waste a lot of space. When doing day-forward compression or real-time compression it’s easy to check as the files come across to make sure at initial setup everything is good, but if you do bulk compression and make a mistake you could have ruined a large library of files.

I firmly believe in file compression, but I know first hand the risk of doing it incorrectly. I now compress files as they are created and no longer have to think about data piling up faster then I can find ways to save space.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

OCR makes old systems new

Feb 10
2011

One of the biggest challenges in the IT space, is migration from legacy systems, often mainframe’s, to modern day operating systems and applications. Legacy systems still exist today in the form of classic green screen UNIX systems. Their life has been extended due to the critical nature of the data they contain. Modern day standards have been put into place hoping to avoid this problem in the future. However, those applications that seem most critical to conform to standards such as hospital medical records systems, airline systems, and government systems still do not conform to any. The vendors who make these systems have every intention of making it very hard to migrate from. But there is a way, and it works very well. OCR.

You may have seen in a previous post where I eluded to the possibilities of using OCR to scrape screen-shots. This is one of the best real examples of why the technology is so useful. When you don’t have XML and ODBC or any of the other great standards that allow the exchange of data from one system to another, you always have what you can see, and if you can see it you can OCR it. If you can view the data on the screen, you can move it to a new system.

Using OCR to either problematically or manual read portions of a screen where the legacy system window is displaying data, copy it to memory, and paste it into the new system is one of the most ingenious ways to ensure the neutrality of your data. Vendor lock down attempts, or old technology should not prevent you from getting to what you own, the information.

Whether it’s a manual process or a programmatic one, the ability to OCR screen-shots and to migrate data is the hidden secret to crack any proprietary software safe.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Space Age Optical Character Recognition

Aug 19
2010

There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.

Robots

How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were,  just look at the other cool uses of OCR technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Down and dirty paperless office

Jul 11
2010

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Workflow, super-charge with OCR

Jun 21
2010

Document workflow can be as easy as saving a file to a single location to as complex as decision tree document routing rules. Throw some paper into the mix and the problem intensifies slightly. Getting your paper documents to fit your already accepted digital document workflow can be challenging. Some organizations choose to keep the paper and digital workflows separate. Others unite them but create separate rules for each. For most however, it would be ideal to have a single workflow engine or product supporting both the digital, image, and paper documents.

To do so with the greatest value, you need not only document conversion using Optical Character Recognition ( OCR ), but some other advanced imaging and recognition tools. In the digital document world, you don’t have only the data contained in the document, you have various other meta data items such as file name, file location ( taxonomy ), tags, etc. In order to marry paper with digital the same has to be duplicated on the paper document and has to occur at time of document processing. This could be a manual process or automated, and depending on your paper volume doing it in manual may be OK. To compete with the efficiency of digital documents however, automatic is the way to go.

Using OCR, image-based and contextual-based classification, paper or image documents that enter the workflow can obtain the same value as digital documents. The OCR is responsible for getting all the content from the document. The purpose of this content is for search, indexing, auto-filing, as well as generation of keywords ( tags ) associated with a taxonomy. In order to determine where the document fits into a taxonomy, you must first classify it.

For classification to be most effective, it happens on two levels. Image-based classification, which is what the document looks like, classifies documents based on their physical structure which is a good indicator of its type and very fast. Contextual classification, which is what words are contained in the document, is one level deeper in classification and looks for the keywords that would make a document one type over another.  For some environments, image-based classification can do the job entirely.  Once classification is known, a classification engine can place the document in the correct spot in an existing taxonomy. Once an ID or classification is determined, it is no challenge to apply tags, file-naming, and file location to a document.

Workflow can stand alone, but injected with the power of OCR and document classification, it becomes a power house that does not know the difference between paper and digital.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Even OCR needs a helping hand – Quality Assurance

Jun 05
2010

Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

The Magic of 300DPI

Jun 02
2010

Many users of OCR don’t realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR, more is better. More resolution, more bit-depth. It’s more information the OCR engine can use to interpret text. But as with many things, there is a point of diminishing returns and when relating to image resolution, diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy without sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides the fact that it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initially trained on this resolution. Some engines, no matter what resolution you give it will actually sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI as to why it is best practiced is that it’s one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

“You vote engines! Of course it’s better” – Reality of voting

May 15
2010

The trend of companies promoting OCR voting has become less common, but you will still occasionally find products that promote their accuracy by saying they don’t just use one engine, they use many and vote them together. The presumption of this approach is that of course they are more accurate then single engine solutions. This would seem to be the case, but it’s not that easy.

All the OCR engines have a system of voting internally already. This is how OCR technologies have made their advances throughout the years. They take algorithms that are expert in one particular way to interpret text, such as trigrams, words, fonts, etc. and vote their character guesses against each other for the final guess. This works great. This is very different from the voting that is often promoted of taking several engines and voting their result together. When you take two separate OCR engines and vote them together, it would seem you are getting the best of what’s available, but there is one major problem. Voting requires that each engine guess the same way, and this is not the case. For example Engine A might report a confidence on the letter “c” at 98% that it’s actually an “e” while Engine B might report with a 78% confidence that I is a “c”. When you vote these two, Engine A will win even though it’s wrong. This is typically how it goes, one engine in a voting scenario will win most of the time right or wrong, just because of how it reports its confidence levels.

This blog is not in combat with voting. Voting is a great tool, it’s used internally in the engines, and it can be used externally as well. How? Vote Engine A settings A against Engine A settings B. The same engine voted against itself just with different settings. This is a tremendous tool especially when dealing with varied documents, or highly degraded documents. By doing so you are comparing apples-to-apples confidence levels and not apples-to-elephants.

So next time you are turned on by voting, take a second look and see if it’s a marketed or real value.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Visit Our Friends!

A few highly recommended friends...

Pages List

General info about this blog...