Already digital but still OCRed

May 19

I’ve faced unique projects in the last four years and in a few, the best approach even seemed to contradict my better logic. The projects I’m talking about are ones where the data we were working with was already in a digital format, namely a PDF file that was created digitally. What this meant was that all the text in the PDF was available and 100% accurate. So why then, to accomplish the project’s goals, did we use OCR to read the already digital files as images?

I had intended for all these projects to do a logical parsing of the already digital content so I can get what I want. The problem is that even though the internal structure of the PDF has a logical standard, it’s not used logically 90% of the time by most PDF generating applications. PDF has in it a tolerance for mistakes that allows organizations to deviate quite drastically from the standard. What this means is that not only is the content in each PDF unique per company that generates it, it’s unique per number of applications able to create them. Variations on-top of variations makes logical parsing very difficult. This becomes most obvious when the documents contain tables. Because of this the only way to text parse the PDF properly would be to flatten the internal logic so that they consist of nothing but text, but by doing so you lose some of the information pointing to where tables are and their structure.

You may have guessed by now that all my projects were to parse tables from PDF. Not just any table but specific tables in PDFs where each was a unique format. As I said before, my preference would have been to use the 100% accurate data already in the PDF. In the end what I ended up doing was OCRing the PDFs because they were what is called “pixel perfect” so the accuracy was very high. Now that I was using OCR, I was able to first recognize an entire document and remove everything that was not a table which was determined by my OCR document analysis. Then I was able to use keywords to find the specific table that I wanted. The end result took me about 3 weeks of work for each project, and the result was higher accuracy in table finding, and only slightly less accurate in the text values than a table parsing.

While it seemed most logical to do the parsing, in the end I saved over 5 man-months of work by using OCR.

Chris Riley – About

Find much more about document technologies at

File Format Buffet! – choosing your output format

Dec 09

Anymore OCR and data capture solutions give you a broad selection of what output format you want the result to be in?  Until the advent of layered file formats, your only choices were texts such as Word .Doc or Plain Text .txt. But now formats themselves have come with a ton of options, leaving people to make decision first on what export format to use then what variation of that format.

It seems for the most part OCR is exported in one of two primary formats Word .Doc or Portable Document Format .PDF. So we will use these as our staples.

Word is more or less a text only format. Scanning and converting a document to word is useful for when you want to make edits to the text, reformat, add graphics, and then re-create the document, or borrow it’s contents. Some of the options included in this format relation to OCR and Data Capture are to keep formatting, keep graphics, and encoding. It’s fairly easy to decide out of these options, which would be most useful to your process. The text formats from document conversion are usually limited to immediate consumption and not distribution, and the layered formats are for distribution and storage.

There are actually many layered file formats. There are even formats of JPEG and TIFF that permit a text layer. In the last few years, Microsoft released their own “layered” format called XPS, who’s popularity has yet to catch on. PDF is still the winner in this area. PDF comes with a salad bar of options, and sometimes it’s hard to pick what is best. When used in conjunction with data capture and OCR, the most common variation of PDF is a PDF with search-able text under page image. What this means is that the visible layer of the PDF is the scanned image, underneath it with matching coordinates is the text from OCR or Data Capture. The purpose is by searching the text you will find on the image the contents of your search. Because PDF is for the most part a locked down format, it’s important to decide first what variation you want before even creating one. Other common settings are tagging, password protection, PDF/A for archiving, and bookmarks. When used with Data Capture and OCR you will see PDF/A frequently for long term archiving of documents, and password protection. The settings tagging and bookmarks usually require an additional manual step unless the Data Capture program supports filling of this meta data. If you keep the quality of the image layer for any layered format high enough, you can OCR it again if you make a mistake in your format.

The upshot is, though you have a lot of options you should be able to very easily find the best practice or norm for your space. You have a lot of choices but many of them are used only in specially scenarios and if you are not privy to the scenario then you probably don’t need it.

Chris Riley – About

Find much more about document technologies at

Down and dirty paperless office

Jul 28

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

  1. An unused computer attached to your network

  2. Google Desktop Search with network browsing enabled

  3. A document scanner

  4. A server based automatic OCR product

  5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don’t even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become a part of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it’s setup it’s simply a matter of putting paper in the scanner and pressing the scan button, and you’re done. It’s that easy, and extremely useful!

Chris Riley – About

Find much more about document technologies at

Why OCR is for everyone

Jul 07

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Sometimes when people who are unfamiliar with scanners see the slew of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the very least, wouldn’t it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it’s more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it’s time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It’s an acronym that stands for Optical Character Recognition, this does not tell you much, so forget about it and use it only to reference the process. Simply it’s just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can imagine.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it’s value.

Chris Riley – About

Find much more about document technologies at

Print to OCR?

Jun 16

When I talk to people about the unique technique of printing text documents to image just for the purpose of running optical character recognition ( OCR ) or data capture on them, they are rightfully confused and think I’m a little nutz.

Why would you ever convert an already digital document back to image? I promise it’s not because I’m so fond of OCR; it actually has its purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally, while font is a clear indicator of language, if it is not accompanied by the proper language encoding, it will not tell the digital process what a language is, and in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices’ times as many PDF generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like, the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Chris Riley – About

Find much more about document technologies at

Zip, Compress, Tar, Rar ….. confuse me?

May 05

Hard drives continually get cheaper and cheaper, but the rate that people collect information is still filling up hard drive real-estate faster than we can get more storage. One trick to saving space is various compression technologies. Most people when they think about compression now think of either a Zip file or that little check-box on Windows settings to enable compression on a physical drive. What is often overlooked is the ability to compress a single file utilizing a file compression tool for a specific format.

Choosing what is the right way to compress files and save space is based on several things, how often will you access that file again, what ratio of compression are you getting, and what are the long term impacts of the compression. When you use the technologies Zip, Tar, and RAR you are usually combining multiple files together, and don’t have plans to access them soon. These compression tools take multiple files and combine them into a single zipped file. This means that access to any one individual file in that zipped file will take additional time and effort to open. With this approach you can combine many various formats. Some formats will have a compression ratio of 0% and others a compression ratio of 60%. Rarely but occasional when a zip is not successful you can result in file corruption. I always suggest checking that you can un-zip a zip after it’s created. People who need to access their files regularly, or need to be able to search on their content at any given time can still benefit with compression tools that are specific to a format and can be done one file at a time in batch.

The most common file format that people use for search and retrieval and is generated by Data Capture and OCR is PDF. PDFs get good compression usually in a Zip, Tar, or Rar tool but there are specific things that can be done just for a PDF to compress it even further. PDFs often have a text layer that is search-able, and an image layer for viewing. The bulk of the file size is always the image layer, so a specific image compression can be applied to just this layer, and a separate text compression to the text layer. The result is a PDF that opens just like any other file, but is taking up much less space. The benefit of this is that you can access your PDF at any time, it’s still indexed with your search utility, and you are saving space!

Compression is almost always a good choice when considering saving space. Compression technologies have come a long way in the last 4 years. It’s good to know what your purpose is in compression and the frequency you want access to your files. Don’t be afraid to scout out compression tools for specific file formats and give them a try.

Chris Riley – About

Find much more about document technologies at

PDFs that need to lose weight

Sep 22

The cost of hard-drive space has dramatically decreased throughout the years, but the amount of data being created is keeping up. It’s important to find ways to manage the space you have and one way to do that is to consider file compression. PDF files are a great opportunity most times to save space. PDFs consist of layers. If you have a PDF converted using OCR it will most likely have an image and text layer. There are several ways to consider compressing PDFs either during scan, post scan, or a compressed file format.

Compressing PDFs during scan is the fastest way to ensure files are in the size you expect. The downside is that you never start with an uncompressed file so quality is out of your control. Most advanced compression tools are not lossless so the file can be compromised, if you don’t ever have a chance to view the file uncompressed there is no chance for undoing any issues. The during scan compression essentially compresses the TIFF or RAW image file prior to PDF creation so the other downside is that it’s not the highest compression that can be achieved.

Compressing PDFs after scan allows you to leverage the latest technology, and ensure greatest compression. There are tools that instead of compressing an image before PDF creation, that will work specifically on a PDF format. The benefit of this is that they can leverage tools specifically within the PDF to create a compressed file. This usually results in the smallest file format with the greatest residual quality.

The most common tool for compression is a compressed file format such as RAR, ZIP, etc. These tools have the ability to very nicely compress many formats into a single file. The challenge is that for files that need to be viewed regularly it requires a step of un-compression. This is time consuming and increases the risk of file loss. This type of compression is useful for storing files that are not regularly accessed. Because it can compress many formats, it is not as advanced in any one format as specialized file compression tools are.

People commonly overlook the importance of compression. Because compressed files often replace originals you only have one chance to get it right for the life of that file. Companies will use various forms of compression. Because PDF files usually contain important information please consider heavily how you wish to store and compress them.

Chris Riley – About

Find much more about document technologies at

Attachment Emailing Master

Jan 19

Very often in business, email correspondences are accompanied by a file attachment. While it’s possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it’s not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document’s ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Chris Riley – About

Find much more about document technologies at