Imprint vs. Annotate

Jan 28
2010

Large volume scanning environments often have the need to imprint, herein “Stamp”, usually date of scan on each and every page that is processed. This requirement is created for tracking purposes and sometimes compliance. Many service bureaus require more than just a date, they require batch IDs and other important tracking information. The question becomes how to do this in the best way. There are several options.

Pre-Scan Imprint

Pre-Scan imprint being the most common option allows an organization to have the stamp on both the physical paper copy and the scan. Scanners capable of pre-scan imprint will print in the proper location for the data prior to the image reaching the scanners lamps. By doing so, the stamp will also be part of the scan. The reason this is the most common is because there are times when a scanned image needs to be compared with a physical document and this is what would be required to do so. Scanners with the imprint feature come at a premium and requires more maintenance.

Post-Scan Imprint

If the organization only needs the data or tracking mechanism on the physical paper then they can imprint after scan. Some scanners support post-scan imprinting or organizations feed the paper through an additional printing process. Usually the purpose of this operation is to imprint pages indicating simply if a page has been processed or not. Scanners with the post-scan imprinting feature run nearly the same price as the pre-scan imprint and gradually being faded out in favor of it.

Software Annotation

If the organization only needs the data or tacking mechanism on the scanned image they may elect to do software annotation. Software annotation gives the greatest amount of flexibility of all three options as any combination or sequence of data can be printed on the image anywhere. Software annotation would require an additional piece of software. Very often organizations will choose software annotation instead of the premium for imprinting scanners but sacrifice the physical imprint. The application that provides the annotation needs to be automated and batch driven.

The alternative to the above three methods is manual stamping. Manual stamping is tedious, time consuming and often inaccurate. It’s up to the organization to review the three options and pick the best fit for their production and budgets.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Translating images

Jan 26
2010

Text translation services come in a variety of forms, from individuals who make a good living translating documents from one language to another, to large firms using many individuals or purely software. No matter the form, they are all faced with a challenge when the text they need to translate is contained in physical paper or an image file.

Today, translation is facilitated with the use of word processing systems. Word processors give the translator the ability to be more efficient and manage the translation process over many sessions. But in order to use the capabilities of a word processing system, it’s necessary to get the text into a digital format. That is where Optical Character Recognition comes in. OCR is one of the greatest tools in a translator’s bag of tricks. It allows the individual to convert the image files and physical paper to digital text which can be consumed and translated.

The great thing about modern OCR is the sheer number of languages that are supported. Not only is OCR capable of converting a document to digital in one language but even if it contains multiple languages, it’s smart enough to know where one language begins and the other ends. If you can imagine the risk of a translator who receives OCR errors, you will see why making sure documents are scanned at the optimum quality is a great consideration. Modern OCR engines will tell the operator exactly where any confusion might have occurred and give them the opportunity to correct it. Documents scanned at 300 DPI TIFF Group 4 black and white will excel.

Without OCR, a translator’s job becomes more of a data entry task than what they are truly skilled at which is translation.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Mysterious tables

Jan 21
2010

In the world of data capture. the one document element that easily doubles the complexity, increases software cost, and is all-around mysterious are tables. In Invoices, table data is all line-item details, in Bills of Lading, they are all shipping details. Many commonly used business documents contain tables. Extracting data from tables starts first with a clear understanding of table structure.

Most tables out there follow the typical structure, a header with column names, 1 to many rows of data below the that align to column names, and a footer which may contain summation data. This structure is ideal. The first added element of complexity that can occur is when column names do not align with data. This can happen intentionally or due to shifts in scanning. If this is an always or common enough occurrence then it’s necessary in data capture setup to ignore table headers completely. Next level of complexity is multi-level headers. Multi-level header structured tables amount to basically tables within tables. There are two levels of headers the first being the parent, and the subsequent levels provide additional details usually a lessor number of items. The levels are usually indicated by using more indents per level. This is most commonly found in EOBs, and what makes EOBs so complex. In this case, you have to capture multiple copies of the same table over and over, and not attempt to collect the whole details as a table. In the most complex documents with this structure, the table data capture element is not used at all but instead a basic field-by-field approach.

One of the biggest mistake’s integrators made is assuming a certain data capture table approach will work for all their tables on all documents. The only way to know for sure is testing. The ability for data capture software to find table structures is based on the process Document Analysis. Document Analysis will tell the data capture software where ALL tables on the document are located allowing it to choose the best one. In the case of tables within tables this very often results in a single table that is cutting data cells in half. Document Analysis is built on probability, so if borders of cells for one column have a high location average than that border is selected right or wrong. The more data in a table, the greater the chance of this probability being wrong.

It’s best to use tables on concrete document types i.e. a single variation of vendor invoice, or class of vendor invoices all with the same table type. If you prepare, you will not be let down by bad expectations and instead, you will be impressed with your table extraction.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Attachment Emailing Master

Jan 19
2010

Very often in business, email correspondences are accompanied by a file attachment. While it’s possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it’s not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document’s ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Replacement for fax right under our noses

Jan 14
2010

How does a technology first invented in 1843 and executed in 1924 still exist as a primary function in our working lives? I’m talking about fax. The fax technology is old and outdated. I personally avoid fax simply because of principle. But my principle alone will not make big changes in adoption. What people don’t understand is that we have a fax replacement right under our noses, one that is both green and as easy to use.

The combination of a document scanner, imaging software, and email software is a complete fax replacement solution. Instead of typing in phone numbers users, can type in email addresses. In fax you double the amount of paper that exists. Paper in, paper out. With the document scanning approach, you are reducing the paper consumption, paper in, email out. Most document scanners today even ship with a pre-configured “Scan to Email” option. On a production level, systems can be setup in offices, your local Kinkos, wherever, to allow multiple users to access the same document scanner and scan to any email with a basic step-by-step wizard.

Not only is fax to email saving trees, it is also increasing efficiency and when combined with workflow, document imaging, OCR, and data capture, it adds much greater value for that single piece of paper.

These systems do in fact exist in small corners of the world, and I have participated in the development and setup of them. The adoption is still very low. What it comes down to is fear of change. People understand paper to paper. Many users of fax don’t even know what email is. There are two ways this can be solved, time and forced adoption. While I would hope for the second which would be a campaign of replacing all fax machines with scanners, it’s very unlikely and requires unity of multiple competing entities.

No I do not like fax, but I understand it. And I hope that sooner rather than later people see there has been a solution to replace fax that is both saving trees, increasing efficiency and has existed for many years.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Duel Stream Scanning – Have your cake and eat it too

Jan 12
2010

The benefit of drop-out forms is that they are very accurate in data capture. The downside to drop-out forms is that after they are scanned, they aren’t much to look at. Companies want the best of both drop-out and black and white forms. They do this in various ways with the most common being to just deal with the images they have. Some will scan a document twice which can be very time consuming. Others will use an overlay utility that stamps the original form fields and labels back on an already processed drop-out image. These utilities are accurate but not as accurate as the original and often result in lines stamped on text. The best solution for getting a form scanned efficiently that is both optimum for data capture and viewing is to use duel stream scanning.

Duel stream scanning is usually a feature on high-end scanners. The technology is slowly moving down to the work group and desktop scanners. What the feature allows for is a single scan that produces both a drop-out and black and white image. The scan speed is the same scan speed as if you were scanning in color. When configured, the drop-out image goes one path and the black and white image another. By doing so, a company can use the drop-out image only for data capture, and the black and white image will marry with the data capture results in the database or file system.

The difference in data capture accuracy between a drop-out form and a black and white scanned form is on average 15% more accurate which is often much higher. The reason for this is that the OCR in data capture does not get interfered with form lines being printed on or too close to text. Additionally, the logic to locate fields can be simplified as field labels are often small font and hard to detect.

With it’s simplicity and it’s greatest accuracy of any solution, duel stream is a great tool.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Print to OCR?

Jan 07
2010

When I talk to people about the unique technique of printing text documents to image just for the purpose of running optical character recognition ( OCR ) or data capture on them, they are rightfully confused and think I’m a little nutz.

Why would you ever convert an already digital document back to image? I promise it’s not because I’m so fond of OCR; it actually has its purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally, while font is a clear indicator of language, if it is not accompanied by the proper language encoding, it will not tell the digital process what a language is, and in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices’ times as many PDF generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like, the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Document Preparation

Jan 05
2010

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – About

Find much more about document technologies at www.cvisiontech.com.

Visit Our Friends!

A few highly recommended friends...

Pages List

General info about this blog...