PDF OCR for NARA compliance

In All, Archived, OCR, PDF OCR, PDF/A by ChrisLeave a Comment

Retaining documents is getting easier is a sense than it’s ever been. Many documents are already in electronic form and paper documents can easily be processed through a scanner or MFP and converted to electronic form. For serious-minded IT managers, however, this is only where the problem begins.

There are issues of retention that must be resolved. Will these documents open and be readable 5, 10, 30 years from now? While microfiche is fast becoming an outdated technology, it is also tried and true. Meaning, little can and will go wrong by archiving a Company’s database to microfilm. The same cannot easily be said for electronic archiving.

Will the operating system in 5 years from now, say Vista 2012, be able to read the PDFs being archived right now? Are these archived records readable across all machines in the Company, even overseas? To the electronic files satisfy NARA archival requirements? Are we converting to PDF/A?

So although document archiving and records management issues are in some sense simplifying and streamlining, as the corporate fileroom becomes a thing of the past, many aspects of archiving and RM are becoming more complex. What constitutes a legal record? Is a facsimile acceptable?

NARA has put out a set of guidelines that can be helpful to IT departments trying to sort all this out

http://www.archives.gov/records-mgmt/initiatives/pdf-records.html

http://www.archives.gov/records-mgmt/initiatives/scanned-textual.html

OCR is generally an important component with respect to scanned documents. Once converting paper to electronic documents to be archived, it is often important to have full-text search capability with respect to these files. It is important to understand, particularly if archiving to PDF, what the various formats are for scanned documents and which are acceptable from a NARA archival perspective.

In particular, many OCR products by default convert PDF OCR ‘ed documents to electronic form such as text or Word. Any such conversion to a non-image format is not acceptable for archiving (at least from a NARA perspective) since guesses are made in the OCR process and there is a certain error rate as scanned characters are misinterpreted by the OCR system.

Also not acceptable for record retention purposes is PDF normal, which is a hybrid of both image and electronic PDF blended together. PDF normal also has problems with OCR mismatches and font subsitutions. In addition, the combination of electronic and image characters can detract from document readability.

What is acceptable for compliance with NARA, PDF/A, and other archival specifications? The basic requirements with respect to compliance is that captured paper documents remain in some image format, e.g., PDF image. They must look perceptually identical to the documents at the time of capture. To achieve this, a purely image PDF format needs to be selected (e.g., JPEG or JBIG2-based) . This rules out conversion to electronic or normal PDF formats.

It is also OK, from a NARA perspective, to add a hidden text layer to an image PDF document. The hidden text layer does not change the appearance of the PDF in any way, but it does allow for full-text search and indexing of the source document.

Leave a Comment