Indexing the OCR output of PDF files

In Archived, Uncategorized by ChrisLeave a Comment

Question: My database system supports full-text indexing and search. Yet for some reason the OCRed, searchable PDFs that are dropped into my database do not show up when I search on OCR terms I know are contained in the PDF document. Why is this?

Answer: PDF is a relatively new format. As such, up until a few years ago, many full text search engines did not index PDF files. In fact, Google has only been indexing PDF files for the last 12 months or so. But even though search engines and databases have started to index on PDFs, there are many complex PDFs out there which are not indexed on either completely or at all. One type of PDF file that is not yet indexable across text-search engines is PDF with hidden text, where the hidden text is often not indexed on. We have seen this phenomenon with Hummingbird, Google, and FileNet installations. Sometimes, as in the Hummingbird case, this lack of text indexability vis a vis an OCR’ed PDF file with hidden text was directly attributable to a Company not having the latest update of a given search engine build or patch.

Leave a Comment