Handling Large Documents on the Web

In All, Archived, Compress File, Document Compression, Web Optimization by ChrisLeave a Comment

Question: What are the major issues with respect to handling very large files on the Web? Is one file format preferrable to another?

Answer: The major issues for handling large files on the Web would seem to be : i. compression, ii. web-optimization, iii. search, iv. chunking. We’ll briefly review each of these issues.

Compression: Just because a file has lots & lots of pages, does not mean that the file size must necessarily also be large. Compare, for example, a 1,000 page electronic file with a 1,000 page scanned color TIFF file. The scanned file can easily be 100x larger than the electronic file (e.g., 1 GB vs. 10 MB). So compression can be a key factor in making sure your documents are amenable to web-hosting. Compression is particularly a factor when dealing with scanned image documents. Compression can yield reductions of up to 10x for black and white image documents and up to 100x for color image documents. See, for example, http://www.cvisiontech.com/pdf_compressor_31.html.

Web-optimization: If files are large, its unlikely that someone specifically wants to view the first page of the document. More than likely, they want to get to some page in the middle of the document. Web-optimization is a feature that allows a document viewer to view any page in an arbitrarily large document in constant time (e.g., 1-2 seconds), requesting from the server than it jump to the byte boundary where the file page starts. This allows for efficient web browsing of a file. PDF format has native support for web-optimization.

Search: The larger a file is, the more likely you’ll need text search capability. If a file is only one or two pages then perhaps you can find what you’re looking for simply by perusing through the document. If a file is large, say over 30 pages, then it is very difficult to find what you’re looking for without text search capability. Although most electronic files are already searchable, some are not (e.g., vector graphics). For scanned files without OCR (unsearchable), finding what you want in the file is akin to finding a needle in a haystack. So make sure all your large web-hosted files are searchable. For scanned documents, this means running the files through an OCR process.

Chunking: Another problem with large files in a Web-based environment, even if they’re web-optimized, is that just downloading the file in your viewer (e.g., Adobe Reader) may tie up all your computer memory resources. Especially when file sizes run into the 100’s of MegaBytes. For efficient handling of large files, even in a web-optimized viewer, the file being viewed will continue to stream and consume available machine RAM. One solution to this problem is chunking, meaning that a very large file is divided into subfiles, none of which exceeds a maximum byte size. For example, if we select 50 MB as a reasonable chunking size then a very large PDF file would be chunked so that no single PDF subfile exceeds 50 MB. Now the total memory consumption on a document search is bounded.

Adobe PDF is recommended for web hosting document databases. The PDF format has native support for 3 of the 4 features we listed as desirable when web-hosting files, namely, compression, web-optimization, and searching (i.e., hidden text layer). As such, there is very little “engineering” required on the IT side when implementing a web-hosted database that is already in PDF format.

Leave a Comment