Compressing PDF Files
Compression is the process of reducing the size of each file and the goal is to save space and time of transmission (a lot of times dependent on file size). The data is compressed through achieving better structure which is done by a program that uses various formulas and algorithms.
When mentioning PDFs with compression, one is usually referring to image compression. In a sense PDFs are already compressed because when you compare them to its PostScript equivalent, they are smaller in size. However, when you use various types of algorithm, you can compress the file even further, which is why the main goal of this article is to give a brief overview of the different types of image compression algorithms.
The following is a list of various image compression algorithms:
- JPEG & JPEG 2000
LZW also known as Lempel-Ziv-Welch
LZW compression uses an algorithm that can be best described as a table lookup. Two of the file formats where LZW is most commonly used are the GIF the TIFF image formats. LZW can also be used for text files. What LZW does is that it creates an entry in the table for a given input sequence. The entry will consist of the original pattern along with a shorter code that can be used as a substitute. As the rest of the document continues to be read, if there are any patterns that have been read before and is included in the table, the short code would be used instead of the longer sequence which would effectively compress the size of the file.
Flate method is based on a public-domain method which is similar to the LZW compression method. Both LZW and Flate compress either binary data or ASCII text but this produces binary data as well. Flate is able to make full use of finding and exploiting patterns in the input data regardless if it is text or images. Flate is usually going to be more compact that LZW because it adapts a Huffman coding. However, its encoding speeds are slower. Overall, Flate does produce better results than LZW.
JPEG is a compression algorithm that has been used to reduce the size of files that closest resembles true-color images as much as it can without changing the quality of the image. We notice small changes in brightness more so than we do with small changes in color. JPEG exploits this aspect of human perception in order to reduce file size. However, JPEG compression is lossy in the sense that it removes image data and the quality of the image might also be reduced even though it attempts to reduce the size of the file while minimizing the loss. Due to this fact, JPEG can achieve considerably smaller sizes than the other options but at the expense of losing data.
With Acrobat, there are a few different levels of compression ratio ranging from Maximum, which produces the least compression but at the same time it is able to retain the most data, and Minimum which is the most compression but it loses the greatest amount of data when compared to the original. People are usually trying to find a balance between the two since Maximum will usually reduce very little for there to be a noticeable difference and Minimum will give you a very distorted and blocky image which looks too different from the original.
This is a pretty new compression algorithm compared to the others. It supports PDF version 1.5 and up. Even with its higher efficiency when compared to JPEG, it is still not used by a lot of companies because of its high overhead and it also has problems regarding compatibility with older systems.
This compression method is most appropriate for black & white images. It is similar to how fax machines compress files. This method will also leave the quality of the images unaffected. Acrobat is a provider of Group 4 and Group 3 compression options which both are good for monochrome images and the latter being the system that fax machines use.
JBIG2 is a compression method that is much more superior than CCITT when dealing with monochromatic images. The way it goes about compressing files is pretty similar to LZW in terms of creating a table of unique short codes which stands for different sequences within the file. And as it reads the file, the short codes replace any repeats of the sequences that have already occurred. JBIG2 is also capable of compressing an entire table. Its compression abilities can range anywhere from 20:1 to 50:1 for a page full of text.
RLE is a lossless algorithm which means the quality of the images will be unaffected. It is best used for images that contain large areas of black & white.
ZIP files are popular amongst certain PC applications such as WinZIP. This type of compression works well with images that have repeating patters and an example could be a large area containing the same color. ZIP is also a lossless algorithm meaning the images will not be affected during the compression process. Acrobat has 4-bit and 8-bit options and as long as you use a higher or equal compression option when compared to your image, the results will be lossless. So that basically means the only way you would lose data is if you attempted to compress an 8-bit image with only a 4-bit ZIP compression.