Just knowledge it: data compression

The process of removing redundant information from data so that it takes up less space is called data compression. Besides saving disk space, compressing data such as e-mail attachments can make data communications faster.

Compression methods generally begin with the realiza-tion that not all characters are found in equal numbers in text. For example, in English, letters such as e and s are found much more frequently than letters such as j or x. By assigning the shortest bit codes to the most common characters and the longer codes to the least common char-acters, the number of bits needed to encode the text can be minimized.

Huffman coding, first developed in 1952, is an algorithm that uses a tree in which the pairs of the least probable (that is, least common) characters are linked, the next least prob-able linked, and so on until the tree is complete.

Another coding method, arithmetic coding, matches characters’ probabilities to bits in such a way that the same bit can represent parts of more than one encoded character. This is even more efficient than Huffman coding, but the necessary calculations make the method somewhat slower to use.

Another approach to compression is to look for words (or more generally, character strings) that match those found in a dictionary file. The matching strings are replaced by numbers. Since a number is much shorter than a whole word or phrase, this compression method can greatly reduce the size of most text files. (It would not be suitable for files that contain numerical rather than text data, since such data, when interpreted as characters, would look like a random jumble.)

The Lempel-Ziv (LZ) compression method does not use an external dictionary. Instead, it scans the file itself for text strings. Whenever it finds a string that occurred earlier in the text, it replaces the later occurrences with an offset, or count of the number of bytes separating the occurrences. This means that not only common words but common prefixes and suffixes can be replaced by num-bers. A variant of this scheme does not use offsets to the file itself, but compiles repeated strings into a dictionary and replaces them in the text with an index to their posi-tion in the dictionary.

Graphics files can often be greatly compressed by replac-ing large areas that represent the same color (such as a blue sky) with a number indicating the count of pixels with that value. However, some graphics file formats such as GIF are already compressed, so further compression will not shrink them much.

More exotic compression schemes for graphics can use fractals or other iterative mathematical functions to encode patterns in the data. Most such schemes are “lossy” in that some of the information (and thus image texture) is lost, but the loss may be acceptable for a given application. Lossy compression schemes are not used for binary (numeric data or program code) files because errors introduced in a pro-gram file are likely to affect the program’s performance (if not “break” it completely). Though they may have less seri-ous consequences, errors in text are also generally consid-ered unacceptable.

Trends

There are a variety of compression programs used on unix systems, but variants of the Zip program are now the overwhelming favorite on Windows-based systems. Zip combines compression and archiving. Archiving, or the bundling together of many files into a single file, contrib-utes a further reduction in file size. This is because files in most file systems must use a whole number of disk sectors, even if that means wasting most of a sector. Combining files into one file means that at most a bit less than one sector will be wasted.

Just knowledge it

Search This Blog

Sunday, 27 October 2013

data compression

No comments:

Post a Comment