The magazine of the Melbourne PC User Group

Data Compression
Major Keary
majkeary@netspace.com.au

Data compression is an essential aspect of connectivity; modems have inbuilt compression facilities, it is used in digital telephony, and without compression many multimedia and graphics files - especially video - would be too big for practical use.

There are two distinct forms of compression, lossless and lossy. The first is necessary for data that has to be restored to its exact original form (for example, text, spreadsheet, and database); lossy compression is used for data that can shed some of its detail without materially affecting the restored file (for example, images, video, and audio).

Can Compression Get Better?

Lossy compression techniques have undergone remarkable improvements. New algorithms enable surprising degrees of effective guessing, which means that when a compressed file is being restored the software can replace discarded data with something very close to the original. Early lossy systems simply discarded some data before compressing a file; the discarded data was lost forever. Now there are systems that do a very good job of replicating the original with the help of some complex mathematics.

Lossless compression is another matter; serious discussion of possible significant improvements has been left to the habitues of specialist Usenet groups. Every now and then a new concept is put forward, such as reported in The Age of 2 August 1994 following a visit to Australia by Phil Katz. The Age report said: "Now that we have the computing power he (Phil) is excited about the prospect of entering a higher level of data compression, using the second order of Markov models, which will shrink a 1MB text file to about 150KB, compared to between 300 to 40OKB in the latest version of PKZIP"

That exciting prospect made me sufficiently curious as to e-mail Phil to ask which Markov he was talking about, and was such a quantum leap in lossless compression in fact imminent?

A response from PKWare indicated that "Phil was not sure about which Markov",but someone would get back tome. There the correspondence rested until, some time later, I raised the subject again seeking news of the spectacular Markov comet that promised to light up our compression skies.

Markov Chains

The term, Markov models, suggested the work of Andreii Andrecvich Markov 1856-1922 (his given names are sometimes spelled, Andrey Andreyevich) a Russian mathematician known for his contribution to the development of the theory of stochastic processes, some of which-Markov chains-carry his name. Stochastic process is usually associated with probability, and probability is important to information theory from which data compression derives.

Markov chains were applied to early work in speech recognition, and speech and music synthesis, a connection that led me to think Phil may have been talking about lossy audio compression rather than data compression. My assumption, however, was wrong; Steve Burg, Vice President, Engineering, of PKWare Inc. responded to my follow-up enquiry:

"The Markov models that Phil was referring [to are] based on the work of Andreii Markov in speech recognition. Most popular compression algorithms at the present time use no information from the current context to improve compression efficiency. Markov modelling allows improvements by providing predicted future content. As a simplified example, a person can often predict the end of a sentence based on the start of the sentence and previous context."

It seems there had been some progress, but there was a problem with achieving the goal of 100 per cent accuracy, as well as writing an application sufficiently robust and stable for commercial release. Lossless compression has to restore a file to its exact original form: 99.9 percent is not good enough. It is not surprising that the project has not been further mentioned.

Is It a Comet, Or Is It a Scam?

A number of compression scams have attracted the attention of the comp.compression newsgroup and it is worth reading the FAQ for accounts of what jean-Loup Gailly calls, "The WEB 16:1 compressor, OWS and other illusions". OWS (the initial letters of the program's author) does not compress data, but "remembers the clusters which contained the data ... which ... can be recovered only if those clusters are not used again for another file".

Another, WEB 16:1, had its claims published in Byte (June 1992 Vol 17 No 6): " ... the compression algorithm used... is not subject to the laws of information theory . ... ". As the laws of information theory have not yet been repealed, the program simply did not work and has not been heard of again.

On a more realistic note, the The Age of 9 May 2000 published a comparison between WiriZIP/PKZIP and RAR. The author of RAR does not disclose the algorithm(s) employed, but clearly uses several, among which an LZ77 variant and Huffman are sure to be present. The program offers an option, solid, that adopts the same technique used by hard disk compression utilities such as Drive Space.

Depending on the operating system and file management arrangements files do not generally stop at the point where the user closes them. Under DOS (which is the underlying operating system for Win95 and Win98) a group of files can occupy space much larger than the sum of the data they contain. Lots of small files can chew up big chunks of hard disk space. There are ways of minimising such loss, but that is another subject.

In order to overcome the problem RAR offers a solid compression option. It concatenates the files selected for archiving, thus eliminating the accumulated trailing space, and requires just one header for the archive. Savings on unused space will not make much difference to the size of an archive, regardless of the utility employed.

RAR seems to make its saving in two ways. First, it eliminates multiple headers; if each file added to an archive is kept as a discrete, stand-alone, item then each one requires its own header. Second, if files are of alike kind they are likely to contain repeated patterns of data that can be replaced by tokens. For example, if a group of MS Word files are each saved with font information, and they all use the same fonts, then large blocks of data will be repeated. If the files are concatenated that represents a significant opportunity for a dramatic compression ratio.

There is quite a body of literature on data compression, but much of it is in professional journals. Two books in particular are well worth looking at if you are interested in how the systems work.

Mark Nelson and Jean-Loup Gailly: The Data Compression Book 2/e (ISBN 1-55851-434-1, RRP $69.95) is written for programmers. However, while the extensive illustrative code is written in C, one does not have to be a programmer to follow the excellent descriptions of how various systems work. Published in 1966, the book stops short of recent developments. Even so, for developers or experimenters who want to write compression routines into a program this is an essential text. It covers both lossless and lossy techniques. A book that ordinary readers can tackle with confidence.

Khalid Sayood: Introduction to Data Compression (ISBN 1-55860-346-8, RRP $130) was written to provide a single text for an introductory course on data compression. It covers lossless and lossy systems and provides photographic images to illustrate the original and end product from various lossy systems.

This is a professional title and assumes some familiarity with linear algebra. Anyone seeking detailed mathematical descriptions and explanations of compression algorithms will find this a valuable resource. It includes a brief introduction to information theory and describes various models (physical, probability Markov, and composite source). There is also an interesting discussion of the human visual systems and auditory perception.

Chapters deal with Huffman coding, arithmetic coding, dictionary techniques (LZ77 and LZ78), lossless image compression, scalar quantization, vector quantization, differential encoding, sub-band coding, transform coding, speech compression, and video compression.

Teachers and students of computer science courses should Introduction to Data Compression well worth adding to their reading lists. If your linear algebra is a bit rusty, the author provides a good refresher tutorial. A copy is in the library for those would like to visit the engine room of data compression. 

Passing of Two Notable Compressionists

This year has seen the sad passing of two people whose names have become part of the vocabulary of data compression: Phil Katz and David Albert Huffman. Phil Katz was an implementer and David Huffman a deviser of compression systems. Without the algorithms invented by people like David Huffman there would be nothing for people like Phil Katz to implement, which is not be taken as lessening the importance of their contribution.

There is another side to the coin. When Abraham Limpet and Jacob Ziv published their two famous algorithms, LZ77 and LZ78, they had nothing to say about how their work could be implemented; it is one thing to devise and describe an algorithm, but quite another matter to put it to work.

It was sad to hear of their passing; Phil was in his prime and David was certainly too young to have his life cut short.

Reprinted from the June 2000 issue of PC Update, the magazine of Melbourne PC User Group, Australia