Comm Corner Logo
Comm Corner
Data Compression: 
An Overview
by John Woody


Alamo PC Organization: HOME > PC Alamode Magazine > Columns > Comm Corner 
We have all used some form of data compression as we have progressed in the use of our computers. This includes uses from the time we loaded our first application; or logged onto a BBS with a modem. FAX machines all use data compression. Data compression may be a new term, but we are all familiar with zipping, compressing, packing, squeezing, squashing or crunching data files.

Archiving is another term we see in the literature which is used as either a noun or verb. Archiving can be used in the place of any of the other terms mentioned previously, or it can be a collection of compressed files or one compressed file. Data compression shows itself in nearly every data movement to and from our computers. Additionally, most of us have discovered that we did not have enough hard disk space for the neat programs we bought or downloaded. Data compression programs such as Stacker or Microsoft DoubleSpace have been developed to assist in getting more file data on smaller hard disks. We are also familiar with PKWARE's PKZIP utilities for doing file or directory size data compression.

 Data compression has been used in most analog or digital modem protocols from the 2400 bps speed version protocols on to date. The reason for modem compression is that the data transfer is much easier to handle error free when the file has been streamlined. The specifications for most high speed analog modems contains the V.42bis/MNP5 data compression protocol along with the V.34 speed standard. Data compression is also used in the ISDN digital protocols.

 Data compression first came into use as application programs grew to one or more disks. It is an efficient method of loading large applications onto fewer disks, keeping the size of the marketed program package reasonable. Microsoft Office 4.3, for example, contains 20 1.44MB disks as it is. If data compression were not used, 613KB of floppy disk space would be necessary to store the original program files.

 Data compression helps to reduce the amount of storage space data requires on hard disks as well. Data compression programs can be used to decrease file storage by 50 to 90 percent. This is especially true where program files are archived for further use such as on the BBS hard drive. The compressed file is much easier to download or transfer than the decompressed version. Most of us have encountered the FILENAME.ZIP upon downloading a file from the BBS and have had to install the PK204G.ZIP compression/decompression utility in order to see what is in the file.
 
 

Data Compression Theory

Data compression theory is based on one primary fact: most data is inherently redundant or repetitious. Compression uses methods of encoding this repetition to eliminate the repetitions to reduce the size of the data file for transmission or storage. The type of compression depends on what type of data is being compressed. Graphics or largely similar data may use one type of compassion which simply identifies what bit of data is repeated and shows it as repeated data. Text data, which contains mixed bit code, requires that we use algorithms which compare all of the American Standard Code for Information interchange (ASCII) characters in the data to determine if there is repetition. The 256 ASCII character combinations include all 26 letters of the alphabet, ten numerals, punctuation marks, foreign characters, graphic symbols, and invisible printer control characters.

One method of consolidating redundant code is to assign short code sequences to frequently-used characters and longer code to characters rarely used. This is known as the Huffman algorithm which was developed in the 1950's. The algorithm performs two functions. The first is to determine the number of times (frequency) each character appears in a file. The second function is to create an encoding scheme based on the frequency of each character. This is known as squeezing the data. The most frequently used characters get short codes, and the least frequently used characters have long codes. Text files are the easiest to compress. This is the method most of the popular data compression programs such as PKZIP, ARJ, LHA, and ZOO use. These programs are able to compress text files to about 27 percent of their original size. In general, binary files do not compress as tightly.

DOS programs compress to about one half their original size. Windows programs reduce to less than half of their original size. Data compression can go too far. Repeatedly compressing a program is not a good idea. The file data contents become more alienated from its original form with repeated compression. In nearly every test case, PKZIP is by far the fastest compression program.
 
 

Elements of Data Compression

All data compression programs contain four elements which must be followed within the program guidelines (Syntax: the form in which commands are entered.). Some of the syntax entries are optional and some must be entered for the program to work. Most of us do not fully use these programs because we do not fully understand that syntax. An example might be from the PK204G.ZIP command set:
 

PKZIP [-b[path]] [options] Zipfile [@list] [files...]

As we can see, we do not for the most part use every option when we use PKZIP. The general components include Program (PKZIP), Command (-a), Switch (-p)(optional), Archive name (FILENAME.ZIP), and Files (.doc)(optional). We mainly get around the complicated command components by using the basic structure and staying with it each time. And, each compression program is a little different in its command elements. Finally, each program has different controlling program functions, ie., some programs such as PKZIP differentiate between the compressing module and the decompressing module.

 The Command element is required in every program. It specifies the type of task the program is to undertake, ie., PKZIP -a FILENAME.ZIP FILENAME.TXT. The -a is the command in this example. What -a does is determined by the PKZIP program and the use must be determined within that program. Help is available in nearly every main line data compression program.

 Switches control the execution of the command and specify deviations from the standard procedure.

 Archive Names must appear in the command and are usually specified in the filename extension, ie., .ARC, .ZIP. The compression program usually assigns this extension automatically.

 Files to be compressed are usually listed at this option. Most of the data compression programs allow more than one file to be compressed at one time.
 
 

Archives

File extensions like .ZIP, .LHA, .ARC, and .ARJ indicate archive files. Archives are the computer version of record physical storage. Think of them as a place where important documents are stored. Archived files are stored in a separate location when they are compressed. They cannot be processed until they are retrieved with a data decompressing program.

 Data compression programs all place header information in the compressed file for location purposes. The header contains data from each archived file, local data, and information about the total archive. The local data belongs to the individually archived file and is divided in a local header and the compressed data. The header also contains important information on how to decompress the file. This makes getting around errors easier in that the entire archive is not usually lost if an individual archived file is corrupted. Damage is usually limited to the local archived file containing the error. Local headers also make it easy to make changes to archived files. Updating archives in made by comparing local header information. Changing file names is also easy for archived files and is done by making minor changes to the header.

 Every archive also contains a global header. Statistical data not important to individual files is maintained here in the archive header. This statistical data is very important to the overall structure of the archive. The archive header contains information such as central header signature, packer version, required version for unpacking, general information, date of last change, time of last change, 32 bit CRC code, and compressed size and normal size. This header information is required to decompress the data back into its original form.
 
 

Self Extracting Archives

Self extracting archives are compressed files which contain an additional program which decompresses the archive upon execution. The compressed archive becomes active when it is used. This is very helpful when a new program is being used, or if you have no experience with compression programs. The application program is extracted from its compressed state by itself. The only difference between a self extracting program and a regular archived program is the command portion which automatically unpacks it.

Decompressing Data

Compressing and decompressing data is accomplished by different data compression programs in different ways. Some are integrated and contain both functions with only a single command difference to compress or decompress. LHARC is an example of an integrated program in that the same program packer command starts both with only a command letter to differentiate them, ie., LHA a FILENAME compresses the file and LHA e FILENAME decompresses it. The a and e command are the only difference in the function. Other programs such as PKZIP/PKUNZIP use separate program utilities to complete the function. Both types of compression programs contain optional commands to manage the archived files. Programs with separate program utilities usually only need to include action commands in the compression function. The decompress utility is needed only to decompress the archive.

 In the archive with multiple files, the decompression program or utility decompresses every file in the archive unless control commands are included in the decompression function. These control commands to list, delete, repair, or convert data files among other functions.

 The decompression programs or utilities also provide controls for placement of the decompressed files in sub-directories other than the current sub-directory if required. Target sub-directories are indicated by the path commands. Path information is usually placed right after the packer command, ie., ARJ e ARC C:\ EXAMPLE\ *.TXT. In this case, the target directory is C:\EXAMPLE\, where all of the .TXT files will be placed. Additionally, new sub-directories may be created from the decompression program or utility. Remember that each program has different commands for all of these functions.

 Existing files are protected automatically as the decompression program or utility unpacks the archive. Files in the target directory are not overwritten until some action is taken. The decompression utility responds by skipping the file, or indicating that a file of the same name exists and gives you the chance to overwrite it. Some programs compare the ages of files, overwriting older files. Some programs also let you specify another file name if necessary. All of the programs have safety prompts built in.

 Some of the data compression programs decompress entire directories. Individual files can be decompressed with or without directory or path information. ARJ automatically incorporates the directory and path information into the archive. LHA, ICE, and PKZIP each require special commands to include the path names during decompression. Directory structures stored in archives require that we understand relative and absolute path information. Relative path positions are the same as giving directions from where you stand at the time, and absolute path positions are referenced from a fixed point such as the water tower. The root directory is the equivalent of the water tower. One can always get back to the root directory.
 
 

Compressing Data

Compressing individual files requires only that the data compression packer program is executed, ie., PKZIP FILENAME. All of the data compression programs work the same way. Each has a list of commands and switches which make it function properly. These commands and switches may be listed by typing the packer program name and pressing , ie., PKZIP . Files may be added, moved, updated, or freshened using the command switches in all of the programs. Some of the data compression programs can compress whole directories. PKZIP, ARJ, and LHA have the capability to represent directories.
 
 

Conclusion

This article is intended to provide a summary of the theory of data compression. This subject is central to being able to really use data from outside one's own computer. The self extracting archive programs are used nearly every time we load new program software. We need some form of decompressing utilities to use downloads from BBS files. Downloads from the Internet are nearly always compressed and require decompressing to use.

 It is just good computer practice to obtain and learn at least one data compression program. The PKWARE PK204G.ZIP utilities are highly used. This utility can be downloaded as Shareware from most BBS's, as can most of the other programs. There are Windows 3.11 and Windows 95 versions of most of them. We all need to learn at least one of them well.
 
 

John Woody is a telecommunications consultant specializing in small business communications networks and Internet business training.