![]() |
Comm
Corner
An Overview by John Woody |
Archiving is another term we see in the literature which is used as either a noun or verb. Archiving can be used in the place of any of the other terms mentioned previously, or it can be a collection of compressed files or one compressed file. Data compression shows itself in nearly every data movement to and from our computers. Additionally, most of us have discovered that we did not have enough hard disk space for the neat programs we bought or downloaded. Data compression programs such as Stacker or Microsoft DoubleSpace have been developed to assist in getting more file data on smaller hard disks. We are also familiar with PKWARE's PKZIP utilities for doing file or directory size data compression.
Data compression has been used in most analog or digital modem protocols from the 2400 bps speed version protocols on to date. The reason for modem compression is that the data transfer is much easier to handle error free when the file has been streamlined. The specifications for most high speed analog modems contains the V.42bis/MNP5 data compression protocol along with the V.34 speed standard. Data compression is also used in the ISDN digital protocols.
Data compression first came into use as application programs grew to one or more disks. It is an efficient method of loading large applications onto fewer disks, keeping the size of the marketed program package reasonable. Microsoft Office 4.3, for example, contains 20 1.44MB disks as it is. If data compression were not used, 613KB of floppy disk space would be necessary to store the original program files.
Data compression helps to reduce the amount of storage space data
requires on hard disks as well. Data compression programs can be used to
decrease file storage by 50 to 90 percent. This is especially true where
program files are archived for further use such as on the BBS hard drive.
The compressed file is much easier to download or transfer than the decompressed
version. Most of us have encountered the FILENAME.ZIP upon downloading
a file from the BBS and have had to install the PK204G.ZIP compression/decompression
utility in order to see what is in the file.
One method of consolidating redundant code is to assign short code sequences to frequently-used characters and longer code to characters rarely used. This is known as the Huffman algorithm which was developed in the 1950's. The algorithm performs two functions. The first is to determine the number of times (frequency) each character appears in a file. The second function is to create an encoding scheme based on the frequency of each character. This is known as squeezing the data. The most frequently used characters get short codes, and the least frequently used characters have long codes. Text files are the easiest to compress. This is the method most of the popular data compression programs such as PKZIP, ARJ, LHA, and ZOO use. These programs are able to compress text files to about 27 percent of their original size. In general, binary files do not compress as tightly.
DOS programs compress to about one half their original size. Windows
programs reduce to less than half of their original size. Data compression
can go too far. Repeatedly compressing a program is not a good idea. The
file data contents become more alienated from its original form with repeated
compression. In nearly every test case, PKZIP is by far the fastest compression
program.
PKZIP [-b[path]] [options] Zipfile [@list] [files...]
As we can see, we do not for the most part use every option when we use PKZIP. The general components include Program (PKZIP), Command (-a), Switch (-p)(optional), Archive name (FILENAME.ZIP), and Files (.doc)(optional). We mainly get around the complicated command components by using the basic structure and staying with it each time. And, each compression program is a little different in its command elements. Finally, each program has different controlling program functions, ie., some programs such as PKZIP differentiate between the compressing module and the decompressing module.
The Command element is required in every program. It specifies the type of task the program is to undertake, ie., PKZIP -a FILENAME.ZIP FILENAME.TXT. The -a is the command in this example. What -a does is determined by the PKZIP program and the use must be determined within that program. Help is available in nearly every main line data compression program.
Switches control the execution of the command and specify deviations from the standard procedure.
Archive Names must appear in the command and are usually specified in the filename extension, ie., .ARC, .ZIP. The compression program usually assigns this extension automatically.
Files to be compressed are usually listed at this option. Most
of the data compression programs allow more than one file to be compressed
at one time.
Data compression programs all place header information in the compressed file for location purposes. The header contains data from each archived file, local data, and information about the total archive. The local data belongs to the individually archived file and is divided in a local header and the compressed data. The header also contains important information on how to decompress the file. This makes getting around errors easier in that the entire archive is not usually lost if an individual archived file is corrupted. Damage is usually limited to the local archived file containing the error. Local headers also make it easy to make changes to archived files. Updating archives in made by comparing local header information. Changing file names is also easy for archived files and is done by making minor changes to the header.
Every archive also contains a global header. Statistical data
not important to individual files is maintained here in the archive header.
This statistical data is very important to the overall structure of the
archive. The archive header contains information such as central header
signature, packer version, required version for unpacking, general information,
date of last change, time of last change, 32 bit CRC code, and compressed
size and normal size. This header information is required to decompress
the data back into its original form.
In the archive with multiple files, the decompression program or utility decompresses every file in the archive unless control commands are included in the decompression function. These control commands to list, delete, repair, or convert data files among other functions.
The decompression programs or utilities also provide controls for placement of the decompressed files in sub-directories other than the current sub-directory if required. Target sub-directories are indicated by the path commands. Path information is usually placed right after the packer command, ie., ARJ e ARC C:\ EXAMPLE\ *.TXT. In this case, the target directory is C:\EXAMPLE\, where all of the .TXT files will be placed. Additionally, new sub-directories may be created from the decompression program or utility. Remember that each program has different commands for all of these functions.
Existing files are protected automatically as the decompression program or utility unpacks the archive. Files in the target directory are not overwritten until some action is taken. The decompression utility responds by skipping the file, or indicating that a file of the same name exists and gives you the chance to overwrite it. Some programs compare the ages of files, overwriting older files. Some programs also let you specify another file name if necessary. All of the programs have safety prompts built in.
Some of the data compression programs decompress entire directories.
Individual files can be decompressed with or without directory or path
information. ARJ automatically incorporates the directory and path information
into the archive. LHA, ICE, and PKZIP each require special commands to
include the path names during decompression. Directory structures stored
in archives require that we understand relative and absolute path information.
Relative path positions are the same as giving directions from where you
stand at the time, and absolute path positions are referenced from a fixed
point such as the water tower. The root directory is the equivalent of
the water tower. One can always get back to the root directory.
It is just good computer practice to obtain and learn at least
one data compression program. The PKWARE PK204G.ZIP utilities are highly
used. This utility can be downloaded as Shareware from most BBS's, as can
most of the other programs. There are Windows 3.11 and Windows 95 versions
of most of them. We all need to learn at least one of them well.
John Woody is a telecommunications
consultant specializing in small business communications networks and Internet
business training.