Lempel-Ziv-Welch (LZW) Algorithm

Next: Entropy Encoding Summary Up: Lossless Compression Algorithms (Entropy Previous: Arithmetic Coding

Lempel-Ziv-Welch (LZW) Algorithm

The LZW algorithm is a very common compression technique.

Suppose we want to encode the Oxford Concise English dictionary which contains about 159,000 entries. Why not just transmit each word as an 18 bit number?

Problems:

Too many bits,
everyone needs a dictionary,
only works for English text.
Solution: Find a way to build the dictionary adaptively.
Original methods due to Ziv and Lempel in 1977 and 1978. Terry Welch improved the scheme in 1984 (called LZW compression).
It is used in UNIX compress -- 1D token stream (similar to below)
It used in GIF comprerssion -- 2D window tokens (treat image as with Huffman Coding Above).

Reference: Terry A. Welch, "A Technique for High Performance Data Compression", IEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19.

The LZW Compression Algorithm can summarised as follows:

   w = NIL;
   while ( read a character k )
       {
         if wk exists in the dictionary
          w = wk;
         else
           add wk to the dictionary;
           output the code for w;
           w = k;
       }

Original LZW used dictionary with 4K entries, first 256 (0-255) are ASCII codes.

Example:

Input string is "^WED^WE^WEE^WEB^WET". 


             w    k    output    index    symbol
          -----------------------------------------
            NIL   ^               
             ^    W      ^        256       ^W
             W    E      W        257       WE
             E    D      E        258       ED
             D    ^      D        259       D^
             ^    W       
            ^W    E     256       260      ^WE
             E    ^      E        261       E^
             ^    W
            ^W    E
           ^WE    E     260       262     ^WEE
             E    ^
            E^    W     261       263      E^W
             W    E
            WE    B     257       264      WEB
             B    ^      B        265       B^
             ^    W
            ^W    E
           ^WE    T     260       266     ^WET
             T   EOF     T

A 19-symbol input has been reduced to 7-symbol plus 5-code output. Each code/symbol will need more than 8 bits, say 9 bits.
Usually, compression doesn't start until a large number of bytes (e.g., > 100) are read in.

The LZW Decompression Algorithm is as follows:

 read a character k;
   output k;
   w = k;
   while ( read a character k )    
  /* k could be a character or a code. */
       {
         entry = dictionary entry for k;
         output entry;
         add w + entry[0] to dictionary;
         w = entry;
       }

Example (continued):

Input string is "^WED<256>E<260><261><257>B<260>T". 


             w      k    output    index    symbol
          -------------------------------------------
                    ^       ^
             ^      W       W       256       ^W
             W      E       E       257       WE
             E      D       D       258       ED
             D    <256>    ^W       259       D^
           <256>    E       E       260      ^WE
             E    <260>   ^WE       261       E^
           <260>  <261>    E^       262     ^WEE
           <261>  <257>    WE       263      E^W
           <257>    B       B       264      WEB
             B    <260>   ^WE       265       B^
           <260>    T       T       266     ^WET

Problem: What if we run out of dictionary space?
- Solution 1: Keep track of unused entries and use LRU
- Solution 2: Monitor compression performance and flush dictionary when performance is poor.
Implementation Note: LZW can be made really fast; it grabs a fixed number of bits from input stream, so bit parsing is very easy. Table lookup is automatic.

Next: Entropy Encoding Summary Up: Lossless Compression Algorithms (Entropy Previous: Arithmetic Coding

Dave Marshall
10/4/2001