comp.compression Frequently Asked Questions (part 2/3)Section - [70] Introduction to data compression (long)

Top Document: comp.compression Frequently Asked Questions (part 2/3)
Previous Document: News Headers
Next Document: [71] Introduction to MPEG (long)

See reader questions & answers on this topic! - Help others by sharing your knowledge

Written by Peter Gutmann <pgut1@cs.aukuni.ac.nz>.

 Huffman and Related Compression Techniques
 ------------------------------------------

  *Huffman compression* is a statistical data compression technique which 
gives a reduction in the average code length used to represent the symbols of 
a alphabet.  The Huffman code is an example of a code which is optimal in the 
case where all symbols probabilities are integral powers of 1/2.  A Huffman 
code can be built in the following manner:

  (1) Rank all symbols in order of probability of occurrence.
    
  (2) Successively combine the two symbols of the lowest probability to form
      a new composite symbol; eventually we will build a binary tree where
      each node is the probability of all nodes beneath it.

  (3) Trace a path to each leaf, noticing the direction at each node.

  For a given frequency distribution, there are many possible Huffman codes,
but the total compressed length will be the same. It is possible to
define a 'canonical' Huffman tree, that is, pick one of these alternative
trees. Such a canonical tree can then be represented very compactly, by
transmitting only the bit length of each code. This technique is used
in most archivers (pkzip, lha, zoo, arj, ...).


  A technique related to Huffman coding is *Shannon-Fano coding*, which
works as follows:

  (1) Divide the set of symbols into two equal or almost equal subsets
      based on the probability of occurrence of characters in each
      subset.  The first subset is assigned a binary zero, the second
      a binary one.

  (2) Repeat step (1) until all subsets have a single element.

The algorithm used to create the Huffman codes is bottom-up, and the
one for the Shannon-Fano codes is top-down. Huffman encoding always
generates optimal codes, Shannon-Fano sometimes uses a few more bits.

[See also "Practical Huffman coding" http://www.compressconsult.com/huffman/ ]


 Arithmetic Coding
 -----------------

  It would appear that Huffman or Shannon-Fano coding is the perfect
means of compressing data.  However, this is *not* the case.  As
mentioned above, these coding methods are optimal when and only when
the symbol probabilities are integral powers of 1/2, which is usually
not the case.

  The technique of *arithmetic coding* does not have this restriction:
It achieves the same effect as treating the message as one single unit
(a technique which would, for Huffman coding, require enumeration of
every single possible message), and thus attains the theoretical
entropy bound to compression efficiency for any source.

  Arithmetic coding works by representing a number by an interval of real 
numbers between 0 and 1.  As the message becomes longer, the interval needed 
to represent it becomes smaller and smaller, and the number of bits needed to 
specify that interval increases.  Successive symbols in the message reduce 
this interval in accordance with the probability of that symbol. The more
likely symbols reduce the range by less, and thus add fewer bits to the   
message.

     1                                             Codewords
    +-----------+-----------+-----------+           /-----\
    |           |8/9 YY     |  Detail   |<- 31/32    .11111
    |           +-----------+-----------+<- 15/16    .1111
    |    Y      |           | too small |<- 14/16    .1110
    |2/3        |    YX     | for text  |<- 6/8      .110
    +-----------+-----------+-----------+
    |           |           |16/27 XYY  |<- 10/16    .1010
    |           |           +-----------+
    |           |    XY     |           |
    |           |           |   XYX     |<- 4/8      .100
    |           |4/9        |           |
    |           +-----------+-----------+
    |           |           |           |
    |    X      |           |   XXY     |<- 3/8      .011
    |           |           |8/27       |
    |           |           +-----------+
    |           |    XX     |           |
    |           |           |           |<- 1/4      .01
    |           |           |   XXX     |
    |           |           |           |
    |0          |           |           |
    +-----------+-----------+-----------+

  As an example of arithmetic coding, lets consider the example of two
symbols X and Y, of probabilities 0.66 and 0.33. To encode this message, we
examine the first symbol: If it is a X, we choose the lower partition; if
it is a Y, we choose the upper partition.  Continuing in this manner for
three symbols, we get the codewords shown to the right of the diagram above
- they can be found by simply taking an appropriate location in the
interval for that particular set of symbols and turning it into a binary
fraction. In practice, it is also necessary to add a special end-of-data
symbol, which is not represented in this simpe example.
        
  In this case the arithmetic code is not completely efficient, which is due 
to the shortness of the message - with longer messages the coding efficiency 
does indeed approach 100%.

  Now that we have an efficient encoding technique, what can we do with it? 
What we need is a technique for building a model of the data which we can 
then use with the encoder.  The simplest model is a fixed one, for example a 
table of standard letter frequencies for English text which we can then use 
to get letter probabilities.  An improvement on this technique is to use an 
*adaptive model*, in other words a model which adjusts itself to the data 
which is being compressed as the data is compressed.  We can convert the 
fixed model into an adaptive one by adjusting the symbol frequencies after 
each new symbol is encoded, allowing the model to track the data being 
transmitted.  However, we can do much better than that.

Using the symbol probabilities by themselves is not a particularly good
estimate of the true entropy of the data: We can take into account
intersymbol probabilities as well.  The best compressors available today
take this approach: DMC (Dynamic Markov Coding) starts with a zero-order
Markov model and gradually extends this initial model as compression
progresses; PPM (Prediction by Partial Matching) looks for a match of the
text to be compressed in an order-n context.  If no match is found, it
drops to an order n-1 context, until it reaches order 0.  Both these
techniques thus obtain a much better model of the data to be compressed,
which, combined with the use of arithmetic coding, results in superior
compression performance.

  So if arithmetic coding-based compressors are so powerful, why are they not 
used universally?  Apart from the fact that they are relatively new and 
haven't come into general use too much yet, there is also one major concern:  
The fact that they consume rather large amounts of computing resources, both 
in terms of CPU power and memory.  The building of sophisticated models for 
the compression can chew through a fair amount of memory (especially in the 
case of DMC, where the model can grow without bounds); and the arithmetic 
coding itself involves a fair amount of number crunching.
There is however an alternative approach, a class of compressors generally 
referred to as *substitutional* or *dictionary-based compressors*.

 Substitutional Compressors
 --------------------------

  The basic idea behind a substitutional compressor is to replace an 
occurrence of a particular phrase or group of bytes in a piece of data with a 
reference to a previous occurrence of that phrase.  There are two main 
classes of schemes, named after Jakob Ziv and Abraham Lempel, who first 
proposed them in 1977 and 1978.

<The LZ78 family of compressors>

  LZ78-based schemes work by entering phrases into a *dictionary* and then, 
when a repeat occurrence of that particular phrase is found, outputting the 
dictionary index instead of the phrase.  There exist several compression 
algorithms based on this principle, differing mainly in the manner in which 
they manage the dictionary.  The most well-known scheme (in fact the most 
well-known of all the Lempel-Ziv compressors, the one which is generally (and 
mistakenly) referred to as "Lempel-Ziv Compression"), is Terry Welch's LZW 
scheme, which he designed in 1984 for implementation in hardware for high- 
performance disk controllers.

Input string: /WED/WE/WEE/WEB

Character input:    Code output:    New code value and associated string:
    /W                  /                   256 = /W
    E                   W                   257 = WE
    D                   E                   258 = ED
    /                   D                   259 = D/
    WE                  256                 260 = /WE
    /                   E                   261 = E/
    WEE                 260                 262 = /WEE
    /W                  261                 263 = E/W
    EB                  257                 264 = WEB
    <END>               B
    
  LZW starts with a 4K dictionary, of which entries 0-255 refer to individual 
bytes, and entries 256-4095 refer to substrings.  Each time a new code is 
generated it means a new string has been parsed.  New strings are generated 
by appending the current character K to the end of an existing string w.  The 
algorithm for LZW compression is as follows:

  set w = NIL
  loop
      read a character K
      if wK exists in the dictionary
          w = wK
      else
          output the code for w
          add wK to the string table
          w = K
  endloop

  A sample run of LZW over a (highly redundant) input string can be seen in 
the diagram above.  The strings are built up character-by-character starting 
with a code value of 256.  LZW decompression takes the stream of codes and 
uses it to exactly recreate the original input data.  Just like the 
compression algorithm, the decompressor adds a new string to the dictionary 
each time it reads in a new code.  All it needs to do in addition is to 
translate each incoming code into a string and send it to the output.  A 
sample run of the LZW decompressor is shown in below.

Input code: /WED<256>E<260><261><257>B

Input code:        Output string:     New code value and associated string:
    /                  /            
    W                  W                      256 = /W
    E                  E                      257 = WE
    D                  D                      258 = ED
    256                /W                     259 = D/
    E                  E                      260 = /WE
    260                /WE                    261 = E/
    261                E/                     262 = /WEE
    257                WE                     263 = E/W
    B                  B                      264 = WEB
           
  The most remarkable feature of this type of compression is that the entire 
dictionary has been transmitted to the decoder without actually explicitly 
transmitting the dictionary.  At the end of the run, the decoder will have a 
dictionary identical to the one the encoder has, built up entirely as part of 
the decoding process.
    LZW is more commonly encountered today in a variant known as LZC, after 
its use in the UNIX "compress" program.  In this variant, pointers do not 
have a fixed length.  Rather, they start with a length of 9 bits, and then 
slowly grow to their maximum possible length once all the pointers of a 
particular size have been used up.  Furthermore, the dictionary is not frozen 
once it is full as for LZW - the program continually monitors compression 
performance, and once this starts decreasing the entire dictionary is 
discarded and rebuilt from scratch.  More recent schemes use some sort of 
least-recently-used algorithm to discard little-used phrases once the 
dictionary becomes full rather than throwing away the entire dictionary.  

Finally, not all schemes build up the dictionary by adding a single new 
character to the end of the current phrase. An alternative technique is to 
concatenate the previous two phrases (LZMW), which results in a faster 
buildup of longer phrases than the character-by-character buildup of the 
other methods.  The disadvantage of this method is that a more sophisticated 
data structure is needed to handle the dictionary.

[A good introduction to LZW, MW, AP and Y coding is given in the yabba
package. For ftp information, see question 2 in part one, file type .Y]


<The LZ77 family of compressors>

  LZ77-based schemes keep track of the last n bytes of data seen, and when a 
phrase is encountered that has already been seen, they output a pair of 
values corresponding to the position of the phrase in the previously-seen 
buffer of data, and the length of the phrase.  In effect the compressor moves 
a fixed-size *window* over the data (generally referred to as a *sliding 
window*), with the position part of the (position, length) pair referring to 
the position of the phrase within the window.  The most commonly used 
algorithms are derived from the LZSS scheme described by James Storer and 
Thomas Szymanski in 1982.  In this the compressor maintains a window of size 
N bytes and a *lookahead buffer* the contents of which it tries to find a 
match for in the window:

  while( lookAheadBuffer not empty )
      {
      get a pointer ( position, match ) to the longest match in the window
          for the lookahead buffer;

      if( length > MINIMUM_MATCH_LENGTH )
          {
          output a ( position, length ) pair;
          shift the window length characters along;
          }
      else
          {
          output the first character in the lookahead buffer;
          shift the window 1 character along;
          }
      }
        
  Decompression is simple and fast:  Whenever a ( position, length ) pair is 
encountered, go to that ( position ) in the window and copy ( length ) bytes 
to the output.

  Sliding-window-based schemes can be simplified by numbering the input text
characters mod N, in effect creating a circular buffer.  The sliding window
approach automatically creates the LRU effect which must be done explicitly in
LZ78 schemes.  Variants of this method apply additional compression to the
output of the LZSS compressor, which include a simple variable-length code
(LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all
of which result in a certain degree of improvement over the basic scheme,
especially when the data are rather random and the LZSS compressor has little
effect.
  Recently an algorithm was developed which combines the ideas behind LZ77 and
LZ78 to produce a hybrid called LZFG.  LZFG uses the standard sliding window,
but stores the data in a modified trie data structure and produces as output
the position of the text in the trie.  Since LZFG only inserts complete
*phrases* into the dictionary, it should run faster than other LZ77-based
compressors.

All popular archivers (arj, lha, zip, zoo) are variations on the LZ77 theme.


[A tutorial on some compression algorithms is available at
http://www.cs.sfu.ca/cs/CC/365/li/squeeze/ ]

User Contributions:

charmdatescamreviewsgmd

⚠

The Major Advantages of uniform dating

online dating services is not a newest platform but it has taken a sudden rise after the advancement in the technologies. This has revolutionised the way singles meet. With many struggles of dating in today sphere, More people these days are turning their heads towards the internet.

Even for arranged your marriage, Older dating and stuff like that, People are changing themselves and adopting the platform with all open heart and mind. Beside several advantages these platforms possess, Location and religion are the factors that set it apart. You can find your life partner from everywhere and out of any religion. These religions and castes barely matters when it comes to online dating sites.

really fast, Easy and convenient: At the first, Dating platforms might seem to be a daunting process, [url=https://charmdatescamreviews.wordpress.com/2018/03/19/charmdate-review-why-should-i-date-ukrainian-girls-how-charmdate-protects-me/]online dating ukraine[/url] But in real it is a simple and efficient process to register and connect with people all around the globe. The only thing you need to do is creating an eye catchy and appealing profile mentioning all needed details about you, Your hobbies and your fascinates. Its speedy and convenient access makes it a must have platform for those busy corporate out there.

Less pressing: This is one of the best platforms especially for the people who are shy or nervous as they can connect with people they find interesting via chats unless they get familiar enough with them to either have a verbal talk or start with dating in person. It gives you relaxed atmosphere, Where you can take out efficient time to think to understand and what you want to say to proceed ahead with the conversation.

Meet lots more people: This platform gives user a numerous choices to select from and also it is possible that you can connect with many people altogether and you can find sometimes a person who can be a great friend and a person who can be eligible to be your partner in the future time. This platform will allow you to pick the best one out of many with whom you think you can share your interests with.

add on a Deeper Level: This online site help you know a person back to front. The only appearance you have of your mate is his profile picture, Else you have to know it with the help of the chat you have with him. this can help you evaluate a person behind his face, numerous experts judge who these persons are truly are. Such dating platforms leave you unbiased to be love someone you share similar interests with.

Full Disclosure: The online dating sites allow you to specify whatever is your expectation and intention, Right right from the start so that you can find people looking for the same things and interests as of yours. The major benefit of all such platform is that it helps preventing misunderstandings and disappointments.

fee: Last but not the cheapest, Cost saving is the most appealing benefit of online dating because real life dates are expensive. You need to hangout with your partner every weekend and you further have to spend money either on food or night-life or both.

These platforms therefore gives you enable you to get to know the person well in advance through these online portals and thereafter you should spend money on real dates. different, On the very first meet even you must spend money with no surety that you will like the person as a partner or not.

Isasymn

⚠

Hey ... Im looking a man..
I love sex. Here are my erotic photos - is.gd/bPOmnQ

Tildacg

⚠

Hello !! Im looking a lover...
I dream of hard sex! Write me - tinyurl.com/yjefe8mm

Ellais

⚠

Hey baby.. my name is Kelly...
I love oral sex! Write me - is.gd/sTu2T8

Lauren0v

⚠

Fill out the form and win a FREE $500 or $1000 voucher! - tinyurl.com/ygvk79dz

Donnaj2

⚠

Hello dear!!! Im looking a man!
I dream of hard sex! Write me - is.gd/gPWpcf

Anja8m

⚠

Hi baby... my name Diane!!!
I love sex. Here are my erotic photos - tinyurl.com/yghegmr2

Janet2g

⚠

Hey dear.. Im looking a man...
I want sex! Write me - is.gd/MvyEhY

Sophie8e

⚠

hallo baby!! Im looking a lover!!
I dream of hard sex! Write me - tinyurl.com/yjs64g36

Judyhb

⚠

Hello dear! my name is Evelyn..
If you want to meet me, I'm here - tinyurl.com/ydun3hp3

Emilyfa

⚠

hey baby. my name Lauren!!!
I love oral sex! Write me - tinyurl.com/ygqxenap

Ellaim

⚠

hallo baby.. my name Mary!
Do you want to see a beautiful female body? Here are my erotic photos - tinyurl.com/yhtxq5lg

Dorism9

⚠

Hi dear!! Im looking a man!
I want sex! Here are my photos - is.gd/M7qBFG

Andrea9h

⚠

hallo !! my name Megan!
I love oral sex! Write me - chilp.it/d870a00

Milau6

⚠

Hey baby!!! Im looking a man!
I love sex. Write me - u.to/j-bKGw

LeDoprano

⚠

I can agree with Judyhb in this topic. Her name may really be Evelyn, but she choosed another nickname out of shame. Kind of sad, Evelyn is a nice name..
Also, I want sex! Don't have any photos though :/

Aaronvew

⚠

Sep 3, 2022 @ 7:19 pm

Good write ups. With thanks! https://definitionessays.com/ degree thesis

Alfredo Maddox

⚠

May 7, 2023 @ 4:16 pm

How Are You

Hello, I wish for to subscribe for this webpage to get most recent updates, thus where can i do it please help out.

https://cutt.ly/x55XjOQ

Best Regards

Juliaqf

⚠

Sep 22, 2023 @ 11:11 am

hallo baby! Im looking a lover!
If you want to meet me, I'm here - is.gd/CZZI2g

Pan Afuki

⚠

Dec 25, 2023 @ 12:12 pm

Matt Mahoney is the trashcan matt or trashcan man. "MY LIFE FOR YOU FLAGG!". We will do the FAGGY PANTS and follow it up with RLE. A close second is Albert Reddit's GAY GZIP, or GAY ZIP, or simply ARZ. We do the faggy pants and it go like this and like that and do the RLE and there we are.

RanSteSuGMindel

⚠

Nov 21, 2024 @ 3:15 pm

re LeDoprano:
nice to see another human around here!

Comment about this article, ask questions, or add new information about this topic:

Archived related questions and answers

Top Document: comp.compression Frequently Asked Questions (part 2/3)
Previous Document: News Headers
Next Document: [71] Introduction to MPEG (long)

Part1 - Part2 - Part3 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
jloup@gzip.OmitThis.org

Last Update March 27 2014 @ 02:11 PM

comp.compression Frequently Asked Questions (part 2/3)
Section - [70] Introduction to data compression (long)

Search the FAQ Archives

comp.compression Frequently Asked Questions (part 2/3)
Section - [70] Introduction to data compression (long)

User Contributions:

Comment about this article, ask questions, or add new information about this topic: