Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.compression Frequently Asked Questions (part 2/3)
Section - [78] The Burrows-Wheeler block sorting algorithm (long)

( Part1 - Part2 - Part3 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Sex offenders ]


Top Document: comp.compression Frequently Asked Questions (part 2/3)
Previous Document: [77] Introduction to Fractal compression (long)
See reader questions & answers on this topic! - Help others by sharing your knowledge

A high-quality implementation of the Burrows-Wheeler
block-sorting-based lossless compression algorithm is available at
http://www.cs.man.ac.uk/arch/people/j-seward/bzip-0.21.tar.gz

Mark Nelson wrote an excellent article "Data Compression with the
Burrows-Wheeler Transform" for Dr. Dobb's Journal, September 1996.  A copy
of the article is at http://www.dogma.net/markn/articles/bwt/bwt.htm

Another introduction written by Sampo Syreeni <tmaaedu@nexus.edu.lahti.fi>:

    The Burrows-Wheeler block sorting compression algorithm is described in
"A Block-sorting Lossless Data Compression Algorithm" by M. Burrows and D.J.
Wheeler, dated in May 10, 1994. A postscript copy of this paper has been made
available by Digital on the Systems Research Center (SRC) FTP site at
ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.Z

    The method was originally discovered by one of the authors (Wheeler) back
in 1983, but has not been published before. As such, the method is fairly new
and hasn't yet gained popularity.

    The method described in the original paper is really a composite of three
different algorithms: the block sorting main engine (a lossless, very slightly
expansive preprocessor), the move-to-front coder (a byte-for-byte simple,
fast, locally adaptive noncompressive coder) and a simple statistical
compressor (first order Huffman is mentioned as a candidate) eventually doing
the compression. Of these three methods only the first two are discussed here
as they are what constitutes the heart of the algorithm. These two algorithms
combined form a completely reversible (lossless) transformation that - with
typical input - skews the first order symbol distributions to make the data
more compressible with simple methods. Intuitively speaking, the method
transforms slack in the higher order probabilities of the input block (thus
making them more even, whitening them) to slack in the lower order statistics.
This effect is what is seen in the histogram of the resulting symbol data.

    The block sorting preprocessor operates purely on a block basis. One way
to understand the idea is to think of the input block arranged as a circular
array where, for every symbol, the succeeding symbols are used as a predictor.
This predictor is then used to group the symbols with similar right neighbors
together. This predictor is realized (conceptually) as a two phase process.
The first phase forms all cyclic shifts of the input block whose size is
usually a power of two. Note here that the original string is always present
intact on some row of the resulting matrix. If the block length is n then
there exist n unique rotations of the original string (to the left). These
rotations are now viewed as the rows of an N x N matrix of symbols. The second
phase consists of sorting this resulting conceptual matrix. This phase results
in the rows coming into order based on their first few symbols. If there is
some commonly repeated string in the input block (the original paper gives
"the" as an example), the sorting phase brings all those rotations that have a
part of this string as the row start very close to each other. The preceding
symbol in this common string is then found in the last column of the sorted
matrix. This way common strings result in short bursts of just a few distinct
characters being formed in the last column of the matrix. The last column is
what is then output from the second phase. One further bit of information is
derived from the input data. This is an integer with enough bits to tell the
size of the input string (that is, log_2(n)). The number is used to note the
row position into which the original input block got in the sorting algorithm.
This integer always results in expansion of the data, but is necessary for us
to be able successfully decompress the string. The absolute amount of overhead
increases as the logarithm of the input block size so its percentage of the
output data becomes negligible with useful block sizes anyway.

    The characteristics of the transformation process make the output from the
sort ideal for certain kinds of further manipulation. The extreme local
fluctuations in the first order statistics of the output string lead one to
use a transformation that boosts and flattens the local fluttering of the
statistics. The best example (and, of course, the one given in the original
paper) is move-to-front coding. This coder codes a symbol as the number of
distinct symbols seen since the symbol's last occurrence. Basically this means
that the coder outputs the index of an input symbol in a dynamic LIFO stack
and then updates the stack by moving the symbol to the top. This is easy and
efficient to implement and results in fast local adaptation. As just a few
common symbols will (locally) govern the input to the coder, these symbols
will be kept on the top of the stack and thus the output will mainly consist
of low numbers. This makes it highly susceptible to first order statistical
compression methods which are, in case, easy and efficient to implement.

    The transform matrix described above would require enormous amounts of
storage space and would not result in a usable algorithm as such. The method
can, however, be realized very efficiently by suffix and quick sort methods.
Thus the whole transformation together with the eventual simple compression
engine is extremely fast but still achieves impressive compression on typical
input data. When implemented well, the speeds achieved can be in the order of
pure LZ and the compression ratios can still approach state-of-the-art Markov
modeling coders. The engine also responds well to increasing block sizes -
the longer the input block, the more space there is for the patterns to form
and the more similar input strings there will be in it. This results in almost
monotonously increasing compression ratios even as the block length goes well
into the megabyte range.
    The decompression cascade is basically just the compression cascade
backwards. More logic is needed to reverse the main sorting stage, however.
This logic involves reasoning around the order of the first the last column of
the conceptual coding matrix. The reader is referred to the original paper for
an in depth treatment of the subject. The original paper also contains a more
thorough discussion of why the method works and how to implement it.

    And now a little demonstration. The original block to be compressed is
chosen to be the (rather pathological) string "good, jolly good". This was
taken as an example because it has high redundancy and it is exactly 16 bytes
long. The first picture shows the cyclic shifts (rotations) of the input
string. The second shows the matrix after sorting. Note that the last column
now has many double characters in it. Note also that the original string has
been placed into the 6th row now. The third picture shows the output for this
input block. The index integer has been packed to a full byte although 4 bits
would suffice in this case (log_2(16)=4). The fourth and fifth pictures show
the transformed string after move-to-front-coding. The sixth picture shows the
statistical distribution of the characters in the output string. Notice the
disproportionately large amount of ones and zeros, even with a very short
string like this. This is the output that is then routed through the simple
statistical encoder. It should compress very well, as the distribution of the
characters in the input block is now very uneven.


      0 1 2 3 4 5 6 7 8 9 A B C D E F        0 1 2 3 4 5 6 7 8 9 A B C D E F
      -------------------------------        -------------------------------
  0 | g o o d ,   j o l l y   g o o d    0 |   g o o d g o o d ,   j o l l y
  1 | o o d ,   j o l l y   g o o d g    1 |   j o l l y   g o o d g o o d ,
  2 | o d ,   j o l l y   g o o d g o    2 | ,   j o l l y   g o o d g o o d
  3 | d ,   j o l l y   g o o d g o o    3 | d ,   j o l l y   g o o d g o o
  4 | ,   j o l l y   g o o d g o o d    4 | d g o o d ,   j o l l y   g o o
  5 |   j o l l y   g o o d g o o d ,    5 | g o o d ,   j o l l y   g o o d
  6 | j o l l y   g o o d g o o d ,      6 | g o o d g o o d ,   j o l l y
  7 | o l l y   g o o d g o o d ,   j    7 | j o l l y   g o o d g o o d ,
  8 | l l y   g o o d g o o d ,   j o    8 | l l y   g o o d g o o d ,   j o
  9 | l y   g o o d g o o d ,   j o l    9 | l y   g o o d g o o d ,   j o l
  A | y   g o o d g o o d ,   j o l l    A | o d ,   j o l l y   g o o d g o
  B |   g o o d g o o d ,   j o l l y    B | o d g o o d ,   j o l l y   g o
  C | g o o d g o o d ,   j o l l y      C | o l l y   g o o d g o o d ,   j
  D | o o d g o o d ,   j o l l y   g    D | o o d ,   j o l l y   g o o d g
  E | o d g o o d ,   j o l l y   g o    E | o o d g o o d ,   j o l l y   g
  F | d g o o d ,   j o l l y   g o o    F | y   g o o d g o o d ,   j o l l

             1. The shifts                    2. In lexicographic order


                                                121,45,102,114,0,1,36,0,
         "y,dood  oloojggl",5                   1,113,1,0,112,110,0,3,5

     3. The output from block sort          4. After move-to-front-coding


                                             00: 4;  01: 3;  03: 1;  05: 1;
          79,2D,66,72,0,1,24,0,              24: 1;  2D: 1;  66: 1;  6E: 1;
          1,71,1,0,70,6E,0,3,5               70: 1;  71: 1;  72: 1;  79: 1

           5. In hexadecimal                      6. The statistics

------------------------------------------------------------------------------

          End of part 2 of the comp.compression faq.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

CAPTCHA




Top Document: comp.compression Frequently Asked Questions (part 2/3)
Previous Document: [77] Introduction to Fractal compression (long)

Part1 - Part2 - Part3 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
jloup@gzip.OmitThis.org





Last Update March 27 2014 @ 02:11 PM