Hi, tool author here. Huffman coding is a static minimum-redundancy code. What t...

Hi, tool author here.

Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.

Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.

The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.

As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.

As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.

So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.