3.2. General Aspects
Compressing a file goes through four main stages: preprocessing, model prediction, context mixing, and probability refining. An optional pre-training phase can be activated via command line parameters. The pipeline for image compression has been described schematically in
Figure 1.
The preprocessing phase is also split into three parts. At first, it searches through the file to be compressed for known stream types. Based on these types, different models are activated for the second stage of compressing. For example, it searches for image (1 bpp, 4 bpp, 8 bpp, 8 bpp grayscale, 24 bpp, 32 bpp, png 8 bpp, png grayscale 8 bpp, png 24 bpp, png 32 bpp), jpeg, gif, text, audio (8 and 16-bit mono and stereo), exe, base64, zlib streams, file containers, and others. After this stage, an optional transform phase is applied for certain stream types such as text, where an end of line transform can be applied, or EXE, where certain instructions are replaced with others. The transform phase is then applied in reverse and if the result matches the original stream, the transform is kept.
In the case of images, the preprocessing phase extracts the file header, which is compressed separately, and the byte stream containing the pixel values of the image. The width and the bit depth of the image are extracted from the header and the width is passed on to the image model selected by bit depth.
The model prediction and context mixing phase happen consecutively. Probabilities of individual bits from the input stream are predicted by many specialized models. All the probabilities are combined into one probability via the context mixing algorithm. The output probability is refined using a network of adaptive probability maps. The final prediction is used to encode the bit from the stream using a binary arithmetic coder. The algorithm is symmetrical, meaning that both the coder and the decoder do the same operations ending up with the same final probability. The decoder uses the probability to decode the bit from the compressed stream.
3.3. Modeling
The term model is used with double meaning throughout the compressor. At first, it is used to denote the unit of the algorithm that outputs a probability that will participate in the mixing phase. One can interpret this as an “elementary” model. The second meaning is the collection of units that are modeling a given type of data. The output of such models is, evidently, a collection of probabilities. Example models are TextModel (for language-specific language stemming and word modeling), MatchModel (for repeatable long matches of data), RecordModel (for data structured in records), SparseModel, JpegModel (for specific jpeg data), WavModel, ExeModel, DmcForest (a collection of dynamic Markov coding models), XmlModel, PpmModel (various order prediction by partial matching), ImageModels (for different bit depth image data), and many more. One or more of this type of model is selected according to the input stream type and compression parameters.
It is outside of the scope of this paper to explain models unrelated to image compression. These can be further detailed in a general compression paper.
3.4. Image Compression
In the case of image streams, the match model can be optionally activated and can bypass the image model if there is a long match found. But our focus will be set on the 8 bpp image model. The output of this model contains predictions for four types of input streams: 8 bpp indexed color or grayscale and 8 bpp png indexed or grayscale. If the stream is png, a part of the filtering scheme used is undone in order to obtain the true pixel value.
Depending on the type of image, different correlations can be expected and, thus, exploited by specific modeling. Before describing the specific contexts, we should describe which types of operations are possible with the contexts. Three major types of models can be identified: direct, indirect, and least squares modeling. All of these models expect byte level context values (as data coming from a file comes in byte chunks) and can output direct probabilities, stretched probabilities, or both. The context mixing stage expects probabilities in the logistic domain (stretched probabilities) and different operations are applied to probabilities to fit or skew them into this domain.
3.4.1. Direct Modeling
Direct modeling is implemented with the use of stationary context maps. This type of map takes as input a context value and outputs a weighted stretched probability and a weighted probability centered around zero (skewing). It is implemented using a direct lookup table where each entry stores a probability (which is then stretched and skewed) and a hit counter. On the update phase, an error is computed as the difference between the stored probability and the value of the bit. The error is weighted with a value dependent on the hit counter. Fewer hits on the context value indicate a more rapid update rate. This is implemented via a lookup table containing the values of an inverse linear function of the hit count.
For each context that requires direct modeling, a new map must be created. This protects the contexts from colliding with each other.
3.4.2. Indirect Modeling
Unlike direct modeling, which updates the probability based on the last probability predicted, indirect modeling tries to learn the answer based on a similar sequence from the past.
Indirect modeling is implemented with the use of indirect context maps, which use two-step mapping. An optional run context map is also included, which is used for modeling runs of bits.
The first mapping is between a context value and a bit history called state. The state is modeled as an 8-bit value with the following meaning: A zero value means the context value was never seen before. States from 1 to 30 map all the possible 4-bit histories. The rest of the states represent bit counts of zero and one or an approximation of the ratio between zeroes and ones if the number of previously seen bits exceeds a count of 16. The states are used as indexes in a state table which contains transitions to the next state depending on the value of the next bit. The states were empirically chosen to try to model non-stationarity and different state maps were proposed in other compression programs [
22].
The states are kept in a hash map implemented as a table with 64-byte entries to fit in a cache line. The entries contain checksums for the context value to prevent collisions and up to seven state values. Since the map expects byte data, at bit 0, 2, and 5, the bucket for a context value is recomputed via a dispersion function. The seven state values can hold information about no bits known (one value), one bit known (two values), and two bits known (four values). At bit zero, only three states are needed and, as an optimization, the next four bytes implement a run map that predicts the last byte seen in the same context value, logarithmically weighted by the length of the run. The hash map implements a “least frequently used” eviction policy and a “priority eviction” based on the state of the first element in the bucket. States are indexed based on the total number of bits seen, and, therefore, the more information available is favored.
The next mapping is between the state and one or more probabilities. This is done in a similar manner as in direct modeling by using a state map. For each input, four probabilities are returned, one stretched, one skewed, and two depending on the bit counts of zero and one for that state. The fifth probability out of the indirect context map comes from the run map.
Unlike stationary maps, more contexts can be added to the indirect map, meaning that they share the same memory space and are identified by an index. Each context has its own state map accessed by the index. Having states modeled as 8-bit values makes them more memory efficient than the 32-bit representation for stationary maps.
3.4.3. Least Squares Modeling
An ordinary least squares modeling is used to predict the value of the next pixel (not bit prediction) based on a given set of context values and acts as a maximum likelihood estimator. The prediction is a linear combination of the regressors, which are the explanatory variables. The update phase tries to minimize the sum of squared differences of the true pixel value and the predicted value. Finding the values of the weight vectors is done online by the method of normal equations that uses a Cholesky decomposition that factors the design matrix into an n by n lower triangular matrix, where n is the number of regressors. The matrix is then used to analytically find the weight values. The bias vector and the covariance matrix are updated using parametrized momentum.
The value of the prediction is not used directly, but is used in combination with the known bits of the current byte and the bit position in the byte as a key into a stationary context map.
3.4.4. Correlations
Different types of correlations are exploited for the type of images supported since we have varying expectations of what the byte values from the input stream represent in the image. It is difficult to describe all the operations used and only a minimal description will be provided. This section does not cover png modeling.
The neighboring pixels are the best estimators for searching correlations. They form the causal pixel neighborhood. Various notations are used for representing the position of the pixels. A simple and meaningful representation is obtained by using the cardinal points on a compass (see
Figure 2).
Each time a cardinal point is mentioned, a step of the size of the pixel is taken into the direction relative to the pixel that is being predicted.
Palette color indexed images, as the name suggests, use the byte value to index the true RGB color in the palette table. This means that the direct values cannot be used with linear predictors because a linear combination will also be an index and might end up suggesting a completely different color. Another problem is that quantizing the values will also result in different indexes that are not matched to the expected texture in the image. Moreover, since we know that we have 8-bit indexes, we expect that only a small portion of the entire color plane is used. This makes the use of indirect context maps useful and context values will be computed, for example, by hashing the W, N and NW values together.
Grayscale images or individual color planes in color images require different modeling that is dependent on what the content of the image represents. If the source of the image is artificial (meaning computer generated, renders, drawings or screenshots), hard edges and continuous tone regions may be expected. Photographic images may present noise, which makes the process of prediction more cumbersome.
Of course, like for palette images, texture tracking via indirect context maps is useful. Contexts can now be computed also by quantizing the values or computing intensity magnitude levels using logarithm functions of direct values or of logarithms of the difference of quotient of two values.
Additionally, modeling for the expected pixel value is needed. The results are used as keys into stationary maps. Various prediction techniques work in many directions, including horizontal, vertical, and diagonals.
Inspired from video compression schemes, half-pixel, quarter pixel, and n-th pixel interpolation and extrapolation provide predictions and can be combined with other predictions by averaging gradients and other interpolation techniques.
Linear pixel value combinations are used, such as averaging or gradients. For example, if the two pixels from above have values 60 and 50, a combination of the form N*2–NN will output 40. An averaging combination of the form (N + NN)/2 and will output 55. Another type of combination can be a Lagrange polynomial used for extrapolating, like NNN*3–NN * 3 + N. Extrapolated values from different directions are then combined by linear combinations for new predictions. The result of a prediction can be negative or above the maximum value of 255, and, therefore, two functions are applied to the result. The clip function restricts the value in the [0, 255] interval. The clamp function is similar to the strategy employed by the LOCO predictor for keeping the prediction in the same plane as the neighboring pixel values that are also passed as parameters to the function.
Color images exploit the same correlations as the grayscale images, but include modeling for the spectral correlation of the color planes. This means that an increased gradient in the red color plan can also mean increased gradients in the other planes. The magnitude of the change in a previous plane can be used to make predictions in a current plane or a prediction in the current plane can be refined based on the residual of the prediction in the previous plane.
3.4.5. Grayscale 8 bpp
In the analyzed version of PAQ8PX, a number of 62 stationary maps are used for grayscale images. Five of them are used in conjunction with OLS modeling, in order to model quadrants of the causal pixel neighborhood of different lengths. The others accept as keys various clipped and clamped predictions. An indirect context map is used which accepts 27 entries as keys computed as hashed predictions. This means that the estimated number of probabilities which are the output of the image model for grayscale images is 62 * 2 + 27 * 5 = 259.
3.5. Context Mixing
Encoding of a bit needs only one probability and the bit to code. Modeling produces many probabilities that need to be combined to obtain a final probability. One option would be to do a linear combination of the probabilities and adjust the weights accordingly after the true value is available.
The solution in the PAQ8 family of compressors is to use a gated linear network, and context mixing is one implementation of such a network. The details of GLNs are described in detail in [
23], which also include the mathematical proof of the convergence guarantee. The description of the network is split into three parts: geometric mixing, gated geometric mixing, and gated linear networks.
Geometric mixing is an adaptive online ensemble that was analyzed in depth and whose properties are described in [
24,
25,
26]. The main difference to linear mixing, which implies weighting the probabilities directly, is that the probabilities are first transformed into the logistic domain using the logit function (sometimes referred to as stretch in the paper).
They are then linearly combined and then the result is transformed back into a probability using a sigmoid function (sometimes referred to as squash in the paper).
The weights are updated using an online gradient descent together with a logarithmic loss. In this way, a weighted redundancy minimization in the coding space can be achieved (minimized Kullback–Leibler divergence) [
25].
An advantage of this method when compared to regular probability weighting also comes with the fact that weights do not need to be normalized or clipped to the positive domain.
Gated geometric mixing means adding a context selector. So far, we have a neuron that takes as input stretched probabilities and has weights associated with the input. If, instead, we had a set of weights from which we select one based on an index, we would create a gate. The index can be computed as a function of a context or as additional information. We can now say that the neuron has specialized weights.
Gated linear networks are a network of stacked gated geometric mixing layers of neurons. The output of a gated geometric mixing neuron is a probability. A set of neurons that works on the same input forms a layer. The set of outputs of one layer form the input for another layer. A final probability is obtained when a layer contains only one neuron. At first glance, the network looks similar to a multi-layer perceptron, but, in this case, the learning is not done via backpropagation. Instead, each neuron output tries to approximate the end probability, and, since each layer constructs on the output of the previous, it further improves the result.
In the following paragraph, we present some important considerations. The loss function is convex, which implies a simplified training of a deep network. The network rapidly adapts to the input, making it a perfect candidate for online learning. Weights can be initialized in more ways and random assignment is not necessary because of the convexity of the loss function. The PAQ8 compressors initialize all the weights to zero with the implication that no predicting model has any importance in the beginning and allowing a rapid update towards selecting the best specialist. Clipping the weights and regularization techniques are also presented in [
23], but are not used in PAQ.
PAQ8PX uses a network with two layers for image compression, the first layer has seven neurons and uses functions of immediate pixels and column information as contexts, which means that the pixel position in the image is taken into account.
3.6. Adaptive Probability Maps
Adaptive probability maps (APM), sometimes referred to as secondary symbol estimation, have a probability and a context value as inputs, and a probability as output. The context value serves as an index in a set of transfer functions. Once selected, a set of interpolation points are available. In the initial state, they should map the input probability to the same value. The input probability is quantized between two points of the set; the output value is the linear interpolation of the value of the points weighted by the distance from them. In the update phase, the two end values are updated so that the output probability is closer to the value of the predicted bit.
There are variations of the APM. One of them, which is used in PAQ8, has a stretched probability as an input with the benefit of having more interpolation points towards the zero and one probability, where compression benefits from the fine-tuning. Other compression programs use APM with two quantized predictions as inputs with a 2D interpolation plane.
It is not necessary to use only a single APM since they can be connected in a network. PAQ8PX uses different architectures based on the type of stream detected. For 8 bpp grayscale images, three APMs are used. Two of them take the output of the context mixing phase as an input and have functions as contexts, including the current known byte bits, the number of mispredictions in the past, and whether the prediction falls in a neighborhood plane or not. The output of the first APM is again refined and the final prediction is a fixed weight linear combination of the three probabilities.
3.7. Other Considerations
Predictions need to be perfectly identical when compressing and decompressing, because, otherwise, the decoder will rely on false data. Floating point operations cannot guarantee cross compiler, cross processor, and cross-operating system this hard constraint. It became even more important to have fixed rules when support for streaming instructions like SSE and AVX was added. It was decided that fixed point arithmetic will be used across all operations. Even setting initial values for lookup operation tables like stretching, squashing or logarithm was done by interpolation of initial integer values or by numerical integration. Components described here use fixed-point values with varying point position. The representation can be on 16 bit or 32-bit integers. For example, representing the weights of the context mixing algorithm in 16 bit is useful when using vector instructions, since more values fit into operands. Some exceptions to this rule were made for the sake of maximum compression for the wav model and the ordinary least squares algorithm used in image compression.
Another unintuitive part is that the update part of each model takes place right before the prediction part. The first prediction is by default 0.5 since it relies on no information. Afterward, each time the predictor is queried, it does the update with the known bit and then computes a prediction. This is done as an optimization since the accessed memory locations during the update might still be loaded in the cache and the prediction might need the same locations.