Extended Multi WLS Method for Lossless Image Coding

In this paper, the most efficient (from data compaction point of view) and current image lossless coding method is presented. Being computationally complex, the algorithm is still more time efficient than its main competitors. The presented cascaded method is based on the Weighted Least Square (WLS) technique, with many improvements introduced, e.g., its main stage is followed by a two-step NLMS predictor ended with Context-Dependent Constant Component Removing. The prediction error is coded by a highly efficient binary context arithmetic coder. The performance of the new algorithm is compared to that of other coders for a set of widely used benchmark images.


Introduction
Image compression methods can be divided into lossy and lossless. Important applications of lossless image and video compression techniques include archiving of 2D, 3D, and 4D (3D video sequences) medical images [1][2][3][4][5], as well as astronomical and satellite ones [6,7]. Lossless mode is often required at some stage of photographs, advertising materials, TV productions, and films graphic processing (post-production [8]), etc. In such cases, lossy versions of compression methods, such as JPEG, JPEG2000 (for static images) [9], MPEG2, and MPEG4 [10] (for video sequences), cannot be used. Although these standards have appropriate lossless modes, they are not particularly efficient.
Samples of multimedia signals are usually strongly correlated, which means that simple entropy coders do not compress them effectively. Correlation can be significantly reduced by signal prediction, but then another problem emerges: changes in signal statistical properties. In the 1990s, several solutions were proposed that used linear and non-linear prediction for lossless image compression. The first image coder that solved the above problems relatively well was the context coder CALIC [11]. At that time, it was considered too complex for practical purposes; nevertheless, the standardized JPEG-LS was a context coder too [12]. Then followed more efficient but also more complex techniques: TMW Lego [13], Multi-WLS [14], MRP 0.5 [15]. The more recent best algorithms are Blend-20 [16], and then the improved versions of MRP 0.5: GPR-BP [17], and MRP-SSP [18]. Further analysis of currently existing solutions is presented in Section 1.2.
In this paper, the currently best image lossless coding algorithm is presented, extended multi weighted least squares (EM-WLS; Section 3.2), and its simplified version, locally adaptive ordinary least squares (LA-OLS; Section 3.1). The algorithms have in common the cascade structure proposed in Section 2; their first stages are either EM-WLS, or LA-OLS predictors. Section 6.4 shows that it is much better than any older method, including the newest GPR-BP, and MRP-SSP. EM-WLS is an extended version of the WLS technique, which is also a basis for the very good multi-WLS algorithm [14].
The first version of AVE-WLS was introduced [19]; the current version using the cascade prediction system has been significantly improved (Section 2). Several new formulas are added, i.e., additional NLMS stages (Section 3.3), enhanced cumulated prediction error cancellation stage (Section 4), and a new arithmetic coder (Section 5).

Basics of Prediction Coding
Optimization of data encoding consists of minimizing the average number of bits per single symbol generated by a source S (in the case of lossless, image compression symbols are pixel color values; in this paper, we work with 8-bit luminance).
Depending on the sources, we can divide them into those without memory (discrete memoryless source (DMS)) and sources with memory (conditional source model (CSM)) [20]. Considering this classification from the Markov model point of view, in the first case, the lower limit on the bit average is the unconditional entropy (H(S DMS ), also named zero-order entropy). In the second case, we deal with the conditional entropy of the kth order, defined as H(S|C (k) ), where in the case of images context, C (k) may be defined, e.g., by k neighbor pixels ( Figure 1).
In general, for the source entropy H(S), the following relation holds: H(S) ≤ H(S|C (k) ) ≤ H(S DMS ). Since image samples take on many values (at least 256), it is impractical to determine and apply a Markov model for them. It is usually assumed that removal of inter-dependencies between pixels is possible by using predictive techniques; prediction errors are coded. A linear predictor of order r is used to estimate the value of a sample: where P(i) is pixel from the currently coded pixel x(0) = P(0) neighborhood (Figure 1), and w(i) is a predictor coefficient from vector w = [w(1), w(2), ..., w(r)] [21]. To simplify notation in this work, indices of the currently coded pixel are usually omitted, while indices i of pixels P(i) and prediction errors e(i) show their distances from the processed element P(0) or e(0) (Figure 1): 26   The closest pixels provide the most information about the coded one, so the neighboring pixels are ordered in accordance with their Euclidean distances from P(0) (Figure 1). The numbering of pixels with the same distance from P(0) is usually determined clockwise (this is discussed in more detail in Section 3.2.3). The prediction model can be a linear model of the kth order or a more complex nonlinear solution, but we reduce it to a predictor of one valuex(0), which is then subtracted from the current P(0), see Formula (2). In this way, we create a not explicitly defined first-order Markov model. As a result, a sequence of prediction errors can be expected to form a source for which the first-order entropy value is noticeably smaller than the zero-order value. The estimated pixel value (expected value rounded to the closest integer) is subtracted from the real pixel value: and the difference (prediction error e(0)) coded. In this way, we obtain a differential image, in which probability distribution of its samples is close to the Laplace distribution [21,22]. This allows efficient coding of prediction errors using one of the static or adaptive entropy methods, among which arithmetic coding is most effective (Section 5).
Notably, it is extremely difficult to determine image entropy H(S) because it is hard to define a Markov model that considers all interdependencies between pixels. Namely, the dependencies may extend quite far. An example is the research use of linear prediction of order r 2 = 106 in the second stage of the proposed-here cascade structure (Section 3.3). The number of Markov model states grows exponentially with its order. Here, if we assume that symbols are 8-bit pixel values, the number is at least 256 106 . Therefore, only the bit average, being the average number of bits necessary for coding a pixel, is used as the basic benchmark for testing lossless codecs (Section 6).

Adaptive Predictive Techniques
The effectiveness of image compression depends on the correct choice of one or more prediction models. Linear prediction is most often used (Formula (1)), where a vector w containing r predictor coefficients w(i) is usually determined by minimizing the mean square error (MMSE). This vector can be determined once for the whole image by a forward adaptation method or individually for each coded pixel using a backward adaptation technique. The advantage of backward adaptation is the possibility of using relatively high prediction orders (there is no need to provide prediction coefficients to the decoder), which allows for high compression efficiency. However, the disadvantage of this solution is the necessity of updating prediction coefficients in both the encoder and decoder for each pixel, which is a time-symmetric approach.
In methods with forward adaptation, it is reasonable to divide images into blocks (e.g., 8 × 8 or 16 × 16 pixels) and define an individual predictive model for each. One of the first solutions of this type was the method presented by Memon [23], where one of eight fixed models was assigned to each block of 8 × 8 pixels, producing the smallest absolute error. As such, the header information associated with a block required only three bits, forming the model number.
Subsequent methods used the mean square error for determining the best prediction coefficients. Unfortunately, this was associated with very large header information, as predictor coefficients required large numbers of bits. To reduce the size of the header, the blocks with similar characteristics were grouped into clusters, and all were associated with a common prediction model [24]. Through using vector quantization techniques (as well as fuzzy clustering [25]), optimized sets of, e.g., 16 prediction models were created, so that even for a large prediction order, the header size did not significantly increase the bit average. Matsuda et al. [15] used a technique of combining adjacent blocks belonging to the same category into groups (associated with the same predictor) to create larger blocks. Next, a map of the blocks with variable sizes was saved using an effective technique of encoding quadtrees.
In the above-mentioned above block techniques, it is easy to prove that it is possible to reach lower values of first-order entropy than for the MMSE method [15]. In the paper [26], it was proposed to replace MMSE criterion with minimum mean absolute error (MMAE), which improved the results for a block method. The poblem of discrepancy between MMSE and first-order entropy criterions was discussed in paper [27].
In the case of backward adaptation in local training windows, the MMSE criterion is closer to optimal one than for forward adaptation approaches. Analysis of this issue is presented in Section 6.1. This type of solution was used in the lossless image codec proposed in this work.

Cascade Prediction Model
Considering the features of adaptive methods described in Section 1.2, to achieve the highest possible compression efficiency in our work, we decided to use a backward adaptation approach, and developed a highly effective cascade prediction model whose final high-efficiency form is discussed in this section. A similar approach was used earlier in lossless audio compression solutions [28]. The cascade is calculated as follows: In the main stage image, pixels are processed: where coefficients of the predictor form vector w MP of order r 1 . They are computed using the WLS or OLS methods described in Sections 3.1 and 3.2, respectively. In the following stages, prediction errors from previous stages are transformed: The final prediction error is obtained at the output of a cascade structure shown in Figure 2. It is given by: where y 1 (0) and y 2/3 (0) are signal estimates provided by the Main Predictor, the locally adaptive Ordinary Least-Squares (OLS) method (LA-OLS, Section 3.1) or the extended multi WLS method (EM-WLS, Section 3.2), and then by Normalized Least Mean Square predictors (NLMS+, Section 3.3) respectively. Finally, C mix is the cumulated prediction error correction defined in Section 4 (Context-Dependent Constant Component Removing (CDCCR)). The last two blocks (Golomb code and Context Adaptive Binary Arithmetic Coder (CABAC)) are used for effective compression of predictive errors. Details are provided in Section 5.  The proposed solution can be considered to be a model offering the highest efficiency. To reduce implementation complexity (Section 6.4), individual blocks can be easily modified (e.g., by reducing prediction orders in the first 3 blocks) or even deleted. For example, block Predictor 1 may contain one of two versions of the Main Predictor: EM-WLS or the faster LA-OLS. The blocks are described in the following sections, and the effects of blocks removal are discussed in more detail in Section 6.

Stages of Adaptive Predictive Cascade
In this section, we present the blocks in the first three stages of the cascade structure from Figure 2. In particular, two versions of the Main Predictor are defined: complex but more efficient EM-WLS and simplified LA-OLS.

Locally Adaptive OLS Method
In Section 1.2, we list examples of codecs using predictive methods with forward adaptation. Among them, the most effective are those that adapt the prediction model to local features of an image, e.g., for 8 × 8 or 16 × 16 pixel blocks. In the case of backward adaptation, similarly sized Q windows are used. The main difference is that the currently encoded pixel P(0) is not in that Q windows.
In the OLS method, prediction coefficients are calculated for each coded pixel, minimizing the prediction mean square error in a certain limited area Q. The vector of OLS predictor coefficients w MP (Main Predictor) is computed from the following formula [29]: where R is the experimental signal autocovariance matrix: and vector P is: where Ψ is a weighting function, which takes a constant value of 1 for classic OLS; pixels P are taken from a training window Q around the coded pixel located at position (y, x) ( Figure 3). It consists of W image sub-rows of size 2W + 1 preceding the estimated pixel and W pixels preceding it in the current row (Q is an area extending W pixels to the left and up from the coded pixel P(0) and to the right in rows preceding it). The best results were obtained for W = 10, r = 18 (Section 6.2). The default vector w MP = [0.620, 0.625, −0.125, 0.125, −0.125, −0.125] is used in border areas when R is ill-conditioned. With the Q training window defined in this way and coding of successive pixels row by row using a fast sliding window procedure, the adaptation of the R matrix and vector P is performed relatively fast in comparison to the WLS method discussed in Section 3.2. It involves deleting data obtained from the extreme left column of the Q area and including the new extreme right column of the training window [30].
In the paper, the improved version of this scheme is analyzed. Firstly, a more robust version of Formula (8) is used [31,32]: where u bias = 800 (term u bias I guarantees non-singularity of (9)). Secondly, predictor coefficients obtained from (11) are weighted by coefficients: where d j = 1/ (∆x j ) 2 + (∆y j ) 2 is the inverse Euclidean distance from the currently coded pixel.
The new coefficients substitute w(j) in (3), which is a novel idea introduced in this algorithm. Another improvement proposed in this work is the introduction of a weighting function promoting pixels, for which prediction errors at their positions in window Q are small: In this case, u bias = 100. As such, the weighted sum ∑ y∈Q ∑ x∈Q Ψ (y,x) · e 2 (y,x) (0) of squared errors is minimized, similar to the WLS method described in Section 3.2.

Multi WLS Method
The use of a more complex weighting function compared to Formula (13)) requires a significant increase in implementation complexity at the stage of determining the R matrix and P vector. EM-WLS is based on the WLS concept, so this section begins with its description.

Weighted Least-Squares (WLS)
In general, the vector of WLS predictor coefficients w MP is computed from Formulas (9)-(11) [21]. The weighting function Ψ reflects similarity between neighborhoods of size m around the coded P(0) and some other P off (0) pixels [14,19] (Figure 4). The initial form of the function was relatively simple [14]: Figure 4. Neighborhoods of pixels P(0), and P off (0) for m = 5 and W = 3.

AVE-WLS Method
A previous paper on AVE-WLS [19] followed suggestions from [33,34], as did we, concerning the optimal form of LS predictors. The first one was a statement that for each coded pixel, an optimal predictor order exists [33]. It was not known a priori, so in [19], instead of searching for it, we propose computing the averaged value of the WLS predictor vectors w MP(j) for orders from j = r min to j = r max : Implemented here, the predictor orders range from r min = 4 to r max = 24. Additionally, W = 14 ( Figure 3). If vectors w MP(j) should be extended to r max , it is achieved by zero-padding.
In this paper, a complex weighting function is proposed. Equation (16) is an enhanced version of the formula from [19]: where is the inverse of Euclidean distance between pixels P(k) and P(0), neighborhood size m = r max (Formula (15)), and γ = 1.18. The expression appearing as a power of number 0.8 determines the Euclidean distance between pixels P(0) and P off (0). The scaling factor α depends on two threshold values calculated by Algorithm 1: The second suggestion concerning LS predictors optimization [34] was a proposition to use not all, but only r max most similar (correlated) pixels taken from a range of r ext pixels, r ext > r max , in the basic prediction Formula (1). In this paper, we propose to set r max = 24 and r ext = 48, but similarity is tested using only 10 pixels with indices from 15 to 48, and located in Q ( Figure 4). The similarity measure here is the smallest cumulative absolute distance of pixels from the Q region: Apart from the chosen 10 pixels, the remaining ones are the closest to the coded one. The minimization rule (17) does not apply to the 14 nearest pixels; they are considered by default. In this way, vector Z = {z(1), z(2), ..., z(r max )} is calculated from r max = 24 pixels P(i) from the neighborhood of size r ext = 48. Formula (3) takes the form: Finally, u bias is also optimized in this paper using ridge regression [35]. Firstly, the initial vector w AVE-MP is calculated for u bias = 0, then the following term is computed: where e(0) is defined in (2), respectively; w AVE-MP (i) is the coefficients of the mentioned above initial vector w AVE-MP . Constant c can be evaluated as 4 · r max /W 2 ≈ 0.5.

Extended Multi WLS Method
Another idea introduced in this work, apart from the proper selection of the neighborhood (Section 6.3), is replacement of the arithmetic mean of 22 prediction models for successive orders from r min = 3 to r max = 24 (Formula (15)) by a weighted mean whose weights are determined adaptively after coding of each pixel using an improved ALCM+ method. The approach of weighted combination of predictive models (for calculation of the Main Predictor) is called the extended multi WLS method (EM-WLS).
The original version of the Activity Level Classification Model (ALCM) technique [36] was developed in the 1990s for image coding purposes. In comparison to the classic LMS solution, it is characterized by a lower, though similar, computational complexity. Originally, the method operated on linear predictors of the fifth or sixth order. In each iteration, only two predictor coefficients were modified, and only by adding/subtracting a constant (µ = 1/256 for 8-bit samples). Here, the number of coefficients is 22, and up to six properly selected coefficients are modified, hence the name ALCM+.
In the proposed solution, Formula (15) is extended to the following: where weights β i are initialized with 1/(r min − r max + 1), so the sum of weights is 1. Formula (18) takes the form: To determine the updated β j , it is necessary to calculate predictionsx MP(j) (0) given by WLS models for orders of predictors from r min = 3 to r max = 24. A 22-element vector After coding the current pixel P(0), six weights β j are adapted. For this purpose, the highest three and the lowest values three are taken from the vector g. Let us denote these elements as the smallest: and the largest: If condition g(q (1) ) < g(p (1) ) is true, then the following adaptation coefficients are used (Algorithm 2): The modifying value µ (k) is determined as: where: and α j = {1; 0.75; 0.5}.

Normalized Least Mean Square (NLMS) Method
Similar to [28,37], in this paper, we use the cascaded NLMS method for tuning the results obtained from the Main Predictor. This approach allows introduction of high prediction orders while maintaining relatively low implementation complexity. Here we have two cascaded NMLS filters with orders r 3 < r 2 . The values of prediction coefficients are initially set to 0, i.e., w where j is the prediction stage (for NLMS j = {2, 3}) of the cascade prediction model shown in Figure 2. The general coefficient update formula for the NLMS method is [21]: where e (n) is the bounded prediction error. The experimentally found bound ϕ = 14.
In the NLMS approach, the learning coefficient µ j (i) adapts to the signal. Here, we propose an enhanced formula for it [31]: where e (n) j (k) is the kth signal sample ( Figure 1). Here, the prediction error from the preceding stage of the algorithm, σ 2 , is the weighted average value of all variances σ 2 : where m = 10, p = 1 δ ∑ m k=1 d k · P(k) and δ = ∑ m k=1 d k .
The orders of NLMS predictors used in this paper are r 2 = 96, and r 3 = 30 for LA-OLS, and r 2 = 106 and r 3 = 42 for EM-WLS.

Context-Dependent Constant Component Removal
It was observed for the first time in [11] that predictors tend to produce a DC error component, which diminished their efficiency. Then, the algorithm for computing the correction value for cancelling this component was constructed. According to Formula (7), in our algorithm, the final pixel estimate and its rounded version are:ẋ where the components come from the Main Predictor, NLMS1, NLMS2, and the cumulative error correction stages. The latter is computed by using an extended formula from our previous works [38]: where C j components are computed using context-based error bias correction methods mentioned in [38]: there are 4 context definitions for each technique, 10,11, 12} for the slightly modified median method, and being current sums of prediction errors e 3 (0) (outputs of NLMS2 stage), where i k is fixed and indicates the number of contexts chosen for coding of current pixel for the kth technique.
consist of current numbers of appearances of context number i k . For j = {9, 10, 11, 12}, C j is the median C Median (i k ) obtained as the middle value of a vector V MED (i k ) containing up to 128 values of sorted prediction errors e 3 (0) for every context number i k . From a statistical point of view, careful averaging of uncertain measurements leads to improved measurement results. This is why the constant component is removed as in Formula (33) in this paper. At the same time, the approach reduces variations in histogram shapes, which may manifest in their strong asymmetry: for a given context i, there may be several maximum values S map (j) not positioned at its center (Laplace distribution is symmetric and has one maximum) [39].
As in [38], four indices pointing at four context systems are applied to each method, which are described in the next subsection. A novelty is Formula (33) being adaptive: where: where θ map (j) is the actual cumulated final square prediction error for index j (θ map = [θ JPEG-LS (i 1 ), , which is updated as follows: The experimentally found weights were: ω j = {0.275; 0; 0.4; 0.15; 0.2; 0.3; 0.1; 0.35; 0.2; 0.2; 0.325; 0.2}; optimization of weights for a particular image is also possible. When a count N map (j) reaches 128, then it is halved, similar to S map (j). In this case, the denominator of (35) is scaled: In the case of median updating when halving the 128-element vector V MED (i k ) (contains sorted prediction errors e 3 (0) for context number i k ), the greatest 32 and the smallest 32 are rejected.

Contexts for Correcting Prediction Error Bias
Three of four context definitions for the error bias correction methods are similar to those described in [38] The context definitions in [11] use values of t samples surrounding the coded pixel P(0), and their mean. For t = 8, the samples are P(1), P(2), P(3), P(4), P(5), P(6), GradNorth = 2P(2) − P(6), and GradWest = 2P(1) − P(5). The numbering is the same as in Figure 1. We take an 8-bit word and set each bit to 1 if the corresponding sample value is greater than the mean of the t samples, and 0 in the opposite case. As such, 1 of the 256 context numbers is defined. Additionally, we can quantize the pixel standard deviation to q values; the number of contexts grows to q · 2 t . In this paper, t = 8 and q = 4 are used, which produces 1024 contexts i 1 . The arithmetic mean is replaced by the expectation value y 3 (0) = y 1 (0) + y 2 (0) + y 3 (0). The variance (multiplied by 8) is: Quantization levels were 3 experimentally found thresholds forσ 2 (0): 300, 2000, 8000. A similar idea was presented in [40].

Approach for k = 3
The third approach is based on a simplified vector quantization method [41]. Initial data analysis resulted in defining 16 vectors of centroids (each containing 3 pixels from the coded pixel neighborhood: V = {P(1), P(2), P(4)}). Centroids are initialized as shown in Algorithm 3.
The simplified adaptive method for updating centroids is based on Euclidean distances of the current vector V = {P(1), P(2), P(4)} from all 16 centroids. The 4-bit (b 0 b 1 b 2 b 3 ) label l of the closest centroid is concatenated with the other 6 bits (defined below) into a 10-bit context number i 2 (i.e., they have 1024 contexts i 3 ). The lth centroid is adapted as follows: Then, the counter count(l) is increased by 1.
Bit b 8 carries information if the current y 3 (0) is greater than average of pixels coded up to that moment. Finally, bit b 9 is defined on the basis of the following majority formula: if at least 5 pixels from the set {P(3), P(4), P(5), P(6), P(7), P(8), P(9)} are greater than y 3 (0), then the bit is zero.

Approach for k = 4
This approach is completely new. We start with reorganizing values {P(1), P(2), y 3 (0)} in ascending order; a new set {p r 1 ≤ p r 2 ≤ p r 3 } is obtained. Then, (non-negative) differences are defined: d 1 = p r 2 − p r 1 and d 2 = p r 3 − p r 2 . The differences are quantized using two thresholds: 5 and 18. There are 6 possible orderings of the set, and 3 2 quantization levels for differences, which produeces 54 combinations. After adding a five-bit number 54 · 2 5 = 1728, contexts i 4 are obtained. Bit zero of the number (b 0 ) is set if p r 2 (median value) is greater than average of pixels coded up to that moment. The next bit (b 1 ) is the error e(1) sign. Two following bits (b 2 b 3 ) respond to two questions linked with pixel P(4): is it lower than y 3 (0) and is substantially different than y 3 (0), i.e., |y 3 (0) − P(4)| ≥ q 3 . The experimentally found q 3 value was 20. Finally, the fifth bit (b 4 ) is one if |P(1) − P(5)| ≥ q 3 .

Context Adaptive Binary Arithmetic Coder (CABAC)
Among the practical applications of prediction errors entropy coders, the most effective are the adaptive arithmetic ones, although various variations of Huffman code are also used, including Rice and Golomb [21]. When measuring the characteristics of prediction errors, it is possible to determine the approximate type of distribution of the currently encoded value e(0) quite well. Based on this assumption, a contextual multi-value arithmetic encoder is usually constructed using not one but t probability distributions associated with context numbers from 0 to t − 1. Theoretically, as the number of contexts increases, an improvement in compression efficiency is expected. Initially, we do not know the probability distributions; their shapes emerge when collecting incoming data. Hence, for large t, the problem of context dilution arises, i.e., for too long time shapes of some, probability distributions are not formed. We need a quick determination of their approximate target form. Therefore, a compromise should be found between the number of contexts and the speed of probability distribution adaptation. Most often, 8 [11], 16, or even 20 contexts are used [42]. An analysis of the influence of the number of contexts on bit average was presented [43].
Another approach consists of assuming an initial knowledge of the approximate probability distributions for each context. In this case, an arithmetic encode should be built that can determine which histogram is updated by the coded pixel. This is a complex problem as it requires finding the best possible mathematical description of distributions matching a given image or class of images. This approach was analyzed based on Laplace, Gaussian, Student's t, and Generalized Gaussian Distribution (GGD) [22]. The conclusions were followed [44], where a GGD was used. A similar solution using Student's t distribution was also used in the TMW method [45], where a principle of blending probability distributions was introduced. However, further research showed that an important issue is the influence of the implemented prediction models on the type of distributions, where parameters describe the actual distributions of prediction errors sufficiently well [46].
A less-often-used approach consists of omitting the prediction stage and coding adaptively pixels instead of prediction errors [47].
Among the most effective codecs, the most often used is an adaptive version of a multivalue arithmetic coder with a reduced number of coding symbols [48] obtained using a non-linear quantization stage. A reduced number of symbols results in an increase in adaptation speed (there are fewer contexts). Even faster probability distribution adaptations can be achieved when using the binary version of the arithmetic encoder. Therefore, our proposed solution to the entropy coder is a completely new context adaptive binary arithmetic coder (CABAC). The absolute error value |e(0)| is coded by an adaptive Golomb code, then compressed by two arithmetic coders. Additionally, if the error is non-zero, then the third coder for sign compression is activated. This two-stage approach significantly improves coder adaptability to the quickly changing properties of image prediction error.

Medium-Term Estimation of Probability Distribution
Golomb code is particularly well suited for coding data with a geometric distribution [49]. Parameter m G is chosen so that for p, the parameter of geometric probability distribution p m G ≈ 1/2. The value of m G is searched for each coded |e(0)| among the 6 probability distributions of the form G(i) = (1 − p)p i . They are defined by m G values m = {1, 1, 2, 3, 4, 12}. The current p parameter is calculated as p = (K − 1)/K, where K = ω 2 for m = 48 (40). Then, m G is evaluated [49,50]: According to a previous observation [51], m G ≈ ln(2)K. Then, the value of ln(2)K is quantized using thresholds {0.01, 1.5, 3.6, 11.0, 16.0}; the obtained index b Golomb with values of {0, 1, ..., 5} is used to select element of set m. The value b Golomb is a part of a context number, and is constant when coding bits of Golomb word representing current |e(0)|. Golomb word consists of unary coded group number u G = |e(0)|/m G , and for m G > 1, the group element number v G = |e(0)| − u G · m G (remainder of division by m G ) is coded using phased-in binary code, which is the variant of the Huffman code for sources with m G equally probable symbols [49]. Specifying the k = log 2 m G parameter means that in each group, the first l = 2 k − m G elements v G are coded using k − 1 bits, and the remaining m − l elements are coded as number v G + l using k bits [21]. The value |e(0)| is transformed into two bitstreams representing u G and v G in the Golomb code block (Figure 2).

Context Number Calculation
Binary sequences u G and v G are coded by separate binary arithmetic coders. The context number for u G is computed as follows: where b unary is in the range {0, 1, ..., 5} and denotes the number of currently coded u G bits (starting with the most significant one). If there are more than six bits, then b unary = 5. Hence, there are 576 ctx u contexts. The number of contexts for v G is 192: where b unary2 = min{b unary , 3} and b binary is the most significant bit of v G , b ω is a one-bit number obtained by quantizing ω using threshold 49, and b phased-in is 0 for the first coded bit of v G and 1 otherwise. If b phased-in = 0, then b binary = 0.

Long-Term Adaptation of Probability Distribution
Each context number is associated with counters of its zeros and ones: n(0) and n(1). The counter values cannot grow infinitely; when the sum n(0) + n(1) reaches a value N max , both counts are halved. For u G , use N max = 2 10 , and the counters' initial values are n(0) = n(1) = 1. For v G , the values are N max = 2 11 , and initially, n(0) = n(1) = 16.

Sign Coding
Separate error e(0) sign coding is rather uncommon in binary arithmetic coders; hence, it is a particular feature of the coder presented in this paper. There are 32 contexts for sign e(0) coding (5-bit context number). Bits of the context number are: signs of neighbor errors sgn(e(1)) and sgn(e(2)) ( Figure 1). Then, bit b ω ; see the comment to (46). Finally, the last two bits are obtained from the four-state quantization of |e(0)| using thresholds {1, 3, 16}. Initial counts of zeros and ones are set to n(0) = n(1) = 2. For the sign coder, N max = 2 10 .

Generalized Criterion for Minimizing the Bit Average of a Coder
In [26], they observed that minimization of the mean square error is not equivalent to minimization of the first-order entropy and the average bitrate of a coded image. This observation prompted us to search for the more accurate global static predictors from this point of view for each of 45 test images using the Minkowsky vector distance criterion [52]: where x (n) (0) andx (n) (0) are reference and actual vector elements, e.g., actual samples and their estimates (1), respectively; Q is the data training area for predictive model learning. It appeared that best predictors were obtained for M values between 0.6 and 0.9 (compromise M = 0.75). For comparison, in [26], the MMAE criterion meant that M = 1; for M = 2, the criterion was equivalent to MMSE. In [26], some improved results were obtained in comparison to MMSE; however, the improvements were not as evident for the final bitrate as for prediction error entropy. This was due to the extremely difficult task of joint optimization of modelling and entropy coding stages, at least for online methods. In the case of advanced offline techniques such as MRP [15], the optimization simply resulted in the iterative nature of such algorithms and in their high computational complexity (forward adaptation). Replacing MMSE by some version of the Minkovsky criterion led to an important increase in potentially enhanced method computational complexity, making it impractical. The results of [52] were so striking that we decided to continue experiments with the Minkovsky criterion. It appeared that the optimum value M can be strongly variable. We started with OLS prediction, followed by an adaptive arithmetic coder, and the value jumped from approximately 0.75 [52] to 1.3, partly due to the locality of the OLS predictors. After adding some kind of cumulated predictor error removal (compare with Section 4), the value increased to 1.5. The 3ST-OLS technique [31] has an additional NLMS+ stage (compare with Section 3.3), and for it, the optimal M is 1.7. Finally, for the proposed EM-WLS version, the best M value is 1.9 ( Figure 5), and the gain in output data entropy with respect to MMSE criterion is very small and unlikely to be exploited in practice (of course, in the paper, we present the algorithm based on MMSE). Bitrate r Figure 6. Dependence of the bit average on the LA-OLS prediction order (average for a set of 45 test images).  In summary, structures of the most effective data compacting algorithms seem to be collections of ad-hoc concepts that work, testing for how close to 2 the optimum M value is to the Minkovsky criterion for predictor optimization seems to validate these concepts, or not. For example, our findings highlight a non-obvious fact: multi-stage multimedia lossless coders are often more accurate than one-stage coders, in accordance with what is known about linear predictors. The added stages could even deteriorate the result; when data are scarce, combined total linear predictor length may be prolonged beyond limits given by length criteria, such as Akaike, MDL, etc. However, as in the examples above, the addition of a stage to a coder may push the optimum M value towards 2, and this seems to be the effect that matters.
In line with the findings presented here, the cascaded EM-WLS algorithm is close to optimal for the Minkovsky criterion (Figure 6), providing the advantages of an online approach (backward adaptation) and excellent performance (see Section 3.2).   Figure 6 shows the bit average for a set of 45 test images as a function of the LA-OLS predictor order.
439 Table 2 presents results for LA-OLS predictor orders from 6 to 20, size of optimal training window size W , and very close to that for GLICBAWLS method [32]. Additional time needed for the realisation of NLMS and bias 443 correction stages is quite big when these stages are appended to described in [53] CoBALP ultra algorithm, its

Performance Analysis of Algorithm with LA-OLS in Main Predictor Stage
Testing of this method started with checking if each of the novel aspects of the method improved overall technique performance. Table 1 compares the average bitrates per pixel for the LA-OLS method (last column) with algorithm versions with an omitted feature. In Test 1, Formula (8) was implemented instead of (11); in Test 2, predictor coefficients were not weighted (12); in Test 3, there were no NLMS stages; and in Test 4, there was no error bias cancelling stage. As can be seen, the results for the full method were superior. The situation was similar when using EM-WLS as the Main Predictor.  Figure 5 shows the bit average for a set of 45 test images as a function of the LA-OLS predictor order. Table 2 presents results for LA-OLS predictor orders from 6 to 20, size of optimal training window size W, and execution times of the versions, times are for coding of Lennagray image on i5 3.4 GHz PC. As can be seen, the best parameters are obtained for r = 18, and W = 10. For this case total execution time is 5.86 s, which is very close to that for GLICBAWLS method [32]. Additional time needed for the realisation of NLMS and bias correction stages is quite big when these stages are appended to described in [53] CoBALP ultra algorithm, its execution time extends from 2.07 s to 2.62 s.

Effects of Neighborhood Selection in Linear Prediction
The most accurate result is obtained when the closest possible neighborhood of the coded pixel (in terms of Euclidean distance) is selected, consisting of r pixels P(i) in (1), where r is the predictor order. This was confirmed by experiments. Notably, there are groups of pixels equidistant to the coded pixel P(0). For r ≤ 30, these are pixels with numbers as shown in Figure 7a 4,6,10,12,14,18,22,24,28, 30}, pixel numbering is irrelevant, as long as we maintain the rule of the Euclidean distance minimization. Examples can be found in previous works for r = 6 [54], for r = 10 [30], for r = 12 [33,[55][56][57], and for r = 18 [29]. 23   A problem arises for other prediction orders, e.g., when r = 5, we have to decide which pixel, P(5) or P(6), completes the set of four nearest pixels: P(1), P(2), P(3), and P(4). For example, in [58], the fifth neighbor was pixel P(5) (Figure 7a). The finding seems to be justified by the fact that we minimized the number of image rows, to which the procedure of calculating the predicted value must have access. This is important for optimizing execution time or resource usage for hardware solutions.
In many cases, the speed of calculations has high priority. In such cases, low orders of prediction are implemented as a compromise, and even some simplifications are introduced to reduce the complexity of Equation (8) calculation to close to that for the model of order r = 5 while maintaining compression efficiency close to that offered by order r = 6 [54]. Some solutions in the literature prefer the Manhattan (l 1 norm) over the Euclidean distance. For example, in [32], the neighborhood for distance l 1 ≤ 3 (r = 12) was used, and in [44,46] distances were l 1 ≤ 4 (r = 20) and l 1 ≤ 5 (r = 30), respectively.
In our opinion, the best approach to the numbering of neighborhood pixels is to minimize the Euclidean distance. In the case of equidistant pixels, counterclockwise numbering should be used (Figure 7b). Our experiments confirmed the advantage of this approach over clockwise numbering. This advantage is most noticeable when using linear predictors of orders 3, 5, and 7. This is particularly important when using multiple prediction models of different orders together, as in Section 3.2.3.
An intuitive explanation is that, assuming the same level of correlation of equidistant neighbors from pixel P(0), the pixels on the right side of P(0) should be selected first. This is because the number of neighbors to the left of P(0) is predominant, for example, for r = 14, there are seven of them; another three are located directly above the coded pixel, and on its right side, there are only four of them. The dominance of left-handed neighbors introduces some imbalance of information. Therefore, counterclockwise numbering, at least for some prediction orders, reduces this disproportion.
The authors of [26] used one step further in this direction by relaxing the Euclidean distance criterion for a model of order r = 14. They decided to use only three rows, keeping the six neighbors to the left (two columns in each of three rows) equal to that of those to the right of the coded pixel P(0) (together with two pixels directly above it: four columns in each of two rows).

Performance Analysis of New Algorithms
Section 6.2 confirms the usefulness of all blocks in the proposed cascade model (see Figure 2) for total coder compression efficiency maximization. Additionally, the dependence is shown of coding time on parameters r and W when using LA-OLS as the Main Predictor. The highest efficiency was obtained for r = 18 and W = 10. For these settings, the encoding time of the Lennagrey image (512 × 512 pixels) is 5.86 s.
When EM-WLS was used in the first predictive block, the coding time increased significantly up to 107.4 s (decoding time was similar due to the same process for calculating predictions of coded pixels in the decoder). The time is not dependent on image content; it is only linearly proportional to the number of pixels in the encoded image. Table 3 presents the time contributions of the EM-WLS cascade model coding stages to the whole coding time. It shows that the total time contribution of the last three stages is less than 1% (two last entries of Table 3). The same holds for total processing time of both NLMS blocks. Among the most complex operations is the determination of predictive model, requiring the solution of 22 matrix equations, which is performed twice due to the use of ridge-regression (Formulas (11) and (19)). This is performed using Cholesky decomposition, which requires 13.21% of the coding time. However, the most complex calculation is that of filling the R matrix and P vector (see Formulas (9) and (10)). The high computational complexity of this step is due to each coded pixel, for as much as W · (W + 1) · r · (r + 5), multiplications and additions should be performed despite symmetry of the R matrix. For r = 24 and W = 14, this necessitates 146,160 multiplication and addition operations.  [59,60]. The execution times of CALIC and JPEG-LS, JPEG2000, and BMF) are below one second. The methods in Table 4 are faster than EM-WLS, but perform worse: Blend-20 is three times faster, LA-OLS codes the Lennagray image in 5.86 s (Pentium i5 3.4 GHz), in contrast to 52.5 s for Vanilic WLS-D. The coding time using CoBALP ultra2 , GLICBAWLS, 3ST-OLS, SWAP, and RALP is a few seconds.
Among methods with the most complex implementations, EM-WLS stands out with an execution time 107.4 s. This is still 3.9 times less than for MRP 0.5. GPR-BP, MRP-SSP, and TMW Lego are even more computationally complex. As can be seen, being less complex, EM-WLS is, on average, closer to optimum than these algorithms. However, decoders for MRP 0.5 and, hence, GPR-BP and MRP-SSP, are relatively time efficient due to the use of prediction with forward adaptation, whereas the EM-WLS decoder complexity is similar to that of a coder. EM-WLS files are on average 11.98% shorter than those of JPEG-LS. Table 4. Bitrate comparison of some state-of-the-art algorithms for the first image database [59].

Conclusions
In this work, different approaches to lossless compression of images, such as forward and backward adaptation, were analyzed. The paper contains some remarks on neighborhood selection in pixel prediction (Section 6.3) and notes on relationships between Minkovsky distance, final prediction error first-order entropy, and eventual coder average data rate (Section 6.4). The proposed EM-WLS algorithm construction was influenced by these observations. When compared to the best algorithms, on average, the proposed EM-WLS lossless image coding technique is currently the most efficient in terms of data compaction. The algorithm is less computationally complex than its main competitors. It is based on the AVE-WLS approach, being an expanded version of WLS, and has a cascade form, where the EM-WLS predictor is followed by a two-stage NLMS section, and by a final Context-Dependent Constant Component Removing stage. The new sophisticated binary context arithmetic coder is much less computationally complex than the preceding data modelling stage; hence, it can be used in other image compression methods. In the proposed universal cascade architecture (Figure 2), the Main Predictor module can be converted to an LA-OLS coder with lower implementation complexity (while maintaining high compression efficiency) due to the simpler prediction value calculation technique, as shown in Section 3.1.