Lossless Image Coding Using Non-MMSE Algorithms to Calculate Linear Prediction Coefficients

This paper presents a lossless image compression method with a fast decoding time and flexible adjustment of coder parameters affecting its implementation complexity. A comparison of several approaches for computing non-MMSE prediction coefficients with different levels of complexity was made. The data modeling stage of the proposed codec was based on linear (calculated by the non-MMSE method) and non-linear (complemented by a context-dependent constant component removal block) predictions. Prediction error coding uses a two-stage compression: an adaptive Golomb code and a binary arithmetic code. The proposed solution results in 30% shorter decoding times and a lower bit average than competing solutions (by 7.9% relative to the popular JPEG-LS codec).


Introduction
At present, the processing of images and video sequences has a fully digital structure, and high memory requirements related to storing multimedia data are therefore a significant problem. Reducing memory requirements is possible due to lossy and lossless compression, and the latter type is the subject of this work. The main applications of lossless images and video compression include archiving 2D, 3D, and 4D medical images (three-dimensional video sequences) [1][2][3][4][5] and astronomical images, as well as satellite image compression [6,7]. In addition, the lossless mode is often required during the graphic processing of photos or in advertising materials in the production of television broadcasts and film post-production [8], etc. In the case of the lossy compression of a video sequence, where the division into a group of pictures (GoP) is introduced, the first frame (type I) is coded in an intra-frame mode, whereas the remaining (N GoP − 1) frames are coded in an inter-frame mode. The image is divided into small squares in such types of codecs (similar to the JPEG format used in digital cameras). Further encoding is usually based on the DCT transform. In the type I frame within each square, the DC coefficient, as the first of the DCT-transformed components, determines the arithmetic average of the pixel values in each square. This means that the diminished version of the selected frame is used, coded with lossless compression methods. For example, for a video sequence of 4K quality (3840 × 2160 resolution) and squares of 8 × 8 pixels, we obtain a diminished resolution frame of 480 × 270.
Two steps are usually used in modern compression methods: data decomposition and compression by efficient entropy methods, the most effective being arithmetic coding and Huffman coding [9]. The decomposition is designed to significantly reduce data redundancy resulting from the high level of correlation between neighboring pixels. At this stage, wavelet methods (e.g., JPEG2000 [10], SPIHT [11], ICER [12]) as well as predictive methods (PRDC [13], JPEG-LS [14], CALIC [15]) are used. The most common definition of nonlinear predictors used for lossless image coding is that the prediction function is a linear combination of nearest-neighbor pixel values using a switching scheme between several prediction models associated with particular contexts. Such nonlinear predictors are sometimes called multichannel predictive coders [16]. Two such proposals will be discussed in Section 3.
In contrast, for example, predictive models based on neural networks are characterized by full nonlinearity. In practical applications, an adaptive method is used in which the network weights are modified on the fly after each successive pixel is encoded (methods with backward adaptation). The papers [17][18][19] used an adaptive neural network (AdNN), the design of which was based on a Multilayer Perceptron Network (MLP), whereas the paper [20] used a cellular neural network (CNN). The feature of backward adaptability refers to the time symmetry of encoding and decoding. In cases where neural networks are used, it usually means long process times for both encoding and decoding. When neural networks with forward adaptation (pre-learned using an image database) and lossy-tolossless coding mode are used, especially when using the latest GPU parallelization and acceleration technologies, the problem of excessive complexity does not exist.
Among the methods with high implementation complexity, there are other timesymmetric solutions with higher efficiency than neural networks. In such cases, the methods were based on linear prediction models with backward adaptation, using mechanisms known from the literature such as RLS [21], OLS [22][23][24], or WLS [25,26], where the coding of each successive pixel is also accompanied by a procedure for adapting or redetermining the coefficients of the linear predictor. The cascaded combination of predictors allows us to obtain the highest compression efficiency, as shown in [25], but it is associated with too long a decoding time (see Section 4). One-step coding (without the step pre-transforming the data) has similarly long encoding and decoding times. [27].
Therefore, this work focuses on time-asymmetric methods (with forward adaptation) with a fast decoding mechanism since the encoding operation is most often performed once, and the decoding operation is performed many times. Backward-adaptive methods for determining prediction coefficients only have pixels already available as decoded on the decoder side (details in Section 2.1) , hence learning the prediction model on the fly on both the encoding and decoding sides. In contrast, in methods with forward adaptation, we can already more precisely "tune" the prediction model at the encoding stage thanks to complete information about all image pixels. The paper will discuss different approaches to determining prediction coefficients, including comparing the classical method based on minimizing the mean-square error and the proprietary algorithms presented in Sections 2.2 and 3.2, which use non-MMSE solutions with different levels of implementation complexity at the coding stage. Section 2.2 discusses the advantages and disadvantages of approaches with minimized prediction errors based on the generalized Minkowski distance. Section 2.3 presents the fast prediction methods using context switching, which are then used in the final solution discussed in Section 3, whereas maintaining the low implementation complexity of the decoder, the proposed solution offers a 7.9% lower bit average relative to the popular JPEG-LS codec. In Section 4, we present a comparison of the effectiveness of the proposed solution with competing methods of similar complexity (methods with forward adaptation). The decoding time of the proposed solution is 30% shorter than the relatively fast Multi-ctx method [28].

Practical Aspects of Lossless Image Coding
Data modeling for two-dimensional signals such as images boils down to removing as much mutual information between adjacent pixels (spatial correlation) as possible. To do this, the predicted value (rounded to an integer value) of the encoded pixelx n is determined. Then, the differences between them are encoded, which are referred to as prediction errors, which are most often small values oscillating near zero: In this way (i.e., by encoding successive rows of the image from top to bottom and within a row and consecutive pixels from left to right, which leads to a linear complexity of decoding time as a function of the number of image pixels), we obtain a differential image in which the errors distribution e n is close to the Laplace distribution, which allows us to encode them efficiently using one of the entropy methods. With an eight-bit input data scale, the prediction error value is an integer in the interval e n ∈ −255; 255. One proposal for determining the predicted value is to use linear prediction with an appropriate selection of neighboring pixels and the prediction order. To obtain a one-dimensional data vector from the neighborhood, it is necessary to number the neighbors of the encoded pixel, for example, according to the increasing Euclidean distance (∆x j ) 2 + (∆y j ) 2 of their centers. The numbering of pixels with the same distance is then determined clockwise. This allows the apparent one-dimensional domain of the signal, which facilitates the mathematical notation of many formulas and the use of known formulas themselves, concerning onedimensional signals. We discussed the analysis of other numbering methods in paper [25]. Figure 1 illustrates the 46 numbered nearest neighbors of a coded pixel x n , where the given j-th number points to a pixel with a value P(j). 46   The linear predictor of order r has the form: where the elements b j are the prediction coefficients that make up the B vector to be passed in the file header to the decoder [9]. It is widely accepted that, for images, the determination of prediction coefficients by minimizing the mean squared error (MMSE) gives very good results [9,25]. To ensure that the header size not too large and assuming that the prediction coefficient is between −1.999 and 1.999, we can write each coefficient using a small number of bits, that is, (N + 2) bits. Each bit is used to store the sign of a number and an integer value, respectively. The remaining N bits are the precision of the fractional part. To achieve this, the following operation is performed: It is assumed that for images, the sum of the coefficients is equal to 1. Suppose the sum of the prediction coefficients (for example, determined by the MMSE method) is slightly different from 1 (which can be affected by the rounding used in formula (3)). In that case, we can modify the first coefficient b 1 by adding a value of 1 − ∑ r j=1 b j . When using a single r-order predictor, the cost of the header is only (r − 1) · (N + 2) bits. For example, for an image with a resolution of 512×512 pixels, with r = 24 and N = 12, the cost of the header is only L head = 0.00123 bits per pixel. Therefore, in this work, we can make some simplifications, such as omitting the header part from the formula for the bit average L avg and using the entropy function: L avg = L err + L head ≈ L err ≈ H, where H denotes the first-order entropy value of the differential image data (after predictive coding): where p i is the probability of occurrence of a symbol from the prediction error alphabet with index i; in this case, the error alphabet can be considered a set of integers with values between e min and e max . With this assumption, the MDL (Minimum Description Length) algorithm, which minimizes the bit average L avg , is reduced to minimize the entropy function of the differential image data.

Determination of Prediction Coefficients Using the Non-MSE Method
In determining linear prediction coefficients, the MMSE criterion [9] is not synonymous with obtaining a model that allows the smallest possible value of entropy H, nor the smallest bit average L avg received after considering header data, for example, adaptive arithmetic coding of prediction errors. The paper [29] proposed a minimization of the mean absolute error (MMAE), which, when the image was divided into squares of size 8×8, yielded better results compared to the use of MMSE. A broader justification for the suboptimal effect of the MMSE on the entropy value was presented in the work of [30]. In the paper [25], we showed that depending on the prediction scheme used (cascade prediction) and the size of the learning area Q, it makes sense to use different values of the power of M (from 0.6 to 2) in the minimization criterion L M , which uses the generalized Minkowski distance formula (for M = 2 we obtain the Euclidean distance (MSE), and for M = 1, the Manhattan distance (MAE)): For methods with iterative optimization of the prediction model, it is more convenient to use the substitute of the objective function L M in place of the bit average L avg , even when the simplified assumption of L avg ≈ H is made. In the approach proposed in this work, the area Q is the entire differential image, for which good compression efficiency is obtained at M = 0.75. Figure 2 shows the dependence of entropy as a function of the parameter M; for example, the Noisesquare image was encoded using a linear predictor of order r = 8. Beyond the integer values M = 1 and M = 2 (especially with M < 1 and b j coefficients ranging from −1.999 to 1.999), finding an optimal linear predictor is challenging. In the paper [31], we presented a suboptimal algorithm for determining prediction coefficients using a selective search (referred to here as the Iterative Search Algorithm, ISA), which allows us to minimize the value of Formula (5) simplified to the form: where the predicted value is determined from Equation (2) with the limitations above imposed on the range of coefficients b j . Figure 3 shows One of the advantages of the ISA algorithm is the flexibility of using an internal (substitute) objective function. Although in the case of compression, we are usually concerned with minimizing the entropy function in the final coding step, deciding to modify the prediction coefficients in subsequent iterations based on the entropy drop does not guarantee that the smallest entropy value will be obtained at the end of the selective search. Comparing the results of using two different internal target functions, in about half of the cases, using the function (6) at M = 0.75 gave a slightly better result than using the function (4). Our target proposal is an experimentally selected compromise that combines both of these criteria into a single internal (substitute) target function: This is due, among other things, to the fact that the ISA quickly takes into account the effects of an additional block of removal of the context-dependent constant component when minimizing errors, which improves the predicted value and which is assumed by the target solution scheme of the encoder proposed here (see Section 3.1).
The ISA algorithm introduces the condition ∑ r j=1 b j = 1 and a predetermined precision for writing coefficients to N fractional bits. First, we set the original vector of coefficients as B = [1, 0, . . . , 0] and the objective function that will be minimized. Either Formulas (4), (6), or (7) can be used. Each t-th iteration consists of checking the objective function after encoding the image using a predictor B to which a specific scaled modifier vector ∆B has been added: If a reduction in the value of the objective function is obtained, then this vector is stored as the new value of B. For the sum of the prediction coefficients to remain constant, the sum of the elements in the modification vector ∆B must be equal to 0. A well-performing restriction of the set of ∆B = [∆b 1 , ∆b 2 , . . . , ∆b r ] vectors to those with two non-zero {−1, 1} elements, whereas the remaining ∆b j elements are 0, was adopted. The number of modifying vectors is N ∆B = r · (r − 1).
The ISA involves selective iterative searching, in which only a rough fit of the predictor to the data is obtained initially. Only subsequent steps allow each prediction coefficient to be refined with increasingly smaller bits after the decimal point, so the i parameter in Formula (8) is changed sequentially from 0 to N. For each fixed value of i, we perform encoding and reading of the objective function for successive vectors ∆B. Going through the cycle of the entire set of N ∆B modifying vectors, the process is repeated if there is a finding of at least one better vector B.
Usually, there are several such cycles, and we can limit the number of cycles to two for i < 4 and to eight for larger values of i. i denotes the number of test image encodings N ISA ≤ (4 · 2 + (N − 3) · 8) · r · (r − 1) = 8 · (N − 2) · r · (r − 1), which is an upper limit, although, in practice, about half of this value is sufficient. Test coding should be a learning process (improving prediction efficiency in subsequent iterations). Thus, the second step related to proper prediction error compression is not performed at this stage. For this reason, the coding time is negligible compared to the preliminary stage of determining the predictive model. Only using the final prediction model leads to an entropy coding procedure of prediction errors.
For example, with N = 7, r = 14, we have a maximum number of 7280 test image encodings. However, based on experiments for 45 test images, the actual number of executions of the encoding procedure was less and averaged 3746. This is still low in complexity compared to a brute force complete search method that checks every possible state of vector B under imposed constraints. In this case, we are dealing with exponential complexity since the maximum number of test encodings is N BF = 2 (N+2)·(r−1) . Table 1 presents the value of N BF and the reduced number (after considering the ∑ r j=1 b j = 1 condition) for small N = 5 and r ≤ 5 values.
In the 27 experiments performed (using the nine test images from Table 1 sequentially for r = 2, 3, 4), only in one case did ISA obtain a slightly higher (increase of fewer than 0.001 bits/pixel) entropy value relative to the entire search. It should be noted that for higher orders of prediction, such high efficiency of suboptimal ISA is not preserved. However, the main advantage of using this algorithm is obtaining good results with a low number of fractional bits (N < 7) used to store prediction coefficients, as confirmed by the results of an experiment comparing the average entropy value (for a base of 45 test images at r = 14) obtained by the two methods ( Figure 4). In the first case (dashed line), the classical minimization of the mean square error was used. Then, the accuracy of the prediction coefficients was reduced using Formula (3), and in the second case (solid line), the ISA implementation was used with the goal minimization function (6) at M = 2. On the other hand, the disadvantages of ISA include the low efficiency of improving entropy minimization for values N > 7 and the long time to determine prediction coefficients for large orders.

Prediction with Context Switching
Although most linear prediction methods with a backward adaptation of the predictive model have relatively high implementation complexity, there are other fast methods that use context-switching of the fixed predictive model. The context is a set of features that characterize the type of the closest neighborhood of the encoded pixel. Using contextual partitioning by detecting different types of neighborhoods (classes with distinct characteristics), we can individualize predictive models well-matched to neighborhood characteristics, increasing compression efficiency. In the most straightforward solutions, the predictors associated with a given context are fixed, and there is no need to determine them separately for each image. An example is the median adaptive predictor (MAP) proposed in the JPEG-LS algorithm based on the median edge detector (MED) context switching technique [14,32]. Several developments of this idea have emerged, including MED + , presented in the paper [33]. Still, the resulting improvement was insignificant compared to methods with more context, the principles of which will be shown in the following sections.

Gradient-Adjusted Predictor
The context-dependent prediction method with fixed coefficients of each model has been proposed as a primary prediction method in the CALIC algorithm [15]. It uses several neighboring pixels to determine the predicted value and allows the selection of one of seven contexts. The decision to choose the appropriate context is made based on the number d GAP = d h − d v , where d h and d v are defined as the levels of the neighborhood gradients [15] d The range of d GAP values is divided into seven fields based on three threshold values T 1 = 8, T 2 = 32, T 3 = 80 (research conducted in the paper [34] suggests changing these thresholds to T 1 = 6, T 2 = 25, T 3 = 78). The following is presented as a pseudo-code Algorithm 1 for determining the k-th context number [15]. A modified version of the method with slightly higher efficiency, denoted hereafter as GAP + , was presented in the paper [35]. Each of the seven contexts was assigned an individual fixed linear predictor using two to five of the six neighboring pixels. Table 2 shows the prediction coefficients for each of the seven contexts.

Prediction Method with Gradient Weights
The gradient-based selection and weighting pixel predictor (GBSW) method presented in the paper [36] is based on directional gradients determined similarly to the GAP + method. After some of our improvements, four values are determined: Predictors P(1), P(2), P(3), and P(4) are associated with these values, respectively. Then, the two with the smallest values are determined from among these four gradients, which become the weighting coefficients of the predictive model. This is a linear combination of two of the four nearest neighbors, with the weights associated with the predictors via the cross method. For example, if the two smallest values are d w and d n , the prediction value is determined as follows:x We developed this method into GBSW + by adding a fifth gradient d GAP , the arithmetic mean of the four described by Formula (10). The fifth associated predictor is determined as the value of predictor GAP + . When the denominator in Formula (11) is 0, the predicted value is obtained from model GAP + .

Prediction Method with Multi-Context Switching
In addition to the previously mentioned methods, new proposals continue to emerge with varying numbers of contexts [16,[37][38][39][40][41][42]. As a generalization of this concept, it is possible to design a fast function that determines a much larger number of contexts (e.g., 2048). For each, an individual predictor is calculated based on a learning image database. In the paper [28], we proposed this type of solution (Multi-ctx) with five different context determination functions built in parallel.
Thus, the predicted value of the encoded pixel x n was determined as a weighted average using five linear predictors of order 12 and the two fast predictors mentioned earlier with GAP + and MED + context switching: with the value of β = 0.025 is chosen experimentally. After considering the adaptive arithmetic encoder (described in the paper [21]), both the Multi-ctx encoder and decoder offer relatively low complexity, similar to the CALIC method. Lennagrey's image decoding time (512 × 512 pixels, using a 3.4 GHz i5 processor and an unoptimized version of Multi-ctx code) is 0.35 s. Table 3 presents a comparison of entropy values for several fast prediction methods.

Scheme of the Proposed Solution
The reference point for designing the new solution is the aforementioned Multi-ctx method. Section 3.1 presents the proposed improvements over Multi-ctx, resulting in a reduction in decoding time. In contrast, Section 3.2 discusses the proposal of a fast algorithm for determining prediction coefficients using the non-MMSE method, which offers some trade-off between speed of operation and bit-average level.

Decoder Complexity Reduction
Obtaining further acceleration of the Multi-ctx decoder requires simplifying the Formula (12). In place of the MED + predictor, we propose the much more efficient GBSW + . At the same time, we omit the constant weight β = 0.025, thus reducing the formula for the predicted value to a simple linear combination: Given the possibility of using a hardware implementation of the decoder with low power consumption and hardware resources, it is worth noting that this formula is feasible using fixed-point computing. The decoding process requires access to only three image rows (the currently decoded and two previous rows) in the case of r = 14 (we use the 12 closest pixels of P(j)) and up to four image rows at r = 24.
Since the proposed codec assumes the presence of time asymmetry between the encoder and decoder, a flexible approach is possible in determining the prediction order and in how to choose the prediction coefficients, which are placed in the file header. One approach may be to use ISA, but in most cases, the encoding time may not be acceptable. It is also possible to use the determination of prediction coefficients with the fast classic MMSE method or Algorithm 2 described in Section 3.2, which offers some compromise between the speed of operation and the bit-average level L avg .
Unchanged from Multi-ctx is the context-dependent correction of the cumulative prediction error (module CDCCR-Context-Dependent Constant Component Removingin Figure 5 showing the block coding scheme proposed in this work). For a detailed description of CDCCR, see paper [28]. Similar solutions were previously used in codecs, such as JPEG-LS and CALIC. Removing the constant C mix component associated with one of the 1024 contexts requires a slight modification of Formula (1) to the following form: Another improvement over Multi-ctx is using a newer prediction error encoder, a combination of an adaptive variation of the Golomb code and a binary arithmetic encoder (see paper [25] for details). The use of these modifications, despite the lack of code optimizations (such as the introduction of fixed-point operations in place of floating-point operations or the conversion of the code from C++ to assembler), made it possible to reduce the decoding time of a Lennagrey image (with dimensions of 512 × 512 pixels) using an i5 3.4 GHz processor from 0.35 s to 0.245 s (with r = 24) for a Multi-ctx decoder. This represents a 30% reduction in time. At the same time, the bit average decreased, even when the prediction order in the encoder was reduced to r = 14. The fast classical MMSE approach was used to determine the prediction coefficients instead of ISA or Algorithm 2. A detailed comparison of bit averages with other codecs known from the literature will be discussed in Section 4.

Algorithm for Fast Determination of Prediction Coefficients Using the Non-MMSE Method
As discussed in Section 2.2, Algorithm ISA is characterized by a too-long coding time at high prediction orders. In the literature on determining prediction coefficients using non-MMSE methods, one also encounters attempts to use, for example, genetic algorithms.
However, for instance, in works such as [46][47][48], such algorithms were designed for loworder prediction models (r = 3 and r = 4). Our experiments indicate that it is more difficult with a high prediction order (r > 10) to obtain a significantly faster convergence of linear prediction model construction in this way than is the case when using ISA. Thus, there was a need to design a compromising solution that could determine linear predictor coefficients quickly, allowing lower entropy than the classic MMSE approach with forward adaptation. To this end, to select the final prediction model, it was necessary to take advantage of the possibilities offered by methods with backward transformation, which can iteratively improve compression efficiency.
The most common prediction coefficient adaptation methods use least mean square (LMS). Regarding image coding, it is among the simplest but, at the same time, the least efficient methods, even in the normalized version of LMS (NLMS). This is due to two reasons. First, a problematic issue is the selection of the right learning rate µ with different levels of data input variability. In the few works using LMS in lossless image compression, for this reason, the value of µ is set relatively low, such as 2 −23 -2 −22 [49], and this results in slow convergence to the expected results. Second, most of the NLMS improvement methods known from the literature, including the selection of a locally variable µ value, are optimized for time series in which we have a one-dimensional input string. This also applies to more computationally complex methods such as RLS.
In contrast, images offer correlations between adjacent pixels in two dimensions. One can largely overcome these difficulties by introducing transformations that use, for example, differential data of neighboring pixel values [50]. This reduces the input signal's dynamics while increasing the predictive model's learning speed. Various input data transformations are often used in the literature to achieve faster convergence of adaptive methods [51]. For this purpose, a vector is introduced: where T is the transformation matrix, the vector X = [P(1), P(2), . . . , P(r)] T contains the neighborhood pixels of the currently encoded pixel x n . The transformed data vector Y is used to calculate the predicted value based on the prediction model included in vector A: Typically, the matrix T is a square matrix of size r × r, where r is the most commonly defined power of two; for example, the Walsh-Hadamard matrix at r = 2 2 looks like this: In this case (as in DCT), the first element of vector Y is proportional to the arithmetic mean of vector X, and the sum of the values in the other rows of matrix T is zero. In the case of lossless image coding, this first row can be dispensed with, and at the same time, it can be noted that matrix T does not have to be square. Indeed, there are more rows composed of, for example, the numbers {−1, 0, 1}, in which the sum is zero. The following formula determines their number: where the parameter k determines in such a matrix T the number of pairs of numbers {−1, 1} and (r − 2k) is the number of zeros in a row of the matrix. For example, with r = 4, we have 19 rows. Omitting the row of the matrix T consisting of only zeros, we obtain the following matrix: (19) in which a certain symmetry can be observed, and the other half can thus be rejected for this reason. Then, after using Formula (15), we obtain a vector Y = [y 1 , y 2 , . . . , y 9 ] T of the data after transformation with nine elements (in general with r A elements) and, consequently, also a vector of nine prediction coefficients A = [a 1 , a 2 , . . . , a 9 ]. By using the relationships between the data in the Y vector and the data in the X vector, there is a chance of faster convergence by the adaptive LMS-type methods. At the same time, it should be noted that after determining the final prediction model A, it is possible to obtain a prediction model B of order r using an inverse transformation, as will be shown later in this section with a practical example. Due to the rapidly increasing number of rows of matrix T along with the prediction row, in the practical implementation, we decided to use only the rows of matrix T in which there is only one pair of non-zero values {−1, 1}. This choice boils down to the fact that elements y j consist of differences between two pixels. In practice, we reduce this even further to pixels that are immediately adjacent to each other (and available in the decoding procedure); for example, the nearest neighbors of pixel P(1) are P(2), P(3), P(5), and P (7). Then, through experiments, we select (as in work [50]) a subset of these differences that work well with the prediction coefficient adaptation method. Taking into account the prediction model proposed in Formula (13), we obtain an input data vector of the form X = [x GBSW + ,x GAP + , P(1), P(2), . . . , P(r − 2)] T . Assuming that we are initializing the vector A = [0, 0, . . . , 0], modify Formula (16) to the form: We designed two sets of transformations for orders r = 14 (r A = 27) and r = 24 (r A = 43), respectively. Table 4 shows the experimentally selected values of the Y vector for r = 14 (j ≤ 27) and for r = 24 (j ≤ 43).
In designing Algorithm 2, we used a modified form of the activity-level classification model (ALCM + ), which is closer to MAE minimizing than to MSE, as is the case with NLMS, as a means of adaptively determining the prediction coefficients of A. In the original, ALCM operated on fifth-or sixth-order linear prediction models [52]. After encoding the next pixel, only a select two of these several coefficients were adapted (by a constant value of µ = 1/256). In the solution proposed here (see Algorithm 2), as many as eight of the r A of the a j coefficients of vector A are adapted for each successively encoded pixel. For this purpose, four of the lowest and highest values from the Y vector are determined. We denote them as the four smallest values, P(q 1 ) ≤ P(q 2 ) ≤ P(q 3 ) ≤ P(q 4 ), and the four largest values, P(p 4 ) ≤ P(p 3 ) ≤ P(p 2 ) ≤ P(p 1 ), respectively. Then, assuming P(q 1 ) < P(p 2 ), the adaptation presented in Algorithm 2 is performed. ifx n < x n then 3: a p k (n + 1) ← a p k (n) + µ k , for k = {1, 2, 3, 4} 4: a q k (n + 1) ← a q k (n) − µ k , for k = {1, 2, 3, 4} 5: else ifx n > x n then 6: a p k (n + 1) ← a p k (n) − µ k , for k = {1, 2, 3, 4} 7: a q k (n + 1) ← a q k (n) + µ k , for k = {1, 2, 3, 4} 8: end if 9: end for  Learning coefficients are determined as follows: where where α k = {1, 0.75, 0.5, 0.5}, and the values of γ(t) and δ(t) change in successive iterations. This is because the entire image is scanned five times, and in each successive iteration, the influence of the learning rate should be reduced (according to the principle from coarse to fine-tuning). Experimentally selected parameters are γ(t) = {10, 12, 14, 15, 17}, δ(t) = {250, 1200, 5000, 30000, 100000}, respectively, for t = {1, 2, . . . , 5}. Thus, we obtain (regardless of the prediction order) the final form of vector A after only five initial image encodings, which we transform into vector B. This requires Table 5 for r = 14 or Table 6 for r = 24, which contains a set of transformations of coefficients from the domain of vector A to vector B. Thus, after the adaptation stage, we can use the target Formula (13) for the predicted value in the encoder and in the decoder. The final step is to reduce, using Formula (3), the accuracy of the prediction coefficients' notation to N = 12 bits after the decimal point, after which we perform the full final encoding according to the scheme shown in Figure 5. In the case of the Algorithm 2 implementation, the total time to determine the prediction coefficients and encode the Lennagrey image is 1.515 s with r = 14 and 1.925 s with r = 24 (decoding times are 0.24 and 0.245 s, respectively).

Features of the Proposed Solution
In Figure 5, we have shown a block diagram of the encoder proposed here (with four separate stages), in which the data processing process proceeds as follows (Algorithm 3):

Algorithm 3
Steps of coder data processing 1: For each successively encoded pixel x n : 2: Calculate the predicted value (Formula (13)-stage 1) and the prediction error e n after considering the context-dependent component of the constant C mix (Formula (14)stage 2); 3: Convert the prediction error e n to a bit stream using an adaptive Golomb encoder (stage 3); 4: Encode the bit stream from stage 3 using an adaptive binary arithmetic encoder; 5: Return to step 2 if there are still pixels to code.
The decoding process, in turn, is described by Algorithm 4.

Algorithm 4
Steps of decoder data processing 1: For each successively encoded pixel x n : 2: Decode the bit sequence from the input using an adaptive binary arithmetic encoder, obtaining the Golomb code word and prediction error e n after considering the contextdependent component of the constant C mix (Formula (14)-stage 2); 3: Convert the Golomb word to a prediction error value e n ; 4: Calculate the predicted value (Formula (13)) and C mix , and then add these values to e n , obtaining the decoded pixel value x n ; 5: Return to step 2 if there are still pixels to decode.
In this work, we focus on analyzing the details of stage 1 (stages 2-4 are discussed in the paper [25,28]). Section 2.2 showed that non-MMSE minimization in a simple predictive model achieves higher compression efficiency than the classical approach. The coding and decoding system was designed to allow decisions on which approach to determining prediction coefficients to use (MMSE, ISA, or type ALCM + adaptation). As a compromise solution in the final version of the encoder, we proposed a predictor of order r = 24 and the calculation of prediction coefficients in an adaptive manner (ALCM + ), where, in five iterations of the initial encoding, an online predictive model is tuned (see Algorithm 2). It should be noted here that there is no need to use entropy coding stages at the initial stage (stages 3 and 4 in Figure 5). In addition, the time periods for calculating prediction coefficients (in the ALCM + version) and encoding and decoding times are linearly dependent on the number of image pixels. Table 7 compares bit averages (for two sets of standard test images, the first consisting of 9 images and the second of 45) obtained by the proposed method using three different approaches to determining prediction coefficients. The first set is the most commonly used in papers on lossless image compression ( Figure 6). The second set includes the first set-the whole dataset is available at [53]. The final compression method proposed in this work was determined to be the one in which Algorithm 2 was used to determine the prediction coefficients with a prediction order of r = 24.   Table 8 compares the bit averages (for nine standard test images) of the proposed solution with those obtained by several methods with fast encoding and decoding times known from the literature. The proposed solution offers a 7.9% lower bit average relative to the popular JPEG-LS codec. It also presents the results of the high-performance LA-OLS method [25], characterized by symmetric, significantly longer encoding and decoding times than the other solutions. Although the bit average for the LA-OLS method is lower than when using the technique presented in this work (using Algorithm 2), the Lennagrey image encoding time of 5.86 s (at 512 × 512 pixel resolution using i5 3.4 GHz processor) using LA-OLS is more than three times longer. The decoding time is as much as 23.9 times longer. In addition, compared to the relatively fast Multi-ctx method, the solution proposed here offers a 30% shorter decoding time (amounting to 0.245 s with r = 24) while having a lower bit average. The decoding time of the solutions compared here depends linearly on the number of pixels of the image being decoded, reflecting the implementation complexity of the decoding process. These results refer to a non-optimized implementation in C++ (without parallelization and use of GPU resources). Therefore, decoding time, although shorter than LA-OLS and Multi-ctx, is inferior to fully optimized solutions (such as CALIC and JPEG-LS), where decoding time is measured in milliseconds. It should be noted here that the decoding algorithm, due to its simplicity, lends itself to easy hardware implementation.

Conclusions
The analysis of solutions for lossless image coding presented in the introduction leads to the conclusion that the choice of compression method depends on user preferences. Based on the assumption that we usually encode an image once and decode it multiple times, we should care about a relatively fast decoding process. For this purpose, one should use time-asymmetric predictive methods with forward adaptation with adjustable implementation complexity in the encoding process. To achieve high compression efficiency, use linear or nonlinear prediction at the data modeling stage and, for example, arithmetic coding of prediction errors.
In this work, a method for the lossless compression of images with a fast decoding time and flexible selection of encoder parameters is presented, proposing, in addition to the classical method of determining prediction coefficients based on minimizing the mean square error, two other algorithms of the non-MMSE type. The proposed authors' algorithm (based on ALCM + type adaptation) for determining prediction coefficients by the non-MMSE method is characterized by a complexity linearly dependent on the number of pixels of the encoded image (unlike, for example, the simplex method, where in extreme cases exponential complexity can be encountered). In the proposed solution, the data modeling stage was based on linear and nonlinear prediction, and a simple block for removing the context-dependent constant component was used. The prediction errors produced after applying the prediction are subjected to two-stage compression using an adaptive Golomb code and a binary arithmetic code.  Data Availability Statement: The datasets for this study are available upon request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.