Stochastic Model of Block Segmentation Based on Improper Quadtree and Optimal Code under the Bayes Criterion

Most previous studies on lossless image compression have focused on improving preprocessing functions to reduce the redundancy of pixel values in real images. However, we assumed stochastic generative models directly on pixel values and focused on achieving the theoretical limit of the assumed models. In this study, we proposed a stochastic model based on improper quadtrees. We theoretically derive the optimal code for the proposed model under the Bayes criterion. In general, Bayes-optimal codes require an exponential order of calculation with respect to the data lengths. However, we propose an algorithm that takes a polynomial order of calculation without losing optimality by assuming a novel prior distribution.


Introduction
There are two approaches to lossless image compression. (These two approaches are detailed in Section 1 of our previous study [1].) Most previous studies (e.g., [2][3][4]) adopted an approach in which they constructed a preprocessing function f : v t−1 → p that outputs a code length assignment vector p from past pixel values v t−1 . p determines the code length of the next pixel value v t , or typically, a value v t equivalent to v t in the meaning that there exists a one-to-one mapping (v 1 , v 2 , . . . v t ) = g(v 1 , v 2 , . . . v t ) computable for both encoder and decoder. Then, v t and p are passed to the following entropy coding process such as [5,6]. In this approach, the elements p i of the code length assignment vector p satisfy ∑ i p i = 1. Therefore, it appears superficially as a probability distribution. However, it does not directly govern the stochastic generation of original pixel value v t . Hence, we cannot define the entropy of the source of pixel value v t , and we cannot discuss the theoretical optimality of the preprocessing function f (v t−1 ) and one-to-one mapping g(v 1 , v 2 , . . . v t ).
In contrast, we adopted an approach in which we estimated a stochastic generative model p(v t |v t−1 , θ m , m) with an unknown parameter θ m and a model variable m, which is directly and explicitly assumed on the original pixel value v t [1,[7][8][9]. Therefore, we can discuss the theoretical optimality of the entire algorithm to the entropy defined from the assumed stochastic model p(v t |v t−1 , θ m , m). In particular, we can achieve the theoretically optimal coding under the Bayes criterion in statistical decision theory (see, e.g., [10]) by assuming prior distributions p(θ m |m) and p(m) on the unknown parameter θ m and model variable m. Such codes are known as Bayes codes [11] in information theory. It is known that the Bayes code asymptotically achieves the entropy of the true stochastic model, and its convergence speed achieves the theoretical limit [12]. The Bayes codes have shown remarkable performance in text compression (e.g., [13]). Therefore, we consider this approach.
We assume that the target image herein has non-stationarity, that is, the properties of pixel values are different among the positions in the image. For such an image, researchers have performed quadtree block segmentation as a component of preprocessing f (v t−1 ) and one-to-one mapping g(v 1 , v 2 , . . . v t ) in the former approach, and its practical efficiency has been reported in many previous studies (e.g., [4,14]). In the latter approach, we proposed a stochastic generative model p(v t |v t−1 , θ m , m) that contains a quadtree as a model variable m. By assuming a prior distribution p(m) on it, we derived the optimal code under the Bayes criterion, and we constructed a polynomial order algorithm to calculate it without loss of optimality [1]. However, in all these studies [1,4,14], the class of quadtrees is restricted to that of proper trees, whose inner nodes have exactly four children.
In this paper, we propose a stochastic generative model p(v t |v t−1 , θ m , m) based on an improper quadtree m and derive the code optimal under the Bayes criterion. In general, the codes optimal under the Bayes criterion require a summation that takes an exponential order calculation for the data length. However, we herein construct an algorithm that only requires a polynomial order calculation without losing optimality by applying a theory of probability distribution for general rooted trees [15] to the improper quadtree representing the block segmentation.

Proposed Stochastic Generative Model
Let V denote a set of possible values of a pixel. For example, we have V = {0, 1} for binary images and V = {0, 1, . . . , 255} for grayscale images. Let h ∈ N and w ∈ N denote a height and a width of an image, respectively. Although our model is able to represent any rectangular images, we assume that h = w = 2 d max for d max ∈ N in the following for the simplicity of the notation. Then, let V t denote the random variable of the t-th pixel value in order of the raster scan, and let v t ∈ V denote its realization. Note that V t is at the x(t)-th row and y(t)-th column, where t divided by w is x(t) with a reminder of y(t). In addition, let V t denote the sequence of pixel values V 0 , V 1 , . . . , V t . Note that all the indices start from zero herein.
We assume V t is generated from a probability distribution p(v t |v t−1 , θ m , m) depending on an unknown model m ∈ M and unknown parameters θ m ∈ Θ m . (For t = 0, we assume V 0 follows p(v 0 |θ m , m).) We define m and θ m in the following.
Therefore, it represents the indices of the upper right region. In a similar manner, s (01) (11) It should be noted that the cardinality |s| for each s ∈ S represents the number of pixels in the block.

Definition 2.
We define the model m as a quadtree whose nodes are elements of S. Let M denote the set of the models. Let S m ⊂ S, L m ⊂ S and I m ⊂ S denote the set of the nodes, the leaf nodes and the inner nodes of m ∈ M, respectively. Let U m ⊂ S m denote the set of nodes that have less than four children. Then, U m corresponds to a pattern of variable block size segmentation, as shown in Figure 1.

Definition 3.
Each node s ∈ U m of the model m has a parameter θ m s whose parameter space is Θ m s . We define θ m as a tuple of parameters {θ m s } s∈U m , and let Θ m denote its space.
Notably, we can reduce the number of parameters from an equivalent model represented by a proper tree with added dummy child nodes. See the following example.

Example 2.
For d max = 2, consider a model represented by the left-hand side image in Figure 2. It has three parameters: θ s λ , θ s (00) , and θ s (10) . An equivalent model can be represented by a proper quadtree shown in the right-hand side of Figure 2, if assuming θ s (01) = θ s (11) by chance. However, it requires four parameters: θ s (00) , θ s (01) , θ s (10) , and θ s (11) . Therefore, it causes inefficient learning. Under the model m ∈ M and the parameters θ m ∈ Θ m , we assume that the t-th pixel value V t is generated as follows.

Assumption 1.
We assume that where s is the minimal block that satisfies (x(t), y(t)) ∈ s ∈ U m (in other words, s is the the deepest node that contains (x(t), y(t)) in m). For t = 0, we assume a similar condition p(v 0 |θ m , m) = p(v 0 |θ m s ).
Thus, the pixel value V t given the past sequence V t−1 depends only on the parameter of the minimal block s that contains V t . Note that we do not assume a specific form of p(v t |v t−1 , θ m s ) at this point. For example, we can assume the Bernoulli distribution for V = {0, 1} and also the Gaussian distribution (with an appropriate normalization and quantization) for V = {0, 1, . . . , 255}.

The Bayes Code for Proposed Model
Since the true m and θ m are unknown, we assume prior distributions p(m) and p(θ m |m). Then, we estimate the true generative probability p(v t |v t−1 , θ m , m) by q(v t |v t−1 ) under the Bayes criterion in statistical decision theory (see, e.g., [10]). Subsequently, we use q(v t |v t−1 ) as a coding probability of the entropy code such as [16]. Such a code is known as Bayes codes [11] in information theory. The expected code length of the Bayes code converges to the entropy of p(v t |v t−1 , θ m , m) for sufficiently large data length, and its convergence speed achieves the theoretical limit [12]. The Bayes code has shown remarkable performances in text compression (e.g., [13]).
The optimal coding probability of the Bayes code for v t is derived as follows, according to the general formula in [11]. Proposition 1. The optimal coding probability q * (v t |v t−1 ) under the Bayes criterion is given by We call q * (v t |v t−1 ) the Bayes-optimal coding probability.
Proposition 1 implies that we should use the coding probability that is a weighted mixture of p(v t |v t−1 , θ m , m) for every block segmentation pattern m and parameters θ m according to the posteriors p(m|v t−1 ) and p(θ m |v t−1 , m). (For t = 0, p(v 0 |θ m , m) is mixed with weights according to the priors p(m) and p(θ m |m), which corresponds to the initialization of the algorithm.) Notably, M is generalized to the set of improper quadtrees from the set of proper quadtrees although (4) has a similar form to Formula (5) in [1].

Polynomial Order Algorithm to Calculate Bayes-Optimal Coding Probability
Unfortunately, the Bayes-optimal coding probability (4) contains a computationally hard calculation. (Herein, we assume that Examples of feasible settings will be described in the next section.) The summation cost for m exponentially increases with respect to d max . Therefore, we propose a polynomial order algorithm to calculate (4) without loss of optimality by applying a theory of probability distribution for general rooted trees [15] to the improper quadtree m. In this section, we focus on the procedure of the constructed algorithm. Its validity is described in Appendix A.  First, we assume the following prior distributions as p(m) and p(θ m |m).
Intuitively, η s (z m s ) represents the conditional probability that s has the block division pattern z m s under the condition that s ∈ S m . The above prior actually satisfies the condition ∑ m∈M p(m) = 1. Although this is proved for any rooted tree in [15], we briefly describe a proof restricted for our model in the Appendix A to make this paper self-contained. Note that the above assumption does not restrict the expressive capability of the general prior in the meaning that each model m still has possibly to be assigned a non-zero probability p(m) > 0.
Moreover, for any m, m ∈ M, s ∈ U m ∩ U m , and θ s ∈ Θ s , we assume that Therefore, each element θ m s of the parameters θ m depends only on s and they are independent from both of the other elements and the model m.
From Assumptions 1 and 3, the following lemma holds.

Lemma 1.
For any m, m ∈ M, let s t ∈ U m and s t ∈ U m denote the minimal node that satisfies (x(t), y(t)) ∈ s t ∈ U m and (x(t), y(t)) ∈ s t ∈ U m , respectively. If s t = s t =: s and z m s t = z m s t =: z s , that is, they are the same block and their division patterns are also the same, then Hence, we represent it byq(v t |v t−1 , s, z s ) because it does not depend on m but (s, Lemma 1 means that the optimal coding probability for v t depends on the minimal block s that contains v t and its division pattern z s . Therefore, it could be calculated as At last, the Bayes-optimal coding probability q * (v t |v t−1 ) can be calculated by a recursive function for nodes on a path of the perfect quadtree on S. The definition of the path is the same as [1].
). Let S t denote the set of nodes which contain (x(t), y(t)). They construct a path from the leaf node s (x 1 y 1 )(x 2 y 2 )···(x dmax y dmax ) = {(x(t), y(t))} to the root node s λ on the perfect quadtree whose depth is d max on S, as shown in Figure 4. In addition, let s ch ∈ S t denote the child node of s ∈ S t on that path.

Definition 6.
We define the following recursive function q(v t |v t−1 , s) for s ∈ S t .
where η s (z s |v t ) is also recursively updated for s ∈ S t as follows: Consequently, the following theorem holds.
Theorem 1. The Bayes-optimal coding probability q * (v t |v t−1 ) for the proposed model is calculated by Although Theorem 1 is proved by applying Corollary 2 of Theorem 7 in [15], we briefly describe a proof restricted to our model in the Appendix A to make this paper self-contained. Theorem 1 means that the summation with respect to m ∈ M in (4) is able to be replaced by the summation with respect to s ∈ S t and z s ∈ {0, 1} 4 , which costs only O(2 4 d max ). The proposed algorithm recursively calculates a weighted mixture of coding probabilitiesq(v t |v t−1 , s, z s ) for the case where block s is not divided at s ch (i.e., z ss ch = 0) and the coding probability q(v t |v t−1 , s ch ) for the case where block s is divided at s ch (i.e., z ss ch = 1).

Experiments
In this section, we perform four experiments. Three of them are similar to the experiments in [1]. The fourth one is newly added. In Experiments 1, 2, and 3, we assume V = {0, 1}, which is the simplest setting, to focus on the effect of the improper quadtrees. In Experiment 4, we assume V = {0, 1, . . . , 255} to show our method is also applicable to grayscale images. The purpose of the first experiment is to confirm the Bayes optimality of q(v t |v t−1 , s λ ) for synthetic images generated from the proposed model. The purpose of the second experiment is to show an example image suitable to our model. The purpose of the third experiment is to compare average coding rates of our proposed algorithm with a current image coding procedure on real images. The purpose of the fourth experiment is to show our method is applicable to grayscale images.
In Experiments 1 and 2, p(v t |v t−1 , θ m , m) is Bernoulli distribution Bern(v t |θ m s ) for the minimal s that satisfies (x(t), y(t)) ∈ s ∈ U m . Each element of θ m is i.i.d. distributed with the beta distribution Beta(θ|α, β), which is the conjugate prior distribution of Bernoulli distribution. Therefore, the integral in (4) has a closed form. The hyperparameter η s (z) of the model prior is η s (z) = 1/2 4 for every s ∈ S and z ∈ {0, 1} 4 , and the hyperparameters of the beta distribution are α = β = 1/2. For comparison, we used the previous method based on proper quadtrees, whose hyperparameters are the same as the experiments in [1], and the standard methods known as JBIG [17] and JBIG2 [18].

Experiment 1
The setting of Experiment 1 is as follows. The width and height of images are w = h = 2 d max = 64. We generate 1000 images according to the following procedure. 1.

2.
Generate θ m s according to p(θ m s |m) for s ∈ U m .
Examples of the generated images are shown in Figure 5. Subsequently, we compress these 1000 images. The size of the image is saved in the header of the compressed file using 4 bytes. The coding probability calculated by the proposed algorithm is quantized in 2 16 levels and substituted into the range coder [16]. Table 1 shows the coding rates (bit/pel) averaged over all the images. Our proposed code has the minimum coding rate as expected by the Bayes optimality.

Experiment 2
In Experiment 2, we compress camera.tif in [19], which is binarized with the threshold of 128. The setting of the header and the range coder is the same as those of Experiment 1. Figure 6 visualizes the maximum a posteriori (MAP) estimation m MAP = arg max m p(m|v hw−1 ) based on the improper quadtree model and the proper quadtree model [1], which are by-products of the compression. They are obtained by applying Theorem 3 in [15] and the algorithm in Appendix B in the preprint of the full version of [15], which is uploaded on arXiv. The improper quadtree represents the non-stationarity by a fewer number of regions (i.e., fewer parameters) than that of the proper quadtree [1]. Table 2 shows that the coding rate of our proposed model for camera.tif is lower than the previous one based on the proper quadtree [1] and JBIG [17] without any special tuning. However, JBIG2 [18] showed the lowest coding rate. The improvement of our method for real images will be described in the next experiment.

Experiment 3
In Experiment 3, we compare the proposed algorithm with the proper-quadtree-based algorithm [1], JBIG [17], and JBIG2 [18] on real images from [19]. They are binarized in a similar manner to Experiment 2. The setting of the header and the range coder is the same as those of Experiments 1 and 2. A difference from Experiments 1 and 2 is in the stochastic generative model p(v t |v t−1 , θ m , m) assumed on each block s. We assume another model  Table 3. The algorithms labeled as Improper-i.i.d. and Proper-i.i.d. are the same as those in Experiments 1 and 2. The algorithms labeled as Improper-Markov and Proper-Markov are the aforementioned ones. Improper-Markov outperforms the other methods from the perspective of average coding rates. The effect of the improper quadtree is probably amplified because the number of parameters for each block is increased. However, JBIG2 [18] still outperforms our algorithms only for text. We consider it is because JBIG2 [18] is designed for text images such as faxes in contrast to our general-purpose algorithm. Note that our algorithm has room for improvement by tuning the hyperparameters α and β of the beta distribution for each of θ m s;0000 , θ m s;0001 , . . . , θ m s;1111 .

Experiment 4
Through Experiment 4, we show our method is applicable to grayscale images. Herein, we assume two types of stochastic generative models p(v t |v t−1 , θ m , m) for the block of the proper quadtree and the improper quadtree. The first one is the i.i.d.  Table 4. (The values for previous studies [2,4,20,21] are cited from [21].) The coding rates of the proper-quadtree-based algorithm are improved by our proposed method for all the images in this data set and for both settings of the stochastic generative model assumed within blocks. This indicates the superiority of the improper-quadtreebased model to the proper-quadtree-based model. The method labeled by Improper-AR showed an average coding rate lower than JPEG2000, averaging for the whole images. It also showed an average coding rate lower than JPEG-LS, averaging for the natural images. Although it does not outperform recent methods such as MRP and Vanilc, we consider this is because of the suitability of the stochastic generative model within blocks, which is out of the scope of this paper.

Conclusions
We proposed a novel stochastic model based on the improper quadtree, so that our model effectively represents the variable block size segmentation of images. Then, we constructed a Bayes code for the proposed stochastic model. Moreover, we introduced an algorithm to implement it in polynomial order of data size without loss of optimality. Some experiments both on synthetic and real images demonstrated the flexibility of our stochastic model and the efficiency of our algorithm. As a result, the derived algorithm showed a better average coding rate than that of JBIG2 [18].

Acknowledgments:
We would like to thank the members of Matsushima laboratory for their meaningful discussions.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Validity of Prior Distribution for Models
Although a general proof for any rooted trees is described in [15] (please see also a preprint for the full version of [15] uploaded on arXiv.), in the following, we briefly describe a proof restricted for our model to make this paper self-contained.
In (A3), M s denotes the set of subtrees whose root node is s . The factorization from (A2) to (A3) is because m in (A2) is determined by the subtrees m whose root nodes are in Ch(s λ ). The same idea is also detailed in Figure 4 in the preprint of the full version of [15], which is uploaded on arXiv. The underbraced parts (a) and (b) have the same structure except for the depth of the root node. We represent them by φ(s), which is a function of the root node s of the subtree. Subsequently, we have Therefore, the following holds by recursively substituting φ(s) from the leaf nodes.
Proof of Lemma 1. Let R(s, z s ) denote s ∈Ch(s):z ss =0 s , which is a region where v t is generated according to η s (z s ) when (x(t), y(t)) ∈ s. Then, where ∝ means that the left-hand side is proportional to the right-hand side, regarding the variables except for v t as constant, and θ m \s denotes the parameters θ m except for θ m s . Formula (A7) does not depend on m but (s, z s ).
Proof of Theorem 1. Although Theorem 1 is proved by applying Corollary 2 of Theorem 7 in [15] (please see also the preprint for the full version of [15] uploaded on arXiv), in the following, we briefly describe a proof restricted to our model to make this paper selfcontained.
Theorem 1 will be proved by induction. First, we assume p(m|v t−1 ) = ∏ s∈S m η s (z m s |v t−1 ), which is true for t = 0 because of Assumption 2 and will be proved later for t > 0. In addition, we define the following function to simplify the notation. (A9) Using this notation, we can represent p(v t |v t−1 , m) as follows. (A11) Since the right-hand side of (A11) has a similar form to the underbraced part (a) in (A2), we can define a recursive function q(v t |v t−1 , s) that satisfies p(v t |v t−1 ) = q(v t |v t−1 , s λ ), where q(v t |v t−1 , s) := By substituting (A9), q(v t |v t−1 , s) = 1 holds for s (x(t), y(t)) (or equivalently for s ∈ S t ). Therefore, we need not calculate (A13) for s ∈ S t and (9) will be derived by substituting (A9) again for s ∈ S t . Lastly, we will prove (A8). Using (A9), the updating Formula (10) can be generally represented as follows. η s (z s |v t−1 ) f (v t |v t−1 , s, z s ) ∏ s ∈Ch(s) q(v t |v t−1 , s ) z ss q(v t |v t−1 , s) In the above operation, (A15) was a telescoping product, i.e., q(v t |v t−1 , s) appeared at once in each of the denominator and the numerator. Therefore, we canceled them except for q(v t |v t−1 , s λ ). (A16) is because of (A8), (A10) and (A11), where (A8) and (A11) are the induction hypotheses.