A Stochastic Model for Block Segmentation of Images Based on the Quadtree and the Bayes Code for It

In information theory, lossless compression of general data is based on an explicit assumption of a stochastic generative model on target data. However, in lossless image compression, researchers have mainly focused on the coding procedure that outputs the coded sequence from the input image, and the assumption of the stochastic generative model is implicit. In these studies, there is a difficulty in discussing the difference between the expected code length and the entropy of the stochastic generative model. We solve this difficulty for a class of images, in which they have non-stationarity among segments. In this paper, we propose a novel stochastic generative model of images by redefining the implicit stochastic generative model in a previous coding procedure. Our model is based on the quadtree so that it effectively represents the variable block size segmentation of images. Then, we construct the Bayes code optimal for the proposed stochastic generative model. It requires the summation of all possible quadtrees weighted by their posterior. In general, its computational cost increases exponentially for the image size. However, we introduce an efficient algorithm to calculate it in the polynomial order of the image size without loss of optimality. As a result, the derived algorithm has a better average coding rate than that of JBIG.


Lossless Data Compression in Information Theory
In information theory, lossless compression for general data (not only images) is based on an explicit assumption of a stochastic generative model p(x) on target data x [1]. This assumption determines the theoretical limit, which is called entropy, of the expected code length for p(x). When p(x) is known, entropy codes such as Huffman code [2] and arithmetic code (see, e.g., [3]) achieve the theoretical limit. Then, researchers have considered a setup in which p(x) is unknown. One method to describe the uncertainty of p(x) is removing any specific assumption from the stochastic model p(x) (e.g., [4,5]). Another is considering a class of parameterized stochastic generative models p(x|θ) and assuming the class is known but the parameter θ is unknown. We focus on the latter method in this paper. Even for this setup, researchers have proposed a variety of stochastic generative model classes and coding algorithms achieving those theoretical limits, e.g., i.i.d. model class, Markov model class, context tree model class, and so on (see, e.g., [6][7][8][9][10]).
In this setup, the variety of the stochastic generative model is described as that of unknown parameters or model variables. For example, the i.i.d. model can be determined by a vector θ whose elements are occurrence probabilities of each symbol and described as p(x|θ). Markov model contains another variable c that represents the state or context, which is a string of the most recent symbols at each time point, and the occurrence probability vector θ c is multiplied for each c. Then, the Markov model can be described as p(x|θ c , c).
Further, when the order of the Markov model is unknown, that contains another variable k that represents the order and the occurrence probability θ k c , and the state variables c k are multiplied for each k. Then, the Markov model with unknown order can be described as p(x|θ k c , c k , k). Moreover, in the context tree model, the order depends on the context, and k is replaced by an unknown model variable m that represents a set of contexts. Finally, the context tree model can be described as p(x|θ m c , c m , m). It should be noted that these parameters and model variables θ, k, and m are the statistical parameters that govern the generation of the data x. Therefore, the coding algorithm achieving the theoretical limit of these stochastic generative models inevitably contains some kind of statistically optimal action, e.g. their statistical estimationθ(x),k(x),m(x) as values, their estimation p(θ|x), p(k|x), p(m|x) as posteriors in a Bayesian setting, or model weighting with their posteriors. The explicit assumption of the stochastic generative model and the construction of the coding algorithm with the statistically optimal action have been successful in text compression. In fact, various text coding algorithms have been derived (e.g., [8][9][10]).

Lossless Image Compression as a Image Processing
However, in most cases of lossless "image" compression, the main focus is on the construction of the coding procedure f (x) that just outputs the coded sequence from the input pixel values x without the explicit assumption of a stochastic generative model. In the usual case, the coding algorithm has a tuning parameter a and is represented as f (x; a). This tuning parameter a is adaptively tuned to pixel values x, and we express this tuning method asã(x). Then, the coded sequence f (x;ã(x)) from x is uniquely determined.
Therefore, the variety of the coding procedures is described as that of the tuning parameters and the tuning methods. More specifically, we give a brief review of a type of lossless image coding called predictive coding. Most of the predictive coding procedures have a form f (x t−1 ; a, b) with two parameters a and b. a is a parameter of the predictor, which predicts the next pixel value x t from the already compressed pixels x t−1 at time t. b is a parameter that determines an assignment of the code length to the predictive error sequence. Note that the assignment of the code length can be represented by a vector whose sum of the elements equals 1, and it is sometimes called "probability". However, it does not represent the occurrence probability of pixel value x t in an explicitly assumed stochastic generative model. Therefore, in this paper, we call it code length assign vector to distinguish them. Then, the predictive error sequence and the code length assign vector are input to the entropy codes such as the arithmetic code [3]. For example, in JPEG-LS [11], they use three predictors that are switched according to the neighboring pixels. This can be regarded as a ∈ {1, 2, 3} corresponding to the index of the three predictors, and the rule to switch them is represented byã(x t−1 ). The code length assign vector of JPEG-LS [11] is represented by a two-sided geometric distribution, which is tuned by the past sequence x t−1 . This can be regarded as b being a parameter of the two-sided geometric distribution andb(x t−1 ) being its tuning method. In other studies [12][13][14][15][16][17], the authors proposed coding procedures f (x t−1 ; a, b, c a ) in which coefficients c a of each linear predictor are tuned by a certain methodc a (x t−1 ), e.g., the least squares method or weighted least squares method. In [18,19], the authors proposed coding procedures f (x t−1 ; a, b, c a , w) in which multiple predictors are combined according to another tuning parameter w that represents the weights of each predictor. Regarding the code length assign vector, Matsuda et al. [20] dealt with a procedure f (x t−1 ; a, b, c a , d) in which the code length assign vector is represented by the generalized Gauss distribution that has another tuning parameter d. (This notation is just for the explanation of the idea of the previous studies; it does not completely match the notation of each paper, and it does not contain all of the tuning parameters of each procedure.) One of the latest studies constructing a complicated coding procedure was reported by Ulacha et al. [21], in which numerous tuning parameters are tuned through careful experiments. Lossless image compression using deep learning (see, e.g., [22]) can be regarded as one of the coding procedures with a huge number of tuning parameters that are pre-trained.
These studies have been practically successful. However, it should be noted that the tuning parameters a and b are not the statistical parameters that govern the generation of pixel values x since they are introduced just to add a degree of freedom to the coding procedure. Even the parameter b, which superficially appears to be a parameter of a probability distribution, does not directly govern the generation of pixel values x unless the coding procedure is extremely simple; it is just used to represent the code length assign vector with fewer variables. Therefore, the tuning of these parameters adaptive to x is not theoretically grounded by the statistics nor information theory. If our task were not lossless compression, e.g., lossy compression, image super-resolution, and so on, this parameter tuning would be evaluated from various points of view, e.g., subjective evaluation. It is because such tasks have difficulty in the performance measure itself. Besides, in lossless image compression, it should be evaluated from an information-theoretical perspective. These parameters should be tuned to decrease the difference between the expected code length and the entropy of the assumed stochastic generative model, and we have to say any other tuning methods are heuristic unless they pursue the added value except for the coding rate. However, such an information-theoretical evaluation is impossible because there is no explicit assumption of the stochastic generative model p(x), and the entropy-the theoretical limit of the expected code length-itself is not defined. This is a critical problem of the previous studies above. In addition, the more tuning parameters are introduced, the more difficult the construction of the tuning method becomes since there is no confirmation of the optimality of each tuning method.

Lossless Image Compression on an Explicitly Redefined the Stochastic Generative Model
However, there are some coding procedures f (x; a) [11][12][13][14][16][17][18][19][20]23] whose tuning parameter a can be regarded as a statistical parameter of an implicitly assumed statistical generative model p(x|a) by changing the viewpoint. (In some of these studies, the assumption of the stochastic generative model is claimed, but the distinction between the stochastic generative model and the code length assign vector is ambiguous, and the discussion about the difference between the expected code length and the entropy of the stochastic generative model is insufficient.) Further, its parameter tuning methodã(x) could be regarded as a heuristic approximation of a statistically optimal estimationâ(x) ≈ã(x). Then, explicitly redefining the implicit stochastic generative model behind the previous coding procedures, we can construct a statistical generative model supported by their practical achievements. Moreover, if we derive the coding algorithm that minimizes the difference between the expected code length and the entropy of the constructed stochastic generative model under some kind of criterion, this algorithm inevitably contains a statistically optimal action that is an improved version ofã(x). Further, such an action is not necessarily the estimation a(x) as a value. We can also estimate its posterior p(a|x) or mix the coding procedures weighted by the posterior p(a|x).
To derive such a coding algorithm, we can utilize the coding algorithms in text coding. Although image data are different from the text data, their stochastic generative models may contain a similar structure, and we may utilize the estimating algorithm in the text coding. In fact, we utilize the efficient algorithm for the context tree model class [8][9][10] for our stochastic generative model in this paper.
It is true that the coding algorithm constructed in this approach does not necessarily work for real images, since the optimality is guaranteed only for the stochastic generative model, and it is difficult to prove that the real images generated from the assumed stochastic generative model. Therefore, the constructed coding algorithm might be inferior to the existing one in the initial stage of this approach. However, we claim that this problem should not be solved by a heuristic tuning of the parameter in the coding procedure but an explicit extension of the stochastic generative model, as much as possible. Such parameter tuning should be done in the final stage before implementation or standardization.
We already adopted this approach in the previous studies [24,25]. In [24], we proposed a two-dimensional autoregressive model and the optimal coding algorithm by interpreting the basic procedure [11][12][13]16,23] of the predictive coding as a stochastic generative model. In [25], we proposed a two-dimensional autoregressive hidden Markov model by interpreting the predictor weighting procedure around a diagonal edge [18] as a stochastic generative model. However, these stochastic generative models do not have enough flexibility to represent the non-stationarity among segments of an image. Therefore, we proposed a stochastic generative model for the non-stationarity in [26]. This paper is an extended version of it.

The Contribution of This Paper
Then, our target data are the images in which the properties of pixel values are different depending on the segments. In this paper, we achieve the following purposes.

1.
We propose a stochastic generative model that effectively represents the non-stationarity among the segments in an image.

2.
We derive the optimal code that minimizes the difference between the expected code length and the entropy of the proposed stochastic model under the Bayes criterion.

3.
We derive an efficient algorithm for the implementation of the code without loss of the optimality.
A trivial way to represent the non-stationarity as a stochastic generative model is to divide the image into fixed-size blocks and assume different probability distributions for each block. However, such a stochastic generative model is not flexible enough to represent the smaller segments and inefficient to represent the larger segments than the block size.
On the other hand, one of the most efficient lossless image coding procedures [20] contains preprocessing to determine a quadtree that represents a variable block size segmentation. Then, different predictors are assigned to each block to mitigate the non-stationarity. The quadtree is also used in various fields of image and video processing to represent the variable block size segmentation, and its flexibility and computational efficiency are reported by a number of studies, e.g., in H.265 [27]. However, the quadtree in these studies is a tuning parameter of a procedure. There are no studies that regard the quadtree as a statistical model variable m of a stochastic generative model p(x|m) governing the generation of pixel values x and construct the optimal code that minimizes the difference between the expected code length and the entropy of it in the Bayes criterion, to the best of our knowledge.
In this paper, we propose a novel stochastic generative model based on the quadtree, so that our model effectively represents the non-stationarity among segments by the variable block size segmentation. Then, we construct the optimal code that minimizes the difference between the expected code length and the entropy of the proposed stochastic generative model under the Bayes criterion. The optimal code is given by a weighted sum of all the possible model quadtrees m, and the optimal weight is given by its posterior p(m|x). In general, its computational cost increases exponentially for the image size. However, we introduce a computationally efficient algorithm to implement our code without loss of optimality, taking in the knowledge of the text coding [8][9][10]. A similar algorithm is also used for decision tree weighting in machine learning [28]. It is in contrast to the previous lossless image coding procedure [20] that fixes a single quadtree in the preprocessing, which statistically corresponds to some kind of model selection.
Although the main theme of this paper is lossless image compression, the substantial contribution of our results is the construction of the stochastic model. Therefore, the proposed stochastic model contributes to not only lossless image compression but also any other stochastic image processing such as recognition, generation, feature extraction, and so on.
The organization of this paper is as follows. In Section 2, we describe the proposed stochastic generative model. In Section 3, we derive the optimal code for the proposed model. In Section 4, we derive an efficient algorithm to implement the derived code.
In Section 5, we perform some experiments to confirm the flexibility of our stochastic generative model and the efficiency of our algorithm. In Section 6, we describe future works. Section 7 is the conclusion of this paper.

The Proposed Stochastic Model
At first, we define some notations. Note that the following notations are independent of those in Section 1. Let V denote a set of possible values of a pixel. For example, V = {0, 1} for binary images, V = {0, 1, . . . , 255} for gray scale images, and V = {0, 1, . . . , 255} 3 for color images. Let N denote the set of natural numbers. Let h ∈ N and w ∈ N denote the height and width of an image, respectively. Although our model is able to represent any rectangular images and its block segmentation, we assume that h = w = 2 d max for d max ∈ N in the following for the simplicity of the notation. Then, let V t denote the random variable of the tth pixel value in order of the raster scan and v t ∈ V denote its realized value. Note that V t is at x(t)th row and y(t)th column, where t divided by w is x(t) with a reminder of y(t). In addition, let V t denote the sequence of pixel values V 0 , V 1 , . . . , V t . Note that all the indices start from zero in this paper.
We consider the pixel value V t is generated from various probability distributions depending on a model m ∈ M and parameters θ m ∈ Θ m . Therefore, they are represented by p(v t |v t−1 , θ m , m) in general. Note that the model m and the parameters θ m are unobservable and should be estimated in actual situations. The definitions of m and θ m are as follows.
Therefore, it represents the indices of the upper right region. In a similar manner, s (01) (11) It should be noted that the cardinality |s| for each s ∈ S represents the number of pixels in the block.

Definition 2.
We define the model m as a full quadtree whose nodes are elements of S. Let L m ⊂ S and I m ⊂ S denote the set of the leaf nodes and the inner nodes of m, respectively. Then, L m corresponds to a pattern of variable block size segmentation, as shown in Figure 1. Let M denote the set of full (i.e., every inner node has exactly four child nodes) quadtrees whose depth is smaller than or equal to d max . Under the model m ∈ M and the parameters θ m ∈ Θ m , we assume that the tth pixel value v t ∈ V is generated as follows.
Thus, the pixel value V t depends only on the parameter of the block s which contains V t under the past sequence V t−1 .

The Bayes Code for the Proposed Model
If we know the true model m and the parameters θ m , we are able to compress the pixel value v t up to the entropy of p(v t |v t−1 , θ m , m) by a well-known entropy code such as the arithmetic code. However, the true m and θ m are unobservable. One reasonable solution is to estimate them and substitute the estimated onesm andθ m into p(v t |v t−1 , θ m , m). Then, we can use p(v t |v t−1 ,θ m ,m) as a coding probability of the entropy code.
However, there is another powerful solution, in which we assume prior distributions p(m) and p(θ m |m). Then, we estimate the true coding probability p(v t |v t−1 , θ m , m) itself instead of m and θ m by q(v t |v t−1 ) so that q(v t |v t−1 ) can minimize the Bayes risk function based on the loss function between the expected code length of entropy code using p(v t |v t−1 , θ m , m) and that using q(v t |v t−1 ). The code constructed by such a method is called the Bayes code (see, e.g., [29,30]).
It is known that the expected code length of the Bayes code converges to the entropy of the true stochastic model for sufficiently large data length t, and its convergence speed achieves the theoretical limits [30]. In fact, the Bayes code achieves remarkable performances in text compression (e.g., [8]). Therefore, we derive the Bayes code for the proposed stochastic model. According to the general formula in [29], the optimal coding probability for v t in the scheme of the Bayes code is derived as follows: Proposition 1. The optimal coding probability q * (v t |v t−1 ) which minimizes the Bayes risk function is We call q * (v t |v t−1 ) the Bayes optimal coding probability.
Proposition 1 implies that we should calculate the posterior distributions p(m|v t−1 ) and p(θ m |v t−1 , m). Then, we should use the coding probability which is a weighted mixture of p(v t |v t−1 , θ m , m) for every block segmentation pattern m and parameters θ m according to the posteriors p(m|v t−1 ) and p(θ m |v t−1 , m).

The Efficient Algorithm to Calculate the Coding Probability
Unfortunately, the Bayes optimal coding probability (5) contains computationally difficult calculations. As the depth d max of full quadtree increases, the amount of calculation for the sum with respect to m ∈ M increases exponentially. Moreover, the posterior p(m|v t−1 ) does not have a closed-form expression in general. (Strictly speaking, a few problems are also left. Both the integral with respect to θ m and the posterior p(θ m |m, v t−1 ) do not have closed-form expressions in general. These problems can be solved in various methods depending on the setting of p(v t |v t−1 , θ m , m) and p(θ m |m) and almost independent of our proposed model. Therefore, we describe an example of a feasible setting of p(v t |v t−1 , θ m , m) and p(θ m |m) in the next section. Other settings are described in Section 6 as future works.) Similar problems are studied in text compression, and efficient algorithms to calculate the coding probability have been constructed (see, e.g., [8][9][10]). In these algorithms, the weighted sum of the context trees is calculated instead of the quadtrees. We apply it for our proposed model. In this section, we focus to describe the procedure of the constructed algorithm. Its validity is described in Appendix A.
First, we assume the following priors on m and θ m .

Assumption 2.
We assume that each node s ∈ S has a hyperparameter g s ∈ [0, 1], and the model prior p(m) is represented by where g s = 0 for s whose cardinality |s| equals 1, and the empty product equals 1.
The idea of this form is to represent p(m) as a product of the probability that the block s is divided. Such a probability is denoted by g s in (6). Note that |s| = 1 means that the block s consists of only 1 pixel and it cannot be divided. A proof that the above prior satisfies the condition ∑ m∈M p(m) = 1 is in Appendix A. Note that the above assumption does not restrict the expressive capability of the general prior in the meaning that each model m still has possibility to be assigned a non-zero probability p(m) > 0.
Therefore, each element θ m s of the parameters θ m depends only on s and is independent of both the other elements and the model m.
From Assumptions 1 and 3, the following lemma holds.
Then, we represent it byq(v t |v t−1 , s) because it does not depend on m but s.
The proof of Lemma 1 is in Appendix A. Lemma 1 means that the optimal coding probability for v t depends only on the leaf node block s which contains v t , and it can be calculated as q(v t |v t−1 , s) if s is known.
At last, the efficient algorithm to compute the Bayes optimal coding probability q * (v t |v t−1 ) is represented as an iteration of updating g s and summing the functions q(v t |v t−1 , s) weighted by g s for nodes on a path of the complete quadtree on S. Definition 4. Let S t denote the set of nodes which contain (x(t), y(t)). They construct a path from the leaf node s (x 1 y 1 )(x 2 y 2 )···(x dmax y dmax ) = {(x(t), y(t))} to the root node s λ on the complete quadtree whose depth is d max on S, as shown in Figure 2. In addition, let s child ∈ S t denote the child node of s ∈ S t on that path. Definition 5. We define the following recursive function q(v t |v t−1 , s) for s ∈ S t .
where g s|t is also recursively updated as follows.
Then, the following theorem holds. Theorem 1. The Bayes optimal coding probability q * (v t |v t−1 ) for the proposed model is calculated by The proof of Theorem 1 is in Appendix A. Theorem 1 means that the summation with respect to m ∈ M in (5) is able to be replaced by the summation with respect to s ∈ S t and it costs only O(d max ). In a sense, (1 − g s|t−1 ) can be regarded as the marginal posterior probability that the true block division was stopped at s. Then, the proposed algorithm takes a mixture of the coding probabilityq(v t |v t−1 , s), weighting such a case with (1 − g s|t−1 ) and the other cases with g s|t−1 .

Experiments
We performed three experiments. The purpose of the first experiment was to confirm the Bayes optimality of q(v t |v t−1 , s λ ). Therefore, we used synthetic images randomly generated from the proposed model. The purpose of the second experiment was to demonstrate the flexibility of our model. Therefore, we used a well-known benchmark image. We also used the Bayes optimal code for fixed block size segmentation for comparison in these two experiments. (Let 2 d be the fixed block size. Such a model is derived by substituting g s = 1 for s whose depth is smaller than d max − d and g s = 0 otherwise.) The purpose of the third experiment was to compare average coding rates of our proposed algorithm with a current image coding procedure on real images.

Experiment 1
In Experiments 1 and 2, we assumed V = {0, 1}. In other words, we treated only binary images. p(v t |v t−1 , θ m , m) was assumed to be the Bernoulli distribution Bern(v t |θ m s ) for s which satisfies (x(t), y(t)) ∈ s. Each element of θ m was i.i.d. distributed with the beta distribution Beta(θ|α, β), which is the conjugate distribution of the Bernoulli distribution. Therefore, the integral in (5) had a closed-form. The hyperparameter g s of the model prior was g s = 1/2 for every s ∈ S \ {s λ } and g s λ = 1, and the hyperparameters of the Beta distribution were α = β = 1/2.
The setting of Experiment 1 was as follows. The width and height of images were w = h = 2 d max = 64. Then, we generated 1000 images according to the following procedure.

2.
Generate θ m s according to p(θ m s |m) for s ∈ L m . 3.
Examples of the generated images are shown in Figure 3. Then, we compressed these 1000 images. The size of the image was saved in the header of the compressed file using 4 bytes. The coding probability calculated by the proposed algorithm was quantized in 2 16 levels and substituted into the range coder [31]. The coding rates (bit/pel) averaged over all the images are shown in Table 1. Our proposed code has the minimum coding rate as expected by the Bayes optimality. Additionally, we compressed them by a standard lossless binary image coder called JBIG [32]. It did not work for the generated images. It is probably because JBIG [32] is not designed for synthetic images but mainly for real images such as faxes. A more detailed comparison was done in Experiment 3.

Experiment 2
In Experiment 2, we compressed the binarized version of camera.tif from Wat [33], where the threshold of binarization was 128. The settings of the header and the range coder were the same as those of Experiment 1. Figure 4 visualizes the maximum a posteriori (MAP) estimation m MAP = arg max m p(m|v hw−1 ), which was calculated as a by-product of the compression by the algorithm detailed in Appendix B. It shows that our proposed model has the flexibility to represent the non-stationarity among the regions. The coding rate for camera.tif is shown in Table 2. For this image, the proposed algorithm a showed better coding rate than JBIG [32].

Experiment 3
In Experiment 3, we compared the proposed algorithm with JBIG [32] on real images from Wat [33]. They were binarized in a similar manner to Experiment 2. The settings of the header and the range coder were the same as those of Experiments 1 and 2. The results are shown in Table 3. The algorithm labeled as Proposed 1 in Table 3 is the same as that in Experiments 1 and 2. In the algorithm labeled as Proposed 2 in Table 3, we assumed that p(v t |v t−1 , θ m , m) is the Bernoulli distribution Bern(v t |θ m s;v t−w−1 v t−w v t−w+1 v t−1 ), which depends on the neighboring four pixels. (If the indices go out of the image, we used the nearest past pixel in Manhattan distance.) In other words, there were 16 parameters θ m s;0000 , θ m s;0001 , . . . , θ m s;1111 for each block s of model m, and one of them was chosen by the realized values v t−w−1 , v t−w , v t−w+1 , and v t−1 in the past. Each parameter was i.i.d. distributed with the beta distribution whose parameters were α = β = 1/2. Proposed 2 outperforms JBIG [32] without any specialized tuning of the hyperparameters from the perspective of average code rates. On the other hand, JBIG [32] outperforms our algorithms for crosses and text. This is because JBIG [32] is designed for text images such as faxes and our stochastic generative model is for images with non-stationarity among segments. The structure of the text images should not be represented by the proposed quadtree-based stochastic generative model but the stochastic model p(v t |v t−1 , θ m , m) in each block. Although refinement of p(v t |v t−1 , θ m , m) for target images is out of the scope of this paper, it is an important problem in the future (see the next section).

Future Works
In this paper, we focus only on the stochastic representation of the non-stationarity among the segments. The discussion about the stochastic model p(v t |v t−1 , θ m , m) and the prior p(θ m |m) to be assumed in each block is out of the scope. This is the first future work. For example, our model also works on the pairs of categorical distribution and Dirichlet distribution, normal distribution and normal-gamma distribution, and twodimensional autoregressive model and normal-gamma distribution [24]. Moreover, using an approximative Bayesian estimation such as the variational Bayesian method, we expect that more complicated stochastic models (e.g., [25]) can be assumed.
The second future work is to apply our model to other stochastic image processing: image recognition, image generation, image inpainting, future extraction, etc. In particular, image generation and image inpainting may be suitable because the whole structure of stochastic image generation is described in our model and the parameters of the stochastic model can be learned optimally.

Conclusions
We propose a novel stochastic model based on the quadtree so that our model effectively represents the variable block size segmentation of images. Then, we construct a Bayes code for the proposed stochastic model. Moreover, we introduce an efficient algorithm to implement it in polynomial order of data size without loss of optimality. As a result, the derived algorithm has a better average coding rate than that of JBIG [32]. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://links.uwaterloo.ca/Repository.html (accessed on 30 July 2021) .

Acknowledgments:
We would like to thank the members of Matsushima laboratory for their meaningful discussions.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Validity of the Proposed Algorithm
Appendix A.1. The Property of the Model Prior p(m) First, we prove the following lemma for a general case. Note that the empty product equals 1 as usual.
Lemma A1. Consider the k-ary complete treeT with its depth D, in which each node u has a parameter g u ∈ [0, 1]. Let T denote the set of full subtrees which contain the root node λ ofT. Then, the following holds.
where L T and I T denote the set of leaf nodes and inner nodes of T, respectively, and g u = 0 for u whose depth is D.
Proof. Lemma A1 is proved by induction with respect to the depth D. Let [λ] denote the tree which consists of only the root node λ ofT. When D = 0, = ∅, and the empty product equals to 1; (A4) is because the assumption of the statement, that is g u = 0 for u whose depth is D.
If we assume (A1) for D = d ≥ 0 as the induction hypothesis, then the following holds for D = d + 1.
Since each subtree T ∈ T \ {[λ]} is identified by k sub-subtrees whose root nodes are the child nodes of λ, let λ child,i denote the ith child node of λ for 0 ≤ i ≤ k − 1 and T λ child,i denote the set of sub-subtrees whose root node is λ child,i . Then, the summation in (A6) are factorized as follows.
Using (A1) for D = d as the induction hypothesis, Therefore, Lemma A1 holds for any D.
Using this lemma, the following corollaries hold for our model.

Proof of Corollary 2:
Since each m ∈ {m ∈ M | s ∈ L m } has the right-hand side of (A12) as the factor in its prior, Then, factorizing the sum in a similar manner from (A7)-(A9) and using Lemma A1 for the subtrees whose root nodes are out of A s , Corollary A2 is proved. Figure A1. The example for the proof of Corollary A2.
Appendix A.2. Proof of Lemma 1 Proof of Lemma 1.
where ∝ means that the left-hand side is proportional to the right-hand side, regarding the variables except v t as constant, and θ m \s denotes the parameters θ m except θ m s . Here, we use Assumptions 1 and 3. As a result, Formula (A21) is independent of m. Proof of Theorem 1. We prove the following two equations simultaneously.
(A22) means that the posterior distribution of the model m has the same form as the prior. (A23) is equivalent to Theorem 1. They are proved by induction with respect to t. Therefore, the proof consists of the following four steps.
Step 3: In the following, we assume (A22) and (A23) for t = k as the induction hypotheses. Let r ∈ L m satisfy (x(k), y(k)) ∈ r and S k be the same one defined in Definition 4. Then, for t = k + 1, When |r| = 1, substituting (11) and (10) in this order, Here, (A34) is given by the cancellation of the telescoping product.
In addition, it holds that ∑ m∈{m ∈M|s∈L m } p(m|v k ) = (1 − g s|k ) ∏ s ∈A s g s |k , since the posterior p(m|v k ) has the same form as the prior p(m) and can be applied Corollary A2.
Step 4: (A23) can be proved for t = k + 1 in a similar manner to the case where t = 0.

Appendix B. The Algorithm to Calculate m MAP
In this appendix, we derive the algorithm to calculate arg max m p(m|v t ). At first, max m p(m|v t ) can be decomposed in a similar manner to the proof of Lemma A1 by replacing the sum for the max. (A50) We define a recursive function φ t : S → R as follows.
We can calculate h s|t and φ t (s) simultaneously. Then, arg max m p(m|v t ) is identified as the model which satisfies s ∈ L m ⇒ h s|t = 0.
Such a model can be searched by backtracking from s λ after the calculation of φ t (s λ ) and h s λ |t .