Information Theoretic Modeling of High Precision Disparity Data for Lossy Compression and Object Segmentation

In this paper, we study the geometry data associated with disparity map or depth map images in order to extract easy to compress polynomial surface models at different bitrates, proposing an efficient mining strategy for geometry information. The segmentation, or partition of the image pixels, is viewed as a model structure selection problem, where the decisions are based on the implementable codelength of the model, akin to minimum description length for lossy representations. The intended usage of the extracted disparity map is to provide to the decoder the geometry information at a very small fraction from what is required for a lossless compressed version, and secondly, to convey to the decoder a segmentation describing the contours of the objects from the scene. We propose first an algorithm for constructing a hierarchical segmentation based on the persistency of the contours of regions in an iterative re-estimation algorithm. Then, we propose a second algorithm for constructing a new sequence of segmentations, by selecting the order in which the persistent contours are included in the model, driven by decisions based on the descriptive codelength. We consider real disparity datasets which have the geometry information at a high precision, in floating point format, but for which encoding of the raw information, in about 32 bits per pixels, is too expensive, and we then demonstrate good approximations preserving the object structure of the scene, achieved for rates below 0.2 bits per pixels.


Motivation
One of the most important types of information handled in modern imaging applications is the geometry of the scene and of the objects present in the scene. The depth maps convey the geometry of the scene and are needed as such as a separate data, to be explicitly encoded and transmitted in some application areas like industrial computer vision or robotics. In other applications, depth data might not be needed explicitly by the user, but it is still used as an intermediate variable helping in removing the redundancy in stereo and multiview image encoding. The disparity map for a stereo pair is proportional to the reciprocal depth map of the scene for an ideal fronto-parallel stereo optical system [1], and modeling the geometry based on depth or on disparity information satisfies the same goal of extracting relevant geometry information, hence, the algorithms that we present can be applied to both types of data. We consider here two aspects that are usually considered separately: the first is the compression of the geometry data, and the second is the mining of the geometry data for finding the relevant objects and parts of objects.
From the compression perspective, we consider in this paper high precision disparity data occurring in immersive media, which is under standardization in JPEG and MPEG working groups [2,3].
In the following, we give examples where this task is relevant. One example is the standardization of point cloud compression (PCC) with voxelized point clouds having very high precision, from 10 to 30 bits per coordinate. In one encoding methodology, using 2D projections of the point clouds to several planes, one gets high precision depth images that need to be encoded [3]. Another example in the light field compression standardized by JPEG Pleno Light Field [2] is that one needs to extract a simplified model of the disparity map from the known disparity map which might be available in floating point precision for the synthetic imagery. The cost of lossless encoding of the disparity map at the full precision may be justified only at very high overall bitrates. For lower bitrates, one needs to reduce the cost of disparity map data, either using a standard lossy encoder of the initial disparity map data, or by using a model based encoding of the disparity information, as we propose in this paper. For the encoding of synthetic images, which occurs in the gaming industry, the image content is generated using models from which the depth information is available at a high precision for all the points in the scene.

Related Work
Our paper considers two aspects that are extremely well studied separately, but which are seldom tackled together. The first one is a model-based disparity map compression, investigated in the image compression literature, and the second is depth image segmentation, which is investigated thoroughly in the pattern recognition literature, but usually without considering the performance of the implementable algorithms for compressing the segmentation. Related papers can be found in the literature dealing with information theory-based segmentation or with model-based depth map compression.

Methods for Disparity Map Compression
We consider in this paper the compression of high precision depth or disparity data, represented in floating point format or high precision integer format. There is work on the lossless compression of such data [4] but most depth compression papers were concerned with the early datasets of 8 bits per pixel raw formats, or on datasets where depth did not have enough precision to justify polynomial approximations beyond planar models. With high precision depth, one can get significantly better approximations by quadratic surface models, as we do here in this paper, where all polynomial models are quadratic surfaces; we show that very precise detection of the contours of the objects is obtained.
While there is a very rich literature on using polynomial models in the image compression standardization related literature, many of them deal with block-based processing of the image, where the segmentations are represented by using quadtree partitions and, hence, the boundaries of objects are not followed at pixel level. We are interested in region based compression, where the arbitrary-shaped contours in the segmentation are encoded explicitly. We use for arbitrary shape region coding the method dubbed crack-edge-region-value (CERV) in [5], which is a context based method for encoding the contours of arbitrary regions, given in the form of crack-edges (elementary contour element between any two neighbor pixels). In [5], the constant disparity values inside each region are also encoded by a context-based algorithm, resulting in an overall lossless compression of the disparity image. A lossy method for disparity maps using arbitrary-shaped regions and planar models inside each region was proposed in [6], with a further refinement in [7] to include a selection of the best model between a planar model and a constant model, inside each region. In [8], the planar models and the underlying segmentation are obtained through Markov random field modeling, by optimizing an energy function, where the number of planar models is an important parameter, and where examples are shown for up to 50 planar models. A wavelet transform approach to depth modeling and coding is presented in [9], where the contours of the regions and the planar variations in each region are modeled using a flexible model.
The most recent JPEG activity for disparity map compression deals with the compression of breakpoints for improving the compressibility of images having high discontinuities, along the lines of the breakpoint adaptive discrete wavelet transform [10], which was illustrated in [11] for the compression of light field images.

Methods for Image Segmentation and Edge Detection
Edge (or contour) detection has a long history [12,13] and is an essential part in modern computer vision systems. Edges provide useful structural information about a color or depth image. Recent works on edge detection [14][15][16] demonstrate that the field of research underwent a rapid development during the last decade.
A classical and perhaps the most well-known method for edge detection is the Canny Edge Detector [13] which applies numerical optimization relying on a general mathematical formulation of detection and localization criteria. Recent approaches are based on machine learning models. Dollar and Zitnick [14] train structured random decision forests for edge detection. DeepContour [17] utilizes a basic CNN architecture and learns to classify input image patches into different contour shape classes. Xie and Tu [15] formulate edge detection as an image-to-image prediction problem and solve it using fully convolutional neural networks in their "Holistically-Nested Edge Detection" framework. More recently, Liu et al. [16] designed a multiscale network architecture to generate rich hierarchical representations.
Superpixels are perceptually meaningful regions in an image. In several computer vision tasks, superpixels are found to be useful, e.g., for object recognition [18,19], scene labeling [20] and object localization [21]. Selective Search [18] starts with an over-segmentation of an input image generated by the graph-based segmentation in [22]. According to certain similarity measures defined over color, shape and texture of regions, neighbors are iteratively merged into larger regions. A state of the art method for objects proposal is multiscale combinatorial grouping (MCG) [19], which performs grouping of multiscale regions and relies on a fast normalized cut algorithm, and was shown to perform very well on the benchmarks for color image segmentation.
A superpixel has to be confined within a single object, therefore, superpixel segmentation has to be loyal to object boundaries. Unlike semantic segmentation, which aims to segment only the objects from a predefined set of classes (leaving some parts of the image unlabeled), superpixel segmentation aims to include every single pixel in one of the segments. This coincides well with our goals because we aim to interpret and compress the entire image. Moreover, by combining the individual superpixels into larger superpixels, it is possible to obtain a segmentation hierarchy.
Superpixel segmentation and edge detection are related in the sense that they both attempt to capture the edge information in an image. On the other hand, it should be noted that edge detection does not necessarily lead to closed contours and, therefore, it is not always easy to recover regions from an edge map. Arbelaez et al. [23] provide a unified approach to these two problems, which we use in the experimental section.

Methods Combining Image Compression and Image Segmentation
Image compression provides the most efficient description of an image, and therefore, it is the essential tool for making inference based on description length. The minimum description length (MDL) for image segmentation was used in [24], where an image partitioning problem is presented in terms of finding the minimum description of an image according to a descriptive language. The connections between the MDL approach, the region growing and the snakes was studied in [25]. Other image segmentation under MDL approach include: [26], where the regions are constrained to be connected components; [27] proposing learning of Gaussian Mixture Models using description codelength criteria; and [28] where a recursive MDL criterion is used in the framework of graph-cuts, and where similar structures in the images are represented by already described similar structures. All the above papers dealt with segmentation of color images, not of disparity map images, but used the same principle of obtaining a segmentation by minimizing the description length of a certain model, which is in final terms expressed as the codelength of a certain program for reconstructing the image.
For applications on more complex data, the MDL-based segmentation of depth maps was studied in [29] for the lossless compression of light fields.
The MDL principle applied to lossy compression in the form of minimizing the description length for a given distortion can be thought of as another facet of the Kolmogorov structure function [30,31].
Another information theoretic approach for model selection is intersection of confidence intervals ICI. Linear Polynomial Approximations combined with ICI(LPA-ICI). Ref. [32] are state-of-the-art for processing images with noise, being extremely efficient especially with impulsive or speckle noise. In the case of our high resolution disparity images, this noise component is not present, but certainly the use of LPA-ICI and similar techniques are worth pursuing in future research. As some interesting related references, segmentation of ultrasonic images was investigated using ICI selection in [33] and by MDL selection in [34].

Contribution
This paper has a dual goal: to provide good overall segmentations of the scene, and to achieve efficient lossy compression at low bitrates. Figure 1 shows the main contribution of the paper: obtaining segmentations with very precise contours from disparity images, and using them for model-based lossy compression of disparity maps. Instead of directly transmitting the raw information, we propose to extract polynomial surface models for the regions belonging to suitable selected partitions of the disparity image. The most challenging task is to obtain a good segmentation of the disparity image, which is achieved from a competition between several possible segmentations. We propose an algorithm belonging to the parametric approach for lossy disparity coding, in which the cost of the models involved will be evaluated in bits, and the task is to seek for the best compromise between the precision of the model and its cost, which has different solutions at different target bitrates.
An important goal for the developed algorithm is to obtain such models of the depth image data that convey information about the objects in the scene. Fitting the polynomial models to the depth data will provide segmentations that capture the most important contours in the image, and will also be able to delineate the objects of interest. We compare the boundaries of the regions in the segmentations that we obtain with our algorithm to the edges obtained from the color images representing the same scene, and we find a high degree of correspondence, which we quantify using the established benchmarking techniques from the segmentation literature.
The main contributions are two segmentation algorithms which produce partitions having simple regions, easy to be transmitted at different desired bitrates to the decoder. During the stage of creating the regions, several models compete for the most efficient image splitting into regions, while ensures a good approximation quality in a procedure based on information theoretic model selection decisions. We analyze the obtained algorithms from the perspective of lossy image compression and from the precision-recall properties of the obtained segmentations.

Image Partition into Regions
We introduce first notations and definitions for presenting in a formal way the proposed algorithms. We consider depth or disparities images G ∈ R n r ×n c having n r rows and n c columns. The set of pixels of the image is denoted 2 (unless otherwise specified we assume in the paper that connectivity level is 4). A label image, X ∈ N n r ×n c , specifies a label X(i, j) for each pixel (i, j) ∈ Ω 0 . A region Ω ⊂ Ω 0 is a subset of Ω 0 and is specified in the label image X by the equivalence (i, j) ∈ Ω ⇔ X(i, j) = .
The region is said to be a connected component if for any pixel (i, j) ∈ Ω there is at least one pixel (i , j ) ∈ Ω , which is connected in connectivity 4 to the pixel (i, j). A partition of the image into regions is denoted as a set as P = {Ω 1 , . . . , Ω L } and can be unequivocally described by a label image with L labels, which we denote X(P ).

Representing the Region's Contours
The contour, or boundary, of any region is formed of horizontal and vertical crack edges: a horizontal crack edge image H X ∈ {0, 1} n r ×n c specifies by H X (i, j) = 1 that X(i − 1, j) = X(i, j) and similarly the vertical crack edge image V X ∈ {0, 1} n r ×n c specifies by V X (i, j) = 1 that X(i, j − 1) = X(i, j) and we define the contour matrix for the label matrix X to be the concatenated matrix C X = H X V X . A crack edge is also named contour element. A label image X has associated a unique contour matrix C X = H X V X . Conversely, a contour image C can be processed by a region labeling routine, X = RegLab(C), to construct a label image X where each connected component has a distinct label, (see, e.g., the BSD benchmarking software [35]). We note that for any label matrix X where each region is a connected component, it holds that X = RegLab(C X ) differs from X only by a permutation of the labels. The set of contour elements set to one in C X that form the outside border of a region Ω is denoted Γ(Ω).

Representing a Hierarchical Segmentation
In the literature dealing with hierarchical segmentations, the representation of a sequence of segmentations is given by the ultrametric contour map (UCM) which can be formally defined as a contour matrix U ∈ R n r ×2n c having real entries, as opposed to the contour matrix C, which has binary elements. It is usual to normalize the real value of the contour element U (i, j) to the range [0; 1] and then consider the value as the probability that a contour element separates two adjacent pixels having different labels. However we keep the UCM matrix to be integer-valued, with the elements specifying a persistency level. By thresholding the elements of the UCM matrix U at a threshold τ one obtains a binary matrix C . Using a decreasing sequence of thresholds one obtains a sequence of binary contour images C 1 , . . . , C L , corresponding to nested segmentations X 1 , . . . , X L of the image G, which together form a hierarchical segmentation.
Considering two consecutive nested segmentations X and X +1 , and two neighbor regions, Ω 1 and Ω 2 , in X +1 that were obtained by splitting a single region Ω 1,2 in X . The split is obtained by setting to one the contour elements from the set ∆Γ = Γ(Ω 1 ) ∩ Γ(Ω 2 ). The cost of the split in terms of bitrate, L(∆Γ), can be approximated to be proportional to the number of contour elements in the set ∆Γ, hence L(∆Γ) = c|∆Γ|, as is done in most papers using MDL merging-splitting optimization [7,24].

Polynomial Surface for Approximating the Disparity Map over a Region
We consider the following two dimensional polynomials: and denote generically P θ (i, j) = ϕ k (i, j) T θ where the elements of the regression vector ϕ k (i, j) are monomials in the variables i and j. The main model considered in this paper is the reconstruction S(Θ, P ) ∈ R n r ×n c of the image G, as a function of a partition P = {Ω 1 , . . . , Ω L } and a set of polynomial parameter vectors Θ = {θ 1 , . . . , θ L }, where the reconstruction surface S for a pixel (i, j) belonging to region Ω is obtained with the parameters θ , as S i,j = P θ (i, j). Finally, we denote the code length necessary for representing the parameters as L(Θ) = ∑ L =1 L(θ ), where we assume that the elements of θ are quantized to a finite precision and are encoded by Golomb-Rice coding (hence assuming a geometric distribution of the parameters). The image of contours C X is encoded by the CERV algorithm [5] and the resulting codelength is denoted L C (C X ), or for short L(X).
The goal of this paper is to start from a given disparity map image G and to find a sequence of partitions P 1 , . . . , P N (or equivalently a sequence of label images X 1 , . . . X N ) and the corresponding polynomial models Θ 1 , . . . , Θ N satisfying two desiderata: should be competitive with the rate-distortion of lossy compression algorithms, at very low bitrates. The wish is to extract relevant information from G, to encode it efficiently, and use it for obtaining a reconstruction with a small distortion, as in the lossy compression tasks, but with the next additional wish on the relevance of the segmentation for the objects in the image. 2. The sequence of partitions P 1 , . . . , P N should compare favorably with the hierarchical partitions obtained from the color information of the same scene, having the diagram (recall, precision) competitive with the existing state of the art boundary detection or segmentation algorithms for finding general structure in images.

Statement of the Problem
We start by defining the disparity map model, consisting of a partition of the image pixels into regions, and of a polynomial surface inside each region. Then we describe the iterative process for obtaining a partition of the image into regions, with a polynomial surface model for reconstructing the depth inside each region, where the optimality criterion is the overall codelength for encoding (describing) the partition and the polynomial models for all regions, subject to a given allowed distortion over each region.
Given the disparity map image G we define a partition P = {Ω ; = 1, . . . , L} of the image support, Ω 0 , into L disjoint regions Ω ; = 1, . . . , L , such that L =1 Ω = Ω 0 . The minimum description length criterion consists of the cost L(P ) of transmitting a segmentation P n , evaluated by the implementable codelength obtained by context based coding of the segmentation [5], and of the cost L(Θ n ) of encoding the parameters of all polynomial models. The precise cost of encoding any segment of the contour can be extracted during the coding process, and we will denote L(Γ) the codelength of encoding the contour segment Γ. We denote Γ(Ω) the outer contour of a connected region Ω.
For any given distortion D one needs to solve the optimization problem

Algorithm for Hierarchical Segmentation based on Persistency of Contours of the Segmentations Generated by Iterative Piece-Wise Polynomial Modeling
The two components of the model are the following: (a) the segmentation and (b) the set of polynomial models, one for each region of the segmentation. Estimating the model that gives directly the minimum solution to the optimization problem (2) for a given D is approached by finding first a set of good "optimal" segmentations, and then checking what is the distortion corresponding to each segmentation, building thus a RD plot of solutions of (2).
The segmentation problem is sometimes seen as the estimation of a latent variable, defined for each pixel, and we introduced the label image notation X for this latent variable.
A simple attempt to finding a good model (including the segmentation X and the polynomial models) will be in the spirit of K-means iterative algorithm, rephrased as a K-models algorithm: fix a desired number of regions N reg in the segmentation and initialize a partition of the image into N reg regions. In a first stage, fit the best polynomial model over each region and in the second stage re-partition the image into N reg regions, so that each pixel (i, j) is associated with the model that gives the smallest reconstruction error of G(i, j). A true K-means or K-models would iterate the two stages until convergence, if that ever occur. However we are taking a different route: we operate with a large N reg , in a very "over-segmented" regime, with N reg larger several times than the final intended maximum number of regions, and we are not interested in iterating until stabilization of the N reg regions. Instead we are interested in exploring as much variability in the region boundaries. Since the re-estimation will make the region to change their boundaries, we track during the process for each contour element, say H(i, j), the number of times in which it was part of regions boundaries during the process. We call counts(H(i, j)) the persistency degree of the contour element, and we are building our segmentations by considering progressively the contour elements in the decreasing order of their degree of persistency.
There are a few problems with the simple K-models approach, and we discuss them and introduce at the same time our algorithmic steps that are correcting the problems.
We introduce several regularization options to this algorithm, resulting in the Algorithm 3. Even with the introduced change we notice that the iterative re-estimation has a high variability of the region contours decided at consecutive iterations. In Figure 2, we show on middle row on panels 1 and 2 one detail of the image Adirondack. The panels 1 to 3 in the top row show consecutive segmentations obtained during re-estimation. Since there are very many models initially (one for each (11 × 11) patch), on the long board which is the arm-rest of the chair there are several patches, with similar almost planar models, which are competing with each other during the re-estimation, and one sees the high variability of the contours of these models within the arm-rest in panels 1 to 3. However, the outline of the arm-rest remains as a clear part of region boundaries in all iterations. The main feature of our algorithm is to let many fitting polynomial surfaces to compete during the re-estimation iterations, resulting in many contour pixels that are changing from one iteration to the other, but also resulting in contour elements that remain "persistent" from one iteration to the next. We are keeping track of the persistency of all contour elements in the image, and after a number of iteration (n iter = 40 in all experiments) we check the persistency of each contour element and we use the most persistent elements for obtain contours that are true outlines of distinct objects or object parts. Just to show the final result of both Algorithms 1 and 2, we show in Figure 2, middle row, panel 3, that after selecting carefully the contour elements in Algorithm 2, using a rate-distortion marking of the regions, we are obtaining a segmentation very relevant for the object parts (the presented segmentation is obtained in Algorithm 2 after including persistent contours resulting in 243 regions in the segmentation image). In there, one can see that the collected persistent contours were successful in providing meaningful image features, resulting in a convincing segmentation.

Algorithm 1 Hierarchical segmentations based on persistency of contours generated by iterative piece-wise polynomial modeling
Input: The input disparity map G. Stage 1. Find persistent contours in the image G: Iterate finding the best fitting models for the current image partition, and then finding the best image partition for the current set of polynomial models. At each iteration mark the boundaries of the partition's regions and add the binary edge matrix to the overall contour persistency matrix; 1.0 Initialize the partition P 0 as being formed of n r L s × n c L s disjoint square regions (L s × L s ). The corresponding label image is denoted X 0 . The overall contour persistency matrix is set to U = 0 ∈ R n r ×2n c ; 1.1 For n = 1, . . . , n iter // Iterate a re-estimation algorithm n iter times Find all connected components of B and denote Ω * the largest of them 1.1.2.2.3 If the cardinality of Ω * is larger than a given size, N S , then a new region is declared, r ← r + 1 and Ω r = Ω * 1.1.2.2.4 Include the new region Ω r in the partition P n , by updating the label image X n (i, j) = r and the corresponding reconstruction R n (i, j) = S (i, j), for each (i, j) ∈ Ω r . 1.1.2.3 Construct the contour image C n for X n and add it to the overall UCM matrix U ← U + C n . Stage 2. Construct a hierarchical segmentation from the persistency contours matrix U , filtering out small regions 2.1 For i p = n iter , n iter − 1, . . . , n min ( Iterate the persistency level from highest to smallest) 2.1.1 Construct a current contours image, C having C(i, j) = 1 if U (i, j) ≥ i p . 2.1.2 Find the labels image X corresponding to C, and if i p = n iter set X i p = X and continue to i p = n iter − 1 2.1.3 Find all connected components of the labels image X 2.1.4 Initialize X i p = X i p +1 (the labels of the previous partition) 2.1.5 For all connected components of the labels image X larger than N 2 2.1.5.1 If the connected component Ω has holes, fill each hole that is smaller than N H pixels and then copy the filled Ω to X i p as a new region 2.1.5 Construct the contour map matrix C i p corresponding to label image X i p larger than N S and update the UCM matrix, U ← U + C i p 2.1 Rename the sequence X n iter , . . . , X n min as X 1 , . . . , X N Output: The ultrametric contour map matrix U , and the sequence of segmentations X 1 , . . . , X N , dubbed Hierarchical segmentations A.

Algorithm 2 Hierarchical partition based on (description length -distortion) optimization
Input: The sequence of segmentations X 1 , . . . , X N from Algorithm 1.  2.3 If λ c * is smaller than a threshold λ 0 , stop adding regions and exit, else add the new region, by modifying R n and X n accounting for O S p * . Output: The sequence of segmentationsX 1 , . . . , X N dubbed Hierarchical segmentation B. Now we present the main particularities of running Algorithm 1 followed by Algorithm 2, as cures to the K-means clustering. First, we do not know a priori a suitable number of regions, corresponding to a given distortion level D. For that reason we are letting the number of regions N reg = n cc to change during the re-estimation iterations, and N reg will be decided implicitly by the selection decisions at each step. We initialize the algorithm with square partitions for simplicity, with the square side 11 pixels. This is similar to the initialization of the segmentation algorithms based on super-pixels. The initial N reg is in the order of thousands, resulting in a heavy over-segmentation of the image.
A major problem of the partition re-estimation is that when distributing each pixel (i, j) to the model that achieves the smallest reconstruction error of G(i, j), there might be very many good models that represent well the other pixels within a neighborhood of (i, j), and then in the neighborhood of (i, j) there may be many different labels of winning models. A certain model might result in many winning patches distributed over the image, with each patch having many holes due to the many similar competitors.
To tackle this problem we adopt several changes to the simple K-models structure of the algorithm. We enforce that during the nth re-estimation of the partition, a given model has associated only one connected component (the largest one) out of all possible connected components where the model was winning over the best current reconstruction. We go over the models in such an order that first we treat the models having a smaller winning patch, and we sequentially mark the winning patches in a label image X n , overwriting the labels created by earlier patches. At the end of this marking process the label image X n will remain with the labels of the models having large winning patches. This process is described in Algorithm 1 at the Step 1.1.2.2 Use the competition of the models of Θ for defining the new partition. The label image X n can remain with many undecided pixels, since we restricted the marking of the winning patches to be only (large) connected components. All pixels with label 0 will be considered again in the decomposition of X n into connected components, at next run of the Step 1.1.1.1., and hence the number of models considered again in Step 1.1. may grow again larger than N cc .
The number of models n 0 cc that are re-estimated based on the new partition might be too large, exceeding our desired level N cc . We use a very simple reduction of their number, by grouping together the "similar" models in the following way: we quantize each model with decreasing precision, by quantizing Q(θ ,r ) = θ ,r 2 n b for n b = 10, 9, . . . , −10 and for each n b we check how many quantized models are distinct in the sequence of parameter vectors Q(θ 1 ), . . . , Q(θ n cc ), picking the n b as the first number for which n cc remains below N cc . This is the process described at the Step 1.1.1.3.
At each iteration of the re-estimation process we pick the contours elements set to 1 in C X n and increment the contour matrix U at the corresponding locations. The contour matrix is a (n r × 2n c ) matrix, where the first half block U (1 : n r , 1 : n c ) specifies that the labels at X(i − 1, j) and X(i, j − 1) are different (horizontal edge) and the second half U (1 : n r , (n c + 1) : 2n c ) specifies that the labels at X(i, j − 1) and X(i, j) are different (vertical edge).
When the re-estimation iterations of Stage 1 are finished, we pass to Stage 2, to analyze the persistency levels marked in the matrix U , with the maximum possible value of n iter . At each persistency level i p we create the contour matrix and then find the associated label matrix X i p . We want to avoid too small regions in X i p , and for that we decompose the image into connected components, fill for each of them the holes that are smaller than a fixed N H (we have used N H = 50) and use the filled connected components for a new label in X i p . The detailed description of Algorithm 1 is presented in the panel of the Algorithm 1.
To illustrate the re-estimation process, we show in Figure 2 bottom panel the evolution of some of the meaningful variables in Algorithm 1, Stage1: the number of connected components found in Step 1.1.1.1 n cc is marked #Connect. Comp.; The number of models estimated at the large connected components, n 0 cc is marked #Large Connect. Comp.; The number n cc of models forced to be smaller than N cc is marked #Kept Models; The number of pixels remaining unclassified (unlabeled) after Step 1.1.2.2 is marked # Unclassified pixels; finally, the number of contour elements (crack edges) in H and V that are set to one at Step 1.1.2.3 is marked #Marked Crack Edges. It is seen that the variables in the re-estimation algorithm are changing at each iteration, inducing variability in the segmentations obtained at each iteration, which is our main goal in the iteration process.

Algorithm for Hierarchical Segmentation based on (Description Length-Distortion) Optimization
The Algorithm 2 starts at Stage 1 with creating a library of large regions (called now "objects" for simplicity, although no "object" meaning is claimed), by inspecting the sequence of segmentations obtained by Algorithm 1. The objects are allowed to overlap (but more than 95% is not allowed). An object is described by its set of pixel locations. For each such object the best model for reconstructing G is found and stored for later use.
At Stage 2 there is a competition between the objects, for being included as new labels in the label matrix X n . The benefits in terms of MSE for including each candidate object with new labels in X n are evaluated, and possibly some subsets of the candidate object are removed, if the current reconstruction at the subset is better than that provided by the object. The best fit over the (carved) object is computed and the improvement in distortion ∆MSE is evaluated.
In order to solve the rate-distortion problem (2) the typical way is to evaluate the slope of the RD curve λ p = ∆MSE L(θ p )+L(Γ) and to add the objects to the segmentation in the order of their slope. The cost of the polynomial models is explained in Section 3.4. We denote Γ the set of additional contour elements that will be set to 1 due to the setting of the new region label in X n . The cost for encoding Γ is estimated as being proportional to the cardinality of the set Γ (with a proportionality factor 1.5 found to cover experimentally well the cost of coding contour elements by the algorithm CERV). The candidate region having the largest value of λ is selected and the Algorithm 2 proceeds to find a new region, exiting when no region with improvements of ∆MSE can be found.

The Datasets
We have experimented with a dataset of high-resolution disparity images [36], where all the images were acquired from real scenes. In Figure 3 we show the scenes used. For the drafting of our algorithms and for setting the thresholds we have experimented over a set of synthetically generated images, but for space economy we do not present here results for the synthetic data. We emphasize that no threshold or algorithmic routines were tuned over the real data.
The dataset [36] was constructed for benchmarking of stereo matching algorithms, and it contains for each scene a left and a right color image, and also a left and a right disparity image. The disparity values can be assumed to be approximately inverse proportional to the depth values in the stereo setting, and then one could apply the algorithms to the modeling of the depth values (equivalent to inverse disparities), but we did not follow that route. Applying the polynomial surface models to depth images and to disparity images can produce rather different results. In [36] the goal was to produce high resolution disparity images, which were carefully evaluated and found to have approximately a quarter of a pixel precision when checking the stereo correspondence of the left and right color views. Hence we decided to apply the polynomial surface approximation directly to the disparity image data provided in [36], not to the inverse disparity data.

Obtaining the Sequences of Segmentations A and B
Our Algorithms 1 and 2 were run over the left disparity images of 13 scenes, on the quarter sized images, e.g., the image Adirondack has size (496, 718). The left color images and left disparity map for the 13 scenes are shown in Figure 3. We have obtained for each scene two sequences of segmentations, A and B, using the Algorithms 1 and 2.
The number of segmentations produced by Algorithm 1 is maximum 38, since we have used n iter = 40, and we stopped the iterations of Step 2.1 at n min = 3, hence the maximum persistency level were 40. However, for some scenes the segmentations obtained at two consecutive values of i p where identical (especially at the very high i p values). The size threshold N S in both Algorithms 1 and 2 was set to 100.
The Algorithm 2 was producing a much larger number of segmentations, in the order of several hundreds, since we have used as the exit threshold condition in Step 2.3 λ 0 = 0, i.e., we exited when no large region had still associated a positive improvement gain λ.

Benchmarking the Sequences of Segmentations against References Extracted from the Color Images
The dataset also contains RGB images of the scenes, that we used only for finding color based contour maps or segmentations so that we can compare the segmentations obtained by our algorithms with other segmentation methods.
In order to find a reference contour map based on the RGB image we have used the software provided in a recent convolutional neural network (CNN) method [16] that performs edge detection, which was shown to achieve similar F-values as the human annotations over the BSD benchmarking database.
We have used the edge map produced for every RGB image of a scene by the software associated to [16], to obtain the edge map probability at every pixel, and we used additionally the non-maxima suppression from [14] for obtaining thin edges. We used a threshold for cutting the edge map probability for obtaining a map of binary edges, and by varying the threshold we obtained candidate contour maps of the image. By visually analyzing the contour maps at various thresholds, we have chosen for each scene one threshold, resulting in a computer generated "color reference edge map", where the threshold was chosen by human. Since there are no public available annotations of the contours in Middleburry images, we chose to provide a substitute of the ground truth that we need for benchmarking by this reference edge map, generated with state of the art CNN based edge detection. We show in third row of Figure 4 the edge maps of the scenes, that we consider as reference color edge maps. They are used as a reference for benchmarking the performance of four algorithms: first: the segmentations generated by Algorithm 1, second, generated by Algorithm 2, (both obtained from disparity data), third, the segmentations obtained by the method Selective Search [18] dubbed here "SSS algorithm" using the disparity data, and fourth the segmentations obtained by the method Multiscale Combinatorial Grouping [19] dubbed here "MCG algorithm" using the disparity data, all against the reference color edge maps obtained from the RGB image.
For benchmarking we have used the established methodology for boundaries evaluations from [35] summarized next: one considers a binary edge map B re f as "ground truth", which we take to be the reference color edge map mentioned above. Then we take say sequence A of segmentations, and go over each segmentation X t , transform it to a binary boundary image B t (function provided in the software [35]) and then the contours in B re f and B t are aligned by a dynamic programming routine, resulting in the best alignment, and in two matching images, out of which one can obtain the true and false positive and the true and false negative, the precision P t and the recall R t , and finally the F values F t = 2P t R t /(P t + R t ). For the best F-value for each image we also show the matching maps in Figure 4, where one can notice that the Algorithms 1 and 2 generate meaningful boundaries (the set of green and blue edges) and miss contour elements that are associated to color features, but not to disparity features. We present a complete description of the (precision-recall) performance in Figure 5 for all the images and all four algorithms and show by a circle the place with maximum F on each curve. In the Figure 5 and in the rest of the paper we refer to the method from [18] as SSS algorithm and to the method from [19] as MCG algorithm.
We also show in Table 1 the F values obtained for each algorithm over each image from the dataset. One can notice that Algorithm 2, based on the optimization using the gains λ defined in term of codelength, is the most often winner.   [18] and MCG algorithm [19] for disparity images from Middlebury Dataset [36]. Algorithm 1: Blue, Algorithm 2: Red, SSS algorithm: Magenta, MCG algorithm: Black. F-value remains constant on each green curve.
Finally, we show for visual evaluation in Figure 6 the segmentations of the four compared methods, at the highest value of F. One can notice the good matching of the segmentation regions with the objects, or distinct parts of the objects in the scene. Figure 6. The segmentation corresponding to the best F-value obtained by Algorithm 2 (first row), Algorithm 1 (second row), by SSS algorithm [18] (third row) and by MCG algorithm [19] (last row). Table 1. Boundary-Recall and compression performances of hierarchical segmentation using four different segmentation algorithms (Algorithm 1, shortened as A1; Algorithm 2, shortened as A2; SSS algorithm [18]; and MCG algorithm [19])). The columns 2 to 5 present the optimal Precision-Recall pairs. The columns 6 to 9 present the optimal corresponding F-values. The columns 10 to 13 present the Bjøntegaard BD-PSNR values between the rate -distortion results when applying the encoding algorithm from Algorithm 3 to each of the four segmentation algorithms, compared against the rate-distortion of JPEG2000, considered as anchor. The bold results present the winning of the fours algorithm for each scene according to F-Value and according to Bjøntegaard BD-PSNR.

Rate-Distortion Performance of the Segmentation Algorithm
We have used the encoding algorithm presented in Algorithm 3 for compressing G using the sequences obtained at the output of the Algorithm 1, Algorithm 2, SSS algorithm [18] and MCG algorithm [19]. The polynomial models have the parameter vector θ of length 6, corresponding to quadratic polynomial surfaces (1). For each obtained θ we quantize the parameters to 8 bits in the fractional part, and we encode the 6 quantized numbers by Golomb-Rice codes, determining the optimal parameter k = 2 r of the GR codes, where r ∈ {1, . . . , 8} is also encoded (in three bits) and transmitted in the header of the bitstream. For each scalar coefficient θ we transmit in one bit the sign of the coefficient, in unary coding the quotient q = |θ| 2 r (i.e., q bits one followed by a 0 bit) and in r bits we transmit the binary representation of (|θ| − q2 r ). The codelength for θ for each r ∈ {1, . . . , 8} is evaluated and the optimal value of r is selected. The length of the bitstream for encoding the parameter vectors θ 1 , . . . , θ L is the model parameter cost ∑ L =1 L(θ ). Additionally, the cost L(X) of transmitting the segmentation X by using the CERV algorithm [5] needs to be added, to obtain the overall codelength L.

Algorithm 3
Encoding G based on the segmentation X and polynomial models over each region Input: The input disparity map G. The segmentation X. Stage 1. Encode the segmentation X by the CERV encoding algorithm from [5], resulting in the codelength L(X). 3.1 For = 1, . . . , L (for each region Ω of the segmentation X) 1.1.1.2.1 Estimate the parameters θ of the polynomial surface model by minimizing ∑ (i,j)∈Ω (G(i, j) − P θ (i, j)) 2 . 1.1.1.2.1 Encode the parameters θ using a Golomb-Rice code, resulting in the codelength L(θ ). 1.1.1.2.1 Compute the sum of squares SSE = ∑ (i,j)∈Ω (G(i, j) − P θ (i, j)) 2 of the reconstruction using the parameters θ 1.1.1.2.1 Compute the mean square error MSE = 1 n r n c ∑ L =1 SSE and the peak signal to noise ratio PSNR = 10 log 10 Mg 2 MSE with Mg = max (i,j)∈Ω 0 G(i, j). Output: The PSNR in dB and the rate RATE = 1 n r n c (L(X) + ∑ L =1 L(θ ) in bits per pixel We additionally consider the rate-distortion of other two lossy compression methods: first the wavelet based coder JPEG 2000, which uses a hierarchical wavelet decomposition and context based coding of the wavelet coefficients, in a heavily engineered reference software. Second we consider a simple method, dubbed "Model 0+UQ", which performs first the uniform quantization (UQ) of the image G, with various quantization steps, and then transmits the quantized image using the CERV algorithm. For reconstruction, the scaling back by the quantization step is performed. Qualitatively, the quantized version of the image G looks like a geodesic map, being formed of constant regions enclosed by simple boundaries. This image can be encoded extremely efficiently by CERV, and hence "Model 0+UQ" has a very good RD curve, better than JPEG 2000 starting from a given rate on. However, at low bitrates the "Model 0+UQ" is below the JPEG 2000, and this is the place where the Algorithms 1 and 2 are intended to be utilized, for conveying at a low bitrate simple efficient reconstructions of G, which additionally convey a good segmentation into objects or parts of objects. The "Model 0+UQ" and JPEG 2000 competing methods do not provide informative segmentations, so we can compare them to ours only on the compression performance.
We show in Figure 7 the Rate-Distortion plots for all the images and algorithms, in which we present the rate in normalized form, L/(n r n c ), as bits per pixels. The performance of Algorithms 1 and 2 is better than "Model 0+UQ" at low bitrates, but still not as high as JPEG 2000. At the very low bitrate Algorithm 2 is systematically the best, but for some images Algorithm 1 succeeds to overpass significantly Algorithm 2 at higher bitrates. Finally, we show in Table 1 the Bjøntegaard distances (BD-PSNR) between the RD curves of each algorithm and the RD plot of JPEG algorithm. The minus sign for BD-PSNR means that the average of (PSNR Alg (R) − PSNR JPEG (R)) is negative. The RD performance, for reconstructing the G, has to be seen in connection to the performance for boundary detection and object detection, which favors consistently Algorithm 2.

Conclusions
We have proposed algorithms that create segmentations from disparity images, useful for two goals: as segmentation of the scene, and also as partitions for a piece-wise polynomial model based lossy compression. These algorithms can be further combined in more complicated structures to cover the more advanced applications mentioned in the paper. The investigation in this paper revealed good properties of the created partitions for high resolution disparity images, encouraging further study of these models for the new types of immersive image modalities that may benefit from precise geometry modeling and compression.