Depth Image Coding Using Entropy-Based Adaptive Measurement Allocation

Differently from traditional two-dimensional texture images, the depth images of three-dimensional (3D) video systems have significant sparse characteristics under the certain transform basis, which make it possible for compressive sensing to represent depth information efficiently. Therefore, in this paper, a novel depth image coding scheme is proposed based on a block compressive sensing method. At the encoder, in view of the characteristics of depth images, the entropy of pixels in each block is employed to represent the sparsity of depth signals. Then according to the different sparsity in the pixel domain, the measurements can be adaptively allocated to each block for higher compression efficiency. At the decoder, the sparse transform can be combined to achieve the compressive sensing reconstruction. Experimental results have shown that at the same sampling rate, the proposed scheme can obtain higher PSNR values and better subjective quality of the rendered virtual views, compared with the method using a uniform sampling rate.


Introduction
Three-dimensional (3D) video can provide the viewers a high-quality and immersive multimedia experience, which has drawn increasing attention among industry and academic researchers [1].Two typical 3D applications have appeared in the form of three-dimensional television (3DTV) [2] and free-viewpoint television (FTV) [3].In 3DTV applications, multiple views from different viewing angles can be rendered for depth perception of the scene while in FTV applications, arbitrary viewpoints within a certain range can be selected interactively by viewers.
The basic format of 3D video is a multiview representation which is usually captured simultaneously by multiple cameras with slightly displaced positions [4].However, with an increasing number of the views, the huge amount of data from multiview video poses great challenge for 3D applications, such as data compression and transmission.In order to solve this problem, the multiview video plus depth (MVD) format has emerged as an efficient data representation for 3D systems.Compared to the pure multiview video format without depth information, the main advantage of the MVD format is that desired virtual views at arbitrary viewpoint positions can be conveniently synthesized via the depth-image-based rendering (DIBR) technique [5].
Depth images represent the distance information between the camera and the objects in the scene.The depth images are often treated as grey scale image sequences, which are similar to the luminance component of texture video.However, differently from the texture video, the depth image has its own special characteristics.Firstly, the depth image signal is much sparser than the texture video under certain transform basis, such as Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), etc.It contains no texture but sharp object boundaries, since the gray levels are nearly the same in most regions within an object but change abruptly across the boundaries.Furthermore, the depth image is not directly used for display, but it plays an important role in the virtual view synthesis.The distortion of depth data, especially around the object boundaries, will seriously degrade the quality of the rendered virtual views [6].Therefore, how to employ the depth image characteristics for efficient compression is an essential part in 3D systems.
In view of the sparsity characteristics of depth images, we attempt to apply compressive sensing (CS) [7] to represent depth information efficiently.CS is a new method to capture and represent compressible signals at a rate significantly below the conventional Shannon/Nyquist rate.In the conventional Shannon/Nyquist sampling theorem, when capturing a signal, one must sample at least two times faster than the signal bandwidth in order to avoid losing information.Due to the low sampling rate, CS can avoid the big burden of data storage and processing at the conventional encoder.
In recent years, CS is applied in image compression and the basic framework is shown in Figure 1.At the encoder, the input image can be processed block by block.For each block in the image, sparse transform, such as DCT or DWT, is used to produce the coefficients with sparse characteristics.Then compressive sensing is employed to encode the transform coefficients and generate the same amount of measurements for each block.At the decoder, a convex optimization method, such as the log-barrier or multiplier [8], can be adopted for the CS recovery.In the end, the corresponding inverse transform can be used for the image reconstruction.Block compressed sensing for natural images is proposed using the same measurement matrix, which is claimed that it can sufficiently capture the complicated geometric structures of natural images [9].A new image/video coding approach is proposed, which can combine the CS theory into the traditional DCT-based coding method to achieve better compression efficiency for spatially sparse signals [10].Furthermore, the whole depth image can be processed by CS, and its performance is evaluated with rendered virtual view quality [11].A novel compressed sensing framework is presented for depth image compression using adaptive graph-based transforms [12].However, since the greedy algorithm is proposed to find the optimal edge image, which means higher complexity especially when the depth image block size increases.To address the above problems, in this paper, a novel depth image coding scheme is proposed based on a block compressive sensing method.The main improvements of the proposed scheme are as follows: (1) to ensure lower-complexity of the CS encoder, the entropy of pixels in each block is employed to represent the sparsity of depth signals; (2) in view of the different sparse characteristics of each block in the depth images, an adaptive measurement rate should be allocated for higher compression efficiency; (3) differently from the conventional CS, in this paper the measurements can be obtained directly in the pixel domain and the sparse transform is combined in the CS reconstruction, which can guarantee the lower-complexity of the CS encoder and the reconstructed image quality; (4) in order to better estimate the performance, objective and subjective quality of the rendered virtual views are taken into account.
The rest of this paper is organized as follows: in Section 2, the proposed scheme is presented step by step.In Section 3, the performance of the proposed scheme is examined.We conclude the paper in Section 4.

Overview
Figure 2 illustrates the block diagram of the proposed scheme.N views from Cameras 1 to N can be processed independently and each view includes texture video and its corresponding depth image.Since texture video is very similar to the traditional two-dimensional (2-D) video, it can be compressed by a standard codec, such as High Efficiency Video Coding (HEVC), for high compression efficiency.In this paper, we focus on the compression of depth images.In view of the sparsity of depth images, a block compressive sensing method is applied to compress them.Firstly, in order to reduce the amount of computation, the original depth image can be down-sampled [13] and the sampling rate can be set as 0.5.Then the entropy of pixels in each block can be calculated to determine the sparsity in the pixel domain.According to the sparsity, adaptive measurements can be allocated to each block for better compression efficiency.It is noted that in order to reduce the complexity of the CS encoder, the sparse transform can be shifted into the CS reconstruction.Therefore, at the decoder CS recovery can be obtained by solving a convex optimization problem combined with the sparse transform.

Basic Idea of CS
Firstly, we will review the basics of the CS theory [7].
x R is a discrete signal and u is its coefficients in some orthonormal basis Ψ , then = Ψ T x u .Here, x is said to be k -sparse with respect to Ψ if only k of n coefficients are non-zero.In CS theory, instead of encoding the k non-zero coefficients, the process of CS encoder is as follows: where Φ is × m n matrix and ∈ m y R .Since < m n , the original signal x can be compressed.At the CS decoder u can be reconstructed by solving the following optimization problem: Then according to = Ψ T x u , the original signal x can be obtained.In this paper, the CS encoder is utilized block by block for each frame to generate the CS frame.Each block can be organized to form a 1 × n vector x .Here, the rows of the matrix Φ are samples of an independent identically distributed (i.i.d.) symmetric Bernoulli distribution.To be more specific, in the matrix Φ , the row consists of 1  ± and the probabilities of +1 and −1 are both 0.5.It is noted that for low complexity the matrix Φ is the same for all blocks.According to Equation (1), the measurement y can be produced directly in pixel domain, whose size is 1 × m .Then the measurement y can be encoded and transmitted to the channels.At the decoder we use a generic log-barrier algorithm to solve Equation (2).The corresponding matlab codes can be found in [14].Furthermore, DCT basis is adopted as the orthonormal basis Ψ for simplicity.In this paper, DCT transform is not utilized at the encoder, but shifted into decoder.The corresponding details can be found in Section 2.4.

Entropy Calculation
In information theory [15], entropy is the average amount of information contained in the source.Therefore, for the image source, the entropy can represent the complexity of the image contents to a great extent.To be more specific, if the image content is very complex, the entropy can be larger while if the image content is very smooth, it can be smaller.According to the essence of the entropy, in this paper, it can be employed to measure the sparsity of the depth image.Generally, the entropy H of a discrete random variable X with possible values 1 2 { , ,... ,... } i n x x x x and probability mass function ( ) P X can be defined as follows [15]: According to Equation (3), we can calculate the entropy of each block in the depth image.However, before the calculation, the background noise of the depth image should be removed first.It is noted that the removal of background noise aims to facilitate the accurate calculation of the entropy.Figure 3 shows an example of an anti-ground noise filter for depth images.Due to the background noise, the neighboring pixel values differ slightly from one another, which results in an inaccurate description of the information content using entropy.To remove the background noise without high computation complexity, an anti-ground noise filter can be utilized here.We adopt 8 as the stepper to quantize all the 256 pixel values of the original depth image.Finally, up to 32 (0-31) quantized values were left, which provided a good condition for the subsequent work.When we calculate the entropy of all blocks in the depth image, the probability of the appearance of each pixel can be counted in the calculation of the entropy, which is as follows: Here, N is the total number of pixels in a block and ( ) i n x is the number of the quantized values i x .As a result, the entropy of each block can be computing to measure the sparsity of the depth image.

Adaptive Measurement Allocation
In order to reconstruct a higher quality depth image at a lower sampling rate, we will allocate different sampling rates to different blocks according to their entropy, shown as the flowchart in Figure 4 < < < < < S S S S S S ) can be allocated for each block until all the blocks have been processed.Here, 6 20% S = , 5 30% S = , 4 40% S = , 3 50% S = , 2 60% S = and 1 70% S = .It is noted that due to the total six decisions, three bits are required for each block as the overhead of the proposed scheme.In Figure 5, a typical example for the standard test depth image Kendo is shown to explain the adaptive measurement allocation.Here, we use different colors to represent different sampling rates, such as white for 1 S , red for 2 S , blue for 3 S , green for 4 S , yellow for 5 S and black for 6 S .In view of the characteristics of depth images, the most smooth block marked by black can be allocated the lowest sampling rate while the complex texture block marked by white can be allocated the highest sampling rate.As shown in Figure 5, since the smooth blocks are actually a larger percentage of all the blocks, higher compression efficiency may be achieved using unequal sampling rates than with a uniform sampling rate.It is noted that the threshold j E can be computed by statistical methods.Firstly, since five thresholds should be taken into account, we can divide the entropy values of all blocks into five equal intervals.
Here, also take the standard test depth image Kendo as an example, as shown in Figure 6.Furthermore, we can obtain the central values of each bin which are noted by colored circles in Figure 6.These central values can be considered as thresholds.We have to compute the entropy for all blocks, and decide the thresholds.Then for a different image, the entropy thresholds have to be computed again.Currently, we consider six levels of thresholding.More levels means better reconstructed image quality, but it also increases the computing complexity.

Improved CS Reconstruction
Here, the sparse transform can be shifted to the decoder to reduce the complexity of the encoder.Here, the log-barrier algorithm can be designed to solve quadratically constrained 1 L minimization: Here, = ΦΨ T A , u is the coefficient of original pixel x in some orthonormal basis Ψ , and b is the vector of observation.It is noted that according to the log-barrier algorithm, some parameters should be updated due to the combination of sparse transform.Next the derivation is shown as follows: Then we will introduce the singular value decomposition (SVD) of T A : Since A is an m n × matrix, U is an m m × unitary matrix, S is an m n × diagonal matrix and the n n × unitary matrix T V denotes the conjugate transpose of the n n × unitary matrix V .Furthermore, according to Equation ( 7), the Equation ( 6) can be rewritten by: It also can be changed as follows: By the comparison between Equation (5) and Equation ( 9), the parameters can be updated as follows: firstly, b in Equation ( 5) can be updated by 1   − T S V y .Secondly, A in Equation ( 5) is updated by T U .
Finally, the initial u can be replaced by Ub .

Experimental Results
In this paper, the standard test sequences shown in Table 1 are selected to validate the proposed scheme.The input for each view is the first color image frame with the corresponding depth image.In the practical application, the camera can process the multiviews image by image, which is like the intra-coding in the traditional method.Here, the experimental results are tested on a PC with a 2.67 GHz Intel CoreTMi5 CPU and the main scheme is implemented using MATLAB R2010a.The virtual viewpoint synthesis software with the version VSRS3.5 is adopted as the experimental platform.It can be seen from Figure 7a,c,e that the proposed scheme outperforms the uniform sampling scheme in PSNR values of depth map at the same ratio.Here, the ratio is the average ratio or average sampling rate for adaptive measurements.In the three tested sequences, the PSNR values of the sequence Pantomime are higher than the two other sequences because this sequence has more smooth regions and better sparsity.Since the depth map is not directly used for display, the objective and subjective quality of the rendered virtual views should be taken into account.In the objective aspect, the synthesized virtual viewpoint image can be achieved by two original camera images.For example, for the tested sequences Balloons and Kendo, the depth and texture from the 1st and 3rd views can be used to synthesize the texture of 2nd view while for the sequence Pantomime, the depth and texture from the 37th and 39th views  can generate the texture of 38th view.Then we make a comparison between the uniform sampling scheme and the proposed one by observing the quality of the synthesized image.In Figure 7b,d,f, it can be seen at the same average sampling rate, the synthesized image using the proposed scheme outperforms the uniform sampling scheme in PSNR values.Furthermore, the encoding and decoding times of the proposed scheme and the uniform one have also been shown in Table 2. From Table 2, we can find that the proposed scheme needs more time than the uniform one due to the increasing complexity.Next, we further discuss the reconstruction quality of synthesized images.From Figure 8, it can be seen that the better visual quality of synthesized images has been observed with the proposed scheme than uniform sampling scheme, especially in some parts denoted by a yellow rectangle.In Table 3, the comparison with the traditional coding has been shown.Here, according to the main idea of JPEG or H.264 intra-coding, the traditional coding method is simulated based on a Discrete Cosine Transform (DCT).We decompose each block (16×16) of the original depth map by DCT and then perform the reconstruction using only the significant DCT coefficients.In Table 3, the traditional method has obtained higher PSNR values than the CS coding with more encoding time.Therefore, the CS method is suitable for real-time compression of high-speed camera images.In the current stage, the CS method cannot compete with the traditional method in terms of compression efficiency.The main reason is that at the DCT encoder can nicely remove the correlation in the original image so that most information of the image can be recovered by a small amount of transform coefficients.In contrast, the CS encoder can realize compression mainly based on random sampling.At the decoder, it can apply sparse transform to gain performance.In the future, it is necessary for us to refer to the traditional method to improve the results of the CS method.

Conclusions
In this paper, we fully consider the sparse characteristics of depth images and propose a novel scheme based on block compressive sensing.Since the entropy can describe the sparsity of the depth image to some extent, adaptive measurement allocation is designed based on the entropy of each block.The simulation results show that compared with uniform sampling scheme, the proposed scheme has better rate distortion performance for both depth maps and synthesized virtual viewpoints.

Figure 1 .
Figure 1.Basic framework of image compression based on CS.

Figure 2 .
Figure 2. Block diagram of the proposed scheme.

Figure 3 .
Figure 3. Example of the anti-ground noise filter.

Figure 5 .
Figure 5.A typical example of measurement allocation.

Figure 6 .
Figure 6.A typical example of threshold determination.
. For simplicity, the depth image can be divided into m blocks with size n n

Table 1 .
Test sequences

Table 2 .
The depth image encoding and decoding time of uniform and adaptive methods.

Table 3 .
Comparison with the traditional method.