Multiple Description Coding Based on Optimized Redundancy Removal for 3d Depth Map

Multiple description (MD) coding is a promising alternative for the robust transmission of information over error-prone channels. In 3D image technology, the depth map represents the distance between the camera and objects in the scene. Using the depth map combined with the existing multiview image, it can be efficient to synthesize images of any virtual viewpoint position, which can display more realistic 3D scenes. Differently from the conventional 2D texture image, the depth map contains a lot of spatial redundancy information, which is not necessary for view synthesis, but may result in the waste of compressed bits, especially when using MD coding for robust transmission. In this paper, we focus on the redundancy removal of MD coding based on the DCT (discrete cosine transform) domain. In view of the characteristics of DCT coefficients, at the encoder, a Lagrange optimization approach is designed to determine the amounts of high frequency coefficients in the DCT domain to be removed. It is noted considering the low computing complexity that the entropy is adopted to estimate the bit rate in the optimization. Furthermore, at the decoder, adaptive zero-padding is applied to reconstruct the depth map when some information is lost. The experimental results have shown that compared to the corresponding scheme, the proposed method demonstrates better rate central and side distortion performance.


Introduction
3D images and video, which offer us a stereoscopic and immersive multimedia experience, have attracted increasing attention [1].Recently, the applications of 3D images and video have become more and more popular in many fields, including cinema, TV broadcasting, streaming and smart phones [2].Differently from the conventional 2D texture image, the depth map, as a special format of 3D image data, represents the distance information between a camera and the objects in the scene.Each pixel of the image frames of depth data streams represents the distance, which is to the camera plane from a specific object in the view of the depth sensor.Depth maps are often treated as gray-scale image sequences, which are similar to the luminance component of texture videos.However, the depth map has its own special characteristics.First, the depth map signal is much sparser than the texture image.It contains no texture, but has sharp object boundaries because the gray levels are nearly the same in most regions within an object, but change abruptly across boundaries.Furthermore, the depth map is not directly used for display, but it plays an important role in the virtual view synthesis.
During the past few years, many studies have focused on the depth map.Edge-based compression was applied to the depth map in [3], which utilizes JBIG (Joint Bi-level Image experts Group) to encode the contours and DPCM (Differential Pulse Code Modulation) to compress both the pixels around the contours, as well as the uniform sparse sampling points of the depth map.The paper in [4] uses a symmetric smoothing filter to smooth the whole depth map before the image warping.While this method decreases the number of holes as the smoothing strength becomes stronger, some geometric distortions also become visible on regions with vertically straight lines.Instead of coding numerous views, two or three texture views with their corresponding depth maps are used in the multiview video coding (MVC) standard.Virtual views can be generated by specifying the preference viewing angles through depth-image-based rendering (DIBR) [5].Later, many scholars applied the compressive sensing (CS) theory to the texture image and depth map coding.Compressive sensing (CS) theory has been proposed by Cands and Donoho in [6][7][8].The CS applications in the image/video coding can be found in [9][10][11][12].In [9], according to a spatially sparse signal, a new image/video coding approach is proposed, which combines the CS theory with the traditional discrete cosine transform (DCT)-based coding method to achieve better compression efficiency.In [10], using block compressed sensing, image acquisition is conducted in a block-by-block manner through the same operator.It is claimed that it can sufficiently capture the complicated geometric structures of natural images.In [11,12], compressed sensing theory is used in the depth map compression, which subsamples the depth map in the frequency domain by choosing a sub-sampling matrix in order to ensure the incoherence of the measurements.The two new papers in [13,14] propose two kinds of exponential wavelet iterative shrinkage thresholding algorithm based on the theory of compressed sensing theory.The work in [13] is only for magnetic resonance images, and [14] is with the random shift for the magnetic resonance image.Two schemes are tested by four kinds of magnetic resonance images, and they have a better reconstruction quality and faster reconstruction speed compared to state-of-the-art algorithms, such as FCSA (Fast Composite Splitting Algorithm), ISTA (Iterative Shrinkage/Threshold Algorithm) and FISTA (Fast Iterative Shrinkage/Threshold Algorithm).
Multiple description (MD) is a coding technique that has emerged as a promising approach to enhance the fault tolerance of a video delivery system [15].In 1993, the first work on MD coding was introduced in [16].The original video signal can be split into multiple bit streams (descriptions) using an MD encoder.Then, these MDs can be transmitted over multiple channels.There is a very low probability when all channels fail at the same time.Therefore, at the MD decoder, only one description received can reconstruct the video with acceptable quality, and the resultant distortion is called side distortion.Of course, more descriptions can produce the video with better quality.In a simple architecture of two channels, the distortion with two received descriptions is called central distortion [17].
Later, a number of methods of MDC have been presented, mainly including the MD scalar quantizer [18], the MD lattice vector quantizer [19,20], MD based on pairwise correlating transforms [21] and MD based on FEC (Forward Error Correction) [22].All of the above methods are difficult to apply in practical applications because these specially-designed MD encoders are not compatible with the widely-used standard codec.To overcome the limitation, some standard-compliant MD video coding schemes were designed to achieve promising results [23,24].Another significant class of MDC is based on pre-and post-processing.In preprocessing, the original source is split into multiple sub-sources before encoding, and then, these sub-sources can be encoded separately by the standard codec to generate multiple descriptions.The typical versions are MDC based on spatial sampling [25] and temporal sampling [26,27].
In this paper, considering the special characteristics of depth information, an MDC scheme is proposed based on optimized redundancy removal for the 3D depth map.Firstly, the depth map is transformed from the pixel domain into the DCT domain.According to the characteristics of DCT coefficients, at the encoder, a Lagrange optimization approach is designed to determine the amounts of high frequency coefficients in the DCT domain to be removed.Considering the low computing complexity, the entropy is adopted to estimate the bit rate in the optimization.Then, the processed depth map is transmitted, which is separated in accordance with odd and even rows or columns.
We have differently adaptive decoding methods for the two situations.Furthermore, at the decoder, adaptive zero-padding is applied to reconstruct the depth map when some information is lost.
The rest of this paper is organized as follows.The proposed scheme is presented in Section 2. In Section 3, the performance of the proposed scheme is investigated through extensive simulation in different situations.Additionally, the paper is finally concluded in Section 4.

Encoder
The block diagram of the proposed encoder has been illustrated in Figure 1.Here, the depth maps of any view can be encoded as follows.At the encoder, the depth map can be firstly transformed into the DCT domain.According to the characteristics of the high frequency coefficients of DCT, a Lagrange optimization approach is designed to determine the redundancy to be removed.After inverse DCT (IDCT, Inverse Discrete Cosine Transformation), the processed depth map can be separated by the means of odd or even lines.Here, the generated sub-images can be encoded by any standard codec to produce two descriptions.
Entropy 2016, 18, 245 3 of 17 have differently adaptive decoding methods for the two situations.Furthermore, at the decoder, adaptive zero-padding is applied to reconstruct the depth map when some information is lost.The rest of this paper is organized as follows.The proposed scheme is presented in Section 2. In Section 3, the performance of the proposed scheme is investigated through extensive simulation in different situations.Additionally, the paper is finally concluded in Section 4.

Encoder
The block diagram of the proposed encoder has been illustrated in Figure 1.Here, the depth maps of any view can be encoded as follows.At the encoder, the depth map can be firstly transformed into the DCT domain.According to the characteristics of the high frequency coefficients of DCT, a Lagrange optimization approach is designed to determine the redundancy to be removed.After inverse DCT (IDCT, Inverse Discrete Cosine Transformation), the processed depth map can be separated by the means of odd or even lines.Here, the generated sub-images can be encoded by any standard codec to produce two descriptions.As is depicted in the block diagram, by removing the N lines' high frequency coefficients in the DCT domain, approximate redundant information can be reduced.Here, N is very important for the entire coding process, which has an effect on the reconstructed image quality at the decoder.The side and central distortion will be sensitive to the value N .When N increases, both side and central distortion will increase.On the other hand, the bit rates associated with side and central distortion will decrease with the increase of N .To achieve an optimized value for N , the following optimization problem should be solved.min ( ( ), ( ), ( ), ) In the above optimization problem, we need to consider the bit rate and the distortion of reconstructed images from the side and central decoder.Furthermore, it can be rewritten as a Lagrange optimized formulation, as shown in Equation ( 1).Here, let ( ) J N denote the Lagrange cost function.As is depicted in the block diagram, by removing the N lines' high frequency coefficients in the DCT domain, approximate redundant information can be reduced.Here, N is very important for the entire coding process, which has an effect on the reconstructed image quality at the decoder.The side and central distortion will be sensitive to the value N. When N increases, both side and central distortion will increase.On the other hand, the bit rates associated with side and central distortion will decrease with the increase of N. To achieve an optimized value for N, the following optimization problem should be solved.In the above optimization problem, we need to consider the bit rate and the distortion of reconstructed images from the side and central decoder.Furthermore, it can be rewritten as a Lagrange optimized formulation, as shown in Equation (1).Here, let JpNq denote the Lagrange cost function.
RpNq, D S pNq and D C pNq denote the bit rate, side and central distortion of the image, respectively.Additionally, λ 1 and λ 2 are balanced parameters to play a role in RpNq, D S pNq and D C pNq.In this paper, the side distortion D S pNq and the central distortion D C pNq will be regarded as the same significance, so λ 1 " 1.In the experimental results, we will discuss the rate distortion performance under different values of λ 2 .
Solving the optimization problem in Equation ( 1) is difficult in multiple description coding.Furthermore, given λ 2 , a triple pD S pNq, D C pNq, RpNqq needs to be computed for each mode selection.This requires actual time-consuming encoding and decoding for each mode.Taking the lower encoding complexity into account, we use the entropy of the image instead of the bit rate RpNq.Here, R e pNq represents the entropy of a depth map.It is defined as follows: R e pNq " ´k ÿ i"0 where p i presents the probability of the gray level i in the depth map and k is the number of gray levels.
In this paper, we need to count the possibility of the appearance of each gray level in a depth map, which will be used in the calculation of the entropy.The equation of the possibility count is expressed as: where S is the number of symbols in an image and s i is the number of the pixels whose value is i.
When we get all of the p i of an image, we can calculate the entropy according to Equation (2).Furthermore, the MSE (mean square error) can be used to represent to D S pNq and D C pNq.Let I 1 , I 2 denote two images, and then, their MSE can be obtained by the following formula: In this paper, the D C pNq is the MSE of the reconstructed image from the central decoder with the original map, and the final result of D S pNq is the average MSE of two reconstructed images from the side decoder.Here, MSE 1 is used to compute the distortion from one side decoder, and MSE 2 is used to compute the distortion from the other side decoder.
In this paper, we use the iterative approach to solve N. Because the block in the encoder is set to be 16, the value of N is in the limit of 16 and 16m, and m is different in the different sequences.While m is too small, it may not find a better value for N, and the algorithm stops; while m is too large, the value of R e pNq will be increasing, and it will lose the role of λ 2 .
The processed depth map is separated in accordance with odd and even lines in the DCT domain, then initializing N and λ 2 .We can compute the value of R e pNq, D S pNq and D C pNq with the two sub-images.Then, we calculate JpNq according to the front three values and circulate N. When N is not in the limit of 16 and 16m, the optimized N is chosen in accordance with the minimum value of JpNq.The basic optimized algorithm is shown in Figure 2.

Decoder
At the decoder, there are two ways for MDC.For a depth map, the distortion with both received descriptions is called central distortion, and the distortion with only one received description is called side distortion [15].The block diagram of the proposed decoder has been illustrated in Figure 3.

Decoder
At the decoder, there are two ways for MDC.For a depth map, the distortion with both received descriptions is called central distortion, and the distortion with only one received description is called side distortion [15].The block diagram of the proposed decoder has been illustrated in Figure 3.
Firstly, the design of the side decoder is discussed as follows.For the side decoder, only Description 1 or 2 can be received.Since the processes for Descriptions 1 and 2 are the same, we use Description 1 as the example.After standard decoding, the decoded odd sub-image will be transformed into the DCT domain.Then, according to the resolution of the original image, adaptive zero-padding can be applied to fill the approximate zeros for the DCT coefficients of the odd sub-image.In the end, after inverse DCT, the reconstructed depth image can be achieved, which can be used to calculate the side distortion.
Then, we will discuss the design of the central decoder.For the central decoder, both descriptions can be received.Therefore, after standard decoding, Descriptions 1 and 2 are rearranged according to the odd and even lines.Then, with the same process in the side decoder, adaptive zero-padding can be used to fill zeros in the DCT domain.At last, we can obtain the reconstruction with central distortion.

Decoder
At the decoder, there are two ways for MDC.For a depth map, the distortion with both received descriptions is called central distortion, and the distortion with only one received description is called side distortion [15].The block diagram of the proposed decoder has been illustrated in Figure 3.
Firstly, the design of the side decoder is discussed as follows.For the side decoder, only Description 1 or 2 can be received.Since the processes for Descriptions 1 and 2 are the same, we use Description 1 as the example.After standard decoding, the decoded odd sub-image will be transformed into the DCT domain.Then, according to the resolution of the original image, adaptive zero-padding can be applied to fill the approximate zeros for the DCT coefficients of the odd sub-image.In the end, after inverse DCT, the reconstructed depth image can be achieved, which can be used to calculate the side distortion.
Then, we will discuss the design of the central decoder.For the central decoder, both descriptions can be received.Therefore, after standard decoding, Descriptions 1 and 2 are rearranged according to the odd and even lines.Then, with the same process in the side decoder, adaptive zero-padding can be used to fill zeros in the DCT domain.At last, we can obtain the reconstruction with central distortion.

Experimental Results
To highlight the performance of the proposed scheme, the experiments are implemented on four standard sequences of depth maps, including Balloons, Kendo, Newspaper, Dancer and Pantomime,

Experimental Results
To highlight the performance of the proposed scheme, the experiments are implemented on four standard sequences of depth maps, including Balloons, Kendo, Newspaper, Dancer and Pantomime, which can be download from [28].The detailed information and the choice of some parameters of the tested sequences are shown in Table 1.This paper focuses on the comparison of the proposed optimized scheme against the conventional scheme.To prove the universality of the experiment, some groups of data in each sequence are selected for comparison.According to the MDC quality assessment, we compare not only the rate central distortion performance when two descriptions can be received correctly, but also the rate side distortion performance when only one description can be received.The objective and subjective quality comparison for the sequences of Balloons, Kendo, Newspaper, Dancer and Pantomime is presented in Figures 4-8.
In the four figures, the horizontal axis represents the bit rate, and the vertical axis represents the PSNR values.Here, one view is chosen for each of the four sequences for comparison in (a): the first view of Balloons, Kendo and Dancer, the second view of Newspaper and the 37th view of Pantomime.Compared to the basic comparison scheme, our proposed depth map coding scheme can on average improve around 1.2 dB for Balloons, 0.4 dB for Kendo, 0.5 dB for Newspaper, 2.6 dB for Dancer and 3.0 dB for Pantomime in side distortion and around 1.2 dB for Balloons, 0.5 dB for Kendo, 0.8 dB for Newspaper, 3 dB for Dancer and 3.1 dB for Pantomime in central distortion at the same or a very close bit rate.
Another view is chosen for each of the above four sequences for comparison in (b): the third view of Balloons, Kendo, Dancer, the fourth view of Newspaper and the 39th view of Pantomime.Compared to the basic comparison scheme, our proposed depth map coding scheme can on average improve around 0.7 dB for Balloons, 0.6 dB for Kendo, 0.6 dB for Newspaper, 3 dB for Dancer and 2.8 dB for Pantomime in side distortion and around 0.8 dB for Balloons, 0.6 dB for Kendo, 0.8 dB for Newspaper, 3.3 dB for Dancer and 2.8 dB for Pantomime in central distortion at the same or a very close bit rate.
Given that the depth map is not directly used for display, the objective and subjective qualities of the rendered virtual views should be taken into account.In the aspect of objective quality, the synthesized virtual viewpoint image can be achieved by two original camera images.For example, for the tested sequences Balloons, Kendo and Dancer, the depth and texture from the first and third views can be used to synthesize the texture of the second view, while for the sequence Newspaper, the depth and texture from the second and fourth views can generate the texture of the third view, and for the sequence Pantomime, the depth and texture from the 37th and 39th views can be used to synthesize the texture of the 38th view.Here, the texture image is also compressed, and the comparison for synthesized images is presented in (c).Compared to the basic comparison scheme, our proposed synthesized scheme can one average improve around 1.1 dB for Balloons, 0.5 dB for Kendo, 0.4 dB for Newspaper, 2.2 dB for Dancer and 2.6 dB for Pantomime in side distortion and around 1.1 dB for Balloons, 0.6 dB for Kendo, 0.4 dB for Newspaper, 2.2 dB for Dancer and 2.7 dB for Pantomime in central distortion at the same or a very close bit rate.
Furthermore, the advantages of the proposed scheme can be more clearly seen in Figures 9-12, in which the subjective quality of the synthesized virtual viewpoint of Kendo and Dancer are presented, especially in some parts denoted by the red rectangle.
The structural similarity (SSIM) index [29] is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.SSIM is used for measuring the similarity between two images.The SSIM index is a full reference metric.SSIM is designed to improve on traditional methods, such as peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which have proven to be inconsistent with human visual perception.At the end of each figure, the SSIM values for the sequences of Balloons, Kendo, Newspaper, Dancer and Pantomime are reflected in the Tables 2-11; and Tables 2, 4, 6, 8 and 10 are the SSIM values of five sequences for the side decoder; Tables 3, 5, 7, 9 and 11 are the SSIM values of five sequences for central decoder.As we can see from the tables, the depth map formed by the proposed scheme and the SSIM values of the synthesized image with the uncompressed original images are close to one, and the values of the proposed scheme are greater than or equal to that of the comparison scheme.This indicates that the proposed scheme is better than the basic comparison scheme.
Pantomime are reflected in the Tables 2-11; and Tables 2, 4, 6, 8, 10 are the SSIM values of five sequences for the side decoder; Tables 3, 5, 7, 9, 11 are the SSIM values of five sequences for central decoder.As we can see from the tables, the depth map formed by the proposed scheme and the SSIM values of the synthesized image with the uncompressed original images are close to one, and the values of the proposed scheme are greater than or equal to that of the comparison scheme.This indicates that the proposed scheme is better than the basic comparison scheme.(a1) (b1)

Conclusions
In this paper, the depth map is removed as a part of the high-frequency component in the DCT domain.An effective and adaptive optimization scheme encoding was accommodated in the proposed scheme to achieve a better rate and central/side distortion performance.It can be seen from the experiment that the proposed scheme also has a better subjective quality.Therefore, our proposed scheme is clearly a worthy choice for depth map coding.

1  and 2 
the bit rate, side and central distortion of the image, respectively.Additionally, are balanced parameters to play a role in ( ) be regarded as the same significance, so 1 1   .In the experimental results, we will discuss the rate distortion performance under different values of 2  .Solving the optimization problem in Equation (1) is difficult in multiple description coding.Furthermore, given 2  , a triple ( ( ), ( ), ( )) S C D N D N R N needs to be computed for each mode selection.This requires actual time-consuming encoding and decoding for each mode.Taking the lower

Figure 3 .
Figure 3.The diagram of the proposed decoder.

Figure 3 .
Figure 3.The diagram of the proposed decoder.

Figure 4 .
Figure 4. Objective quality comparison for the sequence Balloons: (a) the left view of depth map; (b) the right view of depth map; (c) the synthesized virtual viewpoint.

Figure 4 .
Figure 4. Objective quality comparison for the sequence Balloons: (a) the left view of depth map; (b) the right view of depth map; (c) the synthesized virtual viewpoint.

Figure 6 .
Figure 6.Objective quality comparison for the sequence Newspaper: (a) the left view of depth map; (b) the right view of depth map; (c) the synthesized virtual viewpoint.

Table 1 .
The parameters of the tested sequences.

Table 2 .
The SSIM values of the sequence Balloons for the side decoder.

Table 3 .
The SSIM values of the sequence Balloons for the central decoder.

Table 2 .
The SSIM values of the sequence Balloons for the side decoder.

Table 3 .
The SSIM values of the sequence Balloons for the central decoder.

Table 6 .
The SSIM values of the sequence Newspaper for the side decoder.

Table 7 .
The SSIM values of the sequence Newspaper for the central decoder.