Reliability-Based View Synthesis for Free Viewpoint Video

View synthesis is a crucial technique for free viewpoint video and multi-view video coding because of its capability to render an unlimited number of virtual viewpoints from adjacent captured texture images and corresponding depth maps. The accuracy of depth maps is very important to the rendering quality, since depth image–based rendering (DIBR) is the most widely used technology among synthesis algorithms. There are some issues due to the fact that stereo depth estimation is error-prone. In addition, filling occlusions is another challenge in producing desirable synthesized images. In this paper, we propose a reliability-based view synthesis framework. A depth refinement method is used to check the reliability of depth values and refine some of the unreliable pixels, and an adaptive background modeling algorithm is utilized to construct a background image aiming to fill the remaining empty regions after a proposed weighted blending process. Finally, the proposed approach is implemented and tested on test video sequences, and experimental results indicate objective and subjective improvements compared to previous view synthesis methods.


Introduction
In the past few decades, three-dimensional video has been widely adopted in various applications.Free viewpoint video (FVV) is a novel display format that has evolved from 3D video that enables viewers to watch a scene from any position [1].This free navigation (FN) experience provides a rich and compelling immersive feeling that is much better than traditional 3D video [2].However, FVV has significant requirements for video acquisition, compression, and transmission technology.Due to the limitations on camera volume and bandwidth of the communication system, only a limited number of views can be transferred.View synthesis technology is proposed to support the FN capability of generating texture images that are not captured by a real camera.Depth image-based rendering (DIBR) [3] is a crucial technology for view synthesis.DIBR utilizes one or more reference texture images and their associated depth images to synthesize virtual view images, wherein every pixel in the original reference image plane is projected to the 3D world coordinate system according to its associated depth value; thereafter the 3D world coordinates are projected onto the image plane in the virtual viewpoint [4].
Although virtual view from an arbitrary viewpoint can be reconstructed by utilizing reference texture and depth information, DIBR still brings some artifacts due to the inaccurate depth images.image, while texture synthesis can fill the large-scale holes [12][13][14].Since large holes are observed when areas that are occluded by foreground objects in the reference view become exposed in the synthesized view, view-blending approaches can be used to alleviate this problem, as two adjacent cameras can cover a relatively wider viewing angle [15][16][17].
As to exploiting the temporal correlation, Scheming and Jiang [18] tried to determine the background information using a background subtraction method, but this approach relies on good performance of the foreground segmentation method, so it cannot be adopted in complex circumstances.Chen explored the motion vector of H.264/AVC bit stream to render disocclusions in the virtual view [19].In [20,21], a background sprite was generated by the original texture and synthesized images from the temporal previous frames using disocclusion filling, but the temporal consistency of the synthesized images needs further investigation, as described in [22].In [23], Yao proposed a disocclusion filling approach based on temporal correlation and depth information.Experimental results showed that this approach yields better subjective and objective performance beyond the above-mentioned spatial methods of filling disocclusions.However, the SVD format limits its wide usage because of the small baseline.Besides, some disocclusion regions that are not included in a single reference view may easily be spotted in another virtual viewpoint, and reverse mapping [4] may be more reliable.In [23], Luo and Zhu et al. proposed the use of a constructed background video with a modified Gaussian mixture model (GMM) to eliminate the holes in synthesized video.The foreground objects are detected and removed, then motion compensation and modified GMMs are applied to construct a stable background.Results indicated that a clean background without artifacts of foreground objects can be generated by using the proposed background model, so that the blurry effect or artifacts in disoccluded regions can be eliminated and the sharp edges along the foreground boundaries can be preserved with realistic appearance [24].
Although [17] indicates that large holes in a target virtual view can be greatly reduced by using other more neighboring complementary views in addition to the two (commonly used) most neighboring primary views, we still employ only two reference views to render virtual views in our proposed framework.The occlusions that appear on one warped image will be filled by another reference viewpoint in the weighted blending process.
In this paper, a multiview plus depth (MVD) format is employed for view synthesis.Two reference views are selected to interpolate virtual views located between them.Occlusions that appear on one warped image will be filled by another reference viewpoint.In addition, an adaptive background modeling method is proposed to construct background intensity distribution.The stable constructed reference background image helps to fill the remaining unfilled regions that are left due to the unreliable depth map.Another novelty of the proposed algorithm relates to depth refinement, which has the advantage of eliminating some noise caused by the coarse depth map.We also present a weighted blending process to blend two warped images from reference views based on the reliability of each pixel.An adaptive median filter and a depth map processing method are utilized before generating the synthesized virtual image.

Proposed Framework
In this section, the proposed approach will be presented in detail.The framework of the proposed synthesis algorithm is illustrated in Figure 1.There are mainly four techniques proposed in this framework: depth refinement, background modeling, reliability-based blending, and depth map processing method.These approaches will be discussed in Section 3.

Depth Refinement
There are two steps in the depth refinement process, as illustrated in Figure 1a.In the first step, depth consistency cross-checking is used to check whether each pixel's depth value is reliable or unreliable.Second, depth refinement is employed to interpolate the depth values of unreliable pixels.The details of the first step are as follows.For depth consistency cross-checking of the left reference view: let (u, v) be the coordinate of one pixel from the left reference view, then its corresponding pixel (u w , v w ) in the right reference view is obtained through the classical DIBR technology [2].The texture value I and depth value D of these two pixels are verified; the subscript L and R indicate left view and right view, respectively.Ith is a large preset threshold value for texture comparison and Dth is a small preset threshold value for depth comparison.The consistency checking produces five results, as follows: (1) If ‖  (, ) −   (  ,   )‖ implies that these two pixels fail to match.In this situation, there is a high probability that the pixel belongs to the occlusion area and the depth pixel in the left reference depth map fails to check whether it is reliable or not, then it is marked as blue in its cross-checking mask.(3) If ‖  (, ) −   (  ,   )‖ 2 2 >  ℎ and |  (, ) −   (  ,   )| ≤  ℎ are both satisfied, this implies that these two pixels fail to match.Either an erroneous texture pixel or an unreliable depth value causes this situation.We will check its surrounding depth distribution to find the real reason in the second step.The depth pixel in the left reference depth map is unreliable and it is marked as red in its cross-checking mask.(4) If ‖  (, ) −   (  ,   )‖ 2 2 ≤  ℎ and |  (, ) −   (  ,   )| >  ℎ are both satisfied, this also implies that the depth pixel is unreliable, and it is marked as green in its cross-checking mask.(5) Some pixels in the left reference view are not able to project into the right reference view, because their corresponding pixels are located outside the image boundary.These areas are marked as white.
Figure 2 shows a result of the depth consistency check; because pixels in the white and blue regions fail to get a chance to verify their reliability, they are all determined to be unreliable and a

Depth Refinement
There are two steps in the depth refinement process, as illustrated in Figure 1a.In the first step, depth consistency cross-checking is used to check whether each pixel's depth value is reliable or unreliable.Second, depth refinement is employed to interpolate the depth values of unreliable pixels.The details of the first step are as follows.For depth consistency cross-checking of the left reference view: let (u, v) be the coordinate of one pixel from the left reference view, then its corresponding pixel (u w , v w ) in the right reference view is obtained through the classical DIBR technology [2].The texture value I and depth value D of these two pixels are verified; the subscript L and R indicate left view and right view, respectively.I th is a large preset threshold value for texture comparison and D th is a small preset threshold value for depth comparison.The consistency checking produces five results, as follows: are both satisfied, this implies that these two pixels are matched.This depth pixel in the left reference depth map is reliable only in this situation, and it is marked as black in its cross-checking mask. (2 are both satisfied, this implies that these two pixels fail to match.In this situation, there is a high probability that the pixel belongs to the occlusion area and the depth pixel in the left reference depth map fails to check whether it is reliable or not, then it is marked as blue in its cross-checking mask. are both satisfied, this implies that these two pixels fail to match.Either an erroneous texture pixel or an unreliable depth value causes this situation.We will check its surrounding depth distribution to find the real reason in the second step.The depth pixel in the left reference depth map is unreliable and it is marked as red in its cross-checking mask.(4) If ||I L (u, v) − I R (u w , v w )|| 2  2 ≤ I th and |D L (u, v) − D R (u w , v w )| > D th are both satisfied, this also implies that the depth pixel is unreliable, and it is marked as green in its cross-checking mask.
(5) Some pixels in the left reference view are not able to project into the right reference view, because their corresponding pixels are located outside the image boundary.These areas are marked as white.Figure 2 shows a result of the depth consistency check; because pixels in the white and blue regions fail to get a chance to verify their reliability, they are all determined to be unreliable and a specific weight is given when they are interpolated into virtual view.Several measures are implemented to refine other unreliable pixels, especially for the red and green regions.The main idea for the refinement is to find the most appropriate reliable pixel value to interpolate the depth value of unreliable pixels.Neighboring pixels from four directions are utilized here, and both the inverse proportion of distance and the reliability of the depth value are considered in calculating the weighting factors.If the reliable depth pixel maps to a reliable pixel in the other view, this indicates that this depth pixel is highly reliable.On the contrary, if the corresponding pixel in the other view is unreliable, the reliability of the pixel is lower.
Appl.Sci.2018, 8, x FOR PEER REVIEW 5 of 14 specific weight is given when they are interpolated into virtual view.Several measures are implemented to refine other unreliable pixels, especially for the red and green regions.The main idea for the refinement is to find the most appropriate reliable pixel value to interpolate the depth value of unreliable pixels.Neighboring pixels from four directions are utilized here, and both the inverse proportion of distance and the reliability of the depth value are considered in calculating the weighting factors.If the reliable depth pixel maps to a reliable pixel in the other view, this indicates that this depth pixel is highly reliable.On the contrary, if the corresponding pixel in the other view is unreliable, the reliability of the pixel is lower.Let WDt, WDb, WDl, and WDr be the weighting factors calculated by the distance from the current unreliable depth value to the nearest reliable depth pixel in top, bottom, left, and right directions, respectively.WH and WL are the weighting values with high reliability and low reliability, respectively.The weighting factor for each direction can be formulated as: ×   , if pixel in this direction has high reliability   ×   , if pixel in this direction has low reliability , where the subscript direction can be either t, b, l, or r.The four weighting factors are normalized as WNdirection, then the unreliable depth value Dr can be interpolated by Equation ( 2): where Dd is the nearest reliable depth value in one of four directions.

Adaptive Background Modeling
In the previous step, a refined depth map was obtained.In Section 3.2, we propose to apply an adaptive background modeling method evolving from Gaussian mixture model (GMM) to generate a reference image.GMM is commonly used in video processing to detect moving objects because of its capacity to identify foreground and background pixels [7].In previous research, GMM was utilized to construct a stable background image aiming to fill large empty regions.However, GMM is not suitable for scenes that contain periodic or reciprocating foreground objects; these foreground moving objects are easily detected as erroneous background pixels, thus generating an inaccurate background image.In addition, some background pixels might have slight changes, for example, pixel densities are different while shadows caused by foreground objects appear or move.Thus, the stable background images generated by previous approaches always had blurring effects and were not accurate.In our proposed adaptive background modeling method, both the texture images and their associated depth maps are utilized to explore the temporal correlation.In addition, we propose to apply a reliability-based view synthesis method using background information to interpolate the intermediate image and fill the disocclusions.Let WD t , WD b , WD l , and WD r be the weighting factors calculated by the distance from the current unreliable depth value to the nearest reliable depth pixel in top, bottom, left, and right directions, respectively.W H and W L are the weighting values with high reliability and low reliability, respectively.The weighting factor for each direction can be formulated as: where the subscript direction can be either t, b, l, or r.The four weighting factors are normalized as WN direction , then the unreliable depth value D r can be interpolated by Equation (2): where D d is the nearest reliable depth value in one of four directions.

Adaptive Background Modeling
In the previous step, a refined depth map was obtained.In Section 3.2, we propose to apply an adaptive background modeling method evolving from Gaussian mixture model (GMM) to generate a reference image.GMM is commonly used in video processing to detect moving objects because of its capacity to identify foreground and background pixels [7].In previous research, GMM was utilized to construct a stable background image aiming to fill large empty regions.However, GMM is not suitable for scenes that contain periodic or reciprocating foreground objects; these foreground moving objects are easily detected as erroneous background pixels, thus generating an inaccurate background image.In addition, some background pixels might have slight changes, for example, pixel densities are different while shadows caused by foreground objects appear or move.Thus, the stable background images generated by previous approaches always had blurring effects and were not accurate.In our proposed adaptive background modeling method, both the texture images and their associated depth maps are utilized to explore the temporal correlation.In addition, we propose to apply a reliability-based view synthesis method using background information to interpolate the intermediate image and fill the disocclusions.
The proposed method works at the pixel level, and every pixel is modeled independently by a mixture of K Gaussian distributions, where K is usually between 3 and 5.By using this distribution, pixel values that have a high probability of occurring are saved if their associated depth values show they belong to the background.The Gaussian mixture distribution with K components can be written as [25]: where p x j,t denotes the probable density of value x j,t of pixel j at time t; η is the Gaussian density function with three dependent variables: x j,t , µ j,i,t , and σ 2 j,i,t , where µ j,i,t denotes the mean value of pixel x j ; and σ 2 j,t is the variance value of the pixel.Further, ω i j,t is the weight of the ith Gaussian distribution at time t of pixel j, with The function η is given by: Before texture information is modeled by Gaussian distribution, we propose to verify each novel pixel to ensure that it is not from a foreground region.If the depth value is much bigger than the stored depth buffer (which means this pixel is nearer to a captured device), the pixel is considered as a foreground pixel.Otherwise, if the depth value is much smaller than the stored buffer, the pixel is considered as a background pixel, and the modeled distribution is not reliable and should be restarted.The detailed process to generate the reference background distribution is as follows: 1.
Initialization.The model is initialized at the beginning of the generation (time t 0 ): where the variance value σ 2 j is set to a certain large number, d j is the stored depth buffer for pixel j, and d j,t 0 is the depth value of pixel j at time t 0 .

2.
Update.In the next frame, i.e., at time t 1 , we first check the depth level of this pixel, and d j,t1 is compared with the existing depth buffer d j .There are three situations for the depth comparison results: (a) If the condition d j,t1 − d j > t d is satisfied (t d is a predefined threshold depth value), this indicates that the new pixel x j,t1 belongs to the foreground objects, it will be discarded, and background distribution will not be updated.(b) If d j,t 1 − d j < t d is verified, x j,t1 is searched to match with K Gaussian models.From each model i from 1 to K, if the condition x j,t 1 − µ j,i,t 0 ≤ 2.5 σ j,i,t 0 is satisfied, the matching process will stop, and the matched Gaussian model will be updated as follows: where α is the model learning rate (α = 0.01), and ρ = α/ω i j,t 0 .The other parameters of the Gaussian models remain unchanged except: These two parameters reflect the rate of model convergence.If pixel x j,t 1 fails to match all the current Gaussian models, a new Gaussian model is introduced to evict the Gaussian model with the smallest ω/σ value.The mean and variance values of the other Gaussian models remain unchanged, while the new model is set with µ j,t 1 = x j,t 1 , σ j,t 1 = 30, ω j,t 1 = 0.01.Finally, the weights of K Gaussian models are normalized to (c) In the third situation, if the condition d j − d j,t1 > t d is satisfied, this indicates that the new input pixel x j,t1 belongs to the background and the previous Gaussian distributions need to be abandoned.The first step is executed for x j,t1 .

3.
Convergence.The remaining frames are processed by repeating step 2. The value of background pixels is derived by µ, and the most stable pixels in the time domain are modeled as background image; meanwhile, the number of Gaussian models of each pixel is obtained to determine whether the pixel experiences similar intensities over time or not.
Figure 3 shows two examples of adaptive background modeling.Figure 3a presents the Ballet background image generated with a small baseline, where cam03 is chosen as a target virtual viewpoint that is interpolated by the reference viewpoints cam02 and cam04.Figure 3b presents the Breakdancers modeling result, where the background image at virtual viewpoint cam04 is projected from reference viewpoints cam02 and cam06.Although some foreground objects are stored in a stable temporal background reference using the mechanism of the proposed framework, these effects would not affect the quality of the final synthesized image, since the filling of remaining empty regions always occurs in the background areas.Thus, the temporal stable background information can be obtained by both large and small baseline instances.This adaptive background modeling approach can be widely adopted in applications with unchanged scenes.
from reference viewpoints cam02 and cam06.Although some foreground objects are stored in a stable temporal background reference using the mechanism of the proposed framework, these effects would not affect the quality of the final synthesized image, since the filling of remaining empty regions always occurs in the background areas.Thus, the temporal stable background information can be obtained by both large and small baseline instances.This adaptive background modeling approach can be widely adopted in applications with unchanged scenes.

Reliability-Based Weighted Blending
As the background distribution for each reference view is obtained by the proposed background modeling method discussed in Section 3.2, two background images are projected into virtual viewpoint and then blended into one background image in virtual viewpoint (represented by IB).Previous research shows that GMM has an inherent capacity to capture background and foreground pixel intensities; missing pixel intensities of an occluded area are successfully recovered by exploiting temporal correlation.

Reliability-Based Weighted Blending
As the background distribution for each reference view is obtained by the proposed background modeling method discussed in Section 3.2, two background images are projected into virtual viewpoint and then blended into one background image in virtual viewpoint (represented by I B ).Previous research shows that GMM has an inherent capacity to capture background and foreground pixel intensities; missing pixel intensities of an occluded area are successfully recovered by exploiting temporal correlation.
In our proposed method, weighting factors are also applied to blend two reference views and one background image into a synthesized image.Two reference texture images are projected to virtual view using their corresponding refined depth maps, and two intermediate texture images I L , I R and depth images D L , D R are obtained.The reliability-based weighted blending process to produce a virtual image I V is as follows: (1) If a pixel is filled in both I L and I R , first two depth values are compared.If the depth value of one pixel is much bigger than the other, this indicates that one pixel is obviously nearer to the capturing device.I V is filled by the pixel with a bigger associated depth value.If two depth values are very close, weighting factors are utilized.I V is formulated as follows: where WD is the weighting factor for the inversely proportional distance between reference view and virtual view, and WR is the weighting factor for the previously defined reliability of depth value.One of three values (r H , r M , or r L ) is assigned to WR when a pixel in this reference intermediate image is mapped by a reliable, refined, or unreliable depth value, respectively.It should be noted that W L and W R need to be normalized by W i so that W L + W R = 1.(2) If only one pixel is filled in two reference views, for example only I L is filled, the reliability of I L is taken into consideration.If I L is mapped by a reliable depth value, I V can simply be filled with I L (I V = I L ).Otherwise, background information is used to generate I V .If D L is close to the background depth value, then I V = (I L + I B )/2; if D L is much bigger than D B , I V = I L .(3) If pixels in both reference views are not filled, we use the constructed background image to deal with the hole-filling challenge.First, we check the surrounding depth value of I V and find the filled depth value to determine a proper depth value range.Then I V is filled by the background pixel if its depth value is in the obtained range.Otherwise, inverse warping and classical inpainting are applied to fill I V .
We propose this hole-filling method to ensure that background pixels are appropriate to fill the remaining hole regions.Because they adopt depth information, background pixels can be chosen to improve the rendered image quality even when the hole is surrounded by foreground objects.

Depth Map Processing Method
After weighted blending is completed, the warped texture image and depth map become entirely filled.However, in the previous process, cracks and pinholes could be observed in the rendered image.With the previous method, a classical median filter was applied to smooth the texture image or remove these artifacts.In our framework, a depth map processing method is proposed.Not only the above-mentioned artifacts, but also the background pixels in cracks of foreground regions (shown in Figure 4a,b) can be removed.This method has advantages in preserving the texture details, since it is only performed on the detected coordinates.
The main idea DMPM is based on the fact that pixel value in depth maps always changes smoothly in a large area, except in the case of sharp edges in the boundary area between foreground objects and background.These features allow easy detection of noise in depth maps.In fact, most artifacts and noise caused by inaccurate depth values are reduced because of the previously introduced depth refinement, but some unreliable or undetected depth values remain in the reference depth map, most of them in out-of-boundary areas and occluded areas.Therefore, DMPM is still necessary.Details of the depth map processing method are as follows: (1) A conventional median filter is proposed to apply to the coarse depth map d in (x, y) to obtain an improved depth map d (x, y).It is capable of removing the existing noise and preserves the sharp boundary information.
(2) The texture image I in (x, y) is refined according to the improvement of its associated depth map.
If the condition |d (x, y) − d in (x, y)| > ε is satisfied (ε is a threshold value for depth difference), this indicates that the depth value of the pixel is unreliable and it is renewed after the median filter.An inverse mapping process using the updated depth value is employed to find an appropriate texture pixel.A depth range d ∈ [d − ε, d + ε] is used as a candidate to find its corresponding pixel in two reference views.In Equations ( 11) and ( 12), we can get a corresponding reference pixel location (u r , v r ) through pixel (x, y) and the associated depth values z v and z r ; A and b denote rotation matrix and translation matrix, respectively.Several measurements are used to make sure a highly reliable pixel is obtained by using backward warping.First, the depth value of the obtained pixel should be close to the updated depth value d (x, y).Second, the disparity between (x, y) and (u r , v r ) should not be too large according to the alignment of the reference viewpoint and virtual viewpoint: In our previous method, we simply used a median filter on (x, y), and this turned out to be very effective when the texture of this area was smooth.However, a median filter easily produces blurring effects when the scene has detailed textures.Unlike the texture images, the smooth regions in the depth map are invulnerable to the filter with gray value distributions.After the renovation is conducted, the associated texture image is updated according to the improvement of its depth map.
Figure 4d shows the updated version of the integrated depth map, where the infiltration errors and unnatural depth distribution are eliminated by the classical median filter, while the sharp edges are preserved.Comparing Figure 4a,c, the DMPM generates desirable improvement and avoids filtering of the entire image at the same time.
In our previous method, we simply used a median filter on (x, y), and this turned out to be very effective when the texture of this area was smooth.However, a median filter easily produces blurring effects when the scene has detailed textures.Unlike the texture images, the smooth regions in the depth map are invulnerable to the filter with gray value distributions.After the renovation is conducted, the associated texture image is updated according to the improvement of its depth map. Figure 4d shows the updated version of the integrated depth map, where the infiltration errors and unnatural depth distribution are eliminated by the classical median filter, while the sharp edges are preserved.Comparing Figure 4a,c, the DMPM generates desirable improvement and avoids filtering of the entire image at the same time.

Experimental Results
In this section, the proposed framework is implemented in C++ based on OpenCV, and the tested multiview video plus depth sequences include two Microsoft datasets: Ballet and Breakdancers.In all video sequences, the size of each frame is 1024 × 768 pixels, and each video contains 100 frames with an unmoved background.The baseline between two adjacent cameras is 20 cm for both Ballet and Breakdancers.The associated depth maps and camera parameters are provided with the sequences.The format of all video sequences is avi, while texture images contain three channels (RGB).

Experimental Results
In this section, the proposed framework is implemented in C++ based on OpenCV, and the tested multiview video plus depth sequences include two Microsoft datasets: Ballet and Breakdancers.In all video sequences, the size of each frame is 1024 × 768 pixels, and each video contains 100 frames with an unmoved background.The baseline between two adjacent cameras is 20 cm for both Ballet and Breakdancers.The associated depth maps and camera parameters are provided with the sequences.The format of all video sequences is avi, while texture images contain three channels (RGB).
To evaluate the performance of the proposed method, we implemented two state-of-the-art methods and my previous work in [26], in order to compare this with the proposed approach.One of these two methods is a commonly used reference software, VSRS 3.5 [27], which mainly contains a simple DIBR method [3] and a classical inpainting technique [10].The other is a hole-filling method exploiting temporal correlations based on GMM [5].These two methods [5,27] represent the exploitation of spatial correlation and temporal correlation, respectively.In each experiment, the test sequence was composed of three real video sequences from three reference viewpoints.The coded left and right views with their associated depth videos were projected to interpolate the virtual video in the target viewpoint between them.The rendered sequence was compared with the actual video on the target viewpoints to measure the peak signal-noise ratio (PSNR) and structural similarity index (SSIM).In order to show wide practical applicability of the proposed synthesis algorithm, each view synthesis method was performed on both small baseline and large baseline instances.Tables 1 and 2 show the average PSNR and SSIM values for 100 frames.In the PSNR evaluation, the proposed approach obtained 4-10 dB better results than VSRS 3.5 on Ballet for a large baseline instance.In the case of a small baseline, the results for both Ballet and Breakdancers were also better.The proposed method also showed better results beyond GMM-based disocclusion filling method and my previous work.Inpainting is an effective algorithm to fill narrow gaps and other small empty regions when the baseline is small, however, it is not practical for fill large empty regions.Moreover, my previous work did not perform well for both Ballet and Breakdancers sequences.This is due to the fact that simple GMM is not capable to deal with the scenes which foreground objects are with reciprocating motion.
Consequently, the proposed approach yielded better results on both tested sequences.The larger the baseline, the better the results.In addition to the objective measurements, Figure 5 shows a subjective comparison.Figure 5a presents the synthesized results generated by a simple DIBR technology, where the disocclusion regions and pinholes remain to be filled.Figure 5b shows the performance of VSRS 3.5, where large empty regions are filled based on neighboring texture information.Blurring effects are observed, in contrast to our proposed method in Figure 5e.This improvement comes from our idea of avoiding global processing for every pixel to handle the noise.Hence, our method shows desirable results in reducing errors and removing unwanted effects, while texture remains sharp and clear.Figure 5c shows an enlarged part of the synthesis result produced by the GMM-based disocclusion filling method; the temporal correlation method shows better performance in filling large empty areas beyond the inpainting method.Depth refinement and weighted blending lead to much more satisfactory interpolation results, as shown in Figure 5e.
improvement comes from our idea of avoiding global processing for every pixel to handle the noise.Hence, our method shows desirable results in reducing errors and removing unwanted effects, while texture remains sharp and clear.Figure 5c shows an enlarged part of the synthesis result produced by the GMM-based disocclusion filling method; the temporal correlation method shows better performance in filling large empty areas beyond the inpainting method.Depth refinement and weighted blending lead to much more satisfactory interpolation results, as shown in Figure 5e.Frame-by-frame comparisons of PSNR and SSIM are shown in Figure 6. Figure 6a,b show a synthesis result with a large baseline: viewpoint cam03 is interpolated by cam01 and cam 07.Another PSNR and SSIM comparison (Figure 6c,d) comes from a small baseline; two reference viewpoints, cam03 and cam05, were utilized to render target virtual view cam04.Both instances are from the sequence Ballet.Obviously, exploring temporal correlations to fill the disocclusions yields better performance beyond the inpainting-based view synthesis method, which only explores the spatial correlation, especially when the baseline is large.In all the frames, our proposed framework shows more stable output than the GMM-based method.
In this article, we additionally tested the computation time for all the four approaches.Greater improvements in subjective and objective image quality are brought by much more complex computation.In our proposed method, 3D warping process is performed six times and adaptive background modeling is applied twice, that is the reason why the computation cost of my proposed method is high.The first reason is that the GPU-accelerated algorithm is commonly used for image processing and the hardware performance is growing rapidly, the increased computation time for one frame will not increase too much time for synthesizing the whole sequence if parallel algorithm is adopted.The second reason is that due to the mechanism of our proposed approach, we mainly explore the contribution of depth refinement technique and adaptive background modeling, the time can be shortened if this method is applied in practical applications.After all, our proposed method is implemented using OpenCV library, the computation time is capable to reduce a lot if we carefully using coding skills.Frame-by-frame comparisons of PSNR and SSIM are shown in Figure 6. Figure 6a,b show a synthesis result with a large baseline: viewpoint cam03 is interpolated by cam01 and cam 07.Another PSNR and SSIM comparison (Figure 6c,d) comes from a small baseline; two reference viewpoints, cam03 and cam05, were utilized to render target virtual view cam04.Both instances are from the sequence Ballet.Obviously, exploring temporal correlations to fill the disocclusions yields better performance beyond the inpainting-based view synthesis method, which only explores the spatial correlation, especially when the baseline is large.In all the frames, our proposed framework shows more stable output than the GMM-based method.
In this article, we additionally tested the computation time for all the four approaches.Greater improvements in subjective and objective image quality are brought by much more complex computation.In our proposed method, 3D warping process is performed six times and adaptive background modeling is applied twice, that is the reason why the computation cost of my proposed method is high.The first reason is that the GPU-accelerated algorithm is commonly used for image processing and the hardware performance is growing rapidly, the increased computation time for one frame will not increase too much time for synthesizing the whole sequence if parallel algorithm is adopted.The second reason is that due to the mechanism of our proposed approach, we mainly explore the contribution of depth refinement technique and adaptive background modeling, the time can be shortened if this method is applied in practical applications.After all, our proposed method is implemented using OpenCV library, the computation time is capable to reduce a lot if we carefully using coding skills.

Conclusions
In this paper, we present a reliability-based view synthesis framework using depth refinement and an adaptive background modeling method.Multiple viewpoints are employed to render desirable virtual images.In the proposed algorithm, the disocclusion regions are filled by a combination of two sources.The first one comes from two reference viewpoints; the disocclusion regions generated from one reference view are more likely to be found from another reference view due to different position and viewing angle.If the disocclusion regions are lost in both reference views, the updated background image is utilized to fill the static regions.Experimental results indicate that depth refinement obviously improves the accuracy of the depth map, thus improving the performance of the proposed adaptive background modeling and forward (and backward) warping.In addition, an adaptive median filter and DMPM are proposed to replace the classical median filter due to their ability to eliminate unwanted effects and noise while ensuring high-quality texture images.The experimental results show that the combination of proposed techniques yields satisfactory subjective and objective improvement.There are three aspects to focus on in our future research.First, we will focus on improving synthesis quality while reducing computing complexity.Second, we will explore how to construct a stable temporal correlation for complex scenes with moving cameras.Finally, as deep learning is becoming more popular in various types of research, deep view synthesis seems to have a bright future.

Conclusions
In this paper, we present a reliability-based view synthesis framework using depth refinement and an adaptive background modeling method.Multiple viewpoints are employed to render desirable virtual images.In the proposed algorithm, the disocclusion regions are filled by a combination of two sources.The first one comes from two reference viewpoints; the disocclusion regions generated from one reference view are more likely to be found from another reference view due to different position and viewing angle.If the disocclusion regions are lost in both reference views, the updated background image is utilized to fill the static regions.Experimental results indicate that depth refinement obviously improves the accuracy of the depth map, thus improving the performance of the proposed adaptive background modeling and forward (and backward) warping.In addition, an adaptive median filter and DMPM are proposed to replace the classical median filter due to their ability to eliminate unwanted effects and noise while ensuring high-quality texture images.The experimental results show that the combination of proposed techniques yields satisfactory subjective and objective improvement.There are three aspects to focus on in our future research.First, we will focus on improving synthesis quality while reducing computing complexity.Second, we will explore how to construct a stable temporal correlation for complex scenes with moving cameras.Finally, as deep learning is becoming more popular in various types of research, deep view synthesis seems to have a bright future.

Figure 1 .
Figure 1.Framework of the proposed view synthesis: (a) illustration of Depth Refinement; (b) the framework of proposed approaches using refined depth information.

Figure 1 .
Figure 1.Framework of the proposed view synthesis: (a) illustration of Depth Refinement; (b) the framework of proposed approaches using refined depth information.

Figure 2 .
Figure 2. Result of depth consistency cross-check.

Figure 2 .
Figure 2. Result of depth consistency cross-check.

Figure 4 .
Figure 4. Examples of depth map processing method: (a,b) enlarged integrated texture image and its associated depth map before depth map processing method (DMPM); (c,d) image and its associated depth map after DMPM.

Figure 4 .
Figure 4. Examples of depth map processing method: (a,b) enlarged integrated texture image and its associated depth map before depth map processing method (DMPM); (c,d) image and its associated depth map after DMPM.

Author
Contributions: Z.D. and M.W. designed the experiments.Z.D. performed the experiments.Z.D. wrote the paper and analyzed the data.M.W. contributed simulation tools.M.W. supervise the whole work.

Table 1 .
Average peak signal-noise ratio (PSNR) comparison of the proposed technique and three state-of-the-art techniques.

Table 2 .
Average structural similarity index (SSIM) comparison of the proposed technique and three state-of-the-art techniques.