# Reliability-Based View Synthesis for Free Viewpoint Video

^{*}

## Abstract

**:**

## Featured Application

**View synthesis technology is important for free viewpoint video (FVV) and multiview video coding (MVC).**

**It is a practical approach to reduce storage and transmission bandwidth for multiview videos.**

## Abstract

## 1. Introduction

## 2. Related Work

## 3. Proposed Framework

#### 3.1. Depth Refinement

^{w}, v

^{w}) in the right reference view is obtained through the classical DIBR technology [2]. The texture value I and depth value D of these two pixels are verified; the subscript L and R indicate left view and right view, respectively. I

_{th}is a large preset threshold value for texture comparison and D

_{th}is a small preset threshold value for depth comparison. The consistency checking produces five results, as follows:

- (1)
- If ${||{I}_{L}\left(u,v\right)-{I}_{R}\left({u}^{w},{v}^{w}\right)||}_{2}^{2}\le {I}_{th}$ and $\left|{D}_{L}\left(u,v\right)-{D}_{R}\left({u}^{w},{v}^{w}\right)\right|\le {D}_{th}$ are both satisfied, this implies that these two pixels are matched. This depth pixel in the left reference depth map is reliable only in this situation, and it is marked as black in its cross-checking mask.
- (2)
- If ${||{I}_{L}\left(u,v\right)-{I}_{R}\left({u}^{w},{v}^{w}\right)||}_{2}^{2}>{I}_{th}$ and $\left|{D}_{L}\left(u,v\right)-{D}_{R}\left({u}^{w},{v}^{w}\right)\right|>{D}_{th}$ are both satisfied, this implies that these two pixels fail to match. In this situation, there is a high probability that the pixel belongs to the occlusion area and the depth pixel in the left reference depth map fails to check whether it is reliable or not, then it is marked as blue in its cross-checking mask.
- (3)
- If ${||{I}_{L}\left(u,v\right)-{I}_{R}\left({u}^{w},{v}^{w}\right)||}_{2}^{2}>{I}_{th}$ and $\left|{D}_{L}\left(u,v\right)-{D}_{R}\left({u}^{w},{v}^{w}\right)\right|\le {D}_{th}$ are both satisfied, this implies that these two pixels fail to match. Either an erroneous texture pixel or an unreliable depth value causes this situation. We will check its surrounding depth distribution to find the real reason in the second step. The depth pixel in the left reference depth map is unreliable and it is marked as red in its cross-checking mask.
- (4)
- If ${||{I}_{L}\left(u,v\right)-{I}_{R}\left({u}^{w},{v}^{w}\right)||}_{2}^{2}\le {I}_{th}$ and $\left|{D}_{L}\left(u,v\right)-{D}_{R}\left({u}^{w},{v}^{w}\right)\right|>{D}_{th}$ are both satisfied, this also implies that the depth pixel is unreliable, and it is marked as green in its cross-checking mask.
- (5)
- Some pixels in the left reference view are not able to project into the right reference view, because their corresponding pixels are located outside the image boundary. These areas are marked as white.

_{t}, WD

_{b}, WD

_{l}, and WD

_{r}be the weighting factors calculated by the distance from the current unreliable depth value to the nearest reliable depth pixel in top, bottom, left, and right directions, respectively. W

_{H}and W

_{L}are the weighting values with high reliability and low reliability, respectively. The weighting factor for each direction can be formulated as:

_{direction}, then the unreliable depth value D

_{r}can be interpolated by Equation (2):

_{d}is the nearest reliable depth value in one of four directions.

#### 3.2. Adaptive Background Modeling

**Initialization.**The model is initialized at the beginning of the generation (time t_{0}):$${\omega}_{j,{t}_{0}}^{i}=\{\begin{array}{ll}1,\hfill & \mathrm{if}\text{}i=1\hfill \\ 0,\hfill & \mathrm{else}\hfill \end{array},$$$${\mu}_{j,i,{t}_{0}}=\{\begin{array}{cc}{x}_{j,{t}_{0}},& \mathrm{if}i=1\hfill \\ 0,& \mathrm{else}\hfill \end{array},$$$${\sigma}_{j,i,{t}_{0}}^{2}={\sigma}_{j}^{2},$$$${d}_{j}={d}_{j,{t}_{0}},$$**Update.**In the next frame, i.e., at time $\text{}{t}_{1}$, we first check the depth level of this pixel, and ${d}_{j,t1}$ is compared with the existing depth buffer $\text{}{d}_{j}$. There are three situations for the depth comparison results:- (a)
- If the condition ${d}_{j,t1}-{d}_{j}{t}_{d}$ is satisfied (${t}_{d}$ is a predefined threshold depth value), this indicates that the new pixel ${x}_{j,t1}$ belongs to the foreground objects, it will be discarded, and background distribution will not be updated.
- (b)
- If $\left|{d}_{j,{t}_{1}}-{d}_{j}\right|{t}_{d}$ is verified, ${x}_{j,t1}$ is searched to match with K Gaussian models. From each model i from 1 to K, if the condition $\left|{x}_{j,{t}_{1}}-{\mu}_{j,i,{t}_{0}}\right|\le 2.5\left|{\sigma}_{j,i,{t}_{0}}\right|$ is satisfied, the matching process will stop, and the matched Gaussian model will be updated as follows:$${\omega}_{j,{t}_{1}}^{i}=\left(1-\alpha \right){\omega}_{j,{t}_{0}}^{i}+\alpha ,$$$${\mu}_{j,i,{t}_{1}}=\left(1-\rho \right){\mu}_{j,i,{t}_{0}}+\rho \xb7{x}_{j,{t}_{1}},$$$${\mathsf{\sigma}}_{j,i,{t}_{1}}^{2}=\left(1-\rho \right){\mathsf{\sigma}}_{j,i,{t}_{0}}^{2}+\rho {\left({x}_{j,{t}_{1}}-{\mu}_{j,i,{t}_{1}}\right)}^{2},$$$${d}_{j}=\frac{{d}_{j}+{d}_{j,{t}_{1}}}{2},$$$${\omega}_{j,{t}_{1}}^{i}=\left(1-\alpha \right){\omega}_{j,{t}_{0}}^{i},$$These two parameters reflect the rate of model convergence. If pixel ${x}_{j,{t}_{1}}$ fails to match all the current Gaussian models, a new Gaussian model is introduced to evict the Gaussian model with the smallest$\omega /\sigma $value. The mean and variance values of the other Gaussian models remain unchanged, while the new model is set with ${\mu}_{j,{t}_{1}}={x}_{j,{t}_{1}}$, $\text{}{\sigma}_{j,{t}_{1}}=30$, $\text{}{\omega}_{j,{t}_{1}}=0.01$. Finally, the weights of K Gaussian models are normalized to $\text{}{\displaystyle \sum}_{i=1}^{K}{\omega}_{j,{t}_{1}}^{i}=1$.
- (c)
- In the third situation, if the condition ${d}_{j}-{d}_{j,t1}{t}_{d}$ is satisfied, this indicates that the new input pixel ${x}_{j,t1}$ belongs to the background and the previous Gaussian distributions need to be abandoned. The first step is executed for${x}_{j,t1}$.

**Convergence.**The remaining frames are processed by repeating step 2. The value of background pixels is derived by μ, and the most stable pixels in the time domain are modeled as background image; meanwhile, the number of Gaussian models of each pixel is obtained to determine whether the pixel experiences similar intensities over time or not.

#### 3.3. Reliability-Based Weighted Blending

_{B}). Previous research shows that GMM has an inherent capacity to capture background and foreground pixel intensities; missing pixel intensities of an occluded area are successfully recovered by exploiting temporal correlation.

_{L}, I

_{R}and depth images D

_{L}, D

_{R}are obtained. The reliability-based weighted blending process to produce a virtual image I

_{V}is as follows:

- (1)
- If a pixel is filled in both I
_{L}and I_{R}, first two depth values are compared. If the depth value of one pixel is much bigger than the other, this indicates that one pixel is obviously nearer to the capturing device. I_{V}is filled by the pixel with a bigger associated depth value. If two depth values are very close, weighting factors are utilized. I_{V}is formulated as follows:$${I}_{V}={W}_{L}\xb7{I}_{L}+{W}_{R}\xb7{I}_{R},$$$$W{\text{'}}_{i}=W{D}_{i}\times W{R}_{i},i=L,R,\text{}$$_{H}, r_{M}, or r_{L}) is assigned to WR when a pixel in this reference intermediate image is mapped by a reliable, refined, or unreliable depth value, respectively. It should be noted that W_{L}and W_{R}need to be normalized by $W{\text{'}}_{i}$ so that ${W}_{L}+{W}_{R}=1$. - (2)
- If only one pixel is filled in two reference views, for example only I
_{L}is filled, the reliability of I_{L}is taken into consideration. If I_{L}is mapped by a reliable depth value, I_{V}can simply be filled with I_{L}(${I}_{V}={I}_{L}$). Otherwise, background information is used to generate I_{V}. If D_{L}is close to the background depth value, then ${I}_{V}=\left({I}_{L}+{I}_{B}\right)/2$; if D_{L}is much bigger than D_{B}, ${I}_{V}={I}_{L}$. - (3)
- If pixels in both reference views are not filled, we use the constructed background image to deal with the hole-filling challenge. First, we check the surrounding depth value of I
_{V}and find the filled depth value to determine a proper depth value range. Then I_{V}is filled by the background pixel if its depth value is in the obtained range. Otherwise, inverse warping and classical inpainting are applied to fill I_{V}.

#### 3.4. Depth Map Processing Method

- (1)
- A conventional median filter is proposed to apply to the coarse depth map $\text{}{d}_{in}\left(x,y\right)$ to obtain an improved depth map ${d}^{\prime}\left(x,y\right)$. It is capable of removing the existing noise and preserves the sharp boundary information.
- (2)
- The texture image ${I}_{in}\left(x,y\right)$ is refined according to the improvement of its associated depth map. If the condition $\left|{d}^{\prime}\left(x,y\right)-{d}_{in}\left(x,y\right)\right|>\epsilon $ is satisfied ($\epsilon $ is a threshold value for depth difference), this indicates that the depth value of the pixel is unreliable and it is renewed after the median filter. An inverse mapping process using the updated depth value is employed to find an appropriate texture pixel. A depth range $d\u2033\in \left[{d}^{\prime}-\epsilon ,{d}^{\prime}+\epsilon \right]$ is used as a candidate to find its corresponding pixel in two reference views. In Equations (11) and (12), we can get a corresponding reference pixel location (u
_{r}, v_{r}) through pixel (x, y) and the associated depth values z_{v}and z_{r};**A**and**b**denote rotation matrix and translation matrix, respectively. Several measurements are used to make sure a highly reliable pixel is obtained by using backward warping. First, the depth value of the obtained pixel should be close to the updated depth value ${d}^{\prime}\left(x,y\right)$. Second, the disparity between (x, y) and (u_{r}, v_{r}) should not be too large according to the alignment of the reference viewpoint and virtual viewpoint:$${z}_{v}\left[\begin{array}{c}x\\ y\\ 1\end{array}\right]={z}_{r}\mathit{A}\left[\begin{array}{c}{u}_{r}\\ {v}_{r}\\ 1\end{array}\right]+\mathit{b},$$$$z=\frac{255{z}_{near}{z}_{far}}{{d}^{\u2033}\left({z}_{far}-{z}_{near}\right)+255{z}_{near}}.$$

## 4. Experimental Results

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Tech, G.; Chen, Y.; Müller, K.; Ohm, J.-R.; Vetro, A.; Wang, Y.-K. Overview of the multi-view and 3D extensions of high efficiency video coding. IEEE Trans. Circuits Syst. Video Technol.
**2016**, 26, 35–49. [Google Scholar] [CrossRef] - Tanimoto, M.; Tehrani, M.P; Fujii, T.; Yendo, T. Free-viewpoint TV. IEEE Signal Process. Mag.
**2011**, 28, 67–76. [Google Scholar] [CrossRef] - Fehn, C. Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-Tv. In Proceedings of the Society of Photo-Optical Instrumentation Engineers (SPIE), San Jose, CA, USA, 19–22 January 2004. [Google Scholar]
- Farid, M.S.; Lucenteforte, M.; Grangetto, M. Depth image based rendering with inverse mapping. In Proceedings of the IEEE 15th International Conference on Workshop Multimedia Signal Processing, Pula, Italy, 30 September–2 October 2013. [Google Scholar]
- Rahaman, D.M.M.; Paul, M. Virtual View Synthesis for Free Viewpoint Video and Multiview Video Compression using Gaussian Mixture Modeling. IEEE Trans Image Process.
**2018**, 27, 1190–1201. [Google Scholar] [CrossRef] [PubMed] - Zhang, L.; Tam, W.J.; Wang, D.M. Stereoscopic image generation based on depth images. In Proceedings of the International Conference on Image Processing (ICIP), Singapore, Singapore, 24–27 October 2004; pp. 2993–2996. [Google Scholar]
- Zhang, L.; Tam, W.J. Stereoscopic image generation based on depth images for 3D TV. IEEE Trans. Broadcast.
**2005**, 51, 191–199. [Google Scholar] [CrossRef] - Cheng, C.M.; Lin, S.J.; Lai, S.H.; Yang, J.C. Improved novel view synthesis from depth image with large baseline. In Proceedings of the 19th International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
- Lee, P.J. Non-geometric distortion smoothing approach for depth map preprocessing. IEEE Trans. Multimed.
**2011**, 13, 246–254. [Google Scholar] [CrossRef] - Criminisi, A.; Perez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process.
**2004**, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed] - Ahn, I.; Kim, C. A novel depth-based virtual view synthesis method for free viewpoint video. IEEE Trans. Broadcast.
**2013**, 59, 614–626. [Google Scholar] [CrossRef] - Zhao, Y.; Zhu, C.; Chen, Z. Boundary artifact reduction in view synthesis of 3D video: From perspective of texture depth alignment. IEEE Trans. Broadcast.
**2011**, 57, 510–522. [Google Scholar] [CrossRef] - Bertalmio, M.; Bertozzi, A.L.; Sapiro, G. Navier-stokes, fluid dynamics, and image and video inpainting. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001; pp. 355–362. [Google Scholar]
- Bertalmio, M. Strong-continuation, contrast-invariant inpainting with a third-order optimal PDE. IEEE Trans. Image Process.
**2006**, 15, 1934–1938. [Google Scholar] [CrossRef] [PubMed] - Rahaman, D.M.M.; Paul, M. Hole-filling for single-view plus depth based rendering with temporal texture synthesis. In Proceedings of the IEEE International Conference on Workshop Multimedia Expo Workshops (ICMEW), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Rahaman, D.M.M.; Paul, M. Free view-point video synthesis using Gaussian mixture modeling. In Proceedings of the IEEE Conference on Image and Vision Computing, Auckland, New Zealand, 23–24 November 2015; pp. 1–6. [Google Scholar]
- Li, S.; Zhu, C.; Sun, M.T. Hole Filling with Multiple Reference Views in DIBR View Synthesis. IEEE Trans. Multimed.
**2018**. [Google Scholar] [CrossRef] - Schmeing, M.; Jiang, X. Depth image based rendering: A faithful approach for the disocclusion problem. In Proceedings of the 3DTV-Conference: The True Vision—Capture, Transmission and Display of 3D Video, Tampere, Finland, 7–9 June 2010; pp. 1–4. [Google Scholar]
- Chen, K.Y.; Tsung, P.K.; Lin, P.C.; Yang, H.J.; Chen, L.G. Hybrid motion/depth-oriented inpainting for virtual view synthesis in multiview applications. In Proceedings of the 3DTV-Conference: The True Vision—Capture, Transmission and Display of 3D Video, Tampere, Finland, 7–9 June 2010; pp. 1–4. [Google Scholar]
- Köppel, M.; Ndjiki-Nya, P.; Doshkov, D.; Lakshman, H.; Merkle, P.; Müller, K.; Wiegand, T. Temporally consistent handling of disocclusions with texture synthesis for depth-image-based rendering. In Proceedings of the IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 1809–1812. [Google Scholar]
- Ndjiki-Nya, P.; Koppel, M.; Doshkov, D.; Lakshman, H.; Merkle, P.; Muller, K.; Wiegand, T. Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans. Multimed.
**2011**, 13, 453–465. [Google Scholar] [CrossRef] - Bosc, E.; Köppel, M.; Pépion, R.; Pressigout, M.; Morin, L.; Ndjiki-Nya, P.; Le Callet, P. Can 3D synthesized views be reliably assessed through usual subjective and objective evaluation protocols? In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 2597–2600. [Google Scholar]
- Yao, C.; Tillo, T.; Zhao, Y.; Xiao, J.; Bai, H.; Lin, C. Depth map driven hole filling algorithm exploiting temporal correlation information. IEEE Trans. Broadcast.
**2014**, 60, 394–404. [Google Scholar] [CrossRef] - Luo, G.; Zhu, Y.; Li, Z.; Zhang, L. A hole filling approach based on background reconstruction for view synthesis in 3D video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–10 Jun 2016; pp. 1781–1789. [Google Scholar]
- Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Fort Collins, CO, USA, 23–25 June 1999; pp. 246–252. [Google Scholar]
- Deng, Z.M.; Wang, M.J. Hybrid Temporal Correlation Based on Gaussian Mixture Model Framework for View Synthesis. In Proceedings of the 18th International Conference on Computers and Communication Networks, Boston, MA, USA, 24–25 April 2017; pp. 1936–1944. [Google Scholar]
- Tanimoto, M.; Fujii, T.; Suzuki, K. Reference Software of Depth Estimation and View Synthesis for FTV/3DV. ISO/IEC JTC1/SC29/WG11, M15836. 2008. Available online: http://wg11.sc29.org/svn/repos/MPEG-4/test/trunk/3D/view_synthesis/VSRS (accessed on 19 May 2018).

**Figure 1.**Framework of the proposed view synthesis: (

**a**) illustration of Depth Refinement; (

**b**) the framework of proposed approaches using refined depth information.

**Figure 3.**Adaptive background modeling results: (

**a**) Ballet background image; (

**b**) Breakdancers background image.

**Figure 4.**Examples of depth map processing method: (

**a**,

**b**) enlarged integrated texture image and its associated depth map before depth map processing method (DMPM); (

**c**,

**d**) image and its associated depth map after DMPM.

**Figure 5.**Subjective comparisons of four disocclusion-filling methods for the sequence Ballet (frame 1): (

**a**) disocclusions after simple depth image–based rendering (DIBR) [3]; (

**b**) VSRS 3.5 [27]; (

**c**) Modified Gaussian mixture model (GMM) method [5]; (

**d**) previous method [26]; (

**e**) proposed synthesis method.

**Table 1.**Average peak signal-noise ratio (PSNR) comparison of the proposed technique and three state-of-the-art techniques.

Sequence | Camera Set | Baseline (cm) | PSNR (dB) | |||
---|---|---|---|---|---|---|

VSRS [26] | GMM-Based [5] | Previous [27] | Proposed | |||

Ballet | 3, 5 → 4 | 20, 20 | 29.39 | 32.29 | 28.97 | 33.56 |

1, 7 → 3 | 40, 80 | 22.64 | 31.01 | 25.23 | 32.43 | |

1, 7 → 4 | 60, 60 | 22.03 | 31.13 | 25.78 | 32.54 | |

1, 7 → 5 | 80, 40 | 22.12 | 30.98 | 25.12 | 32.32 | |

Breakdancers | 1, 3 → 2 | 20, 20 | 30.88 | 34.42 | 29.46 | 35.37 |

2, 6 → 3 | 20, 60 | 23.89 | 32.31 | 27.51 | 33.65 | |

2, 6 → 4 | 40, 40 | 23.76 | 32.66 | 27.87 | 34.41 | |

2, 6 → 5 | 60, 20 | 23.64 | 32.53 | 27.76 | 34.66 |

**Table 2.**Average structural similarity index (SSIM) comparison of the proposed technique and three state-of-the-art techniques.

Sequence | Camera Set | Baseline (cm) | SSIM | |||
---|---|---|---|---|---|---|

VSRS [26] | GMM-Based [5] | Previous [27] | Proposed | |||

Ballet | 3, 5 → 4 | 20, 20 | 0.8229 | 0.8839 | 0.8114 | 0.8937 |

1, 7 → 3 | 40, 80 | 0.7976 | 0.8839 | 0.8645 | 0.8941 | |

1, 7 → 4 | 60, 60 | 0.7997 | 0.8843 | 0.8688 | 0.8946 | |

1, 7 → 5 | 80, 40 | 0.7913 | 0.8847 | 0.8698 | 0.8955 | |

Breakdancers | 1, 3 → 2 | 20, 20 | 0.8387 | 0.8687 | 0.8344 | 0.8813 |

2, 6 → 3 | 20, 60 | 0.8143 | 0.8601 | 0.8714 | 0.8872 | |

2, 6 → 4 | 40, 40 | 0.8156 | 0.8587 | 0.8702 | 0.8818 | |

2, 6 → 5 | 60, 20 | 0.8132 | 0.8565 | 0.8707 | 0.8821 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Deng, Z.; Wang, M.
Reliability-Based View Synthesis for Free Viewpoint Video. *Appl. Sci.* **2018**, *8*, 823.
https://doi.org/10.3390/app8050823

**AMA Style**

Deng Z, Wang M.
Reliability-Based View Synthesis for Free Viewpoint Video. *Applied Sciences*. 2018; 8(5):823.
https://doi.org/10.3390/app8050823

**Chicago/Turabian Style**

Deng, Zengming, and Mingjiang Wang.
2018. "Reliability-Based View Synthesis for Free Viewpoint Video" *Applied Sciences* 8, no. 5: 823.
https://doi.org/10.3390/app8050823