1. Introduction
Three-dimensional (3D) technologies, e.g., augmented reality, virtual reality, mixed reality, and stereoscopy, have lately enjoyed remarkable growth due to their numerous applications in the entertainment industry, gaming industry, for electro-medical equipment, etc. 3D television (3DTV) and the recent free-viewpoint television (FTV) [
1] have enhanced users’ television experience by providing immersion. 3DTV projects two views of the same scene from slightly different viewpoints to provide the depth sensation. The FTV, in addition to the immersive experience, enables the viewer to enjoy the scene from different viewpoints by changing his/her position in front of the television. To provide a full parallax, FTV needs dozens of views, ideally an infinite number of views. Capturing, coding, and transmitting such a large number of views is not practical due to various financial and technological constraints, such as limited available bandwidth. Therefore, novel 3D video (3DV) formats and representations have been explored to design compression-friendly and cost-efficient solutions. The multiview video plus depth (MVD) format is considered to be the most suitable for 3D televisions. In addition to color images, MVD also provides the corresponding depth maps, which represent the geometry of the 3D scene.
The additional dimension of the depth in MVD provides the ability to generate novel views from a set of available views using the depth-image-based rendering (DIBR) technique [
2], thus enabling the stereoscopy. The quality of the synthesized views is important for a pleasant user experience. Since the depth maps are usually generated using stereo-matching algorithms [
3], they are not accurate. The inaccuracies in depth maps, when used in DIBR, might introduce various distortions in the synthesized images degrading their quality and resulting in a poor quality of experience (QoE). Thus, assessing the quality of the DIBR-synthesized views is necessary to ensure a satisfactory user experience.
Inaccuracies in depth maps cause textural and structural distortions such as ghost artifacts and inconsistent object shifts in the synthesized views [
4,
5,
6,
7,
8]. Texture and depth compression also introduce artifacts in the virtual images [
9,
10]. Another factor that causes degradation in virtual image quality is occluded areas in the original view that become visible in the virtual view, which are called holes. These holes are usually estimated using image inpainting techniques that do not always produce a pleasant reconstruction.
Figure 1a shows the artifacts introduced in a synthesized view due to visible occluded regions. Note the distorted face of a spectator in
Figure 1b because of erroneous depth in DIBR.
The various structural and textural distortions introduced in DIBR images may affect the picture quality, the depth sensation, and the visual comfort, which are considered three main factors of user quality-of-experience (QoE) [
6]. Besides viewing experience, studies show that the distortion in 3D images can affect the performance of various applications designed for the 3D environment, such as image saliency detection, video target tracking, face detection, and event detection [
11,
12,
13]. This means that the image quality is very important not only for viewer satisfaction in a stereoscopic environment but also for various 3D applications built for this environment. Therefore, 3D image quality assessment (3D-IQA) is an essential part of the 3D video processing chain.
In this paper, we propose a 3D-IQA metric to estimate the quality of DIBR-synthesized images. The proposed metric aims to measure the structural and textural distortions introduced in the synthesized image due to depth-image-based rendering and combines them to predict the overall quality of the image. The structural details in an image are considered important for their quality as the human visual system (HVS) is more sensitive to them [
14,
15]. It is the difference between luminance or color that makes the representation of an object or the main features of an image distinguishable. The distortion in these features, referred to as textural distortion, is also important for a true image quality estimation. The textural and structural metric scores are combined to obtain an overall quality score.
The rest of the paper is organized in the following way.
Section 2 reviews the related literature,
Section 3 presents the proposed 3D-IQA technique. The experimental evaluation of the proposed metric is carried out in
Section 4 and we conclude the research in
Section 5.
2. Related Work
The quality of an image can be either assessed through subjective tests or by using an automated objective metric [
16]. As human eyes are the ultimate receiver of the image, a subjective test is certainly the best and the most reliable way to assess the visual image quality. In such tests, a set of human observers assigns quality scores to the image, which are averaged to get one score. This method, however, is a time-consuming and expensive approach. Therefore, it was felt necessary to introduce an automatic and fast way to assess the quality of an image. This provides the opportunity for researchers to introduce objective metrics for quantitative image quality evaluation, which proves to be a significant improvement in the field of image quality assessment.
Objective image and video quality metrics can be grouped into three classes based on the availability of the original reference images: full-reference (FR), no-reference (NR), and reduced-reference (RR) [
17]. The IQA metric that requires the original reference image to evaluate the quality of its distorted version is referred to as a full-reference metric. The IQA approach that assesses the quality of an image in the absence of a corresponding reference image is classified as the no-reference metric. The reduced-reference metrics lie between the two categories, they do not require the reference images but some of their features must be available for comparison.
In the literature, several 2D and 3D objective quality assessment metrics have been proposed to assess visual image quality. Initially, 2D metrics were used to assess the quality of 3D content, however, the use of conventional 2D metrics was found inappropriate to assess the true quality of 3D images due to several additional factors of 3D videos that were not considered by 2D-IQA algorithms [
18,
19,
20]. Therefore, novel IQA algorithms were needed to evaluate the quality of 3D videos. Such algorithms, in addition to 2D-related artifacts, must also consider artifacts introduced due to the additional dimension of depth in the videos.
In recent years, several algorithms have been proposed to evaluate the quality of 3D images. Many of them utilize the existing 2D quality metrics for this purpose, e.g., [
21,
22,
23,
24]. Since these algorithms rely on metrics especially designed for 2D images, they do not consider the most important factor of 3D images, i.e., depth, and therefore they are not accurate and reliable.
Many 3D-IQA techniques consider depth/disparity information while assessing the quality of 3D images, e.g., [
25,
26,
27,
28]. You et al. [
19] adopted a belief-propagation-based method to estimate the disparity and combined the quality maps of distorted image and distorted disparity computed using conventional 2D metrics. The method proposed in [
25] exploits the disparity as well as binocular rivalry to determine the quality. It uses the Multi-scale Structural Similarity Index Measure (MSSIM) [
29] metric to evaluate the quality of disparity of stereo images. Zhan et al. [
26] presented a machine-learning-based method that works by learning the features from 2D-IQA metrics and specially designed 3D features using the Scale Invariant Feature Transform (SIFT) flow algorithm [
30], and was used to obtain the depth information. The different features of disparity and three types of distortions (blur, noise, and compression) were used by [
28] in evaluating the quality of 3D images. These features were used to train a quality prediction model by using the random forest regression algorithm. The method proposed in [
18] addressed the issue of structural distortion in a synthesized view due to DIBR, but the method is limited to structural distortions so it cannot be used to evaluate the overall quality of the image.
The 3D-IQA method presented in [
31] identifies the disocclusion edges in the synthesized image and inversely maps them to the original image, and the corresponding regions are then compared to assess the quality. The algorithm in [
32] uses feature matching points in the synthesized and reference images to compute the quality degradation. The Just Notice Difference (JND) model is exploited in [
33] to compute the global sharpness and distortion in holes in the DIBR image to assess its quality. The quality metric proposed in [
34] identifies the critical blocks in the DIBR synthesized image and the reference image. The texture and color contrast similarities between these blocks are compared to estimate the quality of the synthesized image. The method in [
35] works by extracting the features of energy-weighted spatial and temporal information and entropy. Then, support vector regression uses these features for depth estimation. Gorley et al. proposed a stereo-band-limited contrast method in [
36] that considers contrast sensitivity and luminance changes as important factors for the assessment of image quality. The method presented in [
37] extracts the natural scene features from a discrete cosine transform (DCT) domain, and a deep belief network (DBN) model was trained to get the deep features. These generated deep features and DMOS values were used to train a support vector regression (SVR) model to predict the image quality. The learning framework proposed in [
38] also uses a regression model to learn the features and besides assessing the quality, it also improves the quality of stereo images. The method proposed in [
39] considers the global visual characteristics by using structural similarities and the local quality was evaluated by computing the local magnitude and local phase. The global and local quality scores were combined to get the final score.
Binocular perception or binocular rivalry is an important factor in 3D image quality assessment [
40,
41]. Humans perceive images with both eyes and it is obvious that there is a difference between the perceptions of the left and the right eye in relation to an image. Indeed, binocular rivalry is the visual perception phenomenon in which there exists a difference in the perception of an image when it is seen from the left eye and the right eye. This difference is called the binocular parallax or binocular disparity. The binocular disparity can be divided into horizontal and vertical parallax. The horizontal parallax affects depth perception and the vertical parallax affects visual comfort [
37]. This binocular perception was taken into account in [
42] and a binocular fusion process was proposed for quality assessment of stereoscopic images. The 3D-IQA metric proposed in [
41] is also based on binocular visual characteristics. A learning-based metric [
43] uses binocular receptive field properties for assessing the quality of stereo images. Shao et al. [
44] proposed a metric that simplifies the process of binocular quality prediction by dividing the problem into monocular feature encoding and binocular feature combination.
Lin et al. combine binocular integration behaviors such as binocular combination and binocular frequency integration with conventional 2D metrics in [
45] to evaluate the quality of stereo images. Binocular spatial sensitivity influenced by binocular fusion and binocular rivalry properties was taken into consideration in [
46]. The method proposed in [
47] uses binocular responses, e.g., binocular energy response (BER), binocular rivalry response (BRR), and local structure distribution, for 3D-IQA. Quality assessment of asymmetrically distorted stereoscopic images was targeted in [
48]. The method is inspired by binocular rivalry and it uses estimated disparity and Gabor filter responses to create an intermediate synthesized view whose quality is estimated using 2D-IQA algorithms. A multi-scale model using binocular rivalry is presented in [
49] for quality assessment of 3D images. Numerous other 3D-IQA algorithms use binocular cues for evaluating the quality of 3D images, e.g., [
50,
51,
52].