Stereoscopic Image Super-Resolution Method with View Incorporation and Convolutional Neural Networks

Super-resolution (SR) plays an important role in the processing and display of mixed-resolution (MR) stereoscopic images. Therefore, a stereoscopic image SR method based on view incorporation and convolutional neural networks (CNN) is proposed. For a given MR stereoscopic image, the left view of which is observed in full resolution, while the right view is viewed in low resolution, the SR method is implemented in two stages. In the first stage, a view difference image is defined to represent the correlation between views. It is estimated by using the full-resolution left view and the interpolated right view as input to the modified CNN. Accordingly, a high-precision view difference image is obtained. In the second stage, to incorporate the estimated right view in the first stage, a global reconstruction constraint is presented to make the estimated right view consistent with the low-resolution right view in terms of the MR stereoscopic image observation model. Experimental results demonstrated that, compared with the SR convolutional neural network (SRCNN) method and depth map based SR method, the proposed method improved the reconstructed right view quality by 0.54 dB and 1.14 dB, respectively, in the Peak Signal to Noise Ratio (PSNR), and subjective evaluation also implied that the proposed method produced better reconstructed stereoscopic images.


Introduction
With advancements in imaging, processing, and display technologies in recent years, stereoscopic video entertainment and communication have emerged as promising services of novel visual user experiences such as three-dimensional (3D) television [1], free-viewpoint video [2], and video conferencing [3].Compared with monocular images, stereoscopic images provide depth perception and engender an immersive user experience [4].Meanwhile, the immense amount of data generated by stereoscopic imaging requires large storage and transmission capabilities and thus must be efficiently encoded and processed.On the basis of binocular suppression theory [5], higher quality views will be received as the perceived quality of stereo vision by the human visual system (HVS).Thus, mixed resolution (MR) stereoscopic image processing techniques are motivated by binocular perception theory.Specifically, one view of the MR stereoscopic image is provided with full resolution (FR), whereas the other view is degraded by the MR stereoscopic image observation model.To decrease the amount of data while preserving the high definition and stereo vision experience, the low-resolution (LR) view must be super-resolved to a high resolution (HR) at the decoder and display side.In recent years, MR stereoscopic imaging and processing techniques have proven to be effective approaches for stereoscopic imaging and compression [6].
Existing super-resolution (SR) methods are used to reconstruct the FR image from its LR version.These methods are mainly divided into three types; interpolation [7,8], reconstruction [9,10], and learning [11][12][13][14][15].Among them, the learning-based method has become widely used owing to its outstanding performance.Its basic idea is to establish a mapping relation between the LR and HR image patches and then to find the optimal solution from the LR image.Thus, to study the common prior knowledge between the image patches, most renowned methods adopt the learning-based strategy [16].Chang et al. [11], for example, adopted the concept of local linear embedding to propose an SR reconstruction method based on neighborhood embedding.Furthermore, the above-mentioned neighborhood embedding is deduced to a more complex sparse coding formulation in Yang et al.'s work [12,13].They determined that a linear combination of atoms from an over-complete dictionary can well represent natural image patches.Therefore, after the training process on HR and LR image patches, HR and LR dictionaries are jointly obtained.Then, according to the observed LR patch and LR dictionary, the sparse coefficients are achieved and applied to the HR dictionary to produce the final HR patches.
To improve the SR speed while maintaining SR accuracy, Timofte et al. [14] used several smaller complete dictionaries to replace the single large over-complete dictionary, thereby greatly reducing the computational cost.In recent years, with the development of deep learning, an increasing number of researchers have employed deep learning for image processing such as image classification [17], object detection [18], and image denoising [19].Additionally, some researchers have begun establishing a depth model for SR reconstruction.Cui et al. [20] adopted stacked auto-encoders which combined the internal example-based approach to gradually upsample LR images layer by layer.Moreover, Dong et al. [15] combined dictionary learning and neural networks to establish a model of the SR convolutional neural network (SRCNN).This model showed better performance than traditional methods, such as dictionary learning and sparse coding.Furthermore, Liu et al. [21] emphasized the importance of traditional sparse representation.They integrated it into deep learning to further improve the SR results.Although the LR view of the MR stereoscopic image can be directly upsampled by these single-view SR methods, these methods do not take advantage of the correspondence between views for stereoscopic image SR.
The observed FR image in the neighboring view of the stereoscopic image can provide richly detailed information of the scene.Thus, the relativity between views has been utilized to strengthen the particular LR views.Garcia et al. [22] proposed an SR method that exploits depth information for the MR multi-view video.On the basis of the available depth maps, their approach enhanced the observed LR image by extracting the high-frequency content from the neighboring FR view.However, the acquisition of the depth maps was not discussed in their work as these are not easy to accurately estimate.In addition, Brust et al. [23] employed the estimated depth map calculated in advance from the original FR stereoscopic pairs to render the LR view from other HR views.Again, the original FR stereoscopic pairs cannot be obtained at the decoder side.Therefore, all of the above methods are not fully consistent with the MR stereoscopic imaging and processing techniques.
Unlike existing SR methods, which require depth maps or depth estimation, we combine the correlation between the views of the MR stereoscopic image without estimating the depth map.We propose a stereoscopic image SR method based on view incorporation and convolutional neural network (CNN).The proposed stereoscopic image SR method can be implemented in two stages.In the first stage, for establishing links between views, a view difference image is defined, and the modified CNN is created to estimate a high-precision view difference image.Then, the estimated right view image is obtained by subtracting the estimated view difference image from the observed FR left view image.In the second stage, we consider that the estimated right view image should be retained with the LR right view image with regard to the MR stereoscopic image observation model.Accordingly, we model the global reconstruction constraint for incorporating the right view by projecting the estimated right view image obtained in the first stage onto the solution space of the image observation model.The solution can be computed by iterative back projection [24].The SR results demonstrate that the proposed SR method retained more details and well reduced ringing.
In short, the contributions of this paper are outlined as follows: • We combine the correlation between the views of the MR stereoscopic image without estimating the depth map;

•
We use the view difference image containing the image texture information, as well as the depth information of the stereoscopic pairs, as the input to the modified CNN, whose pooling layers and fully connected layers are removed according to the SR task;

•
We combine the high-precision view difference image estimated by the modified CNN with the global reconstruction constraint to further improve the performance of the MR stereoscopic image SR.
The remainder of this paper is organized as follows.Section 2 describes an MR stereoscopic image observation model.Then, the proposed stereoscopic image SR method is illustrated in Section 3. Experiments are given and discussed in Section 4. Section 5 presents the conclusions.

MR Stereoscopic Image Observation Model
As shown in Figure 1, the observed FR left view and downsampled right view constitute the MR stereoscopic video sequence.Accordingly, the MR stereoscopic video coding model can enable a large amount of data to be compressed for storage and transmission.This is because directly encoding the FR stereoscopic videos leads to doubling the required storage and bandwidth.Actually, owing to the restricted storage space and network bandwidth, this MR stereoscopic video coding model is crucial to providing clear bitrate reduction [6].Furthermore, for ensuring stereo vision comfort, the SR of the MR stereoscopic video is needed at the decoder.Therefore, to contribute to the MR stereoscopic video coding model, we adopt an MR stereoscopic image observation model for stereoscopic image SR. image observation model.The solution can be computed by iterative back projection [24].The SR results demonstrate that the proposed SR method retained more details and well reduced ringing.
In short, the contributions of this paper are outlined as follows:


We combine the correlation between the views of the MR stereoscopic image without estimating the depth map;  We use the view difference image containing the image texture information, as well as the depth information of the stereoscopic pairs, as the input to the modified CNN, whose pooling layers and fully connected layers are removed according to the SR task;  We combine the high-precision view difference image estimated by the modified CNN with the global reconstruction constraint to further improve the performance of the MR stereoscopic image SR.
The remainder of this paper is organized as follows.Section 2 describes an MR stereoscopic image observation model.Then, the proposed stereoscopic image SR method is illustrated in Section 3. Experiments are given and discussed in Section 4. Section 5 presents the conclusions.

MR Stereoscopic Image Observation Model
As shown in Figure 1, the observed FR left view and downsampled right view constitute the MR stereoscopic video sequence.Accordingly, the MR stereoscopic video coding model can enable a large amount of data to be compressed for storage and transmission.This is because directly encoding the FR stereoscopic videos leads to doubling the required storage and bandwidth.Actually, owing to the restricted storage space and network bandwidth, this MR stereoscopic video coding model is crucial to providing clear bitrate reduction [6].Furthermore, for ensuring stereo vision comfort, the SR of the MR stereoscopic video is needed at the decoder.Therefore, to contribute to the MR stereoscopic video coding model, we adopt an MR stereoscopic image observation model for stereoscopic image SR.As shown in Figure 2, a stereoscopic imaging system obtains the original FR stereoscopic image pairs (with the size of N1 × N2 for each view).In general, we assume that there exists the observed FR left view and observed LR right view (with the size of M1 × M2), which is degraded on account of the blurring and downsampling operation.The degradation model is expressed by As shown in Figure 2, a stereoscopic imaging system obtains the original FR stereoscopic image pairs (with the size of N 1 × N 2 for each view).In general, we assume that there exists the observed FR left view and observed LR right view (with the size of M 1 × M 2 ), which is degraded on account of the blurring and downsampling operation.The degradation model is expressed by where Y and X denote the observed LR right view image and original FR right view image (that is the unknown FR right view image), respectively.Moreover, Y and X are both in vector format with the sizes of M 1 M 2 × 1 and N 1 N 2 × 1, respectively.D is the downsampling matrix with the size of

Proposed Stereoscopic Image SR Method with View Incorporation and CNN
In this paper, we focus on the estimation of the view difference image and global reconstruction constraint to solve the SR task of MR stereoscopic images, as depicted in Figure 3.The proposed method includes two main stages.The first stage is the estimation of a view difference image for employing the correlation between the left and right views by using a modified CNN.Firstly, the observed LR right view image is interpolated into FR by using a bicubic interpolation filter.Secondly, the view difference image between the observed FR left view image and interpolated right view image is defined and employed as the input to the modified CNN.Hence, the high-precision view difference image is provided as the output.Furthermore, the estimated right view image is obtained by combining the observed FR left view image with the high-precision view difference image.The second stage is the global reconstruction constraint process for incorporating the right view.By making the estimated right view image obtained in the first stage align with the LR right view according to the MR stereoscopic image observation model, we model the global reconstruction constraint by using the iterative back projection method.Finally, after the above two stages, the SR of the LR right view is obtained.

Proposed Stereoscopic Image SR Method with View Incorporation and CNN
In this paper, we focus on the estimation of the view difference image and global reconstruction constraint to solve the SR task of MR stereoscopic images, as depicted in Figure 3.The proposed method includes two main stages.The first stage is the estimation of a view difference image for employing the correlation between the left and right views by using a modified CNN.Firstly, the observed LR right view image is interpolated into FR by using a bicubic interpolation filter.Secondly, the view difference image between the observed FR left view image and interpolated right view image is defined and employed as the input to the modified CNN.Hence, the high-precision view difference image is provided as the output.Furthermore, the estimated right view image is obtained by combining the observed FR left view image with the high-precision view difference image.The second stage is the global reconstruction constraint process for incorporating the right view.By making the estimated right view image obtained in the first stage align with the LR right view according to the MR stereoscopic image observation model, we model the global reconstruction constraint by using the iterative back projection method.Finally, after the above two stages, the SR of the LR right view is obtained.

Estimation of View Difference Image with Modified CNN
The key aspect of the SR process of MR stereoscopic images is to use the correlation between views as far as possible to enhance the resolution of the LR right view.As mentioned above, the view difference image is very important for representing the correlation between views because both the image texture information and depth information of the stereoscopic pairs [25] are included in the view difference image.Therefore, we establish a modified CNN, the pooling layers and fully connected layers of which are removed according to the SR task to construct the high-precision view difference image, as shown in Figure 4.In addition to the input layer, the modified CNN training framework consists of three layers, in which the hidden layers [26] are the first two convolution layers, and the output layer is the third convolution layer.Given an initial view difference sub-image obtained by a FR left view training sub-image and an interpolated right view training sub-image, the first convolution layer of the modified CNN extracts a number of feature maps.Then the second convolution layer maps these feature maps to high-precision feature vectors.Finally, the third convolution layer produces the high-precision view difference sub-image according to these high-precision feature vectors.

Estimation of View Difference Image with Modified CNN
The key aspect of the SR process of MR stereoscopic images is to use the correlation between views as far as possible to enhance the resolution of the LR right view.As mentioned above, the view difference image is very important for representing the correlation between views because both the image texture information and depth information of the stereoscopic pairs [25] are included in the view difference image.Therefore, we establish a modified CNN, the pooling layers and fully connected layers of which are removed according to the SR task to construct the high-precision view difference image, as shown in The image patch extraction and representation operation extracts image patches from the view difference image and represents them as high-dimensional vectors by a number of bases.In our formulation, this is the same process as convolving the view difference image by a number of filters.

Image Patch Extraction and Representation
The image patch extraction and representation operation extracts image patches from the view difference image and represents them as high-dimensional vectors by a number of bases.In our formulation, this is the same process as convolving the view difference image by a number of filters.These vectors obtained by convolution contain a number of feature maps that can be further mapped to finer feature vectors.
Generally, the view difference image is defined by: where Z and Y B denote the observed FR left view image and the interpolated right view image, which is the FR; that is, respectively, Z = {Z(x, y)} and Y B = {Y B (x, y)}.In addition, d x (x, y) and d y (x, y) are the horizontal and vertical components of the disparity at position (x, y), and I d = {I d (x, y)} is the defined view difference image on the basis of the disparity information.Actually, for SR of the MR stereoscopic image, we cannot obtain the original FR stereoscopic image pair at the decoder side.Thus, the disparity map cannot be accurately identified.Similar to [25], the view difference image is directly calculated from the stereoscopic image pairs as: We take the initial view difference image, I d , as the input of the modified CNN, and the convolutional operation in the first layer in the CNN is represented as: where * denotes the convolutional operation.Then, we apply the rectified linear unit (RELU) [27] to alleviate the overfitting problem after the convolutional operation: where W 1 and B 1 represent n 1 filters of the support f 1 × f 1 × c 1 and biases, which comprise an n 1 -dimensional vector, respectively.Here, f 1 represents the filter size, and c 1 denotes the number of channels in the input image.Additionally, output I d1 is composed of n 1 feature maps of the input image.

Mapping and Estimation of View Difference Image and Right View Image
After extracting an n 1 -dimensional feature map for each image patch in the first CNN layer, these n 1 -dimensional vectors are mapped into n 2 -dimensional vectors in the second CNN layer.The high-precision view difference image is estimated in the third CNN layer.Finally, we estimate the right view image after the three layer operation.
Similar to the first layer, the second layer is built through the convolution and the RELU, as follows: where W 2 and B 2 represent n 2 filters of the support f 2 × f 2 × c 2 and the biases, which comprise an n 2 -dimensional vector, respectively.Output I d2 is composed of n 2 feature maps, which can be conceptually used as representations of high-precision view difference image patches that will constitute a full high precision view difference image.
According to the representations of high-precision view difference image patches in the second layer, the convolutional operation in the third CNN layer is defined to produce the high-precision view difference image, I d3 , as: where W 3 and B 3 represent n 3 filters of the support f 3 × f 3 × c 3 and the biases, which comprise an n 3 -dimensional vector.
After the three layer operation, the right view image is estimated as: where X 1 is the estimated right view image which is the output of the first stage.

Modified CNN and Training
In this paper, the modified CNN is created to estimate the high-precision view difference image.
Our objective is to train network f with three layers when given a training dataset, {Z (i) , , so that the high precision view difference image, I d3 and X (i) denote the ground-truth left view image and ground-truth right view image, respectively.Furthermore, Y B (i) is the interpolated right view image, and N is the number of training samples.
Then, a labeled ground-truth difference image in the used CNN is defined as To produce the high-precision view difference image, we use the sum of the absolute difference (SAD) as the distortion function: According to Equation (9), the distortion between the estimated high-precision view difference image I d3 and ground-truth difference image I d−label is minimized.The parameters θ = {W 1 , W 2 , W 3 , B 1 , B 2 , B 3 } are hence obtained.
Stochastic gradient descent [28] is used to minimize the distortion by standard back-propagation.Then, the parameters' matrices are updated as where l ∈ {1, 2, 3} represents the indices of layers, i denotes iterations, M is the momentum parameter with a value 0.9, r is the learning rate, and ∂L ∂θ l i is the derivative of the network parameters.Initially, all filters are initialized with a random Gaussian distribution with a zero mean and standard deviation of 0.001.Meanwhile, the biases of each layer are initialized by zero.The learning rate is 2.5 × 10 −4 for the first two layers and 2.5 × 10 −5 for the last layer.

Global Reconstruction Constraint to Incorporate the Right View
For a given LR right view, Y, there may be many HR right views X owing to the extremely ill posed character of SR.We consider that the reconstructed HR right view X should be retained with the LR right view Y in terms of the MR stereoscopic image observation model.However, the estimated right view image X 1 obtained in the first stage may not satisfy this condition.Therefore, the global reconstruction constraint is enforced for incorporating the right view by projecting X 1 onto the solution space of Y = DBX.It is computed as: Appl.Sci.2017, 7, 526 8 of 17 Thus, the solution can be computed by iterative back projection.The updated equation is: where X t denotes the reconstructed right view image after the (t − 1)-th iteration and ν denotes the step size with the value of one.We use X * from the aforementioned optimization as the ultimately reconstructed right view image.On one hand, image X * is as close as possible to the estimated right view image X 1 , obtained by estimation of the view difference image in the first stage.On the other hand, it satisfies the global reconstruction constraint in the second stage.

Experimental Results and Discussion
The training and testing data used in our experiments are described in this section.We explore our modified CNN and provide the convergence analysis to ensure the method is efficient.Then, the SR results of the proposed method are compared with those of recent state-of-the art methods.Next, the running time of all these methods is compared to evaluate the computational complexity.At last, subjective evaluation is adopted to analyze the perceived quality of stereoscopic images.

Training and Testing Data
Considering that the appropriate amount of training data can increase the CNN performance [29], we used 20 stereoscopic images from the Middlebury Stereo Dataset [30] as the training dataset.Each image pair was comprised of two views.To test the performance and robustness of the modified CNN, we employed the first frame of various multi-view video sequences, including Champagne_tower, Dog, Pantomime, Newspaper, Pozan Street [31], Ballet, and Breakdancers [32], as well as the remaining image pairs in the Middlebury Stereo Dataset; Sword 2, Umbrella, and Vintage.Figure 6 shows the right view of each pair in the testing dataset.For Champagne_tower, the FR left view is view 38 and view 39 is selected as the LR right view.For Dog and Pantomime, the FR left view is view 40 and view 39 is selected as the LR right view.For Newspaper, the FR left view is view 2 and view 3 is selected as the LR right view.For Poznan Street, the FR left view is view 5 and view 4 is selected as the LR right view.For Ballet and Breakdancers, the FR left view is view 2 and view 1 is selected as the LR To test the performance and robustness of the modified CNN, we employed the first frame of various multi-view video sequences, including Champagne_tower, Dog, Pantomime, Newspaper, Pozan Street [31], Ballet, and Breakdancers [32], as well as the remaining image pairs in the Middlebury Stereo Dataset; Sword 2, Umbrella, and Vintage.Figure 6 shows the right view of each pair in the testing dataset.For Champagne_tower, the FR left view is view 38 and view 39 is selected as the LR right view.For Dog and Pantomime, the FR left view is view 40 and view 39 is selected as the LR right view.For Newspaper, the FR left view is view 2 and view 3 is selected as the LR right view.For Poznan Street, the FR left view is view 5 and view 4 is selected as the LR right view.For Ballet and Breakdancers, the FR left view is view 2 and view 1 is selected as the LR right view.To test the performance and robustness of the modified CNN, we employed the first frame of various multi-view video sequences, including Champagne_tower, Dog, Pantomime, Newspaper, Pozan Street [31], Ballet, and Breakdancers [32], as well as the remaining image pairs in the Middlebury Stereo Dataset; Sword 2, Umbrella, and Vintage.Figure 6 shows the right view of each pair in the testing dataset.For Champagne_tower, the FR left view is view 38 and view 39 is selected as the LR right view.For Dog and Pantomime, the FR left view is view 40 and view 39 is selected as the LR right view.For Newspaper, the FR left view is view 2 and view 3 is selected as the LR right view.For Poznan Street, the FR left view is view 5 and view 4 is selected as the LR right view.For Ballet and Breakdancers, the FR left view is view 2 and view 1 is selected as the LR right view.In this paper, the specific CNN parameters are established according to practical experience, as shown in Figure 4.The MATLAB toolbox MatConvNet [33] is applied to actualize our CNN.Each parameter of the convolutional filters is set to To strengthen the correlation between image patches, the convolution stride is set to one for all In this paper, the specific CNN parameters are established according to practical experience, as shown in Figure 4.The MATLAB toolbox MatConvNet [33] is applied to actualize our CNN.Each parameter of the convolutional filters is set to [ f l , f l , c l , n l ] 3 l=1 = [9, 9, 1, 64] l=1 , [1,1,64,32] l=2 , [5,5,32,1] l=3 .To strengthen the correlation between image patches, the convolution stride is set to one for all layers.For a three-channel color image, we followed the methods of [12][13][14][15] in the experiments.Additionally, our stereoscopic image SR is used to contribute to the MR stereoscopic video coding model, whose video format is YCbCr (That is a color space.Y is the brightness (luma) component of the color space, while Cb and Cr are the blue and red concentration offset components).Thus, the color image is converted to the YCbCr color space, and only the Y component of the image is reconstructed by our SR reconstruction method, while both the Cb and Cr components are upsampled by the bicubic interpolation method [34].Furthermore, although our network can be easily extended to multi-channel image processing, Dong's experiments performed on the YCbCr color space [15] demonstrated that the Cb and Cr channels scarcely improved the performance.Therefore, the following objective evaluation indices are calculated only in the Y channel in Sections 4.2 and 4.3.

Convergence Analysis
To verify the efficiency of the proposed method, a convergence analysis of the proposed method was conducted.Figure 7 shows the training performance on the Middleburry Stereo Dataset [30] and the testing of the performance on the image Dog.For evaluating the convergence of the modified CNN, the above-mentioned SAD was computed as a training error and Peak Signal to Noise Ratio (PSNR) was used as a testing error in each epoch.Here, an epoch denotes the training times.
To verify the efficiency of the proposed method, a convergence analysis of the proposed method was conducted.Figure 7 shows the training performance on the Middleburry Stereo Dataset [30] and the testing of the performance on the image Dog.For evaluating the convergence of the modified CNN, the above-mentioned SAD was computed as a training error and Peak Signal to Noise Ratio (PSNR) was used as a testing error in each epoch.Here, an epoch denotes the training times.It is evident in the figure that, with the increase in the epoch, SAD in training gradually decreases, whereas PSNR in testing progressively increases.Furthermore, the gradual convergence tendency with the increase in the epoch can be predicted.These results show that the reconstruction results are better with more training epochs until the modified CNN reaches stability.Furthermore, the performance of the proposed method surpasses that of the sparse coding (SC) method [13] baseline with a few training epochs, and it outperforms the SRCNN [15] with proper training epochs.Finally, it converges to the PSNR value of 46.66 dB on the image Dog.It is evident in the figure that, with the increase in the epoch, SAD in training gradually decreases, whereas PSNR in testing progressively increases.Furthermore, the gradual convergence tendency with the increase in the epoch can be predicted.These results show that the reconstruction results are better with more training epochs until the modified CNN reaches stability.Furthermore, the performance of the proposed method surpasses that of the sparse coding (SC) method [13] baseline with a few training epochs, and it outperforms the SRCNN [15] with proper training epochs.Finally, it converges to the PSNR value of 46.66 dB on the image Dog.

Contrast with Single-View Methods
To demonstrate that the proposed method is more effective than single-view methods, comparative experiments were carried out between the SC method [13], SRCNN method [15], and proposed method.In addition to the common PSNR index, we used two other objective evaluation indices, namely the structural similarity index (SSIM) [35] and the blind/referenceless image spatial quality evaluator (BRISQUE) [36].Since the PSNR and SSIM indices are full-reference image quality assessments and the BRISQUE index is a no-reference image quality assessment, we considered PSNR and SSIM together for facilitating the analysis.Figure 8 depicts the reconstruction results of these three methods for the first frame of the Pozan Street sequence.It is seen that, according to the details from the local amplification region, SC [13] produced a relatively vague reconstruction.Furthermore, as shown by the edge of the letter P and the number 0, the proposed method was more effective in ringing reduction compared with SRCNN [15].Figure 9 shows the reconstruction results of the three methods for the Umbrella image.From the magnified details of the umbrella stick and umbrella skeleton, the proposed method produced sharper edges that more closely approximate the real HR images.Figure 10 additionally shows that the details of the pantomimist clothes in the first frame of the Pantomime sequence obtained by the proposed method are clearer.Table 1 presents the PSNR and SSIM values in the Y channel obtained by SC [13], SRCNN [15], and the proposed method.As shown in the table, compared with SC [13] and SRCNN [15], the average PSNR value of the proposed method is increased by 2.39 dB and 0.54 dB, respectively, and the average SSIM value of the proposed method is slightly increased by 0.01 and 0.0002, respectively.Furthermore, the BRISQUE value in Table 2, which typically has a value between 0 and 100 (0 represents the best quality, 100 the worst), demonstrates the significant performance of the proposed method.The average BRISQUE value of the proposed method is decreased by 21.66 and 2.03 compared with SC [13] and SRCNN [15].Although the CNN architecture of the proposed method is similar to that of SRCNN [15], the proposed method achieves better performance benefiting from the combination of the high-precision view difference image estimated by the modified CNN and the global reconstruction constraint.Overall, according to the tables and the reconstructed images, the proposed method achieves a better SR result than SC [13] and SRCNN [15].more effective in ringing reduction compared with SRCNN [15].Figure 9 shows the reconstruction results of the three methods for the Umbrella image.From the magnified details of the umbrella stick and umbrella skeleton, the proposed method produced sharper edges that more closely approximate the real HR images.Figure 10 additionally shows that the details of the pantomimist clothes in the first frame of the Pantomime sequence obtained by the proposed method are clearer.Table 1 presents the PSNR and SSIM values in the Y channel obtained by SC [13], SRCNN [15], and the proposed method.As shown in the table, compared with SC [13] and SRCNN [15], the average PSNR value of the proposed method is increased by 2.39 dB and 0.54 dB, respectively, and the average SSIM value of the proposed method is slightly increased by 0.01 and 0.0002, respectively.Furthermore, the BRISQUE value in Table 2, which typically has a value between 0 and 100 (0 represents the best quality, 100 the worst), demonstrates the significant performance of the proposed method.The average BRISQUE value of the proposed method is decreased by 21.66 and 2.03 compared with SC [13] and SRCNN [15].Although the CNN architecture of the proposed method is similar to that of SRCNN [15], the proposed method achieves better performance benefiting from the combination of the high-precision view difference image estimated by the modified CNN and the global reconstruction constraint.Overall, according to the tables and the reconstructed images, the proposed method achieves a better SR result than SC [13] and SRCNN [15].Table 1.Comparison with SC [13] and SRCNN [15] for PSNR (dB)/ structural similarity index (SSIM).To show the advantages of the proposed method, which fully leverages the correspondence between views without estimating the depth map, comparative experiments were carried out between the depth-based method presented by Garcia and the proposed method.As mentioned in Garcia's work [22], for fully considering the high-frequency information of original images, the test sequences, including Dog, Pantomime, Pozan Street, Ballet, and Breakdancers, were resized to 640 × 480, 640 × 480, 960 × 544, 512 × 384, and 256 × 192, respectively, by using a six-tap Lanczos interpolation filter.To ensure consistency, we employed these downsampled images for testing the original images under the same conditions.

Images
As depicted in Table 3, the average PSNR value of the proposed method increased by 1.14 dB compared with Garcia's method [22].Two significant characteristics of the proposed method contributed to the improvement; (1) the estimation of the high-precision view difference image between views owing to modified CNN in the first stage and (2) the gradual improvement of the reconstructed right view by enforcing the global reconstruction constraint to incorporate the right view.

Running Time
To evaluate the computational complexity, the running times of all methods were compared for the ten testing stereoscopic images listed in Table 1, as shown in Table 4.All results were acquired from the corresponding authors' MATLAB code, and ours were likewise obtained on MATLAB software.All the algorithms were run on the same machine with an Intel 2.30-GHz CPU and 16 GB of RAM.It was apparent that the proposed method had a faster processing speed than the SC method [13].This is because the proposed method does not need to solve a complex optimization problem as the SC method does [13]; thus, the running time of the proposed method was less than that of the SC method.Moreover, the proposed method's speed was close to that of SRCNN [15].As soon as the training of our modified CNN was complete, the SR results were quickly obtained by the proposed feed-forward method.For evaluating the quality of the stereoscopic images generated by different SR methods, a subjective experiment was implemented, and the procedure followed the International Telecommunications Union-Radio Communications Sector (ITU-R) Recommendation BT.500 [37] so that subjective quality of the reconstructed stereoscopic images relative to the original stereoscopic images was obtained.Specifically, this experiment also used the ten aforementioned testing stereoscopic images and adopted a Double-Stimulus Continuous Quality-Scale method [37], which is equivalent to simultaneously scoring two stimuli corresponding to the reconstructed stereoscopic image and the original stereoscopic image.Thus, for ten testing stereoscopic images and three different SR methods, there are a total of 10 × 3 = 30 comparison clips produced.

Experimental Environment and Participants
In the experiment, the stereoscopic projection system was adopted, and the participants needed to wear polarized glasses, which separate the left and right views to the appropriate eyes.The system consisted of two projectors (BenQ PB8250 DLP), DELL real-time 3D graphics workstations, a polarized light bracket, and a metal screen (150 inches).The experiment was conducted in a specific laboratory, in which the illumination, temperature, and other experimental conditions followed ITU-R Recommendation BT.500.
A total of 20 participants were involved in the study, with an average age of 23 years, and all of the participants underwent the color vision test and met a 20:30 visual acuity test and a stereoscopic visual acuity test at 40 s-arc.They were non-experts whose professional backgrounds are not directly related to image quality.Each participant, needed to score 30 pairs of stereoscopic images.Each pair cost 40 s on display, 10 s on scoring, and 10 s on resting.According to ITU Recommendation BT.500, the display order of the 30 pairs was random.The distance between the participants and the screen was 3 m; that is, three times the height of the screen.In addition, to make the participants be familiar with the scoring process, four other pairs of stereoscopic images were displayed to them before the official scoring.

Ranking and Raw Data Processing
After the stereoscopic images were scored by the participants, the Difference Mean Opinion Scores (DMOS) between each pair of stereoscopic images, which include the original stereoscopic image and the reconstructed stereoscopic image, were calculated.According to ITU-R BT.500, the value range of DMOS is from 0 to 100.The formula is as follows: d ij = r i re f (j) − r ij (14) where r ij denotes the raw quality score of the j-th reconstructed stereoscopic image evaluated by the i-th participant, r iref(j) is the raw quality score assigned by the i-th participant to the j-th original stereoscopic image, and N j denotes the number of participants involved in assessment of the j-th stereoscopic image.Then DMOS j is the mean of the raw difference scores d ij .
Before calculating the final DMOS of each of the reconstructed stereoscopic images, the data of the participants with poor score stability should be removed.Specially, if the value of d ij is beyond the 95% confidence interval of DMOS j , then d ij is called an outlier.

Results and Analysis
Figure 11 shows the DMOS of the ten testing stereoscopic images reconstructed with the three different SR methods.It is obvious that the SC method [13] was poorer than the other two SR methods.Most of the participants thought that the stereoscopic images reconstructed by the SC method were more obscure.Benefiting from the correlation between views, the proposed method also outperformed the SRCNN [15], demonstrating its better stereo visual quality.
Figure 11 shows the DMOS of the ten testing stereoscopic images reconstructed with the three different SR methods.It is obvious that the SC method [13] was poorer than the other two SR methods.Most of the participants thought that the stereoscopic images reconstructed by the SC method were more obscure.Benefiting from the correlation between views, the proposed method also outperformed the SRCNN [15], demonstrating its better stereo visual quality.

Conclusions
To ensure the perceived quality of stereo vision, we proposed in this paper a method using CNNs and incorporating views to reconstruct a FR stereoscopic image.Compared with single-view and depth-based methods, the proposed method combines the correlation between left and right views without estimating the depth image.Firstly, a deep learning tool is used to estimate the view difference image comprising the image texture information and the depth information of the stereoscopic pairs.Then, the estimated right view image is projected onto the solution space of the MR stereoscopic image observation model.Finally, the HR reconstructed right view image is obtained.The experimental results indicated that the performance of the proposed method is superior to those of the existing methods in terms of both reconstruction effect and speed.In the future, we will focus on accelerating the CNN convergence and further consider exploiting the temporal correlation of video to research MR stereoscopic video SR.

Conclusions
To ensure the perceived quality of stereo vision, we proposed in this paper a method using CNNs and incorporating views to reconstruct a FR stereoscopic image.Compared with single-view and depth-based methods, the proposed method combines the correlation between left and right views without estimating the depth image.Firstly, a deep learning tool is used to estimate the view difference image comprising the image texture information and the depth information of the stereoscopic pairs.Then, the estimated right view image is projected onto the solution space of the MR stereoscopic image observation model.Finally, the HR reconstructed right view image is obtained.The experimental results indicated that the performance of the proposed method is superior to those of the existing methods in terms of both reconstruction effect and speed.In the future, we will focus on accelerating the CNN convergence and further consider exploiting the temporal correlation of video to research MR stereoscopic video SR.
and B is the blurring matrix with the size of N 1 N 2 × N 1 N 2 .In addition, let Z represent the observed FR left view image, the purpose of the proposed method is to acquire an FR right view image, X, by making full use of the abundant information of the observed LR right view image Y and observed FR left view image Z. Appl.Sci.2017, 7, 526 4 of 16 where Y and X denote the observed LR right view image and original FR right view image (that is the unknown FR right view image), respectively.Moreover, Y and X are both in vector format with the sizes of M1M2 × 1 and N1N2 × 1, respectively.D is the downsampling matrix with the size of M1M2 × N1N2, and B is the blurring matrix with the size of N1N2 × N1N2.In addition, let Z represent the observed FR left view image, the purpose of the proposed method is to acquire an FR right view image, X, by making full use of the abundant information of the observed LR right view image Y and observed FR left view image Z.

Figure 1 .Figure 2 .
Figure 1.Overall procedure of the proposed method.

Figure 3 .
Figure 3. Overall procedure of the proposed method.

Figure 5
shows the right view of each pair.To provide sufficient information for training the modified CNN, while reducing the complexity of the modified CNN, we extracted the view difference image between the 20 pairs of the FR left view image and the interpolated right view image, and we randomly cropped sub-images with the size of 33 × 33 from the training images.A total of 552,729 sub-images (that is N, the number of training samples) for training were generated.To avoid the boundary effect in the training process, all the convolutional layers have no padding.Moreover, although we used a changeless image size during training, the modified CNN could be employed on images of variable sizes in the testing process.
Appl.Sci.2017, 7, 526 8 of 16 convolutional layers have no padding.Moreover, although we used a changeless image size during training, the modified CNN could be employed on images of variable sizes in the testing process.

Figure 5 .
Figure 5. Right view of each stereoscopic pair in the training dataset (from the Middlebury Stereo Dataset [30]).

Figure 5 .
Figure 5. Right view of each stereoscopic pair in the training dataset (from the Middlebury Stereo Dataset [30]).

Figure 5 .
Figure 5. Right view of each stereoscopic pair in the training dataset (from the Middlebury Stereo Dataset [30]).

Figure 6 .
Figure 6.Right view of each stereoscopic pair in the testing dataset.

Figure 6 .
Figure 6.Right view of each stereoscopic pair in the testing dataset.

Figure 7 .
Figure 7. Convergence curve.(a) Training performance on Middleburry Stereo Dataset [30]; (b) Testing the performance on the image Dog.

Figure 7 .
Figure 7. Convergence curve.(a) Training performance on Middleburry Stereo Dataset [30]; (b) Testing the performance on the image Dog.

Figure 8 .
Figure 8. Super-resolution (SR) results and Peak Signal to Noise Ratio (PSNR) values in the Y channel for the first frame of the Pozan Street sequence.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of sparse coding (SC) [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 8 .
Figure 8. Super-resolution (SR) results and Peak Signal to Noise Ratio (PSNR) values in the Y channel for the first frame of the Pozan Street sequence.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of sparse coding (SC) [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 9 .
Figure 9. SR results and PSNR values in the Y channel for the Umbrella image.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 9 .
Figure 9. SR results and PSNR values in the Y channel for the Umbrella image.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 1 .
Figure 1.SR results and PSNR values in the Y channel for the first frame of the Pantomime sequence.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 10 .
Figure 10.SR results and PSNR values in the Y channel for the first frame of the Pantomime sequence.(a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.

Figure 11 .
Figure 11.Difference Mean Opinion Scores (DMOS) of ten testing stereoscopic images reconstructed with three different SR methods.

Figure 11 .
Figure 11.Difference Mean Opinion Scores (DMOS) of ten testing stereoscopic images reconstructed with three different SR methods.

Table 4 .
Running time of the ten testing stereoscopic images listed in Table 1 (unit: s).