Depth Estimation for Light-Field Images Using Stereo Matching and Convolutional Neural Networks

The paper presents a novel depth-estimation method for light-field (LF) images based on innovative multi-stereo matching and machine-learning techniques. In the first stage, a novel block-based stereo matching algorithm is employed to compute the initial estimation. The proposed algorithm is specifically designed to operate on any pair of sub-aperture images (SAIs) in the LF image and to compute the pair’s corresponding disparity map. For the central SAI, a disparity fusion technique is proposed to compute the initial disparity map based on all available pairwise disparities. In the second stage, a novel pixel-wise deep-learning (DL)-based method for residual error prediction is employed to further refine the disparity estimation. A novel neural network architecture is proposed based on a new structure of layers. The proposed DL-based method is employed to predict the residual error of the initial estimation and to refine the final disparity map. The experimental results demonstrate the superiority of the proposed framework and reveal that the proposed method achieves an average improvement of 15.65% in root mean squared error (RMSE), 43.62% in mean absolute error (MAE), and 5.03% in structural similarity index (SSIM) over machine-learning-based state-of-the-art methods.


Introduction
Light-field (LF) cameras were recently introduced in the image-processing and computer-vision domains in order to resolve the limitations of the conventional camera model. Conventional cameras, which capture the red, green and blue (RGB) primary colours, were designed to capture the color and the accumulated light intensity of the incoming light rays from all directions incident to the camera plane at each pixel position. In contrast to this model, LF cameras were designed to capture the intensity, color, and directional information of each light ray at each pixel position, yielding a 4D LF image for each acquisition.
LF cameras, also known as plenoptic cameras, are implemented by placing an array of microlenses in front of the camera sensor. They serve as an alternative to the conventional paradigm to acquire 4D LF data, which is to arrange conventional RGB cameras on a rectangular grid. Conventional camera systems are difficult to implement and handle, and the inherently large baselines between cameras yield substantial difficulties when handling occlusions in many applications.
The problem of depth estimation from stereo images was widely investigated during several decades. Recently, LF images have received more attention due to their capability to provide both light ray intensity and directional information about the scene. Stereo matching was limited to

Conventional Computer-Vision Methods
The rise of LF cameras allows the use of different cues, as it is not needed to shoot multiple images separately to capture the scene with various focus or view point anymore. Thus, Tao et al. [28] devised an algorithm to compute a dense depth estimation by combining multiple cues, namely defocus and correspondence. This work was extended in [29] by refining the shape using the shading cue, and by removing the specularity component of a scene, allowing for better depth estimation in specular areas. Wang et al. [3] developed a depth-estimation algorithm focusing on the main issues of [28] by detecting and improving the depth of occluded regions. Back to the correspondence cue, Jeon et al. [4] estimated the depth by shifting SAIs using the phase shift theorem, the gradient costs and a multi-label optimization, while Buades et al. [30] combine multiple pairwise disparity estimations using a multi-scale and multi-window stereo matching algorithm which rejects the unreliable pixels. Navarro and Buades [31] proposed to improve the disparity estimations of [30] by employing an optical-flow algorithm to interpolate the disparity. Williem et al. [32] proposed two new data costs to improve the depth estimation in noisy and occluded areas, where the first one is based on the angular patch approach and the second on the refocus image approach. Huang [33] developed a stereo matching algorithm by employing an empirical Bayesian framework based on Markov Random Fields to infer the depth maps of any kind of scenes-dense, sparse, denoised, RGB or grayscale.
A different approach is based on the estimation of the slopes of an epipolar plane image (EPI) to compute the depth. Wanner and Goldluecke [34] used structure tensor and a convex optimization method to find the dominant slopes of the EPI and convert them to depth. Zhang et al. [35] used a spinning parallelogram operator to determine the disparity, which is given by the orientation of the parallelogram when the distance between the two regions enclosed in its two sides in an EPI is maximal.
More recently, Mishiba's work [36] focused on devising a fast stereo matching-based depth-estimation algorithm for LF cameras. The main novelty lies in an offline cost volume interpolation, and in a weighted median filter which replaced the usual graph cut algorithm as global optimization solver, thus, increasing the speed of the overall algorithm.

Deep-Learning-Based Methods
An approach based on ML techniques addresses the depth-estimation problem from stereo pairs by employing learning-based methods for stereo matching. In [37], a supervised learning approach is proposed for predicting the correctness of stereo matches based on a random forest and a set of features about each pixel. In [38], the authors propose a framework which trains a model for matching cost computation in an unsupervised manner. In [39], the authors propose a deep-learning-based method that predicts the confidence level to improve the accuracy of an estimated depth map in stereo matching. This method was further improved in [40], where the confidence is estimated through a unified deep network, built based on the multi-scale patch processing approach that combines the confidence features extracted both from the matching probability volume and its corresponding disparity.
Recently, several solutions based on ML techniques were developed to address the depth-estimation problem from 4D LF images by employing various DL-based methods. Feng et al. [22] proposed a two-stream Convolutional Neural Network (CNN) by training the network model using the pairs of block-size input patches extracted from the horizontal and vertical EPIs. Shin et al. [23] proposed a four-stream fully convolutional neural network (FCNN) where each stream was designed to process a block-size input patch extracted from a specific EPI (horizontal, vertical, main diagonal, or anti-diagonal). In our prior work [25], a neural network design is proposed to estimate each pixel depth by training network models using the input patches extracted from each of the following EPIs: horizontal, vertical, main diagonal, and anti-diagonal, and by further processing the four estimated depth maps to compute the final depth map. All these methods employ a pixel-wise strategy where the depth information of each pixel in the central view is estimated by inferring the patches from different EPIs containing the local neighborhood of the current pixel. Ma et al. [41] performed multi-scale convolutions to extract multi-scale features from the SAIs and obtain good estimations of the disparity in texture-less and reflective areas based on the estimations at object boundaries.

Proposed Method
The baseline algorithm used for stereo matching was pioneered by Buades and Facciolo [30] and then reused by Navarro and Buades [31]. In our previous work [24], we have built on this algorithm by adapting it to operate on arrays of LF cameras. In this paper, we propose to first enhance our previous design in [24] to increase its robustness, and then employ a DL-based algorithm for residual error prediction to refine the final disparity estimation. Figure 1 depicts the scheme of the proposed method. The proposed multi-stereo matching (MSM) algorithm is first employed to compute the initial estimation, denoted by D msm . The proposed CNN-based algorithm for residual error prediction is employed to compute additional details and to obtain the refined estimation, denoted by D cnn . Finally, a post-processing algorithm is employed to compute the final estimation, denoted by D final . In this paper, we focus on estimating the disparity information as it is well known that the depth information can be easily computed based on disparity, camera baseline, and focal length. This section is organized as follows. Section 3.1 introduces the proposed MSM algorithm. Section 3.2 describes the proposed DL-based refinement algorithm.

Multi-Stereo Matching Method
The starting point of this work is represented by our algorithm from [24], where the proposed disparity estimation method based on a multi-scale multi-window approach was improved by employing the belief propagation [42] as the global energy minimization function. The method is capable of achieving sub-pixel accuracy, which is of critical importance when dealing with LF images that have very small disparity values. Furthermore, the method enforces the estimations to be reliable as unreliable pixels of each stereo pair are rejected, which leads to gaps that can be filled in based on estimations from other pairs. This approach is very well suited when dealing with LF images where multiple disparity estimations are to be fused together. Indeed, the accuracy of each disparity map is more important than its completeness as the missing pixels are likely to have an estimation in other stereo pairs.
In this paper, we propose to extend our previous work from [24] to improve the disparity estimation in flat and untextured areas. The following concepts are proposed: (i) remove the constraint that the two SAIs in the stereo pair must be aligned horizontally or vertically, and extend the set of stereo pairs to contain all available SAI pairs; and (ii) modify the disparity fusion algorithm to employ a weighted mean estimation based on the pixel confidence, instead of a median filter aggregation, and depending on the camera baseline.

Neighborhood Window Selection
To compute the disparity map corresponding to each stereo pair, eight different costs are computed for each pixel and disparity value using different neighborhood windows. These windows, which have different orientations, are depicted in Figure 2. By employing these windows instead of a regular squared window, the proposed multi-stereo method provides improved results in the areas close to objects boundaries and in the slanted regions where the selected window can align with the minimal disparity changes [31]. However, highly untextured areas remain difficult. To overcome this limitation, we introduce a threshold, denoted by τ, for the variance within the neighborhood inside the window. If the computed value is below τ, the size of the window is increased. This enforces the method to assure that enough information of the scene is available for finding overlapping patches in both stereo image, for each pixel p and for each of the eight windows. More exactly, four different window sizes are used in our implementation: 9 × 9, 17 × 17, 65 × 65, and 129 × 129. Figure 3 illustrates the different windows selected for each pixel of the five LF images commonly used in the literature to compare the state-of-the-art methods for depth estimation, extracted from the dataset introduced in [43], where white marks the smallest window size and black marks the largest window size. As expected, uniformed areas such as the wall in the kitchen LF image use a big window size, while the very textured areas like the floor in the town LF image use a small window size. The same observation holds for the object boundaries, where very recognizable edges use the smallest windows.  The use of various window sizes improves the disparity estimation in large uniform areas. The threshold plays an important role: if it is too small, a 9 × 9 window will be assigned to all pixels which does not capture enough information about the scene in untextured areas; if it is too large, a 129 × 129 window will be used, which reduces the sharpness of object boundaries. Our threshold was empirically chosen and fixed for all images, to prove the robustness of the approach. Our experiments show that for the tested dataset, the PSNR increases with up to 0.35 dB and the number of reliable pixels increases with up to 3%.
In the case of stratified scenes such as dots, with a lot of noise, the variance is high everywhere in the image, thus limiting the impact of this threshold. Hence, here we propose to employ a DL-based algorithm to refine the initial estimation map by first detecting the cases where the proposed stereo matching method does not perform very well and then by computing the corresponding residual error to adjust the estimated disparity.

Fusion Algorithm
LF cameras were invented as a different method to capture information of a given scene with the intention to improve its 3D reconstruction. Intuitively, it is clear that the more information of the incoming light should allow for better 3D reconstruction. In [31], only the horizontal and vertical SAIs, with respect to the SAI for which the disparity maps are computed for, are used. In contrast, the proposed method uses all available SAIs when necessary. However, to achieve such goal, we propose to extend our method to take into account the epipolar lines of each stereo pair, because the pairs that are not extracted from the same row or column are neither horizontally nor vertically registered. The epipolar lines are computed using the intrinsics and extrinsics matrices of each SAI. By adding the epipolar lines, the algorithm also gains the advantage of getting closer to reality, where two images captured by two different cameras are not likely to be registered.
The LF images in the 4D benchmark dataset from [43] consists of an array of 9 × 9 SAIs. Therefore, here we first apply the proposed stereo matching algorithm on all the available 9 2 − 1 = 80 stereo pairs, where each pair contains the selected reference SAI (i.e., the corresponding view of the estimated disparity map) and one of the remaining SAIs, selected in turn. The algorithm computes 80 different disparity estimations for the reference SAI, which in our case is selected as the central view. The final disparity estimation is obtained by fusing the 80 disparity maps into one disparity map, D msm . In [24,31], the final disparity value for each pixel was computed as the median of the estimations. In this paper, we propose to carefully analyze the set of obtained estimations to find the best possible method to fuse them and compute the disparity maps. Figure 4 depicts the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE) computed for each stereo pair between the estimated disparity map and the ground truth. Each image collects the RMSE (first row) and MAE (second row) on the positions of the stereo pair marked in Figure 5. One easily notes that the error decreases when the baseline between the stereo images increases. Moreover, the observations depicted in Figure 4 are consistent for all LF images in the dataset from [43], and not exclusively for this set of LF images. For the kitchen and dots LF images the visualizations look slightly different. Most probably, in the former, this is due to the reflective surfaces contained in the scene, and in the latter because of the high amount of noise. Regardless, the relationship between the baseline and estimation error remains the same. Therefore, we decided to give more importance to the disparity maps resulting from the SAIs with a bigger baseline. More exactly, we only use the disparities obtained from the inner SAIs if the estimations from the outer SAIs are not reliable and, thus, rejected, as explained in Section 3.1. These baselines are depicted in Figure 5. Please note that the figure does not express the actual distances between the SAIs, but merely provides a reference of the baseline distances as interpreted by the proposed algorithm.
In addition, the cost associated with each pixel, denoted by cost(i, j), was computed when estimating the disparity maps as the zero-mean sum of squared differences (ZSSD) between the patches in the two images, as expressed by Equation (2). Since this cost is inversely proportional to the similarity of the patches in the stereo pair, it is used here to compute the confidence level measurement of the found match, denoted by con f (i, j), and computed as follows: where cost(i, j) is the cost associated with pixel p = (i, j) int the first image for the chosen window W and disparity d leading the corresponding pixel q in the second image: (2) town pillows medieval2 kitchen dots The output estimated disparity map of the proposed SM-based method, D msm , is computed as the weighted mean of multiple estimated disparity maps D using the confidence metric as follows: where and K is the set of estimated disparity maps to consider for a given pixel, based on their baselines and reliabilities. The proposed fusion method, shown in Algorithm 1, combines k estimated disparity maps D k and their corresponding masks m k based on those two observations. For each pixel, the weighted mean of the reliable disparity maps is first computed using the largest available baseline, which has the value 4 in the case of an LF image represented as an array of 9 × 9 SAIs, for which the disparity map at the central location is computed. If not all pixels are filled in, due to the lack of a reliable estimation, the next available baseline, 3, is then used. This process is repeated until all pixels are filled in or no more disparity estimations are present. In general, due to the large amount of information available in an LF image, all the pixels are filled in. Otherwise, a simple inpainting technique is employed for the remaining pixels. Our tests show that the only remaining image holes are isolated pixel and not large patches and the values can easily be estimated from the surrounding pixels. Therefore, the remaining pixels are filled in based on their neighborhood's disparity estimations, as the mean of the reliable values among its eight neighbors.

Algorithm 1 Disparity maps fusion algorithm
Input: k disparity maps D k , k reliability masks m k , and k baselines b k Output: disparity map D msm 1: baseline = 4 2: while baseline > 0 and not all pixels filled in do 3: for all pixels (i, j) in D msm not yet filled in do 4: create empty vector K of estimations to consider 5: for all k do 6: if b k == baseline and m k (i, j) == 1 then 7: add D k to vector K

Deep-Learning-Based Refinement
The estimated disparity map computed by the proposed MSM-based method is refined by employing a novel DL-based refinement method. The proposed pixel-wise DL-based algorithm is employed to process the local neighborhood information around the current pixel position and to estimate the corresponding residual error of the initial MSM-based estimation. Hence, the initial estimation is first adjusted with the CNN-based residual error prediction, and then further refined by employing a post-processing algorithm to compute the final estimation map.
The proposed training strategy is presented in Section 3.2.1, the proposed network architecture is described in Section 3.2.2, the loss function formulation is introduced in Section 3.2.3, while the final post-processing algorithm is outlined in Section 3.2.4.

Training Strategy
The goal of the proposed DL-based algorithm is to process a small patch extracted from D msm , and to evaluate the performance of the proposed MSM-based method by predicting its residual error. More exactly, the proposed DL-based algorithm uses the local context of the current pixel to detect the cases where the initially applied MSM-based method fails to provide a good disparity estimation, and then predicts the corresponding adjustment needed to correct the current estimation. The strategy was successfully applied in our previous work on lossless image compression [26,27]; however, in this paper, several design changes are required to refine the initial estimation.
The input patch corresponding to the current pixel position, (i, j), is denoted by X i,j and contains the neighborhood of the current pixel, extracted from D msm , by selecting all the rows and columns between the current position and a maximum distance of b pixels, as follows: Here we set b = 15 and generate input patches of size (2b + 1) × (2b + 1) = 31 × 31, as our experiments show that this patch size offers a good performance vs. complexity trade-off. Let us denote the ground truth map as D gt . For the current pixel position, the residual error of the proposed SM-based method, ε i,j , is computed as follows: A natural approach would require the target prediction of the proposed network, denoted by y i,j , to be set as the residual error ε i,j . However, in the distribution of the residual error, one notes that for most of the samples, the target prediction is set with a value within a small range centered at 0. In such case, the neural network tends to ignore the large magnitude errors as not enough samples are available in the corresponding context so that the network can adjust its weights. Therefore, we propose to threshold the residual error using T = 0.25 and set the target prediction as y i,j = −T for large negative errors and y i,j = T for large positive errors. More exactly, here, we choose to focus on predicting the large amount of small residual errors found in large image areas, since the high magnitude errors are sparse and their correction will have a small visual impact. Furthermore, we propose to quantize the residual error, ε i,j , and set the target prediction, y i,j , as follows: where q = T N cls is the quantization step and N cls = 100 is the number of classes generated for the corresponding range. More exactly, we propose to assign each input patch to one of the 2N cls + 1 = 201 available classes, where all the samples in a class have a single quantized residual error assigned. Based on this strategy, more samples are allocated to each class to create reliable contexts. One notes that the quantized residual error is rounded towards zero, so that the adjustment will introduce the smallest distortion.
In conclusion, the proposed neural network architecture (presented below) is trained based on the input patches X i,j , extracted using Equation (5), and the target prediction y i,j , set using Equation (7), and computes the residual error, denoted byε(i, j). Moreover, our tests have shown that a small improvement is obtained by quantizingε(i, j). Hence,ε(i, j) is further processed using Equation (7) to obtainε(i, j).
Finally, the adjusted CNN map, D cnn , is computed as follows:

Proposed Neural Network Design
The proposed architecture is called Depth Residual Error estimation Convolutional Neural Network (DRE-CNN). Figure 6 depicts the DRE-CNN architecture design built based on two types of layer structure, the Convolution Block (CB) and the 2-branch Convolutional Block (2bCB), depicted in Figure 7a,b, respectively. The CB block contains the following layer sequence: (i) a 2D convolution (2D Conv) layer with a 3 × 3 kernel; (ii) a batch normalization (BN) layer [44]; and (iii) a rectified linear unit activation function layer (ReLU). Please note that the parameters of the 2D Conv layers are set using the following notation "[N, k, s]", where N denotes the number of channels, k denotes the k × k kernel size, and s denotes the stride. The default stride is s = (1, 1) and it is omitted in the DRE-CNN design, while the stride s = (2, 2) is denoted by s = /2.
The 2bCB block was inspired by the ResLB block proposed in our previous work [27]. 2bCB follows a two-branch strategy where "branch 1" is processing the input patch using one convolutional layer, while "branch 2" is processing the input patch using a sequence of two convolutional layers. Compared to the ResLB block [27], the 2bCB block proposes the following modifications: (1) introduces two Dropout layers [45] with a probability of 0.2 for setting the input samples to zero, one placed after the ReLU activation layer of the first 2D Conv layer in "branch 2", and one placed at the end of 2bCB layer structure; and (2) replaces the addition layer with a concatenation (Concat) layer, resulting in halving the number of channels in 2D Conv layers before concatenation and reducing the network complexity.
The proposed DRE-CNN architecture contains one CB block, seven 2bCB blocks, and one dense layer, also known as fully connected layer. The CB block is equipped with N channels to process the input patch of size 31 × 31, and extracts the initial features. The following two 2bCB blocks are employed to further process the patches, while the remaining five 2bCB blocks are employed to reduce the patch resolution from 31 × 31 to 1 × 1 by employing a stride s = /2. Please note that the DRE-CNN design follows the general rule of doubling the number of channels whenever the path resolution is halved. The last 2bCB block contains 2D Conv layers with kernels of size 2 × 2. The last layer in the DRE-CNN architecture is a dense layer which contains one output as the network to estimate the final residual error,ε(i, j). In this paper, we set N = 16 and train models containing around 2.3 million (M) parameters. One notes that DRE-CNN takes advantage of both BN and Dropout concepts proposed in the literature to avoid overfitting.

Loss Function Formulation
The loss function is computed based on the Mean Squared Error (MSE), and it employs 2 regularization to avoid overfitting. Let us denote Θ as the set of all DRE-CNN model parameters, where W i ∈ Θ are the trained weights; X i is the ith input patch of size 31 × 31; y i the corresponding target prediction set using Equation (7). Let F(·) be the function which processes X i using Θ to compute the predicted residual error asε i = F(X i , Θ). The loss function is formulated as follows: where: (a) L MSE is the loss term computed as the MSE between y i andε i as follows: where m is the number of samples in the batch. (b) L L2 is the 2 regularization term computed as follows: In this paper, the DRE-CNN models are trained using λ = 10 −2 .

Final Post-Processing
The output of the proposed DL-based algorithm, D cnn , is then post-processed using the algorithm proposed in our previous work [25], where the estimated disparity is first filtered and then denoised based on a conventional algorithm. Hence, D cnn is first filtered twice using a mean filter with a 3 × 3 window, where the disparity values outside the [d − τ,d + τ] range are removed from the window, where τ = 1 andd is the median value inside the window.
Finally, the disparity map is further processed to obtain the final disparity map, D f inal , using the denoising algorithm proposed in [22] and available online. The algorithm uses a directional NonLocal Means (NLM) filter, where the neighborhood regularization term is defined based on the idea that the pixels with similar colors are encouraged to have similar depth values [46]. Therefore, the refinement of the disparity map is guided by the color information found in the corresponding central view (SAI) of the LF image. In the literature, various studies have proven that NLM can be efficiently used for image restoration and denoising [47], and depth map refinement [46,48].

Experimental Setup
The experimental and visual results are shown in comparison with the state-of-the-art depth-estimation methods proposed by: (a) Wang et al. [3]; (b) Williem et al. [32]; (c) Feng et al. [22]; and (d) Schiopu et al. [25]. The results for [3,32] are obtained by running the source codes, which were kindly provided by the authors of these methods. The numerical and visual results for [22] were extracted from the paper. The performance of these state-of-the-art methods is compared with the performance of the Proposed Multi-Stereo Matching method, denoted simply Proposed Stereo and presented in Section 3.1, and the proposed method depicted in Figure 1, denoted simply Proposed Method.
The experimental evaluation is performed on synthetic LF images [43]. The dataset contains 24 LF images, each represented as a grid of 9 × 9 SAIs, belonging to the sub-categories additional, stratified, and training, with available ground truth. Each SAI was captured using a resolution of 512 × 512.
The results of Proposed Method are obtained by first training a DRE-CNN model based on a set of LF images, called Training Set, and by inferring the model for the remaining LF images, called Test Set. In [22], the following training configuration is proposed: Training Set of 19 LF images; Test Set of following LF images: town, pillows, medieval2, kitchen, dots. Please note that from each LF image 512 × 512 = 262,444 samples are extracted, e.g., for the configuration of [22] a total number of 19× 262,144 = 4,980,736 samples are extracted for training. In [25], other training configurations were proposed by randomly collecting a Training set of 20 LF images and a Test Set of the four remaining LF images, i.e., 5,242,880 samples are extracted for training. The Training Set is further divided into 15 LF images for model training and 4 LF images for model validation for the configuration from [22], and respectively 5 LF images in the configuration of [25]. Here, configurations C2 and C3 are selected to prove the robustness of Proposed Method. Therefore, three DRE-CNN models were trained, i.e., one for each training configuration. The Adam optimization algorithm [49] is applied because it is known as an improved procedure for adjusting the learning rate. Each model is trained during 40 epochs, using the learning rate of 2 × 10 −4 and a batch size of 512 samples.
The performance is measured based on the MAE and RMSE metrics (where small values mark better results), and the structural similarity index measure (SSIM) [50] (where large values mark better results) computed between the ground truth and the estimated disparity map. Moreover, we introduce the Relative Gain (RG) (%) metric computed over the average results of a method M relative to Proposed Method as follows: Table 1 shows the depth-estimation results obtained for the training configuration of [22] for: the two traditional methods, [3,32]; the two DL-based methods, [22,25]; and the two proposed methods, Proposed Stereo and Proposed Method, where the bold values mark the best result. Please note that the conventional methods have different pre-processing and post-processing steps, including cost volume computation to produce the best possible results, therefore, only the final result is shown here for convenience. The average results show that the Proposed Method provides much better results compared to all other methods. One can note that based on the RMSE metric, the Proposed Method achieves an improved performance with 15.65% compare with [25] and 14.97% compared with Proposed Stereo. Based on the MAE metric, the Proposed Method achieves an improved performance with 43.62% compare with [25], and 22.69% compared with Proposed Stereo, which proves the efficiency of the proposed DL-based algorithm for residual error prediction. Moreover, based on the SSIM index, the Proposed Method achieves an improved performance with 12.12% compared with the state-of-the-art method based on conventional techniques [32], and an improved performance with 5.03% compared to the state-of-the-art method based on ML techniques [25]. Figure 8 shows the visual results for the four state-of-the-art methods [3,22,25,32] and the two proposed methods.  [22], from left to right: town, pillows, medieval2, kitchen, and dots. (1st row) RGB images [43]. (2nd row) Results for Wang et al. [3]. (3rd row) Results for Williem et al. [32]. (4th row) Results for Feng et al. [22]. (5th row) Results for Schiopu et al. [25]. (6th row) Results for our Stereo method. (7th row) Results for our Proposed method. (8th row) Ground truth disparity [43].

Experimental Results
One notes that the Proposed Method systematically improves the quality of the disparity map estimated by Proposed Stereo. E.g., in town (2nd column) the disparity around the church tower and the top of the tower is improved and the flat areas are smoothed, in pillows (3rd column) the pillow surface is smoothed, in medieval2 (4th column) the entire disparity map is much sharper, and in dots (5th column) the background noise was removed completely. Moreover, one notes that the results of the Proposed Method look visually much better than the results of other methods. Figure 8 show the qualitative comparison of the five test LF images presented in Table 1. Due to the low resolution and continuous disparity maps in contrast to discrete disparity maps in conventional methods, the results of DL-based approaches look different than those of conventional methods, e.g., see the results of Schiopu et al. [25].  Tables 2 and 3 shows the depth-estimation results for the training configuration C2 and C3, respectively, of [25], for the methods [3,25,32] and the two proposed methods, in MAE and SSIM. One notes that Proposed Method achieves the best overall results. Moreover, the DL-based refinement layer provides an improved performance with 9.77% in MAE and 6.60% in SSIM for configuration C2, and with 15.53% in MAE and 9.77% in SSIM for configuration C3, compared with Proposed Stereo. Figure 9 shows the visual results for the Proposed Method for the eight LF images of configurations C2 and C3. One notes that the Proposed Method provides: (i) sharp edges, e.g., see pens, dino; (ii) smooth areas, e.g., see dishes; and (iii) it is able to detect specific local features, e.g., see the pens in pens, and the toys in dino.

Ablation Study
Several design variations were tested in an ablation study focused on finding the best neural network design for residual error prediction. We analyze the impact of the following design decisions and concepts taken into account when building the proposed DRE-CNN architecture design: (1) employing the neural network as a classifier instead of a residual error predictor; (2) the efficiency of the 2bCB block compared with the ResLB block [27]; (3) the importance of the quantization step in the generation of the training data; (4) the influence of the input patch size on the method's performance.
The first architecture variation studies the effect of employing the proposed network design as a classifier. More exactly, the DRE-CNN design was slightly modified by changing the number of output classes of the dense layer from 1 to 2N cls + 1 = 201 and by adding a new SoftMax activation function as the last layer in the network, see Figure 6. This change requires that 4.45% more parameters (compared with DRE-CNN) are needed to train the weights of the extra 200 classes used in the last dense layer. In such case, the network will be employed to classify the input patch into a class, and the corresponding class index will select the quantized residual error computed by Equation (7). The obtained method is called Classification design.
The second architecture variation studies the efficiency of the 2bCB block compared with the ResLB block [27]. More exactly, the DRE-CNN design was modified by replacing all 2bCB blocks with corresponding ResLB blocks. Please note that in contrast to the concatenation layer in the 2bCB design, the addition layer in the ResLB design increases the number of parameters. This change introduced 173.69% more trainable parameters compared with the proposed DRE-CNN architecture. The obtained method is called ResLB-based design.
The third experiment studies the importance of pre-processing the training samples. More exactly, we propose to employ a smaller quantization step q = T N cls in Equation (7), by using N cls = 1000 instead of N cls = 100, see Section 3.2.1. Please note that in this case, only the training samples are modified, and no design change is applied to the proposed DRE-CNN architecture. The obtained method is called Reduced quantization step.
The last experiment studies how the input patch size influences the performance of the proposed method. Please note that in all other experiments, we set b = 15 and generate input patches of size 31 × 31. In this experiment, we first propose to set b = 11 and generate input patches of size 23 × 23. One notes that the network architecture remains the same, while the patches are processed at smaller resolutions. This reduces the runtime as the kernels are applied a lower number of times. The obtained method is called Reduced patch size (b = 11). Secondly, we propose to further reduce the input patch size to less than a quarter by setting b = 7 and generating input patches of size 15 × 15. In this case, the network architecture was slightly modified by removing the processing block 2bCB_7, as shown in Figure 6, which further reduces the inference runtime. The obtained method is called Quarter patch size (b = 7). Table 4 shows the results for the two proposed methods and the experiments presented above. One notes that: (i) all DRE-CNN variations are still able to improve the performance of Proposed Stereo; (ii) DRE-CNN operates better as a predictor than as a classifier; (iii) the proposed 2bCB block structure provides important performance gain with a low complexity; (iv) training data pre-processing plays an important role in network training; (v) the inference runtime can be reduced by employing a smaller input batch; however, which will lead to a decreased performance.

Time Complexity
The proposed stereo estimation method requires around 9.45 min to process a stereo pair with the largest baseline; however, the runtime decreases further for more narrow baselines. Our experiments show that the proposed method can provide a good initial estimation with only 8 stereo pairs: 4 on the corners, 2 on the same row, and 2 on the same column. In this case, the performance of the proposed approach drops by only around 4%.
The proposed neural network is implemented in Python using the Keras open-source deep-learning library, and is running on machines equipped with Titan Xp Graphical Processing Units (GPUs). Table 4 shows the inference time for computing the refined estimation, D cnn , for the different experiments proposed in the ablation study. The proposed DRE-CNN network requires around 23 hours to train one network model, and an average inference time of 47 ms for each batch of 512 input patches. Therefore, for each LF image, a total time of 512 × 0.047 s = 24.064 s is required to apply the CNN model. The experiments show that by halving the input patch resolution, the inference time can be reduced around 3.34 times, while the performance drops by around 3.18% in RMSE, 12.35% in MAE, and 0.22% in SSIM.

Conclusions
The paper proposed a novel depth-estimation method from LF images, which combines multi-stereo matching and ML techniques. A novel block-based stereo matching method is proposed to compute the initial disparity estimation by operating on any pair of two SAIs in the LF image. A novel DL-based method for residual error prediction is proposed to refine the initial estimation. A novel neural network architecture, DRE-CNN, is designed based on a more efficient layer structure, 2bCB. Experimental results on synthetic LF data demonstrate that the proposed framework outperforms quantitatively and qualitatively the existing state-of-the-art methods for depth estimation.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. All authors read and approved the final manuscript.