Efficiency–Accuracy Trade-Off in Light Field Estimation with Cost Volume Construction and Aggregation

The Rich spatial and angular information in light field images enables accurate depth estimation, which is a crucial aspect of environmental perception. However, the abundance of light field information also leads to high computational costs and memory pressure. Typically, selectively pruning some light field information can significantly improve computational efficiency but at the expense of reduced depth estimation accuracy in the pruned model, especially in low-texture regions and occluded areas where angular diversity is reduced. In this study, we propose a lightweight disparity estimation model that balances speed and accuracy and enhances depth estimation accuracy in textureless regions. We combined cost matching methods based on absolute difference and correlation to construct cost volumes, improving both accuracy and robustness. Additionally, we developed a multi-scale disparity cost fusion architecture, employing 3D convolutions and a UNet-like structure to handle matching costs at different depth scales. This method effectively integrates information across scales, utilizing the UNet structure for efficient fusion and completion of cost volumes, thus yielding more precise depth maps. Extensive testing shows that our method achieves computational efficiency on par with the most efficient existing methods, yet with double the accuracy. Moreover, our approach achieves comparable accuracy to the current highest-accuracy methods but with an order of magnitude improvement in computational performance.


Introduction
Light field imaging can serve as a significant potential tool for constructing 3D environments.Unlike traditional imaging, light field imaging captures a richer array of information, describing the distribution of light rays in three-dimensional space.Furthermore, it has applications in virtual reality [1,2], view synthesis [3], 3D reconstruction [4], and autonomous driving [5,6].Depth estimation is a fundamental and critical step in these important research areas.However, these applications require not only high accuracy but also rapid generation speeds, thus necessitating that depth estimation processes simultaneously meet the demands for both estimation accuracy and computational efficiency.
In recent years, deep learning-based methods [6][7][8][9][10][11] have achieved significant advancements and demonstrated considerable potential in the realm of light field depth estimation.These approaches typically utilize all available light field image information, effectively improving the accuracy of depth estimation.However, the presence of abundant light field information in light fields leads to a substantial increase in network memory consumption and computational load.Therefore, many researchers have tried to improve computational speed by pruning redundant light field images, which indeed achieves good results.However, reducing the input information inevitably lowers accuracy and the perception of occlusion and texture regions.To address this issue, state-of-the-art methods [10] have utilized full correlation to construct matching cost volumes, replacing 3D convolution operations with 2D convolutions to further reduce computational load while incorporating multi-scale aggregation to boost accuracy.Nonetheless, full correlation tends to lose critical depth features and performs poorly in textureless areas.There are also many in-depth studies on the perception of textureless regions [12][13][14].In summary, although 2D convolutions reduce data volume, they lose significant spatial depth information compared with 3D convolutions, rendering the network less effective in handling complex spatial scenes.
To overcome these limitations, we propose a more effective method for cost volume construction and a cost aggregation architecture.Firstly, we replace full correlation with grouped correlation to enhance feature matching information.Secondly, we introduce feature dissimilarity operations to compensate for the shortcomings of feature correlation in textureless areas.Lastly, we present an architecture integrating Hourglass modules with 3D convolutions for multi-scale disparity fusion.This network structure more effectively captures multi-scale spatial features, and the use of 4D feature vectors further enhances the model's ability to detect texture details, thereby improving both performance and accuracy in depth estimation.Our performance is shown in Figure 1.

Related Work
In this section, we review the main works in the direction of light field depth estimation based on traditional and deep learning methods.
The concept of EPIs was first introduced by Bolles et al., employing light field (LF) epipolar geometry to calculate line slopes for depth prediction.Wanner et al. [28] then incorporated it into light field depth estimation, using only horizontal-and vertical-direction EPIs of light field images, optimizing the results with a global consistency labeling algorithm.The EPI method significantly accelerated disparity estimation speed.Zhang et al. [29] further proposed the spinning parallelogram operator (SPO) to compute line slopes in EPIs, enhancing the accuracy of disparity estimation.Sheng et al. [30] used multi-directional EPIs to optimize slope estimation accuracy, achieving results surpassing the SPO.Heber et al. [21] developed a principal component analysis matching item for multi-view stereo reconstruction, combined with the projection of sub-aperture images (SAIs) for depth estimation.Jeon et al. [22] introduced a Fourier transform-based phase-shift theory to address small disparities between SAIs.In the refocusing approach, Tao et al. [24] combined defocus cues with consistency cues in light field images for depth map estimation, though performing poorly in occluded areas.Tao et al. [26] proposed a shadow-based refinement method to enhance the robustness of depth map estimation.
While these traditional methods have continuously progressed in accuracy and computational efficiency in light field disparity estimation, they are limited by nonlinear optimization and manually designed features.These features demand extensive computational resources and perform poorly in occluded and weakly textured areas, leaving substantial room for improvement in both accuracy and computational performance in disparity estimation.

Deep Learning Methods
In recent years, the use of deep convolutional neural networks (CNNs) for light field depth estimation has achieved impressive results.Focusing on disparity estimation accuracy, Tsai et al. [31] introduced an attention-based visual selection module that integrates the importance of each view with depth estimation, significantly enhancing robustness against noise interference.Building on this, Chen et al. [8] combined attention mechanisms with multi-level fusion networks, using fusion between different angular branches to further enhance disparity estimation accuracy.Most recently, Yang et al. [32] integrated local and global features within view feature cost volumes to address the challenges of occlusions and textureless regions, further improving disparity estimation accuracy.However, these methods, due to the use of redundant information and extensive 3D convolution operations, tend to be slower in generating disparity maps.
In another direction, Heber et al. [33] proposed a U-shaped artificial neural network to extract geometric information from light fields for depth estimation, initially utilizing EPIs.Subsequently, Shin et al. [34] used CNNs to extract geometric disparities from EPIs and proposed a fully convolutional end-to-end network.Further, Huang [10] designed a disparity estimation model that replaced 3D convolutions with 2D convolutions, significantly reducing the learning parameters and enhancing computational performance.These methods, which generate disparity maps using a lower proportion of light field images as input, have good computational performance but are limited in disparity estimation accuracy and robustness against real-world noise, especially compared with methods that use inputs from all views.
Finally, previous research has already shown the advantages of deep neural networks in light field depth estimation.However, there has been insufficient focus on balancing accuracy and computational performance, often leading to a trade-off when generating disparity images.In this paper, we propose a lightweight convolutional neural network that employs multi-disparity cost aggregation.This network extracts richer depth information from fewer input data and achieves a balance between computational load and depth estimation accuracy.

Method
In this paper, we introduce a novel method that balances efficiency and accuracy in light field depth estimation.The overall architecture of the network is depicted in Figure 2. Initially, acknowledging the redundancy in light field images, we only use sub-aperture images (SAIs) from the horizontal and vertical directions as inputs to reduce computational costs as much as possible.A shared feature extraction module is then employed to extract SAI features (Section 3.1).Following this, we construct cost volumes based on the features of surrounding pixels after pixel shifting and central features.In this process, we propose a hybrid cost volume network to enhance detail perception (Section 3.2).Finally, a multi-scale disparity cost aggregation module is used to synthesize mixed cost depth information, which is then processed by a disparity regression module to predict the disparity map (Section 3.3).

Feature Extraction
The extraction of effective features is crucial to estimating the disparity map, particularly due to the small disparity range in light fields, which complicates accurate estimation in low-texture and occluded areas.To address this, we use multiple basic residual blocks for preliminary feature extraction, applying stride-2 convolutions for downsampling.The feature maps are downsampled at two different scales and subsequently restored to their original scale through bilinear interpolation.These features at different levels are concatenated and fused via convolution, and an SENet [35] module is added to enhance the weighting of key feature channels.The resultant feature map serves as the input to our hybrid cost volume network.

Texture-Aware Cost Volume
After extracting features of SAIs, a 4D cost volume is constructed to predict the disparity map by establishing a correspondence between shifted surrounding view features and the central view feature.We use parallel plane parameterization to represent the four-dimensional light field, L(x, u), where x and u are the spatial and angular coordinates, respectively.I c (x, u c ) is the central view.The disparity of the light field is denoted by d(x, u c ), and u c denotes the center view position.According to LF geometry, given a surrounding SAI I s (x, u), the reconstructed central view, Ĩc (x, u), can be expressed as By using this equation, we can reconstruct the central view, Ĩc (x, u), from the surrounding views, I s (x, u), where x + (u − u c ) • d(x, u c ) calculates the displacement caused by the disparity.
Typically, the full correlation cost volume is obtained by using correlation operations [10,36] between the distorted features of surrounding views and the central view to regress the disparity map.However, relying solely on full correlation can result in the loss of significant information.To further reduce the computational load, we group features to compress the matching cost volume.The number of channels of a univariate feature is denoted by N c , and the channels are uniformly divided into N g groups along the channel dimension.Therefore, each group feature has N c /N g channels.Correlation is then computed for each group.The correlation between the surrounding view features, F g s , and the central view features, F g c , is represented as follows: where ⟨•, •⟩ represents the inner product of two features and C gc is the correlation cost volume for feature group g and disparity d.However, due to the low values of features in textureless areas, the multiplication operation in the correlation process results in a small variance between the feature costs at the correct and incorrect depths.This can easily lead to interference by noise and incorrect depth estimation.To address this issue, we introduce a new set of cost volumes.We construct cost volumes by using the sum of absolute differences between feature views, effectively increasing the variance range of depth-related feature costs and enhancing the network's perception of the correct depth in textureless areas.The differentiation cost volume between feature view pixels is represented as follows: Furthermore, the cost volumes are stacked in the depth direction as a 4D array (G × D × H × W) and then concatenated to connect volumes with the same disparity scale, forming the initial cost volume (4G × D × H × W).Here, G represents the number of groups, D the number of disparity layers, and H and W the dimensions of the input image.It is important to note that the relationship of disparity scales at different distance angles is proportional to the distance (d).Here, we define the maximum disparity for the innermost view as d max , and considering that disparity estimation requires multiple downsampling operations, we set the total number of disparity layers to be even, with a disparity range of [−d max , 1 + d max ] and disparity levels set to 2 + 2d max .Similarly, the disparity for the outermost view is set to [−4d max , 1 + 4d max ], with disparity levels set to 2 + 4d max .The disparity level refers to the number of discrete disparities within the interval from the minimum to the maximum disparity.
Finally, through the network we propose, the correlation cost volume and the differentiation cost volume are fused, as illustrated in the CVCM module shown in the lower right corner of Figure 2. The correlation cost volume and the differentiation cost volume are processed separately through 3D convolution and a 3D channel attention mechanism to extract matching information.They are then combined to form the final cost volume.
During the construction of the cost volume matching process, we employed a grouping method to compress the most memory-intensive part, reducing computational load and memory consumption.Additionally, we proposed a correlation and dissimilarity fusion structure to enhance perception and accuracy in textureless regions.

Multi-View Cost Volume Aggregation and Disparity Regression
To fuse cost volumes of different scales for disparity estimation and enhance the model's performance in occluded areas, we propose a multi-level fusion strategy.Given that spatially occluded areas in the central view are visible from other directional viewpoints, we employ a structure similar to U-Net, featuring upsampling and downsampling, along with Hourglass modules, to extract useful features from unoccluded areas.Furthermore, to achieve better accuracy in depth estimation, we designed a multi-level scale disparity fusion structure to enhance feature robustness.As shown in Figure 3, the first layer input is the maximum disparity cost volume, capturing the broadest spatial information to provide a comprehensive initial perspective for depth estimation.Information from different disparity scales is integrated from largest to smallest, offering a multi-level feature fusion strategy that transitions from local to global and back to local.Finally, each layer's disparity is upsampled to the same size and fused by using three-dimensional convolution.Furthermore, to more effectively capture global and local texture features at the current scale of each layer, we have embedded Hourglass modules [37,38] at every level.These modules not only recognize large-scale structures within the feature cost volumes but also process fine textures and edge details.The specific details of this process are illustrated in Figure 4.This approach ensures a more nuanced and comprehensive analysis of the light field data, significantly enhancing depth estimation accuracy and detail.
After obtaining the final cost volume, each pixel is represented by a vector of length D max , containing the probabilities of all disparity levels.We use the softmax activation function introduced in [39] to generate a continuous distribution of disparity predictions.The predicted disparity value, d, is defined as where d represents the predicted disparity by the pixel; D min and D max denote the minimum and maximum disparities of the outermost view, respectively; and C f is the predicted cost at disparity d i .

Experiments
In this section, we will introduce the details and results of our implementation.Finally, we will compare our method with several state-of-the-art light field depth estimation methods.

Dataset and Implementation Details
The 4D light field dataset [40] is widely regarded as a benchmark for evaluating light field image disparity estimation methods.This dataset, rendered by using Blender, includes 28 densely arranged synthetic light field scenes and their corresponding ground-truth disparity maps divided into four subsets: "Stratified", "Test", "Training", and "Additional".These scenes incorporate a mix of various materials, lighting conditions, and complex spatial occlusions.All light field data possess a 9 × 9 angular resolution and a 512 × 512 image resolution.
We utilized the "Additional" category of the dataset for training our model.During training, we randomly cropped the SAIs into 48 × 48 grayscale patches and applied various image augmentation techniques, including random rotation, brightness and contrast adjustments, and noise injection.Inspired by Zhao [41], we adopted a joint L1 and SSIM [42] loss function for our training network, denoted by loss term L, and optimized it by using the Adam method [43].The loss function is formulated as follows: where d represents the predicted disparity map, d is the true disparity map, and M denotes the number of pixels, with α being set to 0.9.We tested weights α ranging from 0.1 to 0.9 and found that as the weight increases, accuracy in depth estimation also improves.
The proposed method was implemented with PyTorch platform [44] and optimized by using the Adam [43] (β 1 = 0.9, β 2 = 0.999) optimizer.The batch size was set to 16, and the initial learning rate was 10 −3 , decaying by 0.8 every 100 epochs.The total training comprised 1000 iterations.Our model was trained on a PC equipped with an Nvidia 6000× GPU, requiring approximately three days.
To demonstrate the accuracy of our method, we compared its performance with other state-of-the-art methods on the "Stratified" and "Training" categories of 4D LF data in terms of bad pixel rate (BadPix) (0.07) and Mean Squared Error (MSE).The comparative results are reported in Table 1, showing that our method achieved good results overall compared with the other methods.For performance comparison, to ensure fairness, we executed these methods on the same platform and compared the average running time for these scenes.The results, as shown in Table 2, indicate that our method outperforms the other methods, except for the Fast method.Additionally, while our method's accuracy is second only to LFAtt [31], it computes faster.The traditional methods CAE [25] and SPO [29] were run on a CPU platform configured with an Intel i7-10850H.2.836 5.913 Fast [10] 6.792 0.601 Distrib [46] 22.15 4.830 EPI-m [34] 3.383 10.65 SOA-Net [48] 7.170 74.30LF_OCC [45] 4.021 21.00 UnsuperNet [47] 10.19 1.079 Ours 3.203 0.693 The data for methods SOA-Net [48] and UnsuperNet [47] are cited from reference [49].The best result is shown in deep blue and the second best in orange.
In addition, we also subjectively compared the performance of our method with other methods in textureless and occluded areas.Figure 5 displays visual comparison results across four scenarios: "Dish", "Dots", "Rosemary", and "Origami".The depth estimation results and errors for the first three scenes show that our method performs well in spatial areas with occluded edges, comparable to the best methods available [31].Additionally, in the "Origami" scene, as indicated, our method achieves better accuracy in the marked textureless areas.Overall, the results demonstrate that our method exhibits superior performance and robustness in handling the challenge of textureless and occluded areas.(a-g) show the results for CAE [25], SPO [29], LFAtt [31], Fast [10], Distrib [46], EPI-m [34], and our method, respectively.The first row in each scene represents the estimated disparity corresponding to the original image, and the second row displays the distribution of bad pixels, with red indicating areas where the bad pixel rate exceeds 0.07.
To comprehensively evaluate the performance of our method, we also used realworld datasets for testing and comparison with state-of-the-art methods.As illustrated in Figure 6, the depth maps generated by our method are more consistent and exhibit less noise.This indicates that our approach can be effectively generalized to real LF depth estimation.The scenes "Bench" and "Leaf" were captured by using our Lytro Illum camera, and "Knights" was captured by using the gantry setup from the Stanford Light Field Archive., Distrib [46], EPI-m [34], and LFAtt [31], respectively.

Ablation Study
We conducted extensive ablation experiments to analyze the effectiveness of our method.Our ablation study includes the trade-off between performance and efficiency, the choice of disparity cost operations, and the combination of loss functions.

Disparity Cost Calculation
The disparity cost has a significant impact on accuracy in depth estimation, so it is crucial to choose the appropriate cost generation operations.We conducted separate tests with and without the feature dissimilarity operation on the HCI 4D LF benchmark.When we removed the feature dissimilarity operation, the network's performance in terms of both MSE and BadPix (0.07) deteriorated.

Computational Cost
To validate the impact of the light field input distribution and the number of grouped aggregation channels on the computational performance and effectiveness of our network, we used three different combinations of horizontal, vertical, and diagonal distributions as input variables for the network, as well as varying numbers of cost aggregation groups as intermediate variables.The results are shown in Table 3.The number of image inputs has a significant impact on network performance.Increasing the data based on the distribution of horizontal and vertical directions results in a small gain in accuracy but a significant performance drop.As indicated in the fourth column of the table, the optimal number of grouped aggregation channels is four, and the number of groups has little impact on network performance.Overall, our method achieves optimal performance in the network with the choice of input data distribution and the number of groups.

Effectiveness of Cost Aggregation Network
To verify the role of the cost aggregation network within the overall architecture, we constructed a cost aggregation network by using the Resnet structure [50], as depicted in Figure 4, as a benchmark module for comparison.Considering memory limitations, the input cost volumes were uniformly resampled to the same dimensions B × 12 × 2D × H × W before aggregation, followed by two 3D convolutions and eight 3D convolutions within the Resnet structure, and then depth regression.We trained for a total of 500 epochs.The results on the HCI light field benchmark dataset are shown in Table 4.The depth estimation accuracy achieved by our cost aggregation network structure is higher than that of the comparison experiment, indicating that our proposed multi-scale cost aggregation module plays a significant role in improving accuracy.
Overall, our ablation study validates the approach we proposed, demonstrating that each modular component of our model makes a valuable contribution to the final results.

Discussions and Conclusions
Despite achieving commendable results in both disparity estimation accuracy and computational performance, our method still has certain limitations.First, our approach heavily relies on high-quality light field data, and its robustness to distortion and noise in light field images is weak, especially when the number of input data is limited.Conse-quently, the performance of our method might decrease when applied to real-world light field data.In the future, we could explore integrating specially designed network modules to mitigate the impact of distortion.
Secondly, our ablation studies reveal that the total number of input data significantly impacts computational performance.While optimizing the cost aggregation can reduce the number of parameters and thus computation time, it does not substantially affect overall performance.Simultaneously, the quality of cost volume construction directly influences accuracy in depth estimation in challenging areas.Future work could, therefore, focus on exploring better input structures and cost construction methods to balance computational performance and accuracy.
In this paper, we propose an end-to-end network architecture that trades off computational performance and depth estimation accuracy.Our feature dissimilarity cost construction method effectively compensates for the shortcomings of feature correlation, enhancing network accuracy in textureless areas.Moreover, our multi-scale cost aggregation architecture significantly improves depth estimation accuracy while maintaining good computational performance.Overall, compared with state-of-the-art methods, our approach achieves the best trade-off between computational performance and accuracy, as demonstrated on a broad HCI benchmark set and on a real-world light field dataset.

Figure 1 .
Figure 1.Comparison of efficiency and computation performance of light field disparity estimation algorithms.

Figure 2 .
Figure 2.An overview of our network.The term "MCAM" denotes the multi-view cost volume aggregation module, and its specific structure is detailed in Figure 3.

Figure 3 .
Figure 3. Multi-View Cost Aggregation Module architecture.Note that the feature size annotations in our diagram omit the dimensions (H × W) of the input light field images.

Figure 5 .
Figure 5. Quantitative comparison of the performance of different methods on the HCI light field benchmark.(a-g) show the results for CAE[25], SPO[29], LFAtt[31], Fast[10], Distrib[46], EPI-m[34], and our method, respectively.The first row in each scene represents the estimated disparity corresponding to the original image, and the second row displays the distribution of bad pixels, with red indicating areas where the bad pixel rate exceeds 0.07.
Figure 7 illustrates the influence of feature dissimilarity on depth estimation within the aggregated cost volume.It indicates that adding feature dissimilarity effectively improves the network's performance in weak-texture regions and enhances its robustness.Rosemary Wall Non-Diff Cost With-Diff Cost

Figure 7 .
Figure 7. Disparity maps in synthetic and real scenes."Wall" represents a light field image captured by us, and "Rosemary" is the synthetic light field image.

Table 1 .
[40]titative comparison with other state-of-the-art methods in terms of BadPix (0.07) and MSE on 4D light field benchmark dataset[40].Lower scores represent better performance.Bad pixel ratio of 0.07 (BP07) and MSE (multiplied with 100) are the metrics for accuracy evaluation, where lower scores represent better performance.The best result is shown in deep blue and the second best in orange.

Table 2 .
Quantitative comparison of the average performance and efficiency with state-of-the-art methods on the 4D LF Benchmark.

Table 3 .
Quantitative comparison of results with different inputs and varying numbers of aggregation group channels.Vertical, and Diagonals represent inputs at 0 • , 90 • , and two diagonal directions through the central subspace image in the light field array, respectively.

Table 4 .
Depth estimation results of cost aggregation module and resnet benchmark module on HCI dataset.