Next Article in Journal
Special Curves and Tubes in the BCV-Sasakian Manifold
Previous Article in Journal
Adaptive Switching Surrogate Model for Evolutionary Multi-Objective Community Detection Algorithm
Previous Article in Special Issue
Research on AI-Driven Classification Possibilities of Ball-Burnished Regular Relief Patterns Using Mixed Symmetrical 2D Image Datasets Derived from 3D-Scanned Topography and Photo Camera
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving

College of Computer Science, Beijing Information Science & Technology University, Beijing 102206, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(8), 1214; https://doi.org/10.3390/sym17081214
Submission received: 22 June 2025 / Revised: 18 July 2025 / Accepted: 25 July 2025 / Published: 1 August 2025

Abstract

Accurate and efficient stereo matching is fundamental to real-time depth estimation from symmetric stereo cameras in autonomous driving systems. However, existing high-accuracy stereo matching networks typically rely on computationally expensive 3D convolutions, which limit their practicality in real-world environments. In contrast, real-time methods often sacrifice accuracy or generalization capability. To address these challenges, we propose FAMNet (Fusion Attention Multi-Scale Network), a lightweight and generalizable stereo matching framework tailored for real-time depth estimation in autonomous driving applications. FAMNet consists of two novel modules: Fusion Attention-based Cost Volume (FACV) and Multi-scale Attention Aggregation (MAA). FACV constructs a compact yet expressive cost volume by integrating multi-scale correlation, attention-guided feature fusion, and channel reweighting, thereby reducing reliance on heavy 3D convolutions. MAA further enhances disparity estimation by fusing multi-scale contextual cues through pyramid-based aggregation and dual-path attention mechanisms. Extensive experiments on the KITTI 2012 and KITTI 2015 benchmarks demonstrate that FAMNet achieves a favorable trade-off between accuracy, efficiency, and generalization. On KITTI 2015, with the incorporation of FACV and MAA, the prediction accuracy of the baseline model is improved by 37% and 38%, respectively, and a total improvement of 42% is achieved by our final model. These results highlight FAMNet’s potential for practical deployment in resource-constrained autonomous driving systems requiring real-time and reliable depth perception.

1. Introduction

In autonomous driving systems, accurate and efficient stereo depth estimation is essential for enabling real-time 3D understanding of the surrounding environment [1,2,3]. It is achieved by identifying pixel-wise correspondences between rectified left and right views. As illustrated in Figure 1, a pair of symmetric stereo cameras capture the same object at different horizontal positions, and the disparity between corresponding pixels is used to compute object distance via triangulation. This depth information supports a wide range of driving functions such as obstacle avoidance, lane following, and scene understanding.
Stereo matching, as the core task of disparity estimation from binocular imagery, plays a critical role in achieving accurate and robust depth estimation in autonomous vehicles. A typical stereo matching framework follows a pipeline of four steps: feature extraction, cost volume construction, cost aggregation, and disparity regression. Recently, advances in deep learning have led to remarkable improvements in stereo matching accuracy, driven by increasingly powerful CNN-based representations. However, many state-of-the-art models rely heavily on 3D convolutional regularization to refine the cost volume, resulting in high computational complexity and latency. This hinders deployment in embedded automotive systems with real-time and resource-constrained features [4,5,6,7,8].
To address this challenge, several studies have explored more-efficient cost volume construction strategies and lightweight aggregation mechanisms. Methods such as GwcNet [6] and ACVNet [8] demonstrate that well-designed cost volume representations can significantly improve stereo matching accuracy. Nonetheless, these approaches often require complex 3D encoder–decoder architectures, leading to performance bottlenecks. Other works [9,10,11] have investigated multi-scale feature fusion to enhance cost volume expressiveness, but have faced difficulties in balancing accuracy with computational efficiency.
In this paper, we propose FAMNet, a lightweight and generalizable stereo matching network designed specifically for real-time depth estimation in autonomous driving. FAMNet consists of two novel components: the FACV and MAA modules. FACV constructs a compact yet informative cost volume by combining multi-scale correlation, attention-guided fusion, and channel-wise feature reweighting, significantly reducing the reliance on heavy 3D convolutions. MAA further enhances disparity estimation by integrating multi-resolution contextual cues through pyramid-based aggregation and dual-path attention mechanisms. Compared to other networks that benefit from attention mechanisms and multi-scale processing in stereo matching [12,13,14], our FAMNet employs a concise streamlined architecture with lightweight attention mechanisms for efficient yet effective information fusion. Extensive experiments on the KITTI 2012 and 2015 benchmarks demonstrate that these modules enable FAMNet to achieve a favorable trade-off between accuracy, efficiency, and generalization across diverse driving environments, making it suitable for real-time autonomous applications.
The main contributions of this paper can be summarized as follows:
(1)
We propose FACV, which integrates multi-scale correlation, attention-guided fusion, and channel reweighting to construct compact and informative cost volumes with reduced reliance on 3D convolutions.
(2)
We introduce MAA, which combines pyramid-based hierarchical cost aggregation and dual-path attention to enhance disparity estimation with minimal computational cost.
(3)
We develop a lightweight and real-time stereo matching network, FAMNet, to achieve a favorable balance between accuracy, efficiency, and cross-domain generalization, demonstrating its feasibility for practical deployment in autonomous driving scenarios.
This paper is organized as follows: Section 2 describes the related work. Section 3 presents the details of our proposed method. The experimental results are discussed in Section 4. Section 5 summarizes our work and discusses future improvements on our work.

2. Related Work

2.1. End-to-End Stereo Matching Network

In recent years, CNNs have made significant advancements in various vision tasks such as object detection [15,16], depth estimation [9,17], and semantic segmentation [18,19]. With the strong representation of CNNs, learning-based stereo matching methods achieve impressive performance over traditional methods with manually designed features [4]. DispNetC [7] is the first stereo matching model that predicts dense disparity in an end-to-end manner. It builds dense correspondences between pixels in left and right images by employing a correlation layer. The correlation between left and right features measures similarities, but loses contextual information during the squeezing of channels. GCNet [4] retains contextual information by concatenating left and right features at each disparity level and then applies 3D convolutions to regularize the cost volume. It defines a differentiable operation of soft argmin to regress a smooth disparity estimate, therefore achieving excellent results. Although the concatenated cost volume contains rich contextual information, it needs to learn similarities from scratch. GwcNet [6] constructs a cost volume in a group-wise manner that takes both contextual information and similarity measurements contained in a 4D cost volume. Due to its merits of being lightweight and effective, group-wise correlation has become a popular method to build cost volumes. PSMNet [5] improves prediction accuracy by stacking multiple hourglass aggregation subnetworks with a mass of 3D convolutions. However, these resource-costly 3D convolutions lead to significant latency. GANet [7] comprises a semi-global guided aggregation module and local guided aggregation module to replace the widely used 3D convolutional layers. Zeng et al. [20] propose hysteresis attention, which is inspired by the hysteresis comparator in electronic circuits, to improve the representation capability of the feature extractor. ACVNet [8] generates weights from correlation clues to suppress redundant information and enhances matching-related information in the concatenation volume to build a cost volume to achieve state-of-the-art results. Recently, refinement through iterative updates has made disparity prediction more accurate. Raft-Stereo [9] introduces multi-level convolutional GRUs to iteratively refine disparity candidates. On the basis of Raft-Stereo, IGEV [10] builds a combined geometry-encoding volume that encodes geometry and contextual information and then iteratively indexes it to update the disparity map. Furthermore, IGEV++ [11] collects multi-range geometry information for ill-posed regions and large disparities and fine-grained geometry information for details and small disparities, and then leverages ConvGRUs to iteratively update the disparity map.

2.2. Real-Time Stereo Matching Network

Besides improving the prediction accuracy by constructing complicated networks, low latency, crucial to safety, is a significant factor for depth perception in autonomous driving. Driven by the demand for high-speed and accurate depth estimation in applications of autonomous driving, real-time stereo matching has witnessed significant advancements in recent years. StereoNet [21] is the first end-to-end model to realize real-time disparity prediction. It constructs the cost volume using feature maps at a lower resolution to reduce resource consumption. Coarse prediction is refined by edge-aware upsampling to obtain a better disparity map. StereoNet achieves a high frame rate on high-end GPUs. AANet [22] introduces an adaptive aggregation strategy that dynamically adjusts the receptive field size, enabling both efficiency and competitive accuracy. It replaces all the 3D convolutions with 2D ones and constructs adaptive intra-scale and cross-scale aggregation modules to accelerate the network. AnyNet [23] is a multi-stage network that progressively refines disparity prediction results, allowing real-time inference with early-exit capabilities for flexible performance–speed trade-offs. It generates a coarse disparity map at the lowest resolution, and then refines the disparity map by the disparity residuals at higher resolutions. DeepPruner [24] formulates stereo matching as a pruning problem, where a small subset of disparity candidates is selected and refined using a learned pruning strategy. It speeds up the inference by a differentiable PatchMatch module and pruned search space of disparity. BGNet [25] introduces a learnable bilateral grid to upsample the aggregated volume with edges preserved, allowing complex computation at lower resolutions. CoEx [26] constructs a guided cost volume excitation module and shows that prediction accuracy can be improved by simple channel excitation on the cost volume under the guidance of the image. Inspired by CoEx, CGI-Stereo [27] includes a context and geometry fusion block to adaptively fuse context and geometry information for accurate and efficient cost aggregation. Ghost-Stereo [28] utilizes GhostNet-based feature extraction and cost aggregation to balance accuracy and speed. DTPNet [29] adopts a distill-and-then-prune strategy to compress the stereo matching network and achieves fast inference on edge devices. LightStereo [30] is a lightweight network based only on 2D convolutions to reduce the resource consumption of 3D convolutions and improves the results by multi-scale attention.
Two-dimensional cost aggregation models represented by DispNetC [7], AANet [22], and LightStereo [30] use two-dimensional convolutions to regularize and aggregate the constructed single-channel cost volume, showing their faster performance. However, compared to cost aggregation via 3D convolutions, 2D cost aggregation inherently omits the channel dimension information encoded in the 3D cost volume, which cannot capture multi-dimensional dependencies across spatial, channel, and disparity dimensions. Thus, 3D cost aggregation is capable of extracting more contextual and semantic information to improve prediction accuracy at the cost of speed. Meanwhile, the fast speed of 2D cost aggregation comes at the cost of prediction accuracy. Reducing 3D convolutional layers in cost aggregation is a trade-off between prediction accuracy and speed, and various approaches like attention mechanisms have been proposed to compensate for the degraded accuracy. Therefore, strategies should be carefully considered depending on the application scenarios, such as autonomous vehicles, mobile robots, or fine 3D reconstruction.

3. Method

FAMNet is carefully designed to minimize the reliance on heavy 3D convolutions, enabling efficient cost volume construction and aggregation. As illustrated in Figure 2, it comprises four key components: multi-scale feature extraction, FACV, MAA, and disparity regression. We employ a U-Net [31] structure with MobileNetV2 [32] as a backbone for hierarchical feature extraction. The backbone is pre-trained on the ImageNet dataset to confer a strong representation capability on FAMNet. The encoder progressively downsamples the input stereo pairs to capture features at multiple scales, while the decoder incorporates skip connections to effectively fuse low-level and high-level features. This multi-scale representation provides both detailed local information and broader contextual understanding, which is crucial for accurate disparity estimation.

3.1. FACV—Fusion Attention-Based Cost Volume

FACV is designed to compute matching costs and build a cost volume for disparity estimation. Its structure is shown in Figure 3. This sophisticated module addresses fundamental limitations in conventional cost volume construction through three innovative mechanisms: Multi-Modal Similarity Computation, Attention-Guided Fusion Architecture, and channel-wise feature reweighting. The rich similarity measurements and context information contained in FACV are beneficial in reducing the complexity of cost aggregation, resulting in lightweight aggregation and fast inference. In order to capture multi-scale matching cues, we construct multiple cost volumes via global and grouped correlation [5], thereby encoding both coarse and fine-grained disparity information. These cost volumes along with the concatenation volume are concatenated along the channel dimension and then fused using convolutional layers to improve the geometric representation. We use a feature pyramid built from 1/2-, 1/4-, and 1/8-resolution maps to generate adaptive attention maps via operations of upsampling and convolution, which enables the selective integration of multi-resolution features into the fused volume. A disparity attention mechanism then refines the disparity confidence further by dynamic recalibration under local disparity cues.
C g o l b a l x , y , d = i n n e r f l g x , y , f r g x d , y  
    C g w c x , y , d , g = 1 N c / N g i n n e r f l g x , y , f r g x d , y  
  C c o n c a t e n a t e x , y , d = c o n c a t f l x , y , f r x d , y
Multi-Modal Similarity Computation. The FACV module fundamentally rethinks how feature similarity is computed by simultaneously employing two complementary operations. We compute both full correlation (capturing global feature relationships) and group correlation (preserving local feature interactions). This dual correlation strategy provides a comprehensive insight into feature similarity across scales and contexts. Then the original feature information is preserved by directly connecting the left and right image features, which avoids the loss of the absolute feature values in the correlation operation. The module dynamically balances these operations through learnable parameters, automatically adjusting their relative contributions based on image content. For texture-rich regions, correlation operations dominate to capture complex patterns, while in smooth areas, concatenation operations provide more-stable matching cues.
Attention-Guided Fusion Architecture. The fusion process employs a sophisticated attention mechanism that operates at multiple levels. We compute attention weights along the disparity dimension to emphasize likely matching positions while suppressing improbable matches. We implement a cross-scale spatial attention mechanism that identifies important image regions (e.g., object boundaries) and allocates more focus to these areas. A novel attention gate dynamically weights the contributions from different similarity computations (correlation vs. concatenation) at each spatial location. This multi-head attention system is implemented efficiently using separable convolutions and channel shuffling operations, maintaining computational efficiency while providing rich feature interactions.
Channel-Wise Feature Reweighting. The final component in FACV introduces a hierarchical channel attention mechanism, which processes individual feature channels to identify and amplify important matching cues while suppressing noisy ones. It models relationships between different feature channels to capture complex matching patterns, and then aggregates information across different receptive fields to handle both fine details and global context. The reweighting process uses a gated mechanism that combines both bottom-up (feature-driven) and top-down (task-driven) signals, ensuring that the network focuses on the most discriminative features for the current matching task.

3.2. MAA—Multi-Scale Attention Aggregation

Building upon the hierarchical design of FAMNet, the proposed MAA module introduces a pyramidal cost aggregation strategy across three distinct resolution levels. Its structure is shown in Figure 4. Specifically, cost volumes are constructed using hierarchical feature representations at 1/4, 1/8, and 1/16 scales. High-resolution features (1/4) retain fine-grained texture and edge information; mid-level features (1/8) capture intermediate spatial structures; and low-resolution features (1/16) encode global semantic context. The module incorporates three key design elements to maintain precision. Residual learning with skip connections at each scale mitigates gradient vanishing and preserves high-frequency details. Progressive upsampling via transposed 3D convolutions (stride = 2) enables hierarchical feature recovery while maintaining spatial coherence during resolution restoration. Feature refinement through convolutional residual blocks provides iterative disparity estimation refinement. Following FAMNet’s multi-scale cost aggregation paradigm, these features are progressively downsampled through strided 3D convolutions and then reintegrated via skip connections, ensuring both seamless information flow across scales and computational efficiency. This hierarchical architecture enables comprehensive scene understanding by combining local precision with global context awareness.
MAA employs a dual-path attention mechanism to dynamically enhance feature discriminability. The disparity attention pathway first applies global average and max pooling along the disparity dimension, processes the pooled features through weight-shared 3D CNNs (kernel size 1 × 1 × 1), and adjusts channel dimensions via the reduction rate R to generate disparity-specific attention weights through sigmoid activation.
V d = s i g m o i d R × f 3 D 1 × 1 × 1 f a v g p o o l g l o b a l V + f m a x p o o l g l o b a l V × V + V  
where V corresponds to the input cost volume and V d corresponds to the output cost volume, and f 3 D 1 × 1 × 1 , f a v g p o o l g l o b a l , and f m a x p o o l g l o b a l correspond to the convolution and pooling operations, respectively.
Concurrently, the spatial attention pathway computes average and maximum values along the disparity dimension, concatenates these features, and refines them via a 3D CNN (1 × 7 × 7 kernel) to produce spatial attention weights. By jointly optimizing these pathways, the mechanism selectively emphasizes relevant depth levels and critical spatial regions, improving robustness in texture-less areas while maintaining computational efficiency.
V s = s i g m o i d f 3 D 1 × 7 × 7 f a v g p o o l d i s p a r i t y V d + f m a x p o o l d i s p a r i t y V d × V d + V d
where V d corresponds to the output of the disparity attention above and V s corresponds to the output cost volume, and f 3 D 1 × 7 × 7 , f a v g p o o l d i s p a r i t y , and f m a x p o o l d i s p a r i t y correspond to the convolution and pooling operations, respectively. Therefore, MAA is a lightweight cost aggregation module with reduced costly operations (3D convolutions) to enable real-time performance of FAMNet. The attention inside MAA further improves the prediction results.

3.3. Disparity Regression and Loss Function

Disparity regression refines sub-pixel-level disparity estimation by converting matching costs in the cost volume into a probability distribution. It computes continuous disparities through weighted averaging of the soft argmin operation.
d ^ i = d = 0 d m a x 1 d · σ ( C A )    
where d ^ i denotes the predicted disparity, σ denotes the softmax operation, and C A denotes the aggregated cost volume. We use top k regression [4], where the value of k is 2. To more efficiently upsample the disparity map, we utilize “super pixel” weights around pixels [26]. The whole network is trained in a supervised end-to-end manner using s m o o t h   L 1 loss to measure the difference between predicted and ground-truth values:
L = S m o o t h L 1 d ^ i d g t ,
in which
S m o o t h L 1 x = 0.5 x 2 ,                     i f x < 1 x 0.5 ,           o t h e r w i s e ,
where d ^ i is the predicted disparity, and d g t denotes the ground-truth disparity map.

4. Experiments

4.1. Datasets and Evaluation Metrics

The experiments employ a hybrid dataset approach combining synthetic (Scene Flow) and real-world (KITTI) data to ensure comprehensive validation. The Scene Flow dataset [7], a large-scale synthetic stereo image collection containing 35,454 training pairs and 4370 test pairs with ground-truth disparity maps, serves as the primary dataset for pre-training. Model optimization on this dataset is performed using endpoint error (EPE) analysis for pixel-wise disparity deviation assessment. For real-world automotive scenario validation, this study utilizes the KITTI benchmarks. The KITTI 2012 dataset [33], comprising 194 training images and 195 test images, employs a dual evaluation protocol: a 3-pixel error tolerance threshold (3px-noc) specifically for non-occluded regions, and a full-frame assessment (3px-all). The subsequently released KITTI 2015 dataset [34] expands this with 200 balanced training–testing pairs while introducing enhanced evaluation metrics. These maintain the EPE measurement while implementing D1 error categorization, which quantifies outlier proportions (defined as disparity errors exceeding either 3 pixels or 5% of ground-truth values) across three spatial partitions: background regions (D1-bg), foreground objects (D1-fg), and composite scenes (D1-all). This hierarchical evaluation system enables granular performance analysis across various elements of autonomous driving scenarios.

4.2. Implementation Details

Our experimental implementation is carried out using the PyTorch Version 2.0.1. framework on an NVIDIA (Santa Clara, CA, USA) RTX 3090 GPU platform. The models are trained in an end-to-end manner using the Adam [35] optimizer configured with β 1 = 0.9 and β 2 = 0.999. To prepare the training data, the input images are randomly cropped to 256 × 512 pixels with a maximum disparity setting of 192. The training protocol consists of two distinct phases: initial pre-training and subsequent fine-tuning. The pre-training phase uses the Scene Flow dataset over 32 epochs, with an initial learning rate of 0.001, which is scheduled to decrease at epochs 20, 26, 28, and 30. Following pre-training, the model is fine-tuned on a combined training set comprising both the KITTI 2012 and 2015 datasets for 600 epochs, maintaining the same initial learning rate of 0.001, which is then halved at the 300th epoch. This comprehensive training strategy ensures robust model performance across different datasets while maintaining computational efficiency.

4.3. Ablation Analysis

To evaluate the effectiveness of individual components in our proposed method, we performed systematic ablation studies on the Scene Flow and KITTI 2015 validation sets. The experimental setup involved training all models on the respective training sets of the Scene Flow and KITTI datasets following the previously described training strategies, with runtime performance measured on the NVIDIA RTX 3090 GPU platform.
Our baseline architecture uses 4D cost volume construction at 1/4 resolution with standard 3D convolutions for cost aggregation. The comprehensive experimental results, as presented in Table 1, demonstrate the significant performance gains achieved by each proposed component, while maintaining the real-time processing capabilities essential for practical applications. By sequentially incorporating the proposed modules, we observe that both FACV and MAA mechanisms contribute significantly to accuracy improvements on both datasets. FACV provides rich similarity measurements and contextual information for matching costs. The rich information is beneficial to the improvement in prediction results. The attention in MAA provides statistical information in the spatial and channel dimensions to regularize the cost volume. Notably, these improvements are achieved with mild increases in processing time compared to the baseline model, which maintains reasonable computational efficiency. A visualization of experimental data via a histogram plot is shown in Figure 5 to demonstrate distinct comparisons between ablation results.

4.4. Performance

Table 2 provides a detailed comparison between the proposed model and representative state-of-the-art stereo matching methods across the KITTI 2012 and KITTI 2015 benchmarks. The comparison is stratified into two categories: high-accuracy models and real-time models. Notably, the proposed method achieves a significant speed advantage—up to 5× faster—over non-real-time baselines such as PSMNet [5] and GwcNet [6], while maintaining comparable disparity estimation accuracy. Although there are differences in accuracy between the OpenStereo [36] and Monster [37] high-performance models, we still have a speed advantage of about 10 times, which ensures the effectiveness of our model. Among real-time networks, the proposed method FAMNet exhibits competitive accuracy, with only a marginal performance gap compared to LightStereo [30], the cutting-edge real-time method. The consumption of resources is a key metric to evaluate the efficiency of a lightweight model. We compare our model with state-of-the-art models in computing resources, and present the experimental results in Table 3. All models in the table are lightweight and can realize real-time prediction. None of them require considerable computing resources. The resource consumption of our model is moderate among the models in the table, but its accuracy is excellent and ranks second. The parameters in the table denote the memory overhead; our model achieves a 57% parameter reduction compared to LightStereo-M, substantially decreasing storage requirements and model loading time. While achieving better accuracy, our model reduces computational cost by about 20% compared to BGNet and CoEx. Thus, our FAMNet achieves a better balance between prediction accuracy and resource consumption.
Qualitative results of evaluations on KITTI 2012 and 2015 are demonstrated in Figure 6 and Figure 7. For a more fair comparison, we introduced a high-performance network CFNet [40] for comparison. From the disparity map, it can be seen that FAMNet predicts accurate disparities for thin objects and non-continuous regions. The annotations in Figure 6 demonstrate the excellent performance of our model. For the thin and tiny objects in the left parts of the first and third images in Figure 6, FAMNet predicts more-accurate disparity maps. Disparities are more continuous and distinct from the background. This can be attributed to rich contextual and semantic information provided by the MAA module. For the outlines of objects (non-continuous disparities) in the rights parts of images in Figure 6, the predicted disparity results are more accurate and clearer for our model than other models, since the FACV module builds a cost volume with more geometric priors inside. As shown in the results of matching errors in Figure 8, our model achieves comparable results to other state-of-the-art models. All of the models can obtain excellent performance in most regions. Due to the effectiveness of the proposed modules, our model demonstrates competitive robustness with state-of-the-art models in ill-posed regions such as tiny structures, reflective glass, and over-exposed surfaces. Cross-domain performance is crucial to the practical capability of a stereo matching model. For the cross-domain evaluation, FAMNet and other models are only pre-trained on the Scene Flow dataset and evaluated on Middlebury 2014 [41] and ETH 3D [42]. Table 4 highlights the strong generalization ability of our model, which achieves excellent performance on both the Middlebury and ETH3D datasets, rivaling recent high-performing methods such as CGI-Stereo [27] and Ghost-Stereo [28]. Figure 9 illustrates qualitative disparity predictions in zero-shot scenes, where the model demonstrates excellent capability in resolving fine details and accurately estimating depth in regions containing small or distant objects. The error maps in Figure 9 also prove that our model performs excellently in most regions including tiny structures and texture-less surfaces. Especially for pencils, the details of the motorcycle, and the chair behind the motorcycle, disparities are distinct and accurate. Notably, our model also exhibits robust prediction capabilities for the emergence of small objects, further validating the effectiveness of our FACV and MAA modules in unseen scenes.

5. Conclusions

In this paper, we proposed FAMNet, a lightweight and generalizable stereo matching framework designed for real-time depth estimation in autonomous driving scenarios. To address the limitations of conventional 3D convolution-based networks, FAMNet introduces two key innovations. The FACV module integrates multi-scale correlation, attention-guided fusion, and channel reweighting to construct compact and expressive cost volumes with reduced computational overhead. The MAA module efficiently refines disparity estimation by aggregating hierarchical contextual information through pyramid-based processing and a dual-path attention mechanism. Extensive experiments on both synthetic and real-world datasets demonstrate that FAMNet achieves a compelling balance between accuracy, efficiency, and generalization. It outperforms most 3D convolution-heavy models in runtime, while maintaining competitive disparity estimation accuracy. Compared to the cutting-edge method CoEx [14], our FAMNet achieves a performance improvement of 3% and 5% on the KITTI 2015 and KITTI 2012 datasets, respectively. In terms of generalization performance, there is a maximum improvement of 22% on the Middlebury 2014 and ETH 3D datasets, while maintaining real-time performance. Moreover, strong cross-domain generalization on benchmarks proves its robustness for diverse driving environments. With its modular design, low latency, and high accuracy, FAMNet offers a practical solution for onboard depth perception in autonomous vehicles. Although the proposed model achieves an excellent trade-off between speed and accuracy, there is a large margin for the prediction accuracy between our model and most accuracy-oriented models, and the proposed model is not optimized for ill-posed regions like texture-less and reflective regions. Thus, in future work, we plan to work on more effective yet efficient modules to improve the prediction results with less speed loss and will try to extract more geometric features in an efficient manner to improve performance in ill-posed regions. We also aim to extend the framework to more challenging scenarios, such as motion-aware stereo matching under texture-less surfaces and varying-illumination conditions.

Author Contributions

Conceptualization, J.Z., Q.T., N.Y. and X.L.; methodology, J.Z. and Q.T.; software, J.Z. and N.Y.; validation, Q.T., N.Y. and X.L.; formal analysis, J.Z., Q.T. and N.Y.; investigation, J.Z., Q.T. and N.Y.; resources, Q.T. and X.L.; data curation, J.Z., Q.T. and N.Y.; writing—original draft preparation, J.Z., Q.T. and N.Y.; writing—review and editing, Q.T. and X.L.; visualization, J.Z. and N.Y.; supervision, Q.T. and X.L.; project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research Fund of R&D and application demonstration of trusted multimodal large-scale model technology for industrial situation awareness and decision-making (No. Z241100001324010).

Data Availability Statement

Data is unavailable at present due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FAMNetFusion Attention Multi-Scale Network
FACVFusion Attention-Based Cost Volume
MAAMulti-Scale Attention Aggregation
GRUGate Recurrent Unit
GPUGraphics Processing Unit
GwcNetGroup-Wise Correlation Network
CFNetCascade and Fused Cost Volume-Based Network
ACVNetAttention Concatenation Volume Network
BGNetBilateral Grid Network
GAnetGuided Aggregation Network
AANetAdaptive Aggregation Network
GCNetGeometry and Context Network
PSMNetPyramid Stereo Matching Network
IGEVIterative Geometry-Encoding Volume

References

  1. Chen, C.; Seff, A.; Kornhauser, A.; Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  2. Biswas, J.; Veloso, M. Depth camera based localization and navigation for indoor mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011. [Google Scholar]
  3. Alhaija, H.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. Int. J. Comput. Vis. 2011, 126, 961–972. [Google Scholar] [CrossRef]
  4. Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  5. Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  6. Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-wise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  7. Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H.S. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  8. Xu, G.; Cheng, J.; Guo, P.; Yang, X. Attention concatenation volume for accurate and efficient stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  9. Lipson, L.; Teed, Z.; Deng, J. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In Proceedings of the International Conference on 3D Vision, Prague, Czech Republic, 1–3 December 2021. [Google Scholar]
  10. Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  11. Xu, G.; Wang, X.; Zhang, Z.; Cheng, J.; Liao, C.; Yang, X. IGEV++: Iterative multi-range geometry encoding volumes for stereo matching. arXiv 2024, arXiv:2409.00638. [Google Scholar] [CrossRef] [PubMed]
  12. Liao, L.; Zeng, J.; Lai, T.; Xiao, Z.; Zou, F.; Fujita, H. Stereo matching on images based on volume fusion and disparity space attention. Eng. Appl. Artif. Intell. 2024, 136, 108902. [Google Scholar] [CrossRef]
  13. Lu, Y.; He, X.; Zhang, Q.; Zhang, D. Fast stereo conformer: Real-time stereo matching with enhanced feature fusion for autonomous driving. Eng. Appl. Artif. Intell. 2025, 149, 110565. [Google Scholar] [CrossRef]
  14. Tahmasebi, M.; Huq, S.; Meehan, K.; McAfee, M. DCVSMNet: Double cost volume stereo matching network. Neurocomputing 2025, 618, 129002. [Google Scholar] [CrossRef]
  15. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  16. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  17. Bhar, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth estimation using adaptive bins. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
  18. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  19. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  20. Li, Z.; Liu, X.; Drenkow, N.; Ding, A.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  21. Khamis, S.; Fanello, S.; Rhemann, C.; Kowdle, A.; Valentin, J.L.; Izadi, S. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  22. Xu, H.; Zhang, J. Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  23. Wang, Y.; Lai, Z.; Huang, G.; Wang, B.; Maaten, L.; Campbell, M.; Weinberger, K.Q. Anytime stereo image depth estimation on mobile devices. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, BC, Canada, 20–24 May 2019. [Google Scholar]
  24. Duggal, S.; Wang, S.; Ma, W.; Hu, R.; Urtasun, R. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  25. Xu, B.; Xu, Y.; Yang, X.; Jia, W.; Guo, Y. Bilateral grid learning for stereo matching networks. In Proceedings of the IEEE International Conference on Computer Vision, Virtual, 19–25 June 2021. [Google Scholar]
  26. Bangunharcana, A.; Cho, J.W.; Lee, S.; Kweon, I.S.; Kim, K.S.; Kim, S. Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
  27. Xu, G.; Zhou, H.; Yang, X. CGI-Stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv 2023, arXiv:2301.02789. [Google Scholar]
  28. Jiang, X.; Bian, X.; Guo, C. Ghost-Stereo: GhostNet-based cost volume enhancement and aggregation for stereo matching networks. arXiv 2024, arXiv:2405.14520. [Google Scholar]
  29. Pan, B.; Jiao, J.; Pang, J.; Cheng, J. Distill-then-prune: An efficient compression framework for real-time stereo matching network on edge devices. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024. [Google Scholar]
  30. Guo, X.; Zhang, C.; Zhang, Y.; Zheng, W.; Nie, D.; Poggi, M.; Chen, L. Light Stereo: Channel boost is all you need for efficient 2D cost aggregation. arXiv 2025, arXiv:2406.19833v3. [Google Scholar]
  31. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
  32. Sandler, M.; Howard, A.; Zhu, M.L.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  33. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE International Conference on Computer Vision, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  34. Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE International Conference on Com puter Vision, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  35. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  36. Guo, X.; Zhang, C.; Lu, J.; Duan, Y.; Wang, Y.; Yang, T.; Zhu, Z.; Chen, L. OpenStereo: A comprehensive benchmark for stereo matching and strong baseline. arXiv 2023, arXiv:2312.00343. [Google Scholar]
  37. Cheng, J.; Liu, L.; Xu, G.; Wang, X.; Zhang, Z.; Deng, Y.; Zang, J.; Chen, Y.; Cai, Z.; Yang, X. MonSter: Marry monodepth to stereo unleashes power. arXiv 2025, arXiv:2501.08643. [Google Scholar] [CrossRef]
  38. Yang, J.; Wu, C.; Wang, G.; Xu, R.; Zhang, M.; Xu, Y. Guided aggregation and disparity refinement for real-time stereo matching. Signal Image Video Process. 2024, 18, 4467–4477. [Google Scholar] [CrossRef]
  39. Wu, Z.; Zhu, H.; He, L.; Zhao, Q.; Shi, J.; Wu, W. Real-time stereo matching with high accuracy via spatial attention-guided upsampling. Appl. Intell. 2023, 53, 24253–24274. [Google Scholar] [CrossRef]
  40. Shen, Z.; Dai, Y.; Rao, Z. CFNet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, Virtual, 19–25 June 2021. [Google Scholar]
  41. Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešic, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the German Conference on Pattern Recognition, Cham, Switzerland, 2–5 September 2014. [Google Scholar]
  42. Schöps, T.; Schönberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Figure 1. Principle of depth estimation via symmetric stereo cameras. (a) Autonomous vehicle with symmetric stereo cameras. (b) Object on symmetric image planes. (c) Disparity estimated from pixel-to-pixel difference.
Figure 1. Principle of depth estimation via symmetric stereo cameras. (a) Autonomous vehicle with symmetric stereo cameras. (b) Object on symmetric image planes. (c) Disparity estimated from pixel-to-pixel difference.
Symmetry 17 01214 g001
Figure 2. Architecture of FAMNet. It consists of a straight pipeline with four stages of feature extraction, cost volume construction, cost aggregation, and disparity regression.
Figure 2. Architecture of FAMNet. It consists of a straight pipeline with four stages of feature extraction, cost volume construction, cost aggregation, and disparity regression.
Symmetry 17 01214 g002
Figure 3. Structure of Fusion Attention-based Cost Volume.
Figure 3. Structure of Fusion Attention-based Cost Volume.
Symmetry 17 01214 g003
Figure 4. Structure of Multi-scale Attention Aggregation.
Figure 4. Structure of Multi-scale Attention Aggregation.
Symmetry 17 01214 g004
Figure 5. Histogram of ablation experiments.
Figure 5. Histogram of ablation experiments.
Symmetry 17 01214 g005
Figure 6. Qualitative results of FAMNet compared to other models on KITTI 2012 dataset.
Figure 6. Qualitative results of FAMNet compared to other models on KITTI 2012 dataset.
Symmetry 17 01214 g006
Figure 7. Qualitative results of FAMNet compared to other models on KITTI 2015 dataset.
Figure 7. Qualitative results of FAMNet compared to other models on KITTI 2015 dataset.
Symmetry 17 01214 g007
Figure 8. Qualitative results of matching errors on KITTI 2015 dataset.
Figure 8. Qualitative results of matching errors on KITTI 2015 dataset.
Symmetry 17 01214 g008
Figure 9. Visualization results of disparities and errors for FAMNet on Middlebury 2014.
Figure 9. Visualization results of disparities and errors for FAMNet on Middlebury 2014.
Symmetry 17 01214 g009
Table 1. Ablation study of proposed model on Scene Flow and KITTI validation sets.
Table 1. Ablation study of proposed model on Scene Flow and KITTI validation sets.
ModelScene Flow
EPE (px)
KITTI 2015Runtime (ms)
D1-all (%)EPE (px)
Baseline0.892.590.7319
Baseline + FACV0.741.640.6525
Baseline + MAA0.721.610.6426
Baseline + FACV + MAA (Ours)0.621.490.5931
Table 2. Quantitative results on KITTI online benchmarks.
Table 2. Quantitative results on KITTI online benchmarks.
ModelKITTI 2012KITTI 2015PlatformRuntime (ms)
3px-noc
(%)
3px-all
(%)
D1-bg
(%)
D1-fg
(%)
D1-all
(%)
Accuracy
GANet-deep [7]1.191.601.483.461.81Tesla P401800
PSMNet [5]1.491.891.864.622.32Titan X410
GwcNet [6]1.321.702.216.162.11Titan X320
GCNet [4]1.772.301.373.162.87Titan X900
IGEV-Stereo [10]1.121.441.382.671.59RTX 3090180
Raft-Stereo [9]1.301.661.58 3.05 1.82RTX 6000380
CFNet [33]1.231.581.543.561.88Tesla V100180
ACVNet [8]1.131.471.373.071.65RTX3090250
OpenStereo [36]1.001.261.282.261.44-290
MonSter [37]0.841.091.132.811.41RTX 3090450
Speed
StereoNet [21]4.916.024.307.454.83Titan X15
AnyNet [23]2.202.66--2.71RTX 2080TI27
AANet [22]1.912.421.995.392.55Tesla V10062
BGNet [25]1.772.152.074.742.51RTX 2080TI25
Fast-ACVNet [8]1.682.131.823.932.17RTX 309039
CoEx [26]1.551.931.793.822.13RTX 2080TI27
GADR-Stereo [38]--1.80-2.11RTX 309031
SAGU-Net-fr [39]1.551.551.703.79 2.05 RTX 309035
Ghost-Stereo [28]1.451.80 1.713.77 2.05RTX309037
LightStereo-M [30]1.561.911.813.222.04RTX309023
FAMNet (Ours)1.531.811.723.432.06RTX 309031
Table 3. Comparisons of computing resources.
Table 3. Comparisons of computing resources.
ModelParams (M)FLOPS (G)D1-all (%)Runtime (ms)
StereoNet [21]0.29524.8315
BGNet [25]2.97512.1725
Fast-ACVNet [8]2.56362.1339
CoEx [26]2.69502.1327
LightStereo-M [30]7.61332.0423
FAMNet (Ours)3.26412.0631
Table 4. Cross-domain generalization performance.
Table 4. Cross-domain generalization performance.
ModelMiddlebury 2014
(>2 px) (%)
ETH 3D
(>1 px) (%)
PSMNet [5]15.89.8
GANet-deep [7]20.314.1
CFNet [40]15.45.3
CoEx [26]14.59
BGNet [25]24.722.6
CGI-Stereo [27]13.56.3
Ghost-Stereo [28]11.47.3
FAMNet (Ours)12.17.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Tong, Q.; Yan, N.; Liu, X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry 2025, 17, 1214. https://doi.org/10.3390/sym17081214

AMA Style

Zhang J, Tong Q, Yan N, Liu X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry. 2025; 17(8):1214. https://doi.org/10.3390/sym17081214

Chicago/Turabian Style

Zhang, Jingyuan, Qiang Tong, Na Yan, and Xiulei Liu. 2025. "FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving" Symmetry 17, no. 8: 1214. https://doi.org/10.3390/sym17081214

APA Style

Zhang, J., Tong, Q., Yan, N., & Liu, X. (2025). FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry, 17(8), 1214. https://doi.org/10.3390/sym17081214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop