FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving

Zhang, Jingyuan; Tong, Qiang; Yan, Na; Liu, Xiulei

doi:10.3390/sym17081214

Open AccessArticle

FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving

College of Computer Science, Beijing Information Science & Technology University, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1214; https://doi.org/10.3390/sym17081214

Submission received: 22 June 2025 / Revised: 18 July 2025 / Accepted: 25 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Computer Vision, Pattern Recognition, Machine Learning, and Symmetry, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate and efficient stereo matching is fundamental to real-time depth estimation from symmetric stereo cameras in autonomous driving systems. However, existing high-accuracy stereo matching networks typically rely on computationally expensive 3D convolutions, which limit their practicality in real-world environments. In contrast, real-time methods often sacrifice accuracy or generalization capability. To address these challenges, we propose FAMNet (Fusion Attention Multi-Scale Network), a lightweight and generalizable stereo matching framework tailored for real-time depth estimation in autonomous driving applications. FAMNet consists of two novel modules: Fusion Attention-based Cost Volume (FACV) and Multi-scale Attention Aggregation (MAA). FACV constructs a compact yet expressive cost volume by integrating multi-scale correlation, attention-guided feature fusion, and channel reweighting, thereby reducing reliance on heavy 3D convolutions. MAA further enhances disparity estimation by fusing multi-scale contextual cues through pyramid-based aggregation and dual-path attention mechanisms. Extensive experiments on the KITTI 2012 and KITTI 2015 benchmarks demonstrate that FAMNet achieves a favorable trade-off between accuracy, efficiency, and generalization. On KITTI 2015, with the incorporation of FACV and MAA, the prediction accuracy of the baseline model is improved by 37% and 38%, respectively, and a total improvement of 42% is achieved by our final model. These results highlight FAMNet’s potential for practical deployment in resource-constrained autonomous driving systems requiring real-time and reliable depth perception.

Keywords:

stereo matching; depth estimation; autonomous driving; cost volume; real-time perception

1. Introduction

In autonomous driving systems, accurate and efficient stereo depth estimation is essential for enabling real-time 3D understanding of the surrounding environment [1,2,3]. It is achieved by identifying pixel-wise correspondences between rectified left and right views. As illustrated in Figure 1, a pair of symmetric stereo cameras capture the same object at different horizontal positions, and the disparity between corresponding pixels is used to compute object distance via triangulation. This depth information supports a wide range of driving functions such as obstacle avoidance, lane following, and scene understanding.

Stereo matching, as the core task of disparity estimation from binocular imagery, plays a critical role in achieving accurate and robust depth estimation in autonomous vehicles. A typical stereo matching framework follows a pipeline of four steps: feature extraction, cost volume construction, cost aggregation, and disparity regression. Recently, advances in deep learning have led to remarkable improvements in stereo matching accuracy, driven by increasingly powerful CNN-based representations. However, many state-of-the-art models rely heavily on 3D convolutional regularization to refine the cost volume, resulting in high computational complexity and latency. This hinders deployment in embedded automotive systems with real-time and resource-constrained features [4,5,6,7,8].

To address this challenge, several studies have explored more-efficient cost volume construction strategies and lightweight aggregation mechanisms. Methods such as GwcNet [6] and ACVNet [8] demonstrate that well-designed cost volume representations can significantly improve stereo matching accuracy. Nonetheless, these approaches often require complex 3D encoder–decoder architectures, leading to performance bottlenecks. Other works [9,10,11] have investigated multi-scale feature fusion to enhance cost volume expressiveness, but have faced difficulties in balancing accuracy with computational efficiency.

In this paper, we propose FAMNet, a lightweight and generalizable stereo matching network designed specifically for real-time depth estimation in autonomous driving. FAMNet consists of two novel components: the FACV and MAA modules. FACV constructs a compact yet informative cost volume by combining multi-scale correlation, attention-guided fusion, and channel-wise feature reweighting, significantly reducing the reliance on heavy 3D convolutions. MAA further enhances disparity estimation by integrating multi-resolution contextual cues through pyramid-based aggregation and dual-path attention mechanisms. Compared to other networks that benefit from attention mechanisms and multi-scale processing in stereo matching [12,13,14], our FAMNet employs a concise streamlined architecture with lightweight attention mechanisms for efficient yet effective information fusion. Extensive experiments on the KITTI 2012 and 2015 benchmarks demonstrate that these modules enable FAMNet to achieve a favorable trade-off between accuracy, efficiency, and generalization across diverse driving environments, making it suitable for real-time autonomous applications.

The main contributions of this paper can be summarized as follows:

(1): We propose FACV, which integrates multi-scale correlation, attention-guided fusion, and channel reweighting to construct compact and informative cost volumes with reduced reliance on 3D convolutions.
(2): We introduce MAA, which combines pyramid-based hierarchical cost aggregation and dual-path attention to enhance disparity estimation with minimal computational cost.
(3): We develop a lightweight and real-time stereo matching network, FAMNet, to achieve a favorable balance between accuracy, efficiency, and cross-domain generalization, demonstrating its feasibility for practical deployment in autonomous driving scenarios.

This paper is organized as follows: Section 2 describes the related work. Section 3 presents the details of our proposed method. The experimental results are discussed in Section 4. Section 5 summarizes our work and discusses future improvements on our work.

2. Related Work

2.1. End-to-End Stereo Matching Network

In recent years, CNNs have made significant advancements in various vision tasks such as object detection [15,16], depth estimation [9,17], and semantic segmentation [18,19]. With the strong representation of CNNs, learning-based stereo matching methods achieve impressive performance over traditional methods with manually designed features [4]. DispNetC [7] is the first stereo matching model that predicts dense disparity in an end-to-end manner. It builds dense correspondences between pixels in left and right images by employing a correlation layer. The correlation between left and right features measures similarities, but loses contextual information during the squeezing of channels. GCNet [4] retains contextual information by concatenating left and right features at each disparity level and then applies 3D convolutions to regularize the cost volume. It defines a differentiable operation of soft argmin to regress a smooth disparity estimate, therefore achieving excellent results. Although the concatenated cost volume contains rich contextual information, it needs to learn similarities from scratch. GwcNet [6] constructs a cost volume in a group-wise manner that takes both contextual information and similarity measurements contained in a 4D cost volume. Due to its merits of being lightweight and effective, group-wise correlation has become a popular method to build cost volumes. PSMNet [5] improves prediction accuracy by stacking multiple hourglass aggregation subnetworks with a mass of 3D convolutions. However, these resource-costly 3D convolutions lead to significant latency. GANet [7] comprises a semi-global guided aggregation module and local guided aggregation module to replace the widely used 3D convolutional layers. Zeng et al. [20] propose hysteresis attention, which is inspired by the hysteresis comparator in electronic circuits, to improve the representation capability of the feature extractor. ACVNet [8] generates weights from correlation clues to suppress redundant information and enhances matching-related information in the concatenation volume to build a cost volume to achieve state-of-the-art results. Recently, refinement through iterative updates has made disparity prediction more accurate. Raft-Stereo [9] introduces multi-level convolutional GRUs to iteratively refine disparity candidates. On the basis of Raft-Stereo, IGEV [10] builds a combined geometry-encoding volume that encodes geometry and contextual information and then iteratively indexes it to update the disparity map. Furthermore, IGEV++ [11] collects multi-range geometry information for ill-posed regions and large disparities and fine-grained geometry information for details and small disparities, and then leverages ConvGRUs to iteratively update the disparity map.

2.2. Real-Time Stereo Matching Network

Besides improving the prediction accuracy by constructing complicated networks, low latency, crucial to safety, is a significant factor for depth perception in autonomous driving. Driven by the demand for high-speed and accurate depth estimation in applications of autonomous driving, real-time stereo matching has witnessed significant advancements in recent years. StereoNet [21] is the first end-to-end model to realize real-time disparity prediction. It constructs the cost volume using feature maps at a lower resolution to reduce resource consumption. Coarse prediction is refined by edge-aware upsampling to obtain a better disparity map. StereoNet achieves a high frame rate on high-end GPUs. AANet [22] introduces an adaptive aggregation strategy that dynamically adjusts the receptive field size, enabling both efficiency and competitive accuracy. It replaces all the 3D convolutions with 2D ones and constructs adaptive intra-scale and cross-scale aggregation modules to accelerate the network. AnyNet [23] is a multi-stage network that progressively refines disparity prediction results, allowing real-time inference with early-exit capabilities for flexible performance–speed trade-offs. It generates a coarse disparity map at the lowest resolution, and then refines the disparity map by the disparity residuals at higher resolutions. DeepPruner [24] formulates stereo matching as a pruning problem, where a small subset of disparity candidates is selected and refined using a learned pruning strategy. It speeds up the inference by a differentiable PatchMatch module and pruned search space of disparity. BGNet [25] introduces a learnable bilateral grid to upsample the aggregated volume with edges preserved, allowing complex computation at lower resolutions. CoEx [26] constructs a guided cost volume excitation module and shows that prediction accuracy can be improved by simple channel excitation on the cost volume under the guidance of the image. Inspired by CoEx, CGI-Stereo [27] includes a context and geometry fusion block to adaptively fuse context and geometry information for accurate and efficient cost aggregation. Ghost-Stereo [28] utilizes GhostNet-based feature extraction and cost aggregation to balance accuracy and speed. DTPNet [29] adopts a distill-and-then-prune strategy to compress the stereo matching network and achieves fast inference on edge devices. LightStereo [30] is a lightweight network based only on 2D convolutions to reduce the resource consumption of 3D convolutions and improves the results by multi-scale attention.

Two-dimensional cost aggregation models represented by DispNetC [7], AANet [22], and LightStereo [30] use two-dimensional convolutions to regularize and aggregate the constructed single-channel cost volume, showing their faster performance. However, compared to cost aggregation via 3D convolutions, 2D cost aggregation inherently omits the channel dimension information encoded in the 3D cost volume, which cannot capture multi-dimensional dependencies across spatial, channel, and disparity dimensions. Thus, 3D cost aggregation is capable of extracting more contextual and semantic information to improve prediction accuracy at the cost of speed. Meanwhile, the fast speed of 2D cost aggregation comes at the cost of prediction accuracy. Reducing 3D convolutional layers in cost aggregation is a trade-off between prediction accuracy and speed, and various approaches like attention mechanisms have been proposed to compensate for the degraded accuracy. Therefore, strategies should be carefully considered depending on the application scenarios, such as autonomous vehicles, mobile robots, or fine 3D reconstruction.

3. Method

FAMNet is carefully designed to minimize the reliance on heavy 3D convolutions, enabling efficient cost volume construction and aggregation. As illustrated in Figure 2, it comprises four key components: multi-scale feature extraction, FACV, MAA, and disparity regression. We employ a U-Net [31] structure with MobileNetV2 [32] as a backbone for hierarchical feature extraction. The backbone is pre-trained on the ImageNet dataset to confer a strong representation capability on FAMNet. The encoder progressively downsamples the input stereo pairs to capture features at multiple scales, while the decoder incorporates skip connections to effectively fuse low-level and high-level features. This multi-scale representation provides both detailed local information and broader contextual understanding, which is crucial for accurate disparity estimation.

3.1. FACV—Fusion Attention-Based Cost Volume

FACV is designed to compute matching costs and build a cost volume for disparity estimation. Its structure is shown in Figure 3. This sophisticated module addresses fundamental limitations in conventional cost volume construction through three innovative mechanisms: Multi-Modal Similarity Computation, Attention-Guided Fusion Architecture, and channel-wise feature reweighting. The rich similarity measurements and context information contained in FACV are beneficial in reducing the complexity of cost aggregation, resulting in lightweight aggregation and fast inference. In order to capture multi-scale matching cues, we construct multiple cost volumes via global and grouped correlation [5], thereby encoding both coarse and fine-grained disparity information. These cost volumes along with the concatenation volume are concatenated along the channel dimension and then fused using convolutional layers to improve the geometric representation. We use a feature pyramid built from 1/2-, 1/4-, and 1/8-resolution maps to generate adaptive attention maps via operations of upsampling and convolution, which enables the selective integration of multi-resolution features into the fused volume. A disparity attention mechanism then refines the disparity confidence further by dynamic recalibration under local disparity cues.

C_{g o l b a l} (x, y, d) = i n n e r 〈f_{l}^{g} (x, y), f_{r}^{g} (x - d, y)〉

(1)

C_{g w c} (x, y, d, g) = \frac{1}{N_{c} / N_{g}} i n n e r 〈f_{l}^{g} (x, y), f_{r}^{g} (x - d, y)〉

(2)

C_{c o n c a t e n a t e} (x, y, d) = c o n c a t 〈f_{l} (x, y), f_{r} (x - d, y)〉

(3)

Multi-Modal Similarity Computation. The FACV module fundamentally rethinks how feature similarity is computed by simultaneously employing two complementary operations. We compute both full correlation (capturing global feature relationships) and group correlation (preserving local feature interactions). This dual correlation strategy provides a comprehensive insight into feature similarity across scales and contexts. Then the original feature information is preserved by directly connecting the left and right image features, which avoids the loss of the absolute feature values in the correlation operation. The module dynamically balances these operations through learnable parameters, automatically adjusting their relative contributions based on image content. For texture-rich regions, correlation operations dominate to capture complex patterns, while in smooth areas, concatenation operations provide more-stable matching cues.

Attention-Guided Fusion Architecture. The fusion process employs a sophisticated attention mechanism that operates at multiple levels. We compute attention weights along the disparity dimension to emphasize likely matching positions while suppressing improbable matches. We implement a cross-scale spatial attention mechanism that identifies important image regions (e.g., object boundaries) and allocates more focus to these areas. A novel attention gate dynamically weights the contributions from different similarity computations (correlation vs. concatenation) at each spatial location. This multi-head attention system is implemented efficiently using separable convolutions and channel shuffling operations, maintaining computational efficiency while providing rich feature interactions.

Channel-Wise Feature Reweighting. The final component in FACV introduces a hierarchical channel attention mechanism, which processes individual feature channels to identify and amplify important matching cues while suppressing noisy ones. It models relationships between different feature channels to capture complex matching patterns, and then aggregates information across different receptive fields to handle both fine details and global context. The reweighting process uses a gated mechanism that combines both bottom-up (feature-driven) and top-down (task-driven) signals, ensuring that the network focuses on the most discriminative features for the current matching task.

3.2. MAA—Multi-Scale Attention Aggregation

Building upon the hierarchical design of FAMNet, the proposed MAA module introduces a pyramidal cost aggregation strategy across three distinct resolution levels. Its structure is shown in Figure 4. Specifically, cost volumes are constructed using hierarchical feature representations at 1/4, 1/8, and 1/16 scales. High-resolution features (1/4) retain fine-grained texture and edge information; mid-level features (1/8) capture intermediate spatial structures; and low-resolution features (1/16) encode global semantic context. The module incorporates three key design elements to maintain precision. Residual learning with skip connections at each scale mitigates gradient vanishing and preserves high-frequency details. Progressive upsampling via transposed 3D convolutions (stride = 2) enables hierarchical feature recovery while maintaining spatial coherence during resolution restoration. Feature refinement through convolutional residual blocks provides iterative disparity estimation refinement. Following FAMNet’s multi-scale cost aggregation paradigm, these features are progressively downsampled through strided 3D convolutions and then reintegrated via skip connections, ensuring both seamless information flow across scales and computational efficiency. This hierarchical architecture enables comprehensive scene understanding by combining local precision with global context awareness.

MAA employs a dual-path attention mechanism to dynamically enhance feature discriminability. The disparity attention pathway first applies global average and max pooling along the disparity dimension, processes the pooled features through weight-shared 3D CNNs (kernel size 1 × 1 × 1), and adjusts channel dimensions via the reduction rate

R

to generate disparity-specific attention weights through sigmoid activation.

\begin{matrix} V_{d} = s i g m o i d (R \times f_{{3 D}_{1 \times 1 \times 1}} (f_{{a v g p o o l}_{g l o b a l}} (V) + f_{{m a x p o o l}_{g l o b a l}} (V))) \end{matrix} \times V + V

(4)

where

V

corresponds to the input cost volume and

V_{d}

corresponds to the output cost volume, and

f_{{3 D}_{1 \times 1 \times 1}}

,

f_{{a v g p o o l}_{g l o b a l}}

, and

f_{{m a x p o o l}_{g l o b a l}}

correspond to the convolution and pooling operations, respectively.

Concurrently, the spatial attention pathway computes average and maximum values along the disparity dimension, concatenates these features, and refines them via a 3D CNN (1 × 7 × 7 kernel) to produce spatial attention weights. By jointly optimizing these pathways, the mechanism selectively emphasizes relevant depth levels and critical spatial regions, improving robustness in texture-less areas while maintaining computational efficiency.

\begin{matrix} V_{s} = s i g m o i d (f_{{3 D}_{1 \times 7 \times 7}} (f_{{a v g p o o l}_{d i s p a r i t y}} (V_{d}) + f_{{m a x p o o l}_{d i s p a r i t y}} (V_{d}))) \end{matrix} \times V_{d} + V_{d}

(5)

where

V_{d}

corresponds to the output of the disparity attention above and

V_{s}

corresponds to the output cost volume, and

f_{{3 D}_{1 \times 7 \times 7}}

,

f_{{a v g p o o l}_{d i s p a r i t y}}

, and

f_{{m a x p o o l}_{d i s p a r i t y}}

correspond to the convolution and pooling operations, respectively. Therefore, MAA is a lightweight cost aggregation module with reduced costly operations (3D convolutions) to enable real-time performance of FAMNet. The attention inside MAA further improves the prediction results.

3.3. Disparity Regression and Loss Function

Disparity regression refines sub-pixel-level disparity estimation by converting matching costs in the cost volume into a probability distribution. It computes continuous disparities through weighted averaging of the soft argmin operation.

{\hat{d}}_{i} = \sum_{d = 0}^{d_{m a x} - 1} d \cdot σ (- C^{A})

(6)

where

{\hat{d}}_{i}

denotes the predicted disparity,

σ

denotes the softmax operation, and

C^{A}

denotes the aggregated cost volume. We use top

k

regression [4], where the value of

k

is 2. To more efficiently upsample the disparity map, we utilize “super pixel” weights around pixels [26]. The whole network is trained in a supervised end-to-end manner using

s m o o t h

L_{1}

loss to measure the difference between predicted and ground-truth values:

\begin{matrix} L = {S m o o t h}_{L_{1}} ({\hat{d}}_{i} - d_{g t}) \end{matrix},

(7)

in which

\begin{matrix} {S m o o t h}_{L_{1}} (x) = \{\begin{matrix} 0.5 x^{2}, i f |x| < 1 \\ |x| - 0.5, o t h e r w i s e \end{matrix} \end{matrix},

(8)

where

{\hat{d}}_{i}

is the predicted disparity, and

d_{g t}

denotes the ground-truth disparity map.

4. Experiments

4.1. Datasets and Evaluation Metrics

The experiments employ a hybrid dataset approach combining synthetic (Scene Flow) and real-world (KITTI) data to ensure comprehensive validation. The Scene Flow dataset [7], a large-scale synthetic stereo image collection containing 35,454 training pairs and 4370 test pairs with ground-truth disparity maps, serves as the primary dataset for pre-training. Model optimization on this dataset is performed using endpoint error (EPE) analysis for pixel-wise disparity deviation assessment. For real-world automotive scenario validation, this study utilizes the KITTI benchmarks. The KITTI 2012 dataset [33], comprising 194 training images and 195 test images, employs a dual evaluation protocol: a 3-pixel error tolerance threshold (3px-noc) specifically for non-occluded regions, and a full-frame assessment (3px-all). The subsequently released KITTI 2015 dataset [34] expands this with 200 balanced training–testing pairs while introducing enhanced evaluation metrics. These maintain the EPE measurement while implementing D1 error categorization, which quantifies outlier proportions (defined as disparity errors exceeding either 3 pixels or 5% of ground-truth values) across three spatial partitions: background regions (D1-bg), foreground objects (D1-fg), and composite scenes (D1-all). This hierarchical evaluation system enables granular performance analysis across various elements of autonomous driving scenarios.

4.2. Implementation Details

Our experimental implementation is carried out using the PyTorch Version 2.0.1. framework on an NVIDIA (Santa Clara, CA, USA) RTX 3090 GPU platform. The models are trained in an end-to-end manner using the Adam [35] optimizer configured with

β_{1}

= 0.9 and

β_{2}

= 0.999. To prepare the training data, the input images are randomly cropped to 256 × 512 pixels with a maximum disparity setting of 192. The training protocol consists of two distinct phases: initial pre-training and subsequent fine-tuning. The pre-training phase uses the Scene Flow dataset over 32 epochs, with an initial learning rate of 0.001, which is scheduled to decrease at epochs 20, 26, 28, and 30. Following pre-training, the model is fine-tuned on a combined training set comprising both the KITTI 2012 and 2015 datasets for 600 epochs, maintaining the same initial learning rate of 0.001, which is then halved at the 300th epoch. This comprehensive training strategy ensures robust model performance across different datasets while maintaining computational efficiency.

4.3. Ablation Analysis

To evaluate the effectiveness of individual components in our proposed method, we performed systematic ablation studies on the Scene Flow and KITTI 2015 validation sets. The experimental setup involved training all models on the respective training sets of the Scene Flow and KITTI datasets following the previously described training strategies, with runtime performance measured on the NVIDIA RTX 3090 GPU platform.

Our baseline architecture uses 4D cost volume construction at 1/4 resolution with standard 3D convolutions for cost aggregation. The comprehensive experimental results, as presented in Table 1, demonstrate the significant performance gains achieved by each proposed component, while maintaining the real-time processing capabilities essential for practical applications. By sequentially incorporating the proposed modules, we observe that both FACV and MAA mechanisms contribute significantly to accuracy improvements on both datasets. FACV provides rich similarity measurements and contextual information for matching costs. The rich information is beneficial to the improvement in prediction results. The attention in MAA provides statistical information in the spatial and channel dimensions to regularize the cost volume. Notably, these improvements are achieved with mild increases in processing time compared to the baseline model, which maintains reasonable computational efficiency. A visualization of experimental data via a histogram plot is shown in Figure 5 to demonstrate distinct comparisons between ablation results.

4.4. Performance

Table 2 provides a detailed comparison between the proposed model and representative state-of-the-art stereo matching methods across the KITTI 2012 and KITTI 2015 benchmarks. The comparison is stratified into two categories: high-accuracy models and real-time models. Notably, the proposed method achieves a significant speed advantage—up to 5× faster—over non-real-time baselines such as PSMNet [5] and GwcNet [6], while maintaining comparable disparity estimation accuracy. Although there are differences in accuracy between the OpenStereo [36] and Monster [37] high-performance models, we still have a speed advantage of about 10 times, which ensures the effectiveness of our model. Among real-time networks, the proposed method FAMNet exhibits competitive accuracy, with only a marginal performance gap compared to LightStereo [30], the cutting-edge real-time method. The consumption of resources is a key metric to evaluate the efficiency of a lightweight model. We compare our model with state-of-the-art models in computing resources, and present the experimental results in Table 3. All models in the table are lightweight and can realize real-time prediction. None of them require considerable computing resources. The resource consumption of our model is moderate among the models in the table, but its accuracy is excellent and ranks second. The parameters in the table denote the memory overhead; our model achieves a 57% parameter reduction compared to LightStereo-M, substantially decreasing storage requirements and model loading time. While achieving better accuracy, our model reduces computational cost by about 20% compared to BGNet and CoEx. Thus, our FAMNet achieves a better balance between prediction accuracy and resource consumption.

Qualitative results of evaluations on KITTI 2012 and 2015 are demonstrated in Figure 6 and Figure 7. For a more fair comparison, we introduced a high-performance network CFNet [40] for comparison. From the disparity map, it can be seen that FAMNet predicts accurate disparities for thin objects and non-continuous regions. The annotations in Figure 6 demonstrate the excellent performance of our model. For the thin and tiny objects in the left parts of the first and third images in Figure 6, FAMNet predicts more-accurate disparity maps. Disparities are more continuous and distinct from the background. This can be attributed to rich contextual and semantic information provided by the MAA module. For the outlines of objects (non-continuous disparities) in the rights parts of images in Figure 6, the predicted disparity results are more accurate and clearer for our model than other models, since the FACV module builds a cost volume with more geometric priors inside. As shown in the results of matching errors in Figure 8, our model achieves comparable results to other state-of-the-art models. All of the models can obtain excellent performance in most regions. Due to the effectiveness of the proposed modules, our model demonstrates competitive robustness with state-of-the-art models in ill-posed regions such as tiny structures, reflective glass, and over-exposed surfaces. Cross-domain performance is crucial to the practical capability of a stereo matching model. For the cross-domain evaluation, FAMNet and other models are only pre-trained on the Scene Flow dataset and evaluated on Middlebury 2014 [41] and ETH 3D [42]. Table 4 highlights the strong generalization ability of our model, which achieves excellent performance on both the Middlebury and ETH3D datasets, rivaling recent high-performing methods such as CGI-Stereo [27] and Ghost-Stereo [28]. Figure 9 illustrates qualitative disparity predictions in zero-shot scenes, where the model demonstrates excellent capability in resolving fine details and accurately estimating depth in regions containing small or distant objects. The error maps in Figure 9 also prove that our model performs excellently in most regions including tiny structures and texture-less surfaces. Especially for pencils, the details of the motorcycle, and the chair behind the motorcycle, disparities are distinct and accurate. Notably, our model also exhibits robust prediction capabilities for the emergence of small objects, further validating the effectiveness of our FACV and MAA modules in unseen scenes.

5. Conclusions

In this paper, we proposed FAMNet, a lightweight and generalizable stereo matching framework designed for real-time depth estimation in autonomous driving scenarios. To address the limitations of conventional 3D convolution-based networks, FAMNet introduces two key innovations. The FACV module integrates multi-scale correlation, attention-guided fusion, and channel reweighting to construct compact and expressive cost volumes with reduced computational overhead. The MAA module efficiently refines disparity estimation by aggregating hierarchical contextual information through pyramid-based processing and a dual-path attention mechanism. Extensive experiments on both synthetic and real-world datasets demonstrate that FAMNet achieves a compelling balance between accuracy, efficiency, and generalization. It outperforms most 3D convolution-heavy models in runtime, while maintaining competitive disparity estimation accuracy. Compared to the cutting-edge method CoEx [14], our FAMNet achieves a performance improvement of 3% and 5% on the KITTI 2015 and KITTI 2012 datasets, respectively. In terms of generalization performance, there is a maximum improvement of 22% on the Middlebury 2014 and ETH 3D datasets, while maintaining real-time performance. Moreover, strong cross-domain generalization on benchmarks proves its robustness for diverse driving environments. With its modular design, low latency, and high accuracy, FAMNet offers a practical solution for onboard depth perception in autonomous vehicles. Although the proposed model achieves an excellent trade-off between speed and accuracy, there is a large margin for the prediction accuracy between our model and most accuracy-oriented models, and the proposed model is not optimized for ill-posed regions like texture-less and reflective regions. Thus, in future work, we plan to work on more effective yet efficient modules to improve the prediction results with less speed loss and will try to extract more geometric features in an efficient manner to improve performance in ill-posed regions. We also aim to extend the framework to more challenging scenarios, such as motion-aware stereo matching under texture-less surfaces and varying-illumination conditions.

Author Contributions

Conceptualization, J.Z., Q.T., N.Y. and X.L.; methodology, J.Z. and Q.T.; software, J.Z. and N.Y.; validation, Q.T., N.Y. and X.L.; formal analysis, J.Z., Q.T. and N.Y.; investigation, J.Z., Q.T. and N.Y.; resources, Q.T. and X.L.; data curation, J.Z., Q.T. and N.Y.; writing—original draft preparation, J.Z., Q.T. and N.Y.; writing—review and editing, Q.T. and X.L.; visualization, J.Z. and N.Y.; supervision, Q.T. and X.L.; project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research Fund of R&D and application demonstration of trusted multimodal large-scale model technology for industrial situation awareness and decision-making (No. Z241100001324010).

Data Availability Statement

Data is unavailable at present due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FAMNet	Fusion Attention Multi-Scale Network
FACV	Fusion Attention-Based Cost Volume
MAA	Multi-Scale Attention Aggregation
GRU	Gate Recurrent Unit
GPU	Graphics Processing Unit
GwcNet	Group-Wise Correlation Network
CFNet	Cascade and Fused Cost Volume-Based Network
ACVNet	Attention Concatenation Volume Network
BGNet	Bilateral Grid Network
GAnet	Guided Aggregation Network
AANet	Adaptive Aggregation Network
GCNet	Geometry and Context Network
PSMNet	Pyramid Stereo Matching Network
IGEV	Iterative Geometry-Encoding Volume

References

Chen, C.; Seff, A.; Kornhauser, A.; Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Biswas, J.; Veloso, M. Depth camera based localization and navigation for indoor mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011. [Google Scholar]
Alhaija, H.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. Int. J. Comput. Vis. 2011, 126, 961–972. [Google Scholar] [CrossRef]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-wise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H.S. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Xu, G.; Cheng, J.; Guo, P.; Yang, X. Attention concatenation volume for accurate and efficient stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Lipson, L.; Teed, Z.; Deng, J. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In Proceedings of the International Conference on 3D Vision, Prague, Czech Republic, 1–3 December 2021. [Google Scholar]
Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Xu, G.; Wang, X.; Zhang, Z.; Cheng, J.; Liao, C.; Yang, X. IGEV++: Iterative multi-range geometry encoding volumes for stereo matching. arXiv 2024, arXiv:2409.00638. [Google Scholar] [CrossRef] [PubMed]
Liao, L.; Zeng, J.; Lai, T.; Xiao, Z.; Zou, F.; Fujita, H. Stereo matching on images based on volume fusion and disparity space attention. Eng. Appl. Artif. Intell. 2024, 136, 108902. [Google Scholar] [CrossRef]
Lu, Y.; He, X.; Zhang, Q.; Zhang, D. Fast stereo conformer: Real-time stereo matching with enhanced feature fusion for autonomous driving. Eng. Appl. Artif. Intell. 2025, 149, 110565. [Google Scholar] [CrossRef]
Tahmasebi, M.; Huq, S.; Meehan, K.; McAfee, M. DCVSMNet: Double cost volume stereo matching network. Neurocomputing 2025, 618, 129002. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Bhar, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth estimation using adaptive bins. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, Z.; Liu, X.; Drenkow, N.; Ding, A.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Khamis, S.; Fanello, S.; Rhemann, C.; Kowdle, A.; Valentin, J.L.; Izadi, S. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Xu, H.; Zhang, J. Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, Y.; Lai, Z.; Huang, G.; Wang, B.; Maaten, L.; Campbell, M.; Weinberger, K.Q. Anytime stereo image depth estimation on mobile devices. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, BC, Canada, 20–24 May 2019. [Google Scholar]
Duggal, S.; Wang, S.; Ma, W.; Hu, R.; Urtasun, R. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xu, B.; Xu, Y.; Yang, X.; Jia, W.; Guo, Y. Bilateral grid learning for stereo matching networks. In Proceedings of the IEEE International Conference on Computer Vision, Virtual, 19–25 June 2021. [Google Scholar]
Bangunharcana, A.; Cho, J.W.; Lee, S.; Kweon, I.S.; Kim, K.S.; Kim, S. Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Xu, G.; Zhou, H.; Yang, X. CGI-Stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv 2023, arXiv:2301.02789. [Google Scholar]
Jiang, X.; Bian, X.; Guo, C. Ghost-Stereo: GhostNet-based cost volume enhancement and aggregation for stereo matching networks. arXiv 2024, arXiv:2405.14520. [Google Scholar]
Pan, B.; Jiao, J.; Pang, J.; Cheng, J. Distill-then-prune: An efficient compression framework for real-time stereo matching network on edge devices. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024. [Google Scholar]
Guo, X.; Zhang, C.; Zhang, Y.; Zheng, W.; Nie, D.; Poggi, M.; Chen, L. Light Stereo: Channel boost is all you need for efficient 2D cost aggregation. arXiv 2025, arXiv:2406.19833v3. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.L.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE International Conference on Computer Vision, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE International Conference on Com puter Vision, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Guo, X.; Zhang, C.; Lu, J.; Duan, Y.; Wang, Y.; Yang, T.; Zhu, Z.; Chen, L. OpenStereo: A comprehensive benchmark for stereo matching and strong baseline. arXiv 2023, arXiv:2312.00343. [Google Scholar]
Cheng, J.; Liu, L.; Xu, G.; Wang, X.; Zhang, Z.; Deng, Y.; Zang, J.; Chen, Y.; Cai, Z.; Yang, X. MonSter: Marry monodepth to stereo unleashes power. arXiv 2025, arXiv:2501.08643. [Google Scholar] [CrossRef]
Yang, J.; Wu, C.; Wang, G.; Xu, R.; Zhang, M.; Xu, Y. Guided aggregation and disparity refinement for real-time stereo matching. Signal Image Video Process. 2024, 18, 4467–4477. [Google Scholar] [CrossRef]
Wu, Z.; Zhu, H.; He, L.; Zhao, Q.; Shi, J.; Wu, W. Real-time stereo matching with high accuracy via spatial attention-guided upsampling. Appl. Intell. 2023, 53, 24253–24274. [Google Scholar] [CrossRef]
Shen, Z.; Dai, Y.; Rao, Z. CFNet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, Virtual, 19–25 June 2021. [Google Scholar]
Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešic, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the German Conference on Pattern Recognition, Cham, Switzerland, 2–5 September 2014. [Google Scholar]
Schöps, T.; Schönberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]

Figure 1. Principle of depth estimation via symmetric stereo cameras. (a) Autonomous vehicle with symmetric stereo cameras. (b) Object on symmetric image planes. (c) Disparity estimated from pixel-to-pixel difference.

Figure 2. Architecture of FAMNet. It consists of a straight pipeline with four stages of feature extraction, cost volume construction, cost aggregation, and disparity regression.

Figure 3. Structure of Fusion Attention-based Cost Volume.

Figure 4. Structure of Multi-scale Attention Aggregation.

Figure 5. Histogram of ablation experiments.

Figure 6. Qualitative results of FAMNet compared to other models on KITTI 2012 dataset.

Figure 7. Qualitative results of FAMNet compared to other models on KITTI 2015 dataset.

Figure 8. Qualitative results of matching errors on KITTI 2015 dataset.

Figure 9. Visualization results of disparities and errors for FAMNet on Middlebury 2014.

Table 1. Ablation study of proposed model on Scene Flow and KITTI validation sets.

Model	Scene Flow EPE (px)	KITTI 2015		Runtime (ms)
Model	Scene Flow EPE (px)	D1-all (%)	EPE (px)	Runtime (ms)
Baseline	0.89	2.59	0.73	19
Baseline + FACV	0.74	1.64	0.65	25
Baseline + MAA	0.72	1.61	0.64	26
Baseline + FACV + MAA (Ours)	0.62	1.49	0.59	31

Table 2. Quantitative results on KITTI online benchmarks.

Model	KITTI 2012		KITTI 2015			Platform	Runtime (ms)
Model	3px-noc (%)	3px-all (%)	D1-bg (%)	D1-fg (%)	D1-all (%)	Platform	Runtime (ms)
Accuracy
GANet-deep [7]	1.19	1.60	1.48	3.46	1.81	Tesla P40	1800
PSMNet [5]	1.49	1.89	1.86	4.62	2.32	Titan X	410
GwcNet [6]	1.32	1.70	2.21	6.16	2.11	Titan X	320
GCNet [4]	1.77	2.30	1.37	3.16	2.87	Titan X	900
IGEV-Stereo [10]	1.12	1.44	1.38	2.67	1.59	RTX 3090	180
Raft-Stereo [9]	1.30	1.66	1.58	3.05	1.82	RTX 6000	380
CFNet [33]	1.23	1.58	1.54	3.56	1.88	Tesla V100	180
ACVNet [8]	1.13	1.47	1.37	3.07	1.65	RTX3090	250
OpenStereo [36]	1.00	1.26	1.28	2.26	1.44	-	290
MonSter [37]	0.84	1.09	1.13	2.81	1.41	RTX 3090	450
Speed
StereoNet [21]	4.91	6.02	4.30	7.45	4.83	Titan X	15
AnyNet [23]	2.20	2.66	-	-	2.71	RTX 2080TI	27
AANet [22]	1.91	2.42	1.99	5.39	2.55	Tesla V100	62
BGNet [25]	1.77	2.15	2.07	4.74	2.51	RTX 2080TI	25
Fast-ACVNet [8]	1.68	2.13	1.82	3.93	2.17	RTX 3090	39
CoEx [26]	1.55	1.93	1.79	3.82	2.13	RTX 2080TI	27
GADR-Stereo [38]	-	-	1.80	-	2.11	RTX 3090	31
SAGU-Net-fr [39]	1.55	1.55	1.70	3.79	2.05	RTX 3090	35
Ghost-Stereo [28]	1.45	1.80	1.71	3.77	2.05	RTX3090	37
LightStereo-M [30]	1.56	1.91	1.81	3.22	2.04	RTX3090	23
FAMNet (Ours)	1.53	1.81	1.72	3.43	2.06	RTX 3090	31

Table 3. Comparisons of computing resources.

Model	Params (M)	FLOPS (G)	D1-all (%)	Runtime (ms)
StereoNet [21]	0.29	52	4.83	15
BGNet [25]	2.97	51	2.17	25
Fast-ACVNet [8]	2.56	36	2.13	39
CoEx [26]	2.69	50	2.13	27
LightStereo-M [30]	7.61	33	2.04	23
FAMNet (Ours)	3.26	41	2.06	31

Table 4. Cross-domain generalization performance.

Model	Middlebury 2014 (>2 px) (%)	ETH 3D (>1 px) (%)
PSMNet [5]	15.8	9.8
GANet-deep [7]	20.3	14.1
CFNet [40]	15.4	5.3
CoEx [26]	14.5	9
BGNet [25]	24.7	22.6
CGI-Stereo [27]	13.5	6.3
Ghost-Stereo [28]	11.4	7.3
FAMNet (Ours)	12.1	7.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Tong, Q.; Yan, N.; Liu, X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry 2025, 17, 1214. https://doi.org/10.3390/sym17081214

AMA Style

Zhang J, Tong Q, Yan N, Liu X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry. 2025; 17(8):1214. https://doi.org/10.3390/sym17081214

Chicago/Turabian Style

Zhang, Jingyuan, Qiang Tong, Na Yan, and Xiulei Liu. 2025. "FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving" Symmetry 17, no. 8: 1214. https://doi.org/10.3390/sym17081214

APA Style

Zhang, J., Tong, Q., Yan, N., & Liu, X. (2025). FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry, 17(8), 1214. https://doi.org/10.3390/sym17081214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving

Abstract

1. Introduction

2. Related Work

2.1. End-to-End Stereo Matching Network

2.2. Real-Time Stereo Matching Network

3. Method

3.1. FACV—Fusion Attention-Based Cost Volume

3.2. MAA—Multi-Scale Attention Aggregation

3.3. Disparity Regression and Loss Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Ablation Analysis

4.4. Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI