Binocular Stereo Vision in Remote Sensing: A Review
Highlights
- This study synthesizes the technical progress and inherent limitations of representative models in the field of remote sensing stereo matching.
- The research provides a critical analysis of the characteristics and domain-specific constraints of major remote sensing stereo datasets.
- This study serves as a practical guide for researchers to select the most suitable models and datasets for specific remote sensing stereo matching tasks.
- This survey identifies the unresolved bottlenecks—large disparity ranges, ill-posed regions, cross-sensor domain shift, and label scarcity—that should guide the design of next-generation remote sensing stereo matching algorithms.
Abstract
1. Introduction
- (1)
- Empirically designed cost metrics: Traditional matching costs often rely on heuristic metrics, such as luminance difference and correlation coefficients. These hand-crafted measures are often insufficient for capturing complex radiometric variations in high-resolution images.
- (2)
- Localized cost aggregation: Cost aggregation is typically performed within a finite neighborhood with a fixed window size. This localized approach lacks the flexibility to adapt to varying terrain scales and often fails in regions with repetitive textures or depth discontinuities.
- (3)
- Dependency on post-processing: Disparity refinement heavily depends on a sequence of manual techniques, such as median filtering for smoothing and subpixel enhancement. These multi-stage pipelines increase computational complexity and require extensive parameter tuning.
- (1)
- Computational inefficiency: High-resolution inputs from satellite sensors significantly increase computational demands. For instance, PSMNet [17] requires 576 ms to process a remote sensing image on an NVIDIA GTX 1080Ti. BGA-Net [12] takes 1572 ms on the same GPU, and MaskCRNet [18] requires 724 ms on an NVIDIA RTX 3090. These requirements hinder real-time use and complicate deployment on edge devices.
- (2)
- Reduced accuracy in remote sensing scenes: Remote sensing imagery is subject to various challenges, including occlusions in urban environments, repetitive textures such as forests and water surfaces, seasonal appearance changes, illumination variations, human activities, and the presence of small-scale targets. As illustrated in Figure 1, remote sensing stereo pairs often exhibit substantial appearance variations across views. These factors introduce significant ambiguities that degrade the performance of stereo matching models trained primarily on conventional terrestrial datasets. For example, PSMNet [17] achieves a three-pixel error rate of 24.81% on the remote sensing WHU-Stereo dataset [19], but yields a significantly lower error rate of 1.89% on the terrestrial KITTI 2012 dataset [20].
- (3)
- Dependence on predefined priors: Many stereo matching methods rely on a fixed disparity range (e.g., 0–192), typically determined by the camera baseline and image resolution. This assumption limits model adaptability to the varying imaging geometries present in satellite-based systems. Moreover, the disparity distributions in remote sensing datasets, such as US3D [21] and WHU-Stereo [19], differ markedly from those in terrestrial datasets like KITTI [20] and ETH3D [22]. As illustrated in Figure 2a, remote sensing disparities often span or , whereas terrestrial datasets commonly cover ranges such as or . This disparity mismatch not only impedes model generalization but also complicates joint training across domains, thereby limiting the transferability of terrestrial datasets and pretrained models to remote sensing scenarios.
- (3)
- Limited data: The scarcity of labeled stereo image pairs in remote sensing significantly constrains the generalization ability of deep learning models, particularly in supervised settings. As shown in Figure 2b, the WHU-Stereo dataset provides only 1757 annotated samples, whereas the widely used terrestrial SceneFlow dataset includes 39,049 samples. This severe data imbalance poses a major obstacle to training robust and transferable stereo matching models in remote sensing domains.
2. Traditional Stereo Matching Methods
3. Deep Learning-Based Stereo Matching Methods
3.1. Combining Deep Learning and Traditional Algorithms
3.2. End-to-End Supervised Algorithms
3.2.1. 2D Convolution-Based Models
3.2.2. 3D Convolution-Based Models
3.2.3. Iterative Optimization-Based Models
3.2.4. Transformer-Based Models
3.2.5. Multi-Task Learning-Based Models
3.2.6. Integrating Vision Foundation Models
3.2.7. Other Emerging Models
3.3. Weakly-, Semi-, and Self-Supervised Algorithms
3.3.1. Self-Supervised Models
3.3.2. Semi-Supervised Models
3.3.3. Weakly-Supervised Models
3.4. Advantages and Limitations
4. Acceleration of Stereo Matching Models
4.1. Lightweight Design
4.2. Compression
4.2.1. Model Pruning
4.2.2. Model Quantization
5. Remote Sensing Stereo Matching Datasets
5.1. Development of Stereo Datasets
5.2. Dataset Characteristics
- (1)
- Ground Sampling Distance (GSD). Remote sensing datasets exhibit diverse spatial resolutions—for instance, US3D at approximately 0.3 m, WHU-Stereo at around 0.8 m, GaoFen-7 at about 0.8 m, and UAVStereo ranging from 0.05 m to 0.2 m. The GSD defines the spatial resolution of a remote sensing image, representing the real-world distance between two adjacent pixels on the ground. In satellite stereo imagery, GSD is determined by the imaging altitude H, focal length f, and sensor pixel size p, following the relation . A smaller GSD provides finer spatial detail, allowing more accurate disparity estimation and 3D reconstruction, but also increases the data volume and computational burden. Conversely, larger GSD values reduce spatial precision and may obscure small objects such as vehicles or trees, thereby complicating stereo correspondence.
- (2)
- Acquisition mode and processing methods. A further source of variability arises from acquisition strategies. Certain datasets (e.g., US3D, SatStereo) are constructed from incidental multi-temporal views of the same geographic location. These inevitably introduce seasonal and illumination changes, thereby creating radiometric inconsistencies between the left and right images. By contrast, other datasets (e.g., WHU-Stereo, GaoFen-7, ISPRS 2021) are based on near-simultaneous acquisitions, which mitigate such inconsistencies but still retain geometric and atmospheric challenges.
- (3)
- Disparity range. Remote sensing datasets also exhibit distinctive disparity distributions. Unlike computer vision benchmarks such as KITTI (0–192) or SceneFlow (0–256), the disparity range in RS data is often narrower and may even include negative values (e.g., −64 to 64 in US3D). This property has direct implications for algorithm design: methods based on fixed disparity assumptions become less suitable, and instead adaptive cost volume construction, hierarchical search, or multi-scale disparity estimation strategies are required.
5.3. Dataset Challenges
- (1)
- Limited data volume. Most RS stereo datasets are relatively small compared to terrestrial benchmarks, which restricts the training of deep stereo networks and limits their ability to learn complex spatial and semantic patterns.
- (2)
- Seasonal and environmental variations. In datasets with multi-temporal acquisition, differences in season, weather, and illumination between left and right images substantially complicate disparity estimation and require models with stronger generalization capability.
- (3)
- Distinct disparity statistics. The disparity distributions in RS imagery differ markedly from those of everyday scenes, diminishing the effectiveness of pre-training on large-scale synthetic datasets designed for CV applications.
- (4)
- High frequency of small targets. RS scenes often contain numerous fine-scale objects, such as buildings, trees, and vehicles. Accurate disparity estimation for such targets requires stereo models to maintain high spatial resolution and fine-grained matching precision.
- (5)
- Irregular and repetitive textures. Surfaces such as vegetation, water bodies, and urban facades frequently exhibit repetitive or ambiguous patterns. These characteristics create challenges for traditional correspondence search and necessitate algorithms that integrate semantic and geometric context.
6. Future Work and Suggestions
6.1. Model Design
- (1)
- Development of high-accuracy and remote sensing oriented algorithms. Remote sensing imagery exhibits distinctive characteristics such as seasonal variation, small object size, and repetitive textures across large areas. Current models, often adapted from terrestrial benchmarks, struggle to capture these patterns effectively. Future work should focus on designing specialized architectures that explicitly model the geometric, spectral, and temporal properties of remote sensing imagery, improving both robustness and precision in complex scenes. Meanwhile, the integration of large-scale vision foundation models and multi-modal satellite data (e.g., SAR–optical fusion) will likely define the next phase of RS stereo research.
- (2)
- Lightweight and efficient stereo matching networks. Although deep learning-based methods have achieved remarkable accuracy, their high computational cost limits practical deployment. With increasing demands for onboard processing on UAVs, satellites, and embedded systems, future models must achieve an optimal balance between accuracy, complexity, and runtime. Lightweight multi-scale architectures, efficient cost aggregation, and adaptive attention mechanisms could facilitate real-time inference in resource-constrained environments. Although 3D reconstruction tasks emphasize accuracy, edge-based inference is essential for onboard satellite or UAV applications requiring immediate geometric feedback.
- (3)
- Intelligent and adaptive learning under label scarcity. The limited availability of annotated stereo pairs and large domain gaps between terrestrial and aerial imagery restrict supervised learning. Future research should explore adaptive and self-evolving learning paradigms, such as self-supervised, weakly supervised, or meta-learning frameworks, integrated with foundation models and domain adaptation to enhance generalization across sensors and conditions.
- (4)
- Unified multi-task stereo matching frameworks. Stereo matching in remote sensing is often coupled with tasks like semantic segmentation, object detection, and 3D reconstruction. Developing unified architectures that enable mutual task reinforcement while maintaining interpretability and computational efficiency could lead to comprehensive and scene-aware 3D understanding in remote sensing.
6.2. Dataset Construction
- (1)
- Large-scale and diverse benchmarks. Future datasets should encompass a broad range of Ground Sampling Distances (GSDs)—from centimeter-level UAV imagery to meter-level satellite data—and include both simultaneous and multi-temporal acquisitions to assess robustness under illumination, seasonal, and geometric variations. Integrating negative disparity ranges and metadata such as viewing geometry or land-cover categories will further enhance their utility.
- (2)
- Task-driven and standardized dataset design. Beyond disparity estimation alone, future benchmarks should be constructed with downstream applications in mind, such as 3D reconstruction, digital surface modeling, and change detection. Building datasets with consistent annotation standards, unified evaluation metrics, and open-source protocols will foster fair comparison, reproducibility, and progress across the community.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pourrahmati, M.R.; Baghdadi, N.; Scolforo, H.F.; Alvares, C.A.; Stape, J.L.; Fayad, I.; Le Maire, G. Integration of very high-resolution stereo satellite images and airborne or satellite LiDAR for eucalyptus canopy height estimation. Sci. Remote Sens. 2024, 10, 100170. [Google Scholar] [CrossRef]
- Venkatesan, V.; Panangian, D.; Reyes, M.F.; Bittner, K. SyntStereo2Real: Edge-aware GAN for remote sensing image-to-image translation while maintaining stereo constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 512–521. [Google Scholar]
- Wang, X.; Jiang, L.; Wang, F.; You, H.; Xiang, Y. Disparity refinement for stereo matching of high-resolution remote sensing images based on GIS data. Remote Sens. 2024, 16, 487. [Google Scholar] [CrossRef]
- Zhang, J.; Huang, L.; Bai, X.; Zheng, J.; Gu, L.; Hancock, E. Exploring the usage of pre-trained features for stereo matching. Int. J. Comput. Vis. 2024, 132, 4305–4326. [Google Scholar] [CrossRef]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Boykov, Y.Y.; Jolly, M.P. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, 7–14 July 2001; Volume 1, pp. 105–112. [Google Scholar]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef]
- Bleyer, M.; Rhemann, C.; Rother, C. PatchMatch stereo-stereo matching with slanted support windows. In Proceedings of the British Machine Vision Conference (BMVC), Dundee, UK, 29 August–2 September 2011; Volume 11, pp. 1–11. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
- Tao, R.; Xiang, Y.; You, H. Stereo matching of VHR remote sensing images via bidirectional pyramid network. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Waikoloa, HI, USA, 26 September–2 October 2020; pp. 6742–6745. [Google Scholar]
- Rao, Z.; He, M.; Zhu, Z.; Dai, Y.; He, R. Bidirectional guided attention network for 3-D semantic detection of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6138–6153. [Google Scholar] [CrossRef]
- Teed, Z.; Deng, J. RAFT: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
- Lipson, L.; Teed, Z.; Deng, J. RAFT-Stereo: Multilevel recurrent field transforms for stereo matching. In Proceedings of the International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 218–227. [Google Scholar]
- Li, Z.; Liu, X.; Drenkow, N.; Ding, A.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 6197–6206. [Google Scholar]
- Lin, L.; Zhang, Y.; Wang, Z.; Zhang, L.; Liu, X.; Wang, Q. A-SATMVSNet: An attention-aware multi-view stereo matching network based on satellite imagery. Front. Earth Sci. 2023, 11, 1108403. [Google Scholar] [CrossRef]
- Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5410–5418. [Google Scholar]
- Rao, Z.; Li, X.; Xiong, B.; Dai, Y.; Shen, Z.; Li, H.; Lou, Y. Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images. ISPRS J. Photogramm. Remote Sens. 2024, 218, 151–165. [Google Scholar] [CrossRef]
- Li, S.; He, S.; Jiang, S.; Jiang, W.; Zhang, L. WHU-Stereo: A challenging benchmark for stereo matching of high-resolution satellite images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603914. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Bosch, M.; Foster, K.; Christie, G.; Wang, S.; Hager, G.D.; Brown, M. Semantic stereo for incidental satellite images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1524–1532. [Google Scholar]
- Schops, T.; Schonberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3260–3269. [Google Scholar]
- Hirschmuller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
- Kordelas, G.A.; Alexiadis, D.S.; Daras, P.; Izquierdo, E. Content-based guided image filtering, weighted semi-global optimization, and efficient disparity refinement for fast and accurate disparity estimation. IEEE Trans. Multimed. 2016, 18, 155–170. [Google Scholar] [CrossRef]
- Facciolo, G.; De Franchis, C.; Meinhardt, E. MGM: A significantly more global matching for stereovision. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; pp. 1–13. [Google Scholar]
- Patil, S.; Prakash, T.; Comandur, B.; Kak, A. A comparative evaluation of SGM variants (including a new variant, tMGM) for dense stereo matching. arXiv 2019, arXiv:1911.09800. [Google Scholar] [CrossRef]
- Lee, H.Y.; Kim, T.; Park, W.; Lee, H.K. Extraction of digital elevation models from satellite stereo images through stereo matching based on epipolarity and scene geometry. Image Vis. Comput. 2003, 21, 789–796. [Google Scholar] [CrossRef]
- Ghuffar, S. Satellite stereo based digital surface model generation using semi global matching in object and image space. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 63–68. [Google Scholar]
- Qin, R. A critical analysis of satellite stereo pairs for digital surface model generation and a matching quality prediction model. ISPRS J. Photogramm. Remote Sens. 2019, 154, 139–150. [Google Scholar] [CrossRef]
- Xia, Y.; d’Angelo, P.; Tian, J.; Fraundorfer, F.; Reinartz, P. Multi-label learning based semi-global matching forest. Remote Sens. 2020, 12, 1069. [Google Scholar]
- Wang, Y.; Qin, A.; Hao, Q.; Dang, J. Semi-global stereo matching of remote sensing images combined with speeded up robust features. Acta Opt. Sin. 2020, 40, 1628003–1628012. [Google Scholar] [CrossRef]
- Tatar, N.; Arefi, H.; Hahn, M. High-resolution satellite stereo matching by object-based semiglobal matching and iterative guided edge-preserving filter. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1841–1845. [Google Scholar] [CrossRef]
- Zhao, L.; Liu, Y.; Men, C.; Men, Y. Double propagation stereo matching for urban 3-D reconstruction from satellite imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5601717. [Google Scholar]
- Zhang, Y.; Zou, S.; Liu, X.; Huang, X.; Wan, Y.; Yao, Y. LiDAR-guided stereo matching with a spatial consistency constraint. ISPRS J. Photogramm. Remote Sens. 2022, 183, 164–177. [Google Scholar]
- Zhao, J.; Chen, X.; Hou, W.; Han, J. Stereo matching based on urban satellite remote sensing image pair. Opt. Precis. Eng. 2022, 30, 830–839. [Google Scholar] [CrossRef]
- Zou, S.; Liu, X.; Huang, X.; Zhang, Y.; Wang, S.; Wu, S.; Zheng, Z.; Liu, B. Edge-preserving stereo matching using LiDAR points and image line features. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Yue, Y.; Fang, T.; Li, W.; Chen, M.; Xu, B.; Ge, X.; Hu, H.; Zhang, Z. Hierarchical edge-preserving dense matching by exploiting reliably matched line segments. Remote Sens. 2023, 15, 4311. [Google Scholar] [CrossRef]
- Žbontar, J.; LeCun, Y. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1592–1599. [Google Scholar]
- Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H. GA-Net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 185–194. [Google Scholar]
- Duggal, S.; Wang, S.; Ma, W.C.; Hu, R.; Urtasun, R. DeepPruner: Learning efficient stereo matching via differentiable PatchMatch. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4384–4393. [Google Scholar]
- Gómez, A.; Randall, G.; Facciolo, G.; von Gioi, R.G. An experimental comparison of multi-view stereo approaches on satellite images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 844–853. [Google Scholar]
- Gómez, A.; Randall, G.; Facciolo, G.; von Gioi, R.G. Improving the pair selection and the model fusion steps of satellite multi-view stereo pipelines. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; pp. 6344–6353. [Google Scholar]
- Albanwan, H.; Qin, R. A comparative study on deep-learning methods for dense image matching of multi-angle and multi-date remote sensing stereo-images. Photogramm. Rec. 2022, 37, 385–409. [Google Scholar] [CrossRef]
- Gao, J.; Liu, J.; Ji, S. A general deep learning based framework for 3D reconstruction from multi-view stereo satellite images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 446–461. [Google Scholar] [CrossRef]
- Zheng, Z.; Wan, Y.; Zhang, Y.; Hu, Z.; Wei, D.; Yao, Y.; Zhu, C.; Yang, K.; Xiao, R. Digital surface model generation from high-resolution satellite stereos based on hybrid feature fusion network. Photogramm. Rec. 2024, 39, 36–66. [Google Scholar] [CrossRef]
- Xu, H.; Zhang, J. AANet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1959–1968. [Google Scholar]
- Badki, A.; Troccoli, A.; Kim, K.; Kautz, J.; Sen, P.; Gallo, O. Bi3D: Stereo depth estimation via binary classifications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1600–1608. [Google Scholar]
- Tankovich, V.; Hane, C.; Zhang, Y.; Kowdle, A.; Fanello, S.; Bouaziz, S. HITNet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14362–14372. [Google Scholar]
- Tosi, F.; Liao, Y.; Schmitt, C.; Geiger, A. SMD-Nets: Stereo mixture density networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8942–8952. [Google Scholar]
- Ji, S.; Liu, J.; Lu, M. CNN-based dense image matching for aerial remote sensing images. Photogramm. Eng. Remote Sens. 2019, 85, 415–424. [Google Scholar] [CrossRef]
- Wang, Y.; Gong, D.; Hu, H.; Wang, S.; Han, Y.; Wang, Y.; Ma, X. State of the art in dense image matching cost computation for high-resolution satellite stereo. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 43, 109–114. [Google Scholar] [CrossRef]
- Shen, Z.; Dai, Y.; Rao, Z. CFNet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13906–13915. [Google Scholar]
- Shen, Z.; Dai, Y.; Song, X.; Rao, Z.; Zhou, D.; Zhang, L. PCW-Net: Pyramid combination and warping cost volume for stereo matching. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 280–297. [Google Scholar]
- Li, X.; Fan, Y.; Lv, G.; Ma, H. Area-based correlation and non-local attention network for stereo matching. Vis. Comput. 2022, 38, 3881–3895. [Google Scholar] [CrossRef]
- Li, X.; Fan, Y.; Rao, Z.; Lv, G.; Liu, S. Synthetic-to-real domain adaptation joint spatial feature transform for stereo matching. IEEE Signal Process. Lett. 2022, 29, 60–64. [Google Scholar] [CrossRef]
- Xu, P.; Xiang, Z.; Qiao, C.; Fu, J.; Pu, T. Adaptive multi-modal cross-entropy loss for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5135–5144. [Google Scholar]
- Chen, Q.; Ge, B.; Quan, J. Unambiguous pyramid cost volumes fusion for stereo matching. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9223–9236. [Google Scholar] [CrossRef]
- He, S.; Li, S.; Jiang, S.; Jiang, W. HMSM-Net: Hierarchical multi-scale matching network for disparity estimation of high-resolution satellite stereo images. ISPRS J. Photogramm. Remote Sens. 2022, 188, 314–330. [Google Scholar] [CrossRef]
- Jiang, L.; Wang, F.; Zhang, W.; Li, P.; You, H.; Xiang, Y. Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4936–4948. [Google Scholar] [CrossRef]
- Tao, R.; Xiang, Y.; You, H. A confidence-aware cascade network for multi-scale stereo matching of very-high-resolution remote sensing images. Remote Sens. 2022, 14, 1667. [Google Scholar] [CrossRef]
- Wu, T.; Vallet, B.; Pierrot-Deseilligny, M. PSMNet-FusionX3: LiDAR-guided deep learning stereo dense matching on aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 6527–6536. [Google Scholar]
- Kim, J.; Cho, S.; Chung, M.; Kim, Y. Improving Disparity Consistency With Self-Refined Cost Volumes for Deep Learning-Based Satellite Stereo Matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9262–9278. [Google Scholar] [CrossRef]
- Xu, Z.; Jiang, Y.; Wang, J.; Wang, Y. A Dual Branch Multiscale Stereo Matching Network for High-Resolution Satellite Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 949–964. [Google Scholar] [CrossRef]
- Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 21919–21928. [Google Scholar]
- Xu, G.; Wang, X.; Zhang, Z.; Cheng, J.; Liao, C.; Yang, X. IGEV++: Iterative multi-range geometry encoding volumes for stereo matching. arXiv 2024, arXiv:2409.00638. [Google Scholar] [CrossRef]
- Chen, Z.; Long, W.; Yao, H.; Zhang, Y.; Wang, B.; Qin, Y.; Wu, J. MoCha-Stereo: Motif channel attention network for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27768–27777. [Google Scholar]
- Wang, X.; Xu, G.; Jia, H.; Yang, X. Selective-Stereo: Adaptive frequency information selection for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19701–19710. [Google Scholar]
- Zeng, J.; Yao, C.; Wu, Y.; Jia, Y. Temporally consistent stereo matching. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 341–359. [Google Scholar]
- Feng, M.; Cheng, J.; Jia, H.; Liu, L.; Xu, G.; Yang, X. MC-Stereo: Multi-peak lookup and cascade search range for stereo matching. In Proceedings of the International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 344–353. [Google Scholar]
- Patil, S.; Guo, Q. Stellar: A large satellite stereo dataset for digital surface model generation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 433–440. [Google Scholar] [CrossRef]
- Rao, Z.; He, M.; Dai, Y.; Shen, Z. Sliding space-disparity transformer for stereo matching. Neural Comput. Appl. 2022, 34, 21863–21876. [Google Scholar] [CrossRef]
- Lou, J.; Liu, W.; Chen, Z.; Liu, F.; Cheng, J. ELFNet: Evidential local-global fusion for stereo matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17784–17793. [Google Scholar]
- Min, J.; Jeon, Y.; Kim, J.; Choi, M. S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 26729–26739. [Google Scholar]
- Wei, K.; Huang, X.; Li, H. Stereo matching method for remote sensing images based on attention and scale fusion. Remote Sens. 2024, 16, 387. [Google Scholar] [CrossRef]
- Dovesi, P.L.; Poggi, M.; Andraghetti, L.; Martí, M.; Kjellström, H.; Pieropan, A.; Mattoccia, S. Real-time semantic stereo matching. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10780–10787. [Google Scholar]
- Chen, S.; Xiang, Z.; Qiao, C.; Chen, Y.; Bai, T. SGNet: Semantics guided deep stereo matching. In Proceedings of the Asian Conference on Computer Vision (ACCV), Kyoto, Japan, 30 November–4 December 2020; pp. 106–122. [Google Scholar]
- Kusupati, U.; Cheng, S.; Chen, R.; Su, H. Normal assisted stereo depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2189–2199. [Google Scholar]
- Aleotti, F.; Poggi, M.; Tosi, F.; Mattoccia, S. Learning end-to-end scene flow by distilling single tasks knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10435–10442. [Google Scholar]
- Rao, Z.; Xiong, B.; He, M.; Dai, Y.; He, R.; Shen, Z.; Li, X. Masked representation learning for domain generalized stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 5435–5444. [Google Scholar]
- Liao, P.; Zhang, X.; Chen, G.; Wang, T.; Li, X.; Yang, H.; Zhou, W.; He, C.; Wang, Q. S2Net: A multi-task learning network for semantic stereo of satellite image pairs. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Chen, H.; Lin, M.; Zhang, H.; Yang, G.; Xia, G.S.; Zheng, X.; Zhang, L. Multi-level fusion of the multi-receptive fields contextual networks and disparity network for pairwise semantic stereo. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019; pp. 4967–4970. [Google Scholar]
- Qin, R.; Huang, X.; Liu, W.; Xiao, C. Pairwise stereo image disparity and semantics estimation with the combination of U-Net and pyramid stereo matching network. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019; pp. 4971–4974. [Google Scholar]
- Rao, Z.; He, M.; Zhu, Z.; Dai, Y.; He, R. SDBF-Net: Semantic and disparity bidirectional fusion network for 3D semantic detection on incidental satellite images. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 438–444. [Google Scholar]
- Chen, C.; Zhao, L.; He, Y.; Long, Y.; Chen, K.; Wang, Z.; Hu, Y.; Sun, X. SemStereo: Semantic-Constrained Stereo Matching Network for Remote Sensing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 15758–15766. [Google Scholar]
- Wen, B.; Trepte, M.; Aribido, J.; Kautz, J.; Gallo, O.; Birchfield, S. FoundationStereo: Zero-Shot Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5249–5260. [Google Scholar]
- Liu, C.W.; Chen, Q.; Fan, R. Playing to Vision Foundation Model’s Strengths in Stereo Matching. IEEE Trans. Intell. Veh. 2024, 1–12. [Google Scholar] [CrossRef]
- Cheng, J.; Liu, L.; Xu, G.; Wang, X.; Zhang, Z.; Deng, Y.; Zang, J.; Chen, Y.; Cai, Z.; Yang, X. MonSter: Marry Monodepth to Stereo Unleashes Power. arXiv 2025, arXiv:2501.08643. [Google Scholar]
- Jiang, H.; Lou, Z.; Ding, L.; Xu, R.; Tan, M.; Jiang, W.; Huang, R. DEFOM-Stereo: Depth Foundation Model Based Stereo Matching. arXiv 2025, arXiv:2501.09466. [Google Scholar] [CrossRef]
- He, X.; Yang, M.; Jiang, S.; Jiang, W.; Li, Q. Stereo Matching of High-Resolution Satellite Images via Hierarchical ViT and Self-Supervised DINO. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, X-G-2025, 357–364. [Google Scholar] [CrossRef]
- Cheng, X.; Zhong, Y.; Harandi, M.; Dai, Y.; Chang, X.; Li, H.; Drummond, T.; Ge, Z. Hierarchical neural architecture search for deep stereo matching. Adv. Neural Inf. Process. Syst. 2020, 33, 22158–22169. [Google Scholar]
- Guan, T.; Wang, C.; Liu, Y.H. Neural markov random field for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5459–5469. [Google Scholar]
- Weinzaepfel, P.; Lucas, T.; Leroy, V.; Cabon, Y.; Arora, V.; Brégier, R.; Csurka, G.; Antsfeld, L.; Chidlovskii, B.; Revaud, J. CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17969–17980. [Google Scholar]
- Zheng, D.; Wu, X.M.; Liu, Z.; Meng, J.; Zheng, W.s. DiffuVolume: Diffusion model for volume based stereo matching. Int. J. Comput. Vis. 2025, 133, 3807–3821. [Google Scholar] [CrossRef]
- Shi, Y. Rethinking iterative stereo matching from diffusion bridge model perspective. arXiv 2024, arXiv:2404.09051. [Google Scholar] [CrossRef]
- Chebbi, M.A.; Rupnik, E.; Pierrot-Deseilligny, M.; Lopes, P. DeepSim-Nets: Deep similarity networks for stereo image matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 2097–2105. [Google Scholar]
- Yang, M.; Jiang, S.; Jiang, W.; Li, Q. Mamba-Based Feature Extraction and Multifrequency Information Fusion for Stereo Matching of High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 23273–23288. [Google Scholar] [CrossRef]
- Zhong, Y.; Dai, Y.; Li, H. Self-supervised learning for stereo matching with self-improving ability. arXiv 2017, arXiv:1709.00930. [Google Scholar]
- Zhang, Y.; Khamis, S.; Rhemann, C.; Valentin, J.; Kowdle, A.; Tankovich, V.; Schoenberg, M.; Izadi, S.; Funkhouser, T.; Fanello, S. ActiveStereoNet: End-to-end self-supervised learning for active stereo systems. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–801. [Google Scholar]
- Liu, P.; King, I.; Lyu, M.R.; Xu, J. Flow2Stereo: Effective self-supervised learning of optical flow and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6648–6657. [Google Scholar]
- Fang, I.S.; Wen, H.C.; Hsu, C.L.; Jen, P.C.; Chen, P.Y.; Chen, Y.S. ES3Net: Accurate and efficient edge-based self-supervised stereo matching network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 4472–4481. [Google Scholar]
- Yuan, W.; Zhang, Y.; Wu, B.; Zhu, S.; Tan, P.; Wang, M.Y.; Chen, Q. Stereo matching by self-supervision of multiscopic vision. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5702–5709. [Google Scholar]
- Wang, Y.; Zheng, J.; Zhang, C.; Zhang, Z.; Li, K.; Zhang, Y.; Hu, J. DualNet: Robust Self-Supervised Stereo Matching with Pseudo-Label Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8178–8186. [Google Scholar]
- Knöbelreiter, P.; Vogel, C.; Pock, T. Self-supervised learning for stereo reconstruction on aerial images. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Valencia, Spain, 22–27 July 2018; pp. 4379–4382. [Google Scholar]
- Igeta, T.; Iwasaki, A. An unsupervised network for stereo matching of very high resolution satellite imagery. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 971–974. [Google Scholar]
- Chen, W.; Chen, H.; Yang, S. Self-supervised stereo matching method based on SRWP and PCAM for urban satellite images. Remote Sens. 2022, 14, 1636. [Google Scholar] [CrossRef]
- Smolyanskiy, N.; Kamenev, A.; Birchfield, S. On the importance of stereo for accurate depth estimation: An efficient semi-supervised deep neural network approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1007–1015. [Google Scholar]
- Amiri, A.J.; Loo, S.Y.; Zhang, H. Semi-supervised monocular depth estimation with left-right consistency using deep neural network. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019; pp. 602–607. [Google Scholar]
- Xu, F.; Wang, L.; Li, H. A unified and efficient semi-supervised learning framework for stereo matching. Pattern Recognit. 2024, 147, 110129. [Google Scholar] [CrossRef]
- Yue, X.; Lu, Z.; Lin, X.; Ren, W.; Shao, Z.; Hu, H.; Zhang, Y.; Liao, Q. Semi-Stereo: A universal stereo matching framework for imperfect data via semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 646–655. [Google Scholar]
- Zhang, G.; Jiang, Y.; Wei, S.; Wang, Y.; Chu, J.; Tan, M.; Li, Z. Hierarchical domain adaptation framework for disparity estimation in optical satellite stereo imagery: Bridging spatiotemporal-sensor heterogeneity. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4704516. [Google Scholar] [CrossRef]
- Tulyakov, S.; Ivanov, A.; Fleuret, F. Weakly supervised learning of deep metrics for stereo reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1339–1348. [Google Scholar]
- Rao, Z.; He, M.; Dai, Y.; Shen, Z. Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction. Vis. Comput. 2022, 38, 77–93. [Google Scholar] [CrossRef]
- Ren, H.; El-Khamy, M.; Lee, J. Stereo disparity estimation via joint supervised, unsupervised, and weakly supervised learning. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 2760–2764. [Google Scholar]
- Albanwan, H.; Qin, R. Fine-tuning deep learning models for stereo matching using results from semi-global matching. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, V-2-2022, 39–46. [Google Scholar] [CrossRef]
- Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
- Schönberger, J.L.; Zheng, E.; Frahm, J.M.; Pollefeys, M. Pixelwise view selection for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 501–518. [Google Scholar]
- Ji, S.; Luo, C.; Liu, J. A Review of Dense Stereo Image Matching Methods Based on Deep Learning. Geomat. Inf. Sci. Wuhan Univ. 2021, 46, 193–202. [Google Scholar]
- Wang, L.; Guo, Y.; Wang, Y.; Liang, Z.; Lin, Z.; Yang, J.; An, W. Parallax attention for unsupervised stereo correspondence learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2108–2125. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Kunwar, S.; Chen, H.; Lin, M.; Zhang, H.; D’Angelo, P.; Cerra, D.; Azimi, S.M.; Brown, M.; Hager, G.; Yokoya, N.; et al. Large-scale semantic 3-D reconstruction: Outcome of the 2019 IEEE GRSS Data Fusion Contest—Part A. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 922–935. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
- Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything v2. Adv. Neural Inf. Process. Syst. 2024, 37, 21875–21911. [Google Scholar]
- Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
- Weinzaepfel, P.; Leroy, V.; Lucas, T.; Brégier, R.; Cabon, Y.; Arora, V.; Antsfeld, L.; Chidlovskii, B.; Csurka, G.; Revaud, J. CroCo: Self-supervised pre-training for 3D vision tasks by cross-view completion. Adv. Neural Inf. Process. Syst. 2022, 35, 3502–3516. [Google Scholar]
- Lu, C.; Uchiyama, H.; Thomas, D.; Shimada, A.; Taniguchi, R.i. Sparse cost volume for efficient stereo matching. Remote Sens. 2018, 10, 1844. [Google Scholar] [CrossRef]
- Xia, Y.; d’Angelo, P.; Fraundorfer, F.; Tian, J.; Fuentes Reyes, M.; Reinartz, P. GA-Net-Pyramid: An efficient end-to-end network for dense matching. Remote Sens. 2022, 14, 1942. [Google Scholar] [CrossRef]
- Xiang, X.; Wang, Z.; Lao, S.; Zhang, B. Pruning multi-view stereo net for efficient 3D reconstruction. ISPRS J. Photogramm. Remote Sens. 2020, 168, 17–27. [Google Scholar] [CrossRef]
- Chen, G.; Ling, Y.; He, T.; Meng, H.; He, S.; Zhang, Y.; Huang, K. StereoEngine: An FPGA-based accelerator for real-time high-quality stereo estimation with binary neural network. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4179–4190. [Google Scholar] [CrossRef]
- Wang, J.; Scharstein, D.; Bapat, A.; Blackburn-Matzen, K.; Yu, M.; Lehman, J.; Alsisan, S.; Wang, Y.; Tsai, S.; Frahm, J.M.; et al. A practical stereo depth system for smart glasses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 21498–21507. [Google Scholar]
- Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
- Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, 2–5 September 2014; Proceedings 36 (GCPR). pp. 31–42. [Google Scholar]
- Mayer, N.; Ilg, E.; Fischer, P.; Hazirbas, C.; Cremers, D.; Dosovitskiy, A.; Brox, T. What makes good synthetic training data for learning disparity and optical flow estimation? Int. J. Comput. Vis. 2018, 126, 942–960. [Google Scholar] [CrossRef]
- Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The ApolloScape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
- Yang, G.; Song, X.; Huang, C.; Deng, Z.; Shi, J.; Zhou, B. DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 899–908. [Google Scholar]
- Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, J.; Fan, H.; Liu, S. Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16263–16272. [Google Scholar]
- Patil, S.; Comandur, B.; Prakash, T.; Kak, A.C. A new stereo benchmarking dataset for satellite images. arXiv 2019, arXiv:1907.04404. [Google Scholar] [CrossRef]
- Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; Volume 2, pp. 807–814. [Google Scholar]
- Liang, Z.; Feng, Y.; Guo, Y.; Liu, H.; Chen, W.; Qiao, L.; Zhou, L.; Zhang, J. Learning for disparity estimation through feature constancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2811–2820. [Google Scholar]
- Atienza, R. Fast disparity estimation using dense networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3207–3212. [Google Scholar]
- Yang, Q.; Chen, G.; Tan, X.; Wang, T.; Wang, J.; Zhang, X. S3Net: Innovating stereo matching and semantic segmentation with a single-branch semantic stereo network in satellite epipolar imagery. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Athens, Greece, 7–12 July 2024; pp. 8737–8740. [Google Scholar]
- Wu, T.; Vallet, B.; Pierrot-Deseilligny, M.; Rupnik, E. A new stereo dense matching benchmark dataset for deep learning. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 43, 405–412. [Google Scholar] [CrossRef]
- Reyes, M.F.; d’Angelo, P.; Fraundorfer, F. SyntCities: A large synthetic remote sensing dataset for disparity estimation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 10087–10098. [Google Scholar] [CrossRef]
- Zhang, X.; Cao, X.; Yu, A.; Yu, W.; Li, Z.; Quan, Y. UAVStereo: A multiple resolution dataset for stereo matching in UAV scenarios. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2942–2953. [Google Scholar] [CrossRef]
- Khamis, S.; Fanello, S.; Rhemann, C.; Kowdle, A.; Valentin, J.; Izadi, S. StereoNet: Guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 573–590. [Google Scholar]












| Type | Category | Characteristic | Representative Works (Common Data) | Representative Works (Remote Sensing) |
|---|---|---|---|---|
| Traditional | Classical | Interpretable, efficient, strong generalization | Scharstein and Szeliski [5], SGM [7], Hirschmuller et al. [23], Kordelas et al. [24], MGM [25], tMGM [26] | Lee et al. [27], Ghuffar [28], Qin et al. [29], SGM-ForestM [30], Wang et al. [31], Tatar et al. [32], DPSM [33], LGSM [34], Zhao et al. [35], L2GSM [36], Yue et al. [37] |
| Hybrid | Combine deep learning with traditional | Combines geometric priors with learning flexibility | MC-CNN [38], GA-Net [39], Deeppruner [40] | S2P-GANet [41], Gómez et al. [42], Albanwan and Qin [43], Sat-MVSF [44], Zheng et al. [45] |
| Supervised | 2D Convolution-based | Fast inference | DispNet [9], AANet [46], Bi3D [47], HITNet [48], SMD-Nets [49] | Ji et al. [50], Wang et al. [51] |
| 3D Convolution-based | High accuracy, high computing cost | Bosch et al. [21], GC-Net [10], CFNet [52], PCW-Net [53], Abc-Net [54], SDA [55], ADL [56], UPFNet [57] | Tao et al. [11], HMSM-Net [58], Jiang et al. [59], Tao et al. [60], PSMNet-FusionX3 [61], SRCV-Net [62], DBMSMNet [63] | |
| Iterative Optimization | High accuracy, iteratively updating | RAFT-Stereo [14], IGEV [64], IGEV++ [65], MoCha-Stereo [66], Selective-Stereo [67], TC-Stereo [68], MC-Stereo [69] | MaskCRNet [18], Patil and Guo [70] | |
| Transformer-based | Global context modeling | STTR [15], SSTTStereo [71], ELFNet [72], S2M2 [73] | Wei et al. [74], MaskCRNet [18], A-SATMVSNet [16] | |
| Multi-task Learning | Cross-task supervision | RTS2Net [75], SGNet [76], NNNet [77], DWARF [78], Rao et al. [79] | BGA-Net [12], S2Net [80], Chen et al. [81], Qin et al. [82], SDBF-Net [83], SemStereo [84] | |
| Vision Foundation Model Integration | Using priors from VFM, strong generalization | FoundationStereo [85], ViTAStereo [86], Monster [87], DEFOM-Stereo [88] | HDSM-Net [89] | |
| Other DL Models | Innovative, flexible | LEAStereo [90], NMRF-Stereo [91], CroCo v2 [92], DiffuVolume [93], DMIO [94] | DeepSim-Nets [95], MEMF-Net [96] | |
| Alternative Supervised | Self-supervised | No GT, geometry-driven | SsSMnet [97], ActiveStereoNet [98], Flow2Stereo [99], ES3Net [100], Yuan et al. [101], DualNet [102] | Knöbelreiter et al. [103], Igeta and Iwasaki [104], Chen et al. [105] |
| Semi-supervised | Limited GT | Smolyanskiy et al. [106], SemiDepth [107], SSMF [108], Semi-Stereo [109] | HDADE [110] | |
| Weakly-supervised | Indirect cues | Tulyakov et al. [111], PA-Net [112], SUW-Stereo [113] | Albanwan and Qin [114] |
| Category | Advantages | Disadvantages |
|---|---|---|
| Hybrid methods | Combine robustness of priors with flexibility of deep learning; improved cross-domain generalization | Complex pipeline; limited improvement compared to fully learning networks |
| 2D convolution-based models | Efficient and lightweight; suitable for real-time and edge devices | Limited accuracy and robustness in complex or textureless regions |
| 3D convolution-based models | Strong spatial regularization; higher disparity accuracy | High computation and memory cost; difficult for large-scale or real-time use |
| Iterative optimization-based models | Accurate and memory-efficient; support progressive refinement | Sequential inference introduces latency, especially on high-resolution RS data |
| Transformer-based models | Capture long-range dependencies; strong semantic awareness | High computational cost; limited applications in RS domain |
| Multi-task learning models | Leverage semantic/geometry cues; benefit downstream tasks | Task balancing is challenging; increased annotation and computation cost |
| Vision Foundation Model Integration | Strong generalization; effective cross-domain adaptation via large-scale pretraining | Integration into stereo is still preliminary; training/fine-tuning is resource-demanding |
| Self/semi/weakly supervised methods | Reduce dependence on dense GT disparity; more practical in RS | Sensitive to seasonal/illumination changes; accuracy gap with fully supervised remains |
| Method | Year | Publication | EPE (px) ↓ | D1 (%) ↓ | Times (ms) ↓ | Equipment |
|---|---|---|---|---|---|---|
| SGM † [141] | 2005 | CVPR | 10.34 | 43.00 | - | Xeon |
| iResNet-i2 † [142] | 2018 | CVPR | 3.05 | 33.00 | - | NVIDIA Titan X |
| DenseMapNet † [143] | 2018 | ICRA | 3.51 | 35.00 | - | NVIDIA GTX 1080Ti |
| MRFCNet [81] | 2019 | IGARSS | 1.34 | 9.01 | 670 | - |
| BGA-Net [12] | 2020 | TGRS | 1.20 | 7.20 | 1572 | NVIDIA 1080Ti |
| RAFT-Stereo [14] | 2021 | 3DV | 1.29 | 8.01 | 575 | NVIDIA RTX 6000 |
| HMSM-Net [58] | 2022 | ISPRS | 1.19 | 7.16 | 511 | NVIDIA RTX 3090 |
| S2Net [80] | 2023 | TGRS | 1.439 | 10.051 | - | NVIDIA Tesla V100 |
| MaskCRNet [18] | 2024 | ISPRS | 1.12 | 7.01 | 724 | NVIDIA RTX 3090 |
| S3Net [144] | 2024 | IGARSS | 1.403 | 9.579 | - | NVIDIA Tesla V100 |
| MEMF-Net [96] | 2025 | JSTARS | 1.18 | 6.64 | - | NVIDIA RTX 4090 |
| HDSM-Net * [89] | 2025 | ISPRS | 2.09 | 18.6 | - | NVIDIA RTX 4090 |
| SRCV-Net [62] | 2025 | JSTARS | 1.387 | 8.615 | - | NVIDIA RTX A6000 |
| DBMSMNet [63] | 2025 | JSTARS | 1.48 | 7.41 | 490 | NVIDIA RTX 4090 |
| Method | Year | Publication | EPE (px) ↓ | D1 (%) ↓ | Times (ms) ↓ | Equipment |
|---|---|---|---|---|---|---|
| SGM [141] | 2005 | CVPR | 4.88 | 50.79 | 506 | Xeon |
| StereoNet [148] | 2018 | ECCV | 2.45 | 25.12 | 238 | NVIDIA Titan X |
| PSMNet [17] | 2018 | CVPR | 2.48 | 24.81 | 614 | NVIDIA Titan-Xp |
| RAFT-Stereo [14] | 2021 | 3DV | 1.77 | 14.35 | 575 | NVIDIA RTX 6000 |
| HMSM-Net [58] | 2022 | ISPRS | 1.67 | 12.94 | 511 | - |
| MaskCRNet [18] | 2024 | ISPRS | 1.66 | 12.87 | 724 | NVIDIA RTX 3090 |
| MEMF-Net [96] | 2025 | JSTARS | 1.524 | 11.6 | - | NVIDIA RTX 4090 |
| HDSM-Net * [89] | 2025 | ISPRS | 3.310 | 29.0 | - | NVIDIA RTX 4090 |
| SRCV-Net [62] | 2025 | JSTARS | 1.7523 | 14.15 | - | NVIDIA RTX A6000 |
| DBMSMNet [63] | 2025 | JSTARS | 2.13 | 13.46 | - | NVIDIA RTX 4090 |
| Datasets | Year | Resolution | Mode | Total Data | Training Size | Testing Size | Label Type |
|---|---|---|---|---|---|---|---|
| US3D [21] | 2019 | RGB | 4292 | 4242 | 50 | Dense | |
| SatStereo [140] | 2019 | Panchromatic | 72 | – | – | Dense | |
| ISPRS 2021 [145] | 2021 | IRRG | 1092 | 585 | 507 | Sparse | |
| GaoFen-7 [58] | 2022 | Panchromatic | 490 | 400 | 90 | Dense | |
| SyntCities [146] | 2022 | RGB | 8100 | 6480 | 1620 | Dense | |
| WHU-Stereo [19] | 2023 | Panchromatic | 1757 | 1220 | 537 | Dense | |
| UAVStereo [147] | 2023 | ∼ | RGB | 38,781 | 30,924 | 7757 | Dense |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, X.; Zhou, H.; Sun, M.; Xiong, B.; Dai, Y.; He, R.; Chen, Z.; Rao, Z. Binocular Stereo Vision in Remote Sensing: A Review. Remote Sens. 2026, 18, 1480. https://doi.org/10.3390/rs18101480
Li X, Zhou H, Sun M, Xiong B, Dai Y, He R, Chen Z, Rao Z. Binocular Stereo Vision in Remote Sensing: A Review. Remote Sensing. 2026; 18(10):1480. https://doi.org/10.3390/rs18101480
Chicago/Turabian StyleLi, Xing, Hongwei Zhou, Mingyu Sun, Bangshu Xiong, Yuchao Dai, Renjie He, Zhihua Chen, and Zhibo Rao. 2026. "Binocular Stereo Vision in Remote Sensing: A Review" Remote Sensing 18, no. 10: 1480. https://doi.org/10.3390/rs18101480
APA StyleLi, X., Zhou, H., Sun, M., Xiong, B., Dai, Y., He, R., Chen, Z., & Rao, Z. (2026). Binocular Stereo Vision in Remote Sensing: A Review. Remote Sensing, 18(10), 1480. https://doi.org/10.3390/rs18101480

