Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Binocular Stereo Vision in Remote Sensing: A Review

Remote Sens. 2026, 18(10), 1480; https://doi.org/10.3390/rs18101480

by Xing Li¹

, Hongwei Zhou¹, Mingyu Sun¹, Bangshu Xiong¹

, Yuchao Dai²

, Renjie He²

, Zhihua Chen¹ and Zhibo Rao^1,*

Reviewer 1:

Ming Wei

Reviewer 2:

Jia Chen

Remote Sens. 2026, 18(10), 1480; https://doi.org/10.3390/rs18101480

Submission received: 8 March 2026 / Revised: 30 April 2026 / Accepted: 6 May 2026 / Published: 9 May 2026

(This article belongs to the Special Issue 3D City Modeling and Observation Using Remote Sensing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors The review manuscript focuses on Binocular Stereo Vision in the field of Remote Sensing. Stereo matching is the core intersection point of computer vision and photogrammetry. Stereo matching in remote sensing scenarios is an important application branch of this field, which aligns with the research hotspots and practical needs in the interdisciplinary area of remote sensing, photogrammetry, and computer vision. The topic is highly targeted. Previous reviews have mostly centered around ground data. The manuscript conducts a systematic analysis of the unique challenges in remote sensing scenarios, remote sensing-specific models, and remote sensing datasets. It summarizes the technologies, datasets, and future directions applicable to the scenario, providing a unified overview and reference for the subfield. It can provide clear direction guidance for subsequent research and promote the technology from theory to practical applications in remote sensing. I think the full text provides a comprehensive summary of binocular stereo matching methods for remote sensing, with a clear and logical structure that is easy to understand and read. The references include many classic papers in the field of stereo matching and recent frontier papers in the field of remote sensing, which meet the requirements.
Here are my suggestions: The manuscript needs to strengthen the domain-specific analytical logic. I suggest that when reviewing models, clearly explain how the model is adapted to remote sensing scenarios and which challenges in remote sensing it has optimized, avoiding directly applying the analysis logic of ground models and highlighting the design of remote sensing-specific models. From the title and content, the manuscript aims to summarize traditional and deep learning binocular stereo vision methods in remote sensing, rather than focusing only on deep learning methods. However, in the Introduction, the first paragraph presents the model formulas of stereo vision, and from the second paragraph onwards, it directly begins to describe deep learning methods, lacking a summary description of traditional methods. In Table 3, EPE (px) and D (%) are not clearly described in the text. Any abbreviation should be specifically described for the first time it appears. I think Tables should have a unified format (different from Table 1). The Weakly-, Semi-, and Self-supervised Algorithms section should add pictures of network structures, like other sections, to help readers understand the content.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Unlike prior surveys that primarily address ground-level scenes, this paper presents a comprehensive review of stereo matching techniques tailored for remote sensing. This survey has reviewed binocular stereo vision in remote sensing, encompassing traditional algorithms, deep learning-based models, acceleration strategies, and datasets.

One of the advantages of this paper is: the authors have grasped the differences between remote sensing stereo matching and traditional computer vision stereo matching, such as limited access to satellite images, seasonal differences, difficulties in small target recognition and large-scale repeated texture and other unique challenges.
Another advantage of this article is the detailed classification of deep learning architecture (including 2D/3D convolution, iterative optimization Transformer, Multi task learning and visual basic model integration.
Binocular Stereo Vision in Remote Sensing is important for terrain modeling and environmental monitoring, so this survey has a significant contribution to the Remote Sensing field. I hope this manuscript will be published as soon as possible for readers.
This manuscript is well organized and comprehensively described. For example, this manuscript also well describes the limitations and future progress of deep stereo networks in remote sensing.
There are appropriate and adequate references (147 references) to related and previous work.

There are a few places that need to be noted by the authors:

1.In the Introduction, disparity is defined as d=|x_l-x_r|, i.e. absolute value. However, it is clearly pointed out later that negative parallax ranges often appear in remote sensing data sets. For example, the parallax range of us3d can be (-64,64) or (-112,64). There seems to be a conflict between these two aspects. If absolute value definition is adopted, negative parallax cannot be expressed naturally. The authors need to check this place to see if it needs to be modified and improved.

Tables 3 and 4 summarize the EPE, D1, and time on US3D and WHU Stereo. The running time cannot be directly compared. As mentioned earlier in the article, different models were tested on different GPUs. For example, PSMNet and BGA Net were tested on GTX 1080Ti, while MaskCRNet was tested on RTX 3090. In this case, putting the time of these different papers directly into a table later in this article will give readers the illusion of "speed comparison under the same hardware device". If it could be clearly stated in the table notes on which devices these models were tested (Alternatively, it could simply be conducted under a unified testing hardware device), would that be better.
Check the initial definition of abbreviations throughout the entire text. For example, EPE and D1 appear in the headers of Tables 3 and 4, but there seems to be no clear abbreviation definition in the main text.
In Section 5.1, the author wrote 'Fig. 8 presents the qualitative results of the WHU Stereo dataset', but the title of Fig. 8 clearly states the US3D dataset. Is there a contradiction between the two? Suggest the authors to thoroughly review the consistency of all figure/table/caption numbering.
This review paper can increase the comparison of previous works, for example, when introducing various 3D CNN remote sensing adaptability improvements, it can appropriately analyze in depth which pain points they are aimed at solving in remote sensing stereo matching (such as large disparity range, excessive memory consumption), so the improvements are made as a result.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Binocular Stereo Vision in Remote Sensing: A Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI