Dense Matching with Low Computational Complexity for Disparity Estimation in the Radargrammetric Approach of SAR Intensity Images
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1) The novelty should be clarified, especially the advantages of the proposed approach over deep-learning-based methods should be enhanced w.r.t theoretical aspect and experimental aspect.
2)The core innovation—reducing kernel dimensionality via feature vector encoding—is intriguing but needs clearer theoretical justification. Equations (5)–(9) should explicitly link kernel reduction to computational savings without sacrificing accuracy.
3)The claim of “stable computational complexity” (Lines 26–27) is vague. Provide detailed analysis comparing the proposed method with NCC/SGM under varying input/kernel sizes.
4)The adaptive speckle reduction and histogram matching (Section 2.3.1) are critical but lack details. Specify parameters (e.g., filter sizes, adaptation criteria) and justify their impact on matching performance.
5)The Sentinel-1 C-band data (10 m resolution) may not fully challenge the algorithm. Include results from higher-resolution datasets (e.g., TerraSAR-X) to demonstrate scalability.
6)The 34.1 m average DSM accuracy (Line 33) is reasonable but lacks context. Compare with theoretical limits (Equation 20) and discuss error sources (e.g., geometric distortions, kernel size trade-offs).
Comments on the Quality of English Language1) Englishi should be improved.
2) Abstract: Simplify technical jargon (e.g., “quadratic growth in computation time” → “quadratic computational complexity”).
Author Response
Reviewer 1,
Dear reviewer,
We would like to thank you very much for your time and insightful comments that helped us improve the presentation of our manuscript further. We have carefully addressed your comments in the revised version. Please see below our point-by-point response to your comments where the changes are highlighted in the revised version.
- The novelty should be clarified, especially the advantages of the proposed approach over deep-learning-based methods should be enhanced w.r.t theoretical aspect and experimental aspect.
Response. Thank you for your comment. The central contribution of this study lies in the strategic utilization of local methods that offer inherently low computational complexity. Among dense matching techniques, local approaches are theoretically the most efficient, and this computational advantage has informed the structural design of the proposed algorithm. As such, the algorithm not only retains the benefits of local methods but also achieves lower computational complexity compared to more elaborate alternatives.
To aid clarity, a comparative Table 1 and Table 2 has been included in the introduction, providing an approximate estimation of computational complexity across major dense matching algorithms and situating deep learning models within this landscape. Dense matching algorithms are grouped into two broad categories: classical methods and deep learning–based approaches. Although both aim to produce accurate pixel-level correspondences, their architectures and deployment requirements differ fundamentally. Deep learning methods inherently rely on training, necessitating access to large-scale annotated datasets and high computational resources.
Due to the architectural complexity of deep networks—substantially greater than classical methods even in their optimized forms—GPU-based acceleration has become essential. This is now feasible on modern consumer-grade systems, rendering such designs increasingly practical. When based on linear algebraic operations such as dot products and summations, deep networks can achieve speedups of several orders of magnitude on GPU hardware.
The proposed algorithm adheres to this linear computational pattern, comprising primarily additive and multiplicative operations. Thus, the computational gap highlighted in the comparative table is expected to persist even under GPU deployment.
With respect to output quality, the proposed method is expected to outperform conventional local methods and approach the accuracy of semi-global techniques such as SGM, owing to its two-stage design that mitigates local method limitations while maintaining efficiency. Nevertheless, without further architectural enhancements or adaptive mechanisms, its performance may remain inferior to state-of-the-art deep networks—particularly on unseen datasets.
Effective deployment of deep networks on novel data typically requires retraining with representative samples, which poses a considerable challenge for non-specialist users. A fair numerical comparison between classical and deep methods remains inherently difficult, as performance strongly depends on familiarity with the network architecture and training protocols. Inadequate tuning may lead to biased or suboptimal outcomes, obscuring the full potential of deep models.
It is for this reason that deep learning methods are often benchmarked on standardized datasets and compared primarily against other neural approaches. This minimizes the need for retraining and ensures the integrity of comparative evaluations. The mention of deep networks in this study serves solely to establish a conceptual framework for theoretical comparison—based on computational complexity as understood at the time of writing. These metrics are expected to evolve with ongoing research. Despite their potential for high-quality results, proper exploitation of deep networks remains a technically demanding task, often beyond the reach of casual users or even those not deeply familiar with the subject matter.
2)The core innovation—reducing kernel dimensionality via feature vector encoding—is intriguing but needs clearer theoretical justification. Equations (5)–(9) should explicitly link kernel reduction to computational savings without sacrificing accuracy.
Response. Thank you for your comment. To better illustrate the proposed concepts, the referenced equations are summarized and interpreted in the table below. Each formulation supports key aspects of the workflow, including optimized disparity range estimation, feature dimensionality reduction, and efficient cost function construction. Their interplay enables faster computation, reduced memory usage.
# |
Equation |
role |
5 |
The use of a kernel surrounding each pixel as a feature descriptor introduces inherent redundancy, analogous to the correlation observed in hyperspectral data. For an image with n-bit radiometric resolution and a kernel of size k, the resulting feature space—approximated by Equation (5)—can become disproportionately large, especially given that pixel counts in typical images reach multi-megapixel scales. In rectified imagery, however, cost volume construction and matching typically involve only a few hundred disparity candidates. This mismatch reinforces the justification for feature reduction. Dimensionality reduction not only eliminates redundancy but also lowers computational complexity, with processing time decreasing quadratically relative to descriptor length. Additionally, shorter feature vectors enable the construction of a 3D data cube, facilitating direct cost function computation via matrix-based operations instead of iterative evaluation. |
|
6 |
Equations (6)– (8) enable automatic extraction of the disparity interval, offering an optional but potentially beneficial preprocessing layer. By estimating the effective disparity range, this step supports more informed parameter selection—including minimal kernel sizes for feature extraction in Stage 1 and cost volume construction in Stage 2. Notably, in rectified stereo imagery, phase correlation lacks a distinct peak and instead exhibits a semi-step response in the first row. The magnitude and sharpness of this step are directly proportional to the underlying disparity distribution, providing a heuristic cue for disparity interval extraction. |
|
7 |
||
8 |
||
9 |
The equations play a pivotal role in determining the minimum shift steps required for accurate cost function recovery. In Stage 1, larger kernels yield broader cost function profiles with increased full width at half maximum (FWHM). This effect becomes more pronounced when kernel dimensions are reduced as depicted in figure 16. Consequently, peak localization can be achieved with fewer samples, permitting the construction of the cost volume using shift increments greater than one pixel. This adaptive sampling approach inherently compresses the final cost volume, significantly reducing memory usage and computational overhead. |
To offer readers a cohesive understanding of the proposed algorithm and to clearly contextualize the equations introduced in Section 2.3, a flow diagram (Figure 17) has been incorporated at the end of this section. This figure provides a stage-wise depiction of the algorithm’s execution while situating each equation within its corresponding step, thereby enhancing the conceptual clarity and continuity of the presentation.
3)The claim of “stable computational complexity” (Lines 26–27) is vague. Provide detailed analysis comparing the proposed method with NCC/SGM under varying input/kernel sizes.
Response. Thank you for your comment. To better contextualize the proposed algorithm within the computational complexity landscape, comparative tables and figures have been added to the introduction. Specifically, Table 1 outlines the approximate computational complexity of major dense matching algorithms, while Table 2 presents the relative complexity of sparse feature-based methods. Furthermore, the impact of increasing kernel dimensions on algorithmic complexity is illustrated in Figure 3, helping to visualize the trade-off between matching precision and computational demand.
4)The adaptive speckle reduction and histogram matching (Section 2.3.1) are critical but lack details. Specify parameters (e.g., filter sizes, adaptation criteria) and justify their impact on matching performance.
Response. Thank you for your comment. Section 2.3.1 includes Tables 4 and 5, which detail the parameters employed in the preprocessing methods along with corresponding explanations. To further evaluate the impact of preprocessing on the stereo matching process, results obtained from the proposed algorithm applied to the raw input data (without preprocessing) are also presented (Figure 28). This dual comparison facilitates both quantitative and qualitative assessment of the algorithm's baseline performance and highlights the effect of preprocessing on key evaluation metrics.
5)The Sentinel-1 C-band data (10 m resolution) may not fully challenge the algorithm. Include results from higher-resolution datasets (e.g., TerraSAR-X) to demonstrate scalability.
Response. Thank you for your comment. To enhance algorithmic evaluation, an additional TerraSAR-X stereo pair (SM mode) has been added to the experimental dataset.
6)The 34.1 m average DSM accuracy (Line 33) is reasonable but lacks context. Compare with theoretical limits (Equation 20) and discuss error sources (e.g., geometric distortions, kernel size trade-offs).
Response. Thank you for your comment. The content outlined below has been added to the discussion section for clarification:
“As discussed in the expected accuracy analysis, the achieved results fall short of ideal predictions due to multiple contributing factors. One major source of deviation stems from inherent geometric distortions present in radar imagery. As illustrated schematically, when the radar incident angle approaches near-vertical orientations (close to 90°), the effective ground resolution in the range direction deteriorates significantly compared to the nominal sensor resolution. This causes aggregation of extended surface areas into individual pixels, leading to reduced spatial precision and overall matching accuracy. The degradation is particularly pronounced in mountainous or topographically complex regions.
Such structural limitations also influence the stereo matching process, particularly the γ parameter in Equation (20). Although a nominal value of 0.5 is commonly used for estimating ideal accuracy, it cannot be assumed that all pixels conform to this level of precision. Among the influential parameters is the kernel size used during matching. To achieve optimal performance, both the proposed algorithm and conventional local methods are tested with varying kernel sizes, selecting the configuration that yields the best output. However, uniform kernel sizes across the image may not result in uniform accuracy. A kernel that performs optimally in one region may not be ideal elsewhere, particularly across heterogeneous landscapes.
This motivates the use of adaptive methods, where kernel dimensions are dynamically adjusted to compensate for local variations. While the proposed algorithm supports flexible kernel configurations, its current design—based on unified computation—does not accommodate per-pixel dynamic kernel adaptation within a single processing pass. This presents an opportunity for future work, where simultaneous optimization of kernel sizes across the image may further improve disparity estimation accuracy.”
Comments on the Quality of English Language
1) Englishi should be improved.
2) Abstract: Simplify technical jargon (e.g., “quadratic growth in computation time” → “quadratic computational complexity”).
Response. Thank you for your comment. Done!
We hope the changes that we have made are satisfactory
Respectfully yours
Authors
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript proposes an efficient dense matching algorithm for SAR radargrammetry. The novelty is clear, the experimental design is sound, and the results are convincing. It successfully addresses the critical issue of the quadratic growth in computational cost associated with large windows, which are often required for local methods when processing SAR images. The specific suggestions for revision are as follows:
- The introduction effectively outlines the dilemma between local and global/semi-global methods in the field of radargrammetry. However, the core innovation of this paper lies in a "low-complexity neighborhood vectorization" technique. The introduction fails to connect this idea with broader related work in the computer vision community, such as classic feature descriptors (e.g., SIFT, which also encodes neighborhood information into a vector) or fast filtering techniques (e.g., integral images). This makes the innovation appear somewhat isolated and fails to fully situate its contribution within a broader academic context. While the motivation is stated at the end of the introduction, the specific research gap could be summarized more concisely.
- The proposed method is a combination of multiple modules (coarse matching with a large kernel, feature dimensionality reduction, a two-stage process, median filter post-processing, etc.). The study validates the effectiveness of the entire method but does not analyze the contribution of each individual module to the final result. For instance, what would the performance be if the first stage used a large window directly without feature reduction? What is the accuracy of using only the first-stage results? Comparing the accuracy and performance of "using a 33x33 kernel reduced to a 9-element vector" versus "directly using a 9x9 kernel" would be insightful. The lack of an ablation study makes it difficult for readers to determine which components are key to the performance improvement.
- The choice of baselines (NCC, SGM series) is classic and appropriate. However, since the introduction mentions deep learning methods, it would be beneficial to briefly reiterate in the experimental setup why they were not included as a primary comparison (e.g., the scarcity of public, large-scale labeled SAR stereo datasets).
- The description on page 13 (Section 2.3.2) and in Figure 10, which details the core innovation of compressing large-window neighborhood information into a low-dimensional feature vector, is somewhat ambiguous. The phrase "The mean of these two directions is computed" is vague. How is this mean calculated? Is it a pixel-wise average of the two sampling results (from the row and column directions) to generate the final 3D data cube? This critical step requires a more explicit mathematical description or pseudocode.
- There are significant inconsistencies in the numbering of figures and tables throughout the manuscript. For example, the text on page 25 refers to "Figure 16 and Figure 17," but the caption on that page is for "Figure 15." This level of disorganization in cross-referencing is a major issue that needs to be thoroughly addressed.
- The statement on page 28 (line 739) that the method can "produce an output with limited elevation accuracy even without preprocessing" is a strong claim. However, the paper does not present any experimental results from a no-preprocessing scenario to support it.
- The conclusion lacks a discussion of the method's limitations. For instance, is the performance inferior to SGM (which includes a smoothness constraint) when dealing with repetitive textures? How does it perform at sharp disparity discontinuities? The absence of a discussion on limitations makes the conclusion seem less balanced and comprehensive.
Overall, the English expression in this manuscript is clear and understandable for the most part. However, there are some sentences that are unidiomatic, sometimes awkward, or verbose, along with a few minor errors in grammar and word choice. This suggests that the authors may not be native English speakers, but they possess a solid capability for scientific writing. The quality of the paper would be significantly enhanced after professional language polishing.
Author Response
Reviewer 2,
Dear reviewer,
We would like to thank you very much for your time and insightful comments that helped us improve the presentation of our manuscript further. We have carefully addressed your comments in the revised version. Please see below our point-by-point response to your comments where the changes are highlighted in the revised version.
This manuscript proposes an efficient dense matching algorithm for SAR radargrammetry. The novelty is clear, the experimental design is sound, and the results are convincing. It successfully addresses the critical issue of the quadratic growth in computational cost associated with large windows, which are often required for local methods when processing SAR images. The specific suggestions for revision are as follows:
The introduction effectively outlines the dilemma between local and global/semi-global methods in the field of radargrammetry. However, the core innovation of this paper lies in a "low-complexity neighborhood vectorization" technique. The introduction fails to connect this idea with broader related work in the computer vision community, such as classic feature descriptors (e.g., SIFT, which also encodes neighborhood information into a vector) or fast filtering techniques (e.g., integral images). This makes the innovation appear somewhat isolated and fails to fully situate its contribution within a broader academic context. While the motivation is stated at the end of the introduction, the specific research gap could be summarized more concisely.
Response. Thank you for your comment. In addition to classical dense matching approaches and deep learning-based methods, there exists a distinct category of image matching techniques known as feature-based matching, which are typically implemented in a sparse rather than dense fashion. The main strength of these methods lies in their applicability to raw imagery—images with unknown geometry, often unrectified or lacking epipolar resampling. However, several considerations regarding these methods warrant attention.
First, feature-based algorithms begin by scanning the input images to select candidate points for matching—commonly referred to as interest points. A variety of criteria have been proposed to identify such points, typically corresponding to prominent image structures that are visible in both the reference and secondary images. These structures may include points, corners, circular or elliptical shapes, and similar salient features.
Second, as illustrated in the newly added figure in the introduction (Figure 2), such features tend to reside in the high-frequency components of the image and therefore represent a very small portion of the overall scene. This characteristic makes the comparison stage computationally efficient, but may limit suitability for dense matching applications where uniform coverage is required.
Third, given their use on unrectified or cross-sensor imagery, feature-based methods typically require descriptors to be defined for each selected keypoint. These descriptors—either manually chosen or automatically extracted—must demonstrate robustness, stability, and accuracy, which often necessitates nonlinear formulation. Considering these three aspects, such techniques are well-suited for establishing initial correspondences or predicting approximate pixel locations to reduce matching errors. Nonetheless, by design, they inherently operate sparsely rather than densely.
Moreover, even if the number of extracted keypoints is artificially increased, the computational complexity of the matching stage scales quadratically with the number of candidate points, as shown in the comparative tables (Table 1 and Table 2). Thus, applying such methods across all pixels—even with keypoint duplication—remains computationally inefficient. For this reason, a limited set of features is typically used to estimate image geometry and perform epipolar resampling, substantially reducing the search space to a linear domain.
Finally, the pixel-wise descriptor definition required for dense matching—when factoring in memory constraints and computational demands—becomes highly challenging, despite efforts to maintain acceptable accuracy. Unlike sparse cases with a limited set of interest points, dense matching must contend with a vastly larger solution space. Consequently, in radargrammetric or photogrammetric elevation modeling pipelines, dense matching is usually decoupled from geometry estimation. Epipolar rectification is typically performed automatically using feature-based methods, which have demonstrated excellent performance. Dense matching over raw imagery, however, remains impractical under most circumstances.
Together with Table 2, the following explanation has been integrated into the introduction section to provide comparative context and foster a deeper conceptual grasp:
“Feature-based correspondence techniques, beyond traditional dense matching and deep learning methods, offer a sparse alternative, particularly effective on raw or unrectified imagery. These methods first detect salient keypoints—such as corners, circular features, and edges—that appear in both reference and secondary images. Because these features often reside in high-frequency regions they occupy only a small portion of the image, simplifying matching but limiting density.
This sparsity proves efficient for estimating initial geometric relationships or minimizing mismatch risk, but inherently lacks coverage for full-image matching. Even when keypoints are artificially multiplied, computational costs escalate quadratically, making full-pixel comparison impractical. As a result, a small subset of points is used to derive epipolar geometry and resample images, narrowing the search space to a linear domain.
Moreover, defining robust descriptors for each point—often nonlinear due to stability and invariance demands—is central to performance, especially when handling multisensor data. Applying such descriptors across all pixels for dense matching introduces memory and complexity bottlenecks. Therefore, in elevation modeling via radargrammetry or photogrammetry, dense matching is typically deferred until after geometric rectification using feature-based methods, which excel in this preprocessing stage.”
The proposed method is a combination of multiple modules (coarse matching with a large kernel, feature dimensionality reduction, a two-stage process, median filter post-processing, etc.). The study validates the effectiveness of the entire method but does not analyze the contribution of each individual module to the final result. For instance, what would the performance be if the first stage used a large window directly without feature reduction? What is the accuracy of using only the first-stage results? Comparing the accuracy and performance of "using a 33x33 kernel reduced to a 9-element vector" versus "directly using a 9x9 kernel" would be insightful. The lack of an ablation study makes it difficult for readers to determine which components are key to the performance improvement.
Response. Thank you for your comment. To provide a clearer understanding of the proposed algorithm's operational flow and the individual contributions of its two main stages, Tables 12 and 13 have been added to the results section. These tables illustrate the outputs of the first and second stages, respectively, and separately analyze their effects. Specifically, Table 13 includes two columns—one with the execution of the first stage and another without it—to facilitate a direct comparison and better highlight the impact of the initial stage.
The choice of baselines (NCC, SGM series) is classic and appropriate. However, since the introduction mentions deep learning methods, it would be beneficial to briefly reiterate in the experimental setup why they were not included as a primary comparison (e.g., the scarcity of public, large-scale labeled SAR stereo datasets).
Response. Thank you for your comment. The central contribution of this study lies in the strategic utilization of local methods that offer inherently low computational complexity. Among dense matching techniques, local approaches are theoretically the most efficient, and this computational advantage has informed the structural design of the proposed algorithm. As such, the algorithm not only retains the benefits of local methods but also achieves lower computational complexity compared to more elaborate alternatives.
To aid clarity, a comparative Table 1 and Table 2 has been included in the introduction, providing an approximate estimation of computational complexity across major dense matching algorithms and situating deep learning models within this landscape. Dense matching algorithms are grouped into two broad categories: classical methods and deep learning–based approaches. Although both aim to produce accurate pixel-level correspondences, their architectures and deployment requirements differ fundamentally. Deep learning methods inherently rely on training, necessitating access to large-scale annotated datasets and high computational resources.
Due to the architectural complexity of deep networks—substantially greater than classical methods even in their optimized forms—GPU-based acceleration has become essential. This is now feasible on modern consumer-grade systems, rendering such designs increasingly practical. When based on linear algebraic operations such as dot products and summations, deep networks can achieve speedups of several orders of magnitude on GPU hardware.
The proposed algorithm adheres to this linear computational pattern, comprising primarily additive and multiplicative operations. Thus, the computational gap highlighted in the comparative table is expected to persist even under GPU deployment.
With respect to output quality, the proposed method is expected to outperform conventional local methods and approach the accuracy of semi-global techniques such as SGM, owing to its two-stage design that mitigates local method limitations while maintaining efficiency. Nevertheless, without further architectural enhancements or adaptive mechanisms, its performance may remain inferior to state-of-the-art deep networks—particularly on unseen datasets.
Effective deployment of deep networks on novel data typically requires retraining with representative samples, which poses a considerable challenge for non-specialist users. A fair numerical comparison between classical and deep methods remains inherently difficult, as performance strongly depends on familiarity with the network architecture and training protocols. Inadequate tuning may lead to biased or suboptimal outcomes, obscuring the full potential of deep models.
It is for this reason that deep learning methods are often benchmarked on standardized datasets and compared primarily against other neural approaches. This minimizes the need for retraining and ensures the integrity of comparative evaluations. The mention of deep networks in this study serves solely to establish a conceptual framework for theoretical comparison—based on computational complexity as understood at the time of writing. These metrics are expected to evolve with ongoing research. Despite their potential for high-quality results, proper exploitation of deep networks remains a technically demanding task, often beyond the reach of casual users or even those not deeply familiar with the subject matter.
The description on page 13 (Section 2.3.2) and in Figure 10, which details the core innovation of compressing large-window neighborhood information into a low-dimensional feature vector, is somewhat ambiguous. The phrase "The mean of these two directions is computed" is vague. How is this mean calculated? Is it a pixel-wise average of the two sampling results (from the row and column directions) to generate the final 3D data cube? This critical step requires a more explicit mathematical description or pseudocode.
Response. Thank you for your comment. The referenced figure has been revised, and the computation procedure for the output feature vector has been clearly illustrated within the updated diagram.
There are significant inconsistencies in the numbering of figures and tables throughout the manuscript. For example, the text on page 25 refers to "Figure 16 and Figure 17," but the caption on that page is for "Figure 15." This level of disorganization in cross-referencing is a major issue that needs to be thoroughly addressed.
Response. Thank you for your comment. Cross-referencing and numbering have been revised throughout the manuscript text.
The statement on page 28 (line 739) that the method can "produce an output with limited elevation accuracy even without preprocessing" is a strong claim. However, the paper does not present any experimental results from a no-preprocessing scenario to support it.
Response. Thank you for your comment. To elucidate the algorithm's behavior in light of the discussed aspects, the output derived from its implementation on the raw (unprocessed) data has been added to the Results section (figure 27). This inclusion enables a direct evaluation of the obtained output and underscores the importance of the preprocessing stage—particularly in terms of the quantitative values of the quality metrics—which is now discussed and substantiated in greater detail.
The conclusion lacks a discussion of the method's limitations. For instance, is the performance inferior to SGM (which includes a smoothness constraint) when dealing with repetitive textures? How does it perform at sharp disparity discontinuities? The absence of a discussion on limitations makes the conclusion seem less balanced and comprehensive.
Response. Thank you for your comment. The discussion and conclusion section has been revised to address the reviewer’s comment and to incorporate potential limitations when comparing with other existing methods.
Comments on the Quality of English Language
Overall, the English expression in this manuscript is clear and understandable for the most part. However, there are some sentences that are unidiomatic, sometimes awkward, or verbose, along with a few minor errors in grammar and word choice. This suggests that the authors may not be native English speakers, but they possess a solid capability for scientific writing. The quality of the paper would be significantly enhanced after professional language polishing.
Response. Thank you for your comment. We have carefully reviewed and fixed grammar/typo issues.
We hope the changes that we have made are satisfactory
Respectfully yours
Authors
Reviewer 3 Report
Comments and Suggestions for Authors- Your paper presents a new method; I suggest you emphasize its novelty in your introduction.
- Please fix the error! Reference source not found, on line 390
- You need more comparisons regarding Learning Based Models
- Please add complementary metrics such as RMSE, MAE, or elevation difference histograms against the reference DSM to improve result interpretability
- Some subsections, particularly 2.3, are quite dense and would benefit from clearer subheadings, bullet points, or diagrams to guide the reader.
Author Response
Reviewer 3,
Dear reviewer,
We would like to thank you very much for your time and insightful comments that helped us improve the presentation of our manuscript further. We have carefully addressed your comments in the revised version. Please see below our point-by-point response to your comments where the changes are highlighted in the revised version.
- Your paper presents a new method; I suggest you emphasize its novelty in your introduction.
Response. Thank you for your comment. The introduction section has been revised to include additional explanations, along with Tables 1 and 2 and Figure 3, to more effectively present the problem and delineate the positioning of various methods relative to the topic under study. These additions also help clarify the conceptual foundation and intended contribution of the proposed algorithm.
- Please fix the error! Reference source not found, on line 390
Response. Thank you for your comment. Done!
- You need more comparisons regarding Learning Based Models
Response. Thank you for your comment. The central contribution of this study lies in the strategic utilization of local methods that offer inherently low computational complexity. Among dense matching techniques, local approaches are theoretically the most efficient, and this computational advantage has informed the structural design of the proposed algorithm. As such, the algorithm not only retains the benefits of local methods but also achieves lower computational complexity compared to more elaborate alternatives.
To aid clarity, a comparative Table 1 and Table 2 has been included in the introduction, providing an approximate estimation of computational complexity across major dense matching algorithms and situating deep learning models within this landscape. Dense matching algorithms are grouped into two broad categories: classical methods and deep learning–based approaches. Although both aim to produce accurate pixel-level correspondences, their architectures and deployment requirements differ fundamentally. Deep learning methods inherently rely on training, necessitating access to large-scale annotated datasets and high computational resources.
Due to the architectural complexity of deep networks—substantially greater than classical methods even in their optimized forms—GPU-based acceleration has become essential. This is now feasible on modern consumer-grade systems, rendering such designs increasingly practical. When based on linear algebraic operations such as dot products and summations, deep networks can achieve speedups of several orders of magnitude on GPU hardware.
The proposed algorithm adheres to this linear computational pattern, comprising primarily additive and multiplicative operations. Thus, the computational gap highlighted in the comparative table is expected to persist even under GPU deployment.
With respect to output quality, the proposed method is expected to outperform conventional local methods and approach the accuracy of semi-global techniques such as SGM, owing to its two-stage design that mitigates local method limitations while maintaining efficiency. Nevertheless, without further architectural enhancements or adaptive mechanisms, its performance may remain inferior to state-of-the-art deep networks—particularly on unseen datasets.
Effective deployment of deep networks on novel data typically requires retraining with representative samples, which poses a considerable challenge for non-specialist users. A fair numerical comparison between classical and deep methods remains inherently difficult, as performance strongly depends on familiarity with the network architecture and training protocols. Inadequate tuning may lead to biased or suboptimal outcomes, obscuring the full potential of deep models.
It is for this reason that deep learning methods are often benchmarked on standardized datasets and compared primarily against other neural approaches. This minimizes the need for retraining and ensures the integrity of comparative evaluations. The mention of deep networks in this study serves solely to establish a conceptual framework for theoretical comparison—based on computational complexity as understood at the time of writing. These metrics are expected to evolve with ongoing research. Despite their potential for high-quality results, proper exploitation of deep networks remains a technically demanding task, often beyond the reach of casual users or even those not deeply familiar with the subject matter.
- Please add complementary metrics such as RMSE, MAE, or elevation difference histograms against the reference DSM to improve result interpretability.
Response. Thank you for your comment. Figure 26 presents the output for the Sentinel-1 dataset, while Figure 30 corresponds to the newly added TerraSAR-X dataset, which was incorporated to support broader comparative assessment and to investigate the generalizability of the proposed algorithm.
- Some subsections, particularly 2.3, are quite dense and would benefit from clearer subheadings, bullet points, or diagrams to guide the reader.
Response. Thank you for your comment. To offer readers a cohesive understanding of the proposed algorithm and to clearly contextualize the equations introduced in Section 2.3, a flow diagram (Figure 17) has been incorporated at the end of this section. This figure provides a stage-wise depiction of the algorithm’s execution while situating each equation within its corresponding step, thereby enhancing the conceptual clarity and continuity of the presentation.
We hope the changes that we have made are satisfactory
Respectfully yours
Authors
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper could be accepted.