Next Article in Journal
Deep Learning Applications for Crop Mapping Using Multi-Temporal Sentinel-2 Data and Red-Edge Vegetation Indices: Integrating Convolutional and Recurrent Neural Networks
Previous Article in Journal
Deep Learning-Based Super-Resolution Reconstruction of a 1/9 Arc-Second Offshore Digital Elevation Model for U.S. Coastal Regions
 
 
Article
Peer-Review Record

Extending SETSM Capability from Stereo to Multi-Pair Imagery

Remote Sens. 2025, 17(18), 3206; https://doi.org/10.3390/rs17183206
by Myoung-Jong Noh 1,* and Ian M. Howat 1,2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Remote Sens. 2025, 17(18), 3206; https://doi.org/10.3390/rs17183206
Submission received: 25 July 2025 / Revised: 9 September 2025 / Accepted: 15 September 2025 / Published: 17 September 2025
(This article belongs to the Section Remote Sensing Image Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This is a well-written, interesting, and important paper describing a new approach - an extension of the SETSM algorithm to multiple stereopairs matching.  

Overall, the paper balances detail of the new methodology well with concepts already covered in (Noh, Howat; 2015, 2017, 2018, 2019).

I have only a few minor comments, questions, and suggestions, mostly to increase clarity.

  • Throughout but particularly in the introduction, recommend breaking up text into more paragraphs to make it more digestible.  This, for example, to indicate that the material is describing existing, competing, or present approaches.
  • Line 122: "...VLL to geometrically constrain..."
  • Line 221: Suggest "is approximated by" instead of "is given by." Either way is probably fine, though.
  • Line 224: Suggest instead something like “…are the standard errors of the contributions of the image matching measurement in image space units (same as f) and object space units (same as Z), respectively.
  • Line 251: Typo: “of at coarser levels”
  • In Figure 2, it appears that the intersection of the WNCC_p and the cumulative WNCC_p is the criterion for determining the WNCC_th threshold. I think this is an unfortunate byproduct of the selected scale for the WNCCp ordinate (0 to 5%).  It may be made clearer if the max for this axis was made to be, e.g., 3%.
  • Line 248: If P_wncc is increased at each pyramid level, and assuming increased pyramid level is a decrease in resolution, why would (line 251) coarser levels have higher WNCCth? Clarify here how the pattern of “subsequent” image level affects resolution.  Coarse-to-fine.
  • Line 274: Typo: “in a … similarly measurements
  • Figure 4, and related text. Please elaborate on this process.  What are the red lines and what are the blue dots?  Are the finer-scale GSD pixels resampled based on the GSD ratio?
  • Figure 5: should K_t=6 be K_t=7?  This is somewhat answered later in the document when it is made clear that even-numbered kernel sizes are used (max = 20), but it does look like a 7 x 7 kernel size.
  • Line 325: What is the point of f_s, the scaling factor in the weighting?
  • Eq. 7, Line 348: the middle part (=sum(WH_q)/sum(WQ_oh) is not needed since SWQ_mp is defined in the text, and is expected.
  • Line 380: please provide a brief explanation why a too-large kernel size would lead to overestimation of FWOH_MP.
  • If possible, please describe the model of DMC used in the experiments.
  • Figure 12 and related text: make it explicit in the caption (and first discussion in the text) that the result (a) is from two-image SETSM (MMP?), and (b-e) are from SETSM MMP.
  • Line 528: "most accurate expected height accuracy” --> maybe "highest" or "best" expected height accuracy?
  • Line 535: For more than 3 optimal heights, yes, but those with 3 optimal heights were subsequently called out in the text. The best RMSE for 3 optimal heights was achieved with 4 images, not 6.  This is probably fine.
  • Sorry if missed it in the manuscript, but in what order were the stereopairs introduced in the experiment as the number of images (2, 3, 4, 5, 6) was increased? From lowest to highest Table 1 ID?
  • Line 559: Suggest: “(green scale color in Figure 15c)”
  • Line 566, 564: "DSM"
  • Figure 16: Suggest moving this to earlier in the Results section.
  • Line 586: Explicitly, (c ) reduces smooth-surface blunders or enhances edges compared to e and f?  It would be helpful to circle, otherwise point out specific areas where this is evident.
  • Line 605: space between 1 and %.
  • Line 605: 3.71% can be found using Figure 18 and Table 2. How would a reader independently arrive at 3.71%? 
  • Figure 18: Some numbers are occluded and not readable.
  • 621: “Center Parc Stadium”
  • Line 623: “lower-left corner” – Further, the portion that is not reconstructed is not evident (upper section of the stadium).

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript presents a significant extension of the SETSM algorithm to multi-pair imagery (SETSM MMP) for improved DSM generation, addressing occlusion challenges in urban environments. The work is technically sound and aligns with the journal’s scope. However, several technical clarifications and contextual enhancements are needed for reproducibility and scholarly rigor.
1:The manuscript claims SETSM MMP "efficiently eliminates occlusions" (DMC results in Fig. 21), but quantitative metrics for occlusion recovery are absent.
2:Figure 17 compares variants of 3D KWHE but omits baselines like traditional SGM or CNN-based multi-view methods.
3:Add a baseline (SGM with median/mean merging) in Fig. 17–19 to contextualize gains from KWHE.
4:The 9-year gap between WV-2 (2009) and LiDAR (2018) data introduces confounding factors ( new buildings in Fig. 15c). While acknowledged, the impact on RMSE is not isolated. 
5:The pixel-to-pixel similarity adjustment (Fig. 4) assumes a linear GSD ratio for kernel repositioning. This may fail for extreme off-nadir cases (WV-2 ID6, GSD=0.92m vs. 0.47m) where projective distortion is nonlinear.  How does the method handle nonlinear distortions (tall buildings leaning in off-nadir views)?
6:The literature review dismisses CNN-based methods ( PSM-Net, GCS-Net) as "parameter-sensitive" but overlooks recent advances . To strengthen the motivation:
Oriented SAR Ship Detection based on Edge Deformable Convolution and Point Set Representation
Weighted Pseudo-Labels and Bounding Boxes for Semisupervised SAR Target Detection
7:Discuss scalability to large-scale datasets  and handling of non-urban terrains.
8:Highlight implications for time-series DSM generation  given the temporal gap issue.
9:Use temporal consistency masks  to filter changed regions in RMSE calculations (Section 3.3).

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This is a very good article, that could benefit from minor refinements. Here they are:

1) The introduction is a bit confusing to read. You have written a paragraph that spans over nearly 3 pages and the text is extremely direct, without more in-depth explanations of each step. Please break it into paragraphs and explain the context of extending SETSM capabilities to stereo-pairs. Otherwise, you'll lose readers right at the beginning.

2) Again, the methods section is a bit confusing. Instead of listing from A to E, you should integrate what you say at the very beginning with the 2.1, 2.2, etc. subtopics. I found myself having to go back and forth to understand what you meant to say. 

3) Starting on 2.1 all the way through 2.4, the explanations are very clear to me. Please notice some images are hard to read, though. And they greatly enhance comprehension. Ensure texts in images is large enough to be readable.

4) Also, when it comes to section 3, I believe the images and LiDAR data should be moved to a materials section under 2. So, I suggest creating a 2.1 Materials section, with 2.1.1 Worldview images, 2.1.2 LiDAR  data and so on... and a 2.2 Methods section with the revised content of section 2. It would have been easier for me to understand the methodology had I seen the sample data used before. For example: I was curious about the reference LiDAR DTM used.

5) Regarding the six scenarios from (a) to (e) starting on line 598, was there an evident difference in terms of processing time when applying SGM strategies? Some of them yield similar results, so, at the end of the day, processing time could play an important role in terms of applying the algorithm to a large variety of overlapping scenes.

6) Your article already does an excellent job at estending SETSM to multiple stereopairs. However, your test data uses mostly off nadir images oriented toward the northeast. You should address this in the discussion section and perhaps elaborate a bit more on the applicability of the algorithm when dealing with off nadir images oriented toward different directions (and thus, different parallax and occlusion between images). There's no need to redo your work with images like these, but the reality is that it is very difficult to find several WorldView images oriented towards a similar direction, and this constraint will have to be taken into consideration when applying the algorithm (or its refinements) to other scenarios. This could be added to the discussion section.

7) It is not mandatory, but I'd recommend adding a 2-3 paragraph conclusion which summarizes what was done and the key findings.

Good luck with the revision. I'm looking forward to reading the new version of the article.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

ACCEPT!

Back to TopTop