Extending SETSM Capability from Stereo to Multi-Pair Imagery
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is a well-written, interesting, and important paper describing a new approach - an extension of the SETSM algorithm to multiple stereopairs matching.
Overall, the paper balances detail of the new methodology well with concepts already covered in (Noh, Howat; 2015, 2017, 2018, 2019).
I have only a few minor comments, questions, and suggestions, mostly to increase clarity.
- Throughout but particularly in the introduction, recommend breaking up text into more paragraphs to make it more digestible. This, for example, to indicate that the material is describing existing, competing, or present approaches.
- Line 122: "...VLL to geometrically constrain..."
- Line 221: Suggest "is approximated by" instead of "is given by." Either way is probably fine, though.
- Line 224: Suggest instead something like “…are the standard errors of the contributions of the image matching measurement in image space units (same as f) and object space units (same as Z), respectively.
- Line 251: Typo: “of at coarser levels”
- In Figure 2, it appears that the intersection of the WNCC_p and the cumulative WNCC_p is the criterion for determining the WNCC_th threshold. I think this is an unfortunate byproduct of the selected scale for the WNCCp ordinate (0 to 5%). It may be made clearer if the max for this axis was made to be, e.g., 3%.
- Line 248: If P_wncc is increased at each pyramid level, and assuming increased pyramid level is a decrease in resolution, why would (line 251) coarser levels have higher WNCCth? Clarify here how the pattern of “subsequent” image level affects resolution. Coarse-to-fine.
- Line 274: Typo: “in a … similarly measurements”
- Figure 4, and related text. Please elaborate on this process. What are the red lines and what are the blue dots? Are the finer-scale GSD pixels resampled based on the GSD ratio?
- Figure 5: should K_t=6 be K_t=7? This is somewhat answered later in the document when it is made clear that even-numbered kernel sizes are used (max = 20), but it does look like a 7 x 7 kernel size.
- Line 325: What is the point of f_s, the scaling factor in the weighting?
- Eq. 7, Line 348: the middle part (=sum(WH_q)/sum(WQ_oh) is not needed since SWQ_mp is defined in the text, and is expected.
- Line 380: please provide a brief explanation why a too-large kernel size would lead to overestimation of FWOH_MP.
- If possible, please describe the model of DMC used in the experiments.
- Figure 12 and related text: make it explicit in the caption (and first discussion in the text) that the result (a) is from two-image SETSM (MMP?), and (b-e) are from SETSM MMP.
- Line 528: "most accurate expected height accuracy” --> maybe "highest" or "best" expected height accuracy?
- Line 535: For more than 3 optimal heights, yes, but those with 3 optimal heights were subsequently called out in the text. The best RMSE for 3 optimal heights was achieved with 4 images, not 6. This is probably fine.
- Sorry if missed it in the manuscript, but in what order were the stereopairs introduced in the experiment as the number of images (2, 3, 4, 5, 6) was increased? From lowest to highest Table 1 ID?
- Line 559: Suggest: “(green scale color in Figure 15c)”
- Line 566, 564: "DSM"
- Figure 16: Suggest moving this to earlier in the Results section.
- Line 586: Explicitly, (c ) reduces smooth-surface blunders or enhances edges compared to e and f? It would be helpful to circle, otherwise point out specific areas where this is evident.
- Line 605: space between 1 and %.
- Line 605: 3.71% can be found using Figure 18 and Table 2. How would a reader independently arrive at 3.71%?
- Figure 18: Some numbers are occluded and not readable.
- 621: “Center Parc Stadium”
- Line 623: “lower-left corner” – Further, the portion that is not reconstructed is not evident (upper section of the stadium).
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript presents a significant extension of the SETSM algorithm to multi-pair imagery (SETSM MMP) for improved DSM generation, addressing occlusion challenges in urban environments. The work is technically sound and aligns with the journal’s scope. However, several technical clarifications and contextual enhancements are needed for reproducibility and scholarly rigor.
1:The manuscript claims SETSM MMP "efficiently eliminates occlusions" (DMC results in Fig. 21), but quantitative metrics for occlusion recovery are absent.
2:Figure 17 compares variants of 3D KWHE but omits baselines like traditional SGM or CNN-based multi-view methods.
3:Add a baseline (SGM with median/mean merging) in Fig. 17–19 to contextualize gains from KWHE.
4:The 9-year gap between WV-2 (2009) and LiDAR (2018) data introduces confounding factors ( new buildings in Fig. 15c). While acknowledged, the impact on RMSE is not isolated.
5:The pixel-to-pixel similarity adjustment (Fig. 4) assumes a linear GSD ratio for kernel repositioning. This may fail for extreme off-nadir cases (WV-2 ID6, GSD=0.92m vs. 0.47m) where projective distortion is nonlinear. How does the method handle nonlinear distortions (tall buildings leaning in off-nadir views)?
6:The literature review dismisses CNN-based methods ( PSM-Net, GCS-Net) as "parameter-sensitive" but overlooks recent advances . To strengthen the motivation:
Oriented SAR Ship Detection based on Edge Deformable Convolution and Point Set Representation
Weighted Pseudo-Labels and Bounding Boxes for Semisupervised SAR Target Detection
7:Discuss scalability to large-scale datasets and handling of non-urban terrains.
8:Highlight implications for time-series DSM generation given the temporal gap issue.
9:Use temporal consistency masks to filter changed regions in RMSE calculations (Section 3.3).
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis is a very good article, that could benefit from minor refinements. Here they are:
1) The introduction is a bit confusing to read. You have written a paragraph that spans over nearly 3 pages and the text is extremely direct, without more in-depth explanations of each step. Please break it into paragraphs and explain the context of extending SETSM capabilities to stereo-pairs. Otherwise, you'll lose readers right at the beginning.
2) Again, the methods section is a bit confusing. Instead of listing from A to E, you should integrate what you say at the very beginning with the 2.1, 2.2, etc. subtopics. I found myself having to go back and forth to understand what you meant to say.
3) Starting on 2.1 all the way through 2.4, the explanations are very clear to me. Please notice some images are hard to read, though. And they greatly enhance comprehension. Ensure texts in images is large enough to be readable.
4) Also, when it comes to section 3, I believe the images and LiDAR data should be moved to a materials section under 2. So, I suggest creating a 2.1 Materials section, with 2.1.1 Worldview images, 2.1.2 LiDAR data and so on... and a 2.2 Methods section with the revised content of section 2. It would have been easier for me to understand the methodology had I seen the sample data used before. For example: I was curious about the reference LiDAR DTM used.
5) Regarding the six scenarios from (a) to (e) starting on line 598, was there an evident difference in terms of processing time when applying SGM strategies? Some of them yield similar results, so, at the end of the day, processing time could play an important role in terms of applying the algorithm to a large variety of overlapping scenes.
6) Your article already does an excellent job at estending SETSM to multiple stereopairs. However, your test data uses mostly off nadir images oriented toward the northeast. You should address this in the discussion section and perhaps elaborate a bit more on the applicability of the algorithm when dealing with off nadir images oriented toward different directions (and thus, different parallax and occlusion between images). There's no need to redo your work with images like these, but the reality is that it is very difficult to find several WorldView images oriented towards a similar direction, and this constraint will have to be taken into consideration when applying the algorithm (or its refinements) to other scenarios. This could be added to the discussion section.
7) It is not mandatory, but I'd recommend adding a 2-3 paragraph conclusion which summarizes what was done and the key findings.
Good luck with the revision. I'm looking forward to reading the new version of the article.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsACCEPT!

