Next Article in Journal
EMWMS-YOLO: Efficient Multi-Scale Detection Framework for Small Objects in Challenging Remote Sensing Scenes
Previous Article in Journal
Machine Learning Approaches for Terrestrial Water Storage Assessment in Coastal Lowland Aquifer System Using GRACE/GRACE-FO Satellite Data (2003–2023)
Previous Article in Special Issue
Assessing the Application of Mobile Light Detection and Ranging in Complex Mixed-Species Forest Inventory
 
 
Article
Peer-Review Record

An Enhanced Image Feature Extraction and Matching Method for Three-Dimensional Reconstruction of Forest Scenes

Remote Sens. 2026, 18(11), 1681; https://doi.org/10.3390/rs18111681
by Hangui Wang and Hongyu Huang *
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2026, 18(11), 1681; https://doi.org/10.3390/rs18111681
Submission received: 20 March 2026 / Revised: 15 May 2026 / Accepted: 20 May 2026 / Published: 22 May 2026
(This article belongs to the Special Issue Digital Modeling for Sustainable Forest Management)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Please find attached.

Comments for author File: Comments.pdf

Author Response

Thank you very much for taking the time to review our manuscript. Please find the detailed responses below and the corresponding revisions (in track changes) in the re-submitted files.

Comments 1: The method introduces an image pyramid (4 levels) to extract multi-scale features via ALIKED, but the fusion process is not clearly described. Please specify how keypoints across scales are combined: coordinate rescaling method, cross-scale non-maximum suppression radius, normalization and comparison of confidence scores across scales, and whether descriptors are re-normalized after fusion.

Response 1: We appreciate the insightful points raised by the reviewer. We have added a description of the feature fusion process in Section 2.2. Specifically, the key point coordinates of the i-th layer pyramid are multiplied by the scaling factor to be projected onto the original resolution image. To eliminate duplicate detections, we employ a cross-scale non-maximum suppression (NMS) algorithm with a radius of 1 pixels. The confidence score descriptions are consistent across all levels and are selected directly based on size. The fused descriptors have been re-normalized, as otherwise, feature matching operations would not be possible. The specific details have been updated to Page 7, lines 13 through 18 (clean version).

Comments 2: The calculation of Average Precision and Average Inlier Number relies on the definition of inliers. Please clarify: which geometric model and robust estimator were used? What thresholds were applied? Were the same parameters used consistently for SIFT, ALIKED+LightGlue, and the proposed method?

Response 2: We appreciate the reviewer’s attention. Inliers are defined as matches that fit the estimated geometric constraint (such as a homography or fundamental matrix). We have added a description of the definition of internal points within the section. Specifically, we use the LO- RANSAC (Locally Optimized RANSAC) implementation integrated within COLMAP to estimate the fundamental matrix and determine inlier correspondences, with an inlier threshold set to 1 pixel. When comparing different methods, these parameters have been consistently maintained. This information is updated in the text.

Comments 3: The neighborhood constraint and NetVLAD retrieval are applied only in the proposed method. To isolate the effect of the feature extraction and matching front-end, please include a control experiment where SIFT and the baseline ALIKED+LightGlue use the same pair selection strategy as the proposed method.

Response 3: We sincerely appreciate the thorough and professional suggestions provided by the reviewer. To clarify the practical effects of feature extraction and matching strategies, we conducted independent experimental analyses in Section 3.3 (ablation experiments) to evaluate the performance of different modularization methods. Additionally, we conducted comparative experiments on the SIFT algorithm to observe the impact of neighborhood constraints. The data was added into Tables 7 and 8.

Comments 4: The paper emphasizes efficiency but lacks runtime data. Please report: a) Average feature extraction time per image per pyramid level; b) Per-pair matching time with LightGlue; c) Total SfM processing time and memory usage for both datasets.

Response 4: Thank you for your advice. While this paper does emphasize efficiency, its primary objective is to highlight the reduction in the number of matches required for feature matching, rather than the time spent on feature extraction and individual matches. In fact, feature extraction accounts for a small portion of the total time in the process of sparse 3D reconstruction: specifically, the ALIKED feature extraction took approximately 2 minutes for both datasets; with the inclusion of image Pyramids, the time increased to around 3 minutes. In contrast, the feature matching phase consumes the majority of time and memory resources. Additionally, the time spent on feature extraction is negligible compared to the overall SFM process. The core optimization focus of this paper is to enhance the efficiency of the feature matching step. To this end, we have listed the total matching times and the number of actual matched image pairs for each method in Tables 7 and 8, clearly demonstrating the improvements in efficiency. A comparative analysis is also provided in the ablation experiments section.

Regarding memory consumption: Due to the use of deep learning models, GPU memory requirements are relatively high. All experiments were conducted on an NVIDIA RTX 4090 GPU, with peak VRAM usage of approximately 25 GB (about 80% of the total GPU memory). Our proposed method effectively reduces the computational burden during matching by eliminating redundant or unnecessary image pairs, thereby shortening the time required for the matching phase. These numbers are added in Tables 7 and 8 of the article. We also expanded discussion on this.  

Comments 5:The method introduces several key parameters: K in NetVLAD retrieval, neighborhood window size, number of pyramid levels, and per-level keypoint caps. Please provide sensitivity analyses showing how these parameters affect registration rate, AP, and runtime, to justify the chosen values.

Response 5: We appreciate the expert suggestions. In this paper, parameter selection was achieved through iterative adjustments and comparisons conducted in our previous experiments to determine optimal values. These include the K-value in NetVLAD retrieval, the neighborhood constraint window size, the number of pyramid layers, and the upper limit of keypoints per layer. The neighborhood constraint accounts for overlap rates; the K-value balances efficiency with full coverage of overlapping regions; the pyramid layer count prevents excessive downsampling from causing blurring in lower-layer images and misidentifying points; and the maximum number of keypoints per layer ensures optimal feature extraction—typically twice the number of layers at the highest resolution. We now explicitly expressed that the final result is sensitive to these parameters; and we conducted experiments systematically to determine the appropriate values.   

Comments 6:The current comparison includes only SIFT and ALIKED+LightGlue. To better position the method relative to the state of the art, please add comparisons to additional learned feature and matching pipelines.

Response 6: We appreciate the reviewer’s professional and insightful suggestions. Our goal was to explore methods for better handling three-dimensional modeling of forest imagery, but at the moment we did not have the time or resources to exhaustively compare all of the methods. We accept the reviewer’s feedback and will proceed to conduct such comparative experiments in the future. In Discussion we added this paragraph:

“This study focuses on exploring the effectiveness of the improved ALIKED+LightGlue baseline model combination in SfM reconstruction of forest scenes. It does not intend to provide a systematic or exhaustive comparison with other mainstream deep learning feature matching methods, such as SuperPoint+SuperGlue. This limitation indeed restricts a comprehensive evaluation of the performance advantages of the proposed approach. As a follow-up effort, we plan to conduct quantitative comparison experiments with representative deep learning methods on a broader set of natural scene datasets, covering aspects such as matching accuracy, reconstruction completeness, computational efficiency, and robustness. This will further validate the competitiveness and applicability boundaries of our framework and provide more universally applicable references for algorithm selection”.

Comments 7:The paper claims that the method offers a cost-effective alternative to LiDAR, but no quantitative comparison against LiDAR or field measurements is provided. Please either add such validation or revise the claim to reflect what is actually demonstrated.

Response 7: We appreciate your expert comment. We have downplayed the mention of alternatives to LiDAR in the abstract and conclusions. We now describe our method as a “cost-effective and complementary photogrammetric solution” suitable for scenarios where LiDAR sensor is either logistically or financially constrained, while acknowledging that LiDAR remains the gold standard for absolute structural accuracy. The second point, which included the first page’s Highlight section, “What are the implications of the main findings?”, was removed, and revised to: “It is now possible to achieve large-scale, high-fidelity 3D reconstructions of complex natural environments such as forests using widely available and low-cost mobile video devices.” Relevant information was also removed from the abstract.

Response to Comments on the Quality of English Language

Point 1: The English needs to be polished.

Response 1: We appreciate your constructive comment. We agree that the language in the previous version needed improvement. To address this, we have thoroughly revised the manuscript. We believe the revised manuscript is now much clearer and easier to follow.

Thanks again for your reivew comments and suggestions.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript proposes an "improved SIFT" combined with adaptive
Harris corner detection and a relaxation matching strategy for 3D reconstruction
of forest scenes. While the problem of feature extraction in texture-poor and
repetitive-texture environments is relevant, this manuscript suffers from
significant weaknesses in novelty, experimental depth, and reproducibility
that preclude acceptance at this venue. Major revisions would be required
before reconsideration.


1. The claimed contributions are not sufficiently novel. The paper essentially
   combines three known techniques: (i) SIFT with parameter adjustments, (ii)
   Harris corner detection, and (iii) relaxation matching. These individual
   components have been previously published in the literature, and the
   paper does not demonstrate a synergistic combination that produces
   qualitatively new capabilities. The term "I-SIFT" merely suggests parameter
   tuning or minor modifications to the standard SIFT descriptor, which does
   not constitute a novel method. The field has moved substantially beyond
   hand-crafted features; methods such as SuperPoint, SuperGlue, and
   learned feature matching achieve dramatically better performance on
   challenging scenes. The paper does not acknowledge or compare against
   any deep-learning-based approaches, which represents a critical gap. The authors must clearly articulate what specific
   algorithmic innovation distinguishes their work from prior arts, and
   why this innovation specifically benefits forest scene reconstruction.

2. The related work section is superficial and fails to properly situate
   this work within the broader literature. Key missing references include:
   - Learned local features: SuperPoint (Detone et al., CVPR 2018),
     SuperGlue (Sarlin et al., CVPR 2020)
   - Point cloud generation in vegetation: NeRF-based approaches for
     vegetation, such as TRADNet: Temporal and Regional-Aware Diffusion Model for Point Cloud Generation, 2025
   Furthermore, the paper does not cite or compare with established
   forest-specific 3D reconstruction methods, which is surprising given
   the specific application claimed in the title.

   Recommendation: A comprehensive literature review covering both
   general feature extraction/matching advances and domain-specific
   (forest/vegetation) 3D reconstruction methods must be included.

3. The method description is insufficient for reproducibility. Critical
   parameters and thresholds are not reported:
   - The specific changes made to SIFT to create "I-SIFT" are not
     clearly specified. What descriptor changes were made? What difference
     does "enhanced key point extraction" make algorithmically?
   - The adaptive threshold for Harris corner detection is not defined.
     How is the adaptation performed?
   - The relaxation matching strategy parameters are not disclosed.
     What is the convergence criterion? What is the iteration count?
   - All control parameters appear to be using default values without
     justification for the specific choices.
   Additionally, the mathematical formulation is weak. The paper lacks
   a proper formal description of the objective functions and algorithms
   used. Equations are sparse and lack proper variable definitions.
 The authors must provide a complete algorithmic
   description with all parameters specified, ideally in pseudocode
   form, to enable reproducibility.

4. The experimental evaluation is severely limited:

   a) Dataset: Results are presented on only 3 forest scenes from what
      appears to be a private dataset. No public standard dataset (e.g.,
      TUT dataset, Multi-View Stereo datasets, or other vegetation/forest
      benchmarks) is used. This makes comparison with prior work impossible.

   b) Baselines: Only SIFT is used as a baseline. The paper should compare
      against: SURF, ORB, AKAZE, and especially learned features (SuperPoint,
      SuperGlue). Without these comparisons, the value of the proposed method
      cannot be assessed.

   c) Quantitative metrics: Only 3D reconstruction accuracy is reported.
      Standard metrics for feature matching should include: inlier ratio,
      matching precision/recall, repeatability score, and homography
      estimation accuracy (as done in standard benchmarks like HPatches).

   d) Ablation study: There is no ablation study to validate the contribution
      of each component (I-SIFT, Harris corners, relaxation matching).
      It is unclear which component provides the actual improvement.

   e) Statistical significance: No error bars, confidence intervals, or
      statistical tests are provided. Results are presented as single
      point estimates, which is insufficient for a rigorous evaluation.

   f) Runtime analysis: Only matching time is reported. Full pipeline
      runtime (feature detection + description + matching + reconstruction)
      should be compared against baselines.

   The authors must expand experiments to (i) use public
   benchmark datasets, (ii) compare with state-of-the-art methods including
   learned approaches, (iii) report standard matching metrics beyond
   reconstruction accuracy, and (iv) conduct a proper ablation study.

5. WRITING AND PRESENTATION QUALITY
   a) The term "forest scenes" is mentioned throughout, but the paper
      does not clearly articulate what specific characteristics of forest
      scenes make standard SIFT fail. Repetitive textures and illumination
      variation are generic challenges, not forest-specific ones.

   b) Figure quality is substandard. Fig. 6 and Fig. 7 are difficult to
      read at the provided resolution. Point cloud figures lack scale bars
      and orientation indicators.

   c) The paper uses passive voice excessively and contains grammatical
      errors throughout (e.g., "The results shows" should be "The results
      show"; "algorithm is improved" is vague).

   d) Section headings are inconsistent in style and do not follow standard
      IMRAD format.

   e) Figure captions lack sufficient detail. For example, "Fig. 5. Matching
      results comparison" provides no information about what is being compared,
      the number of matches, or the evaluation criteria.

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions (in clean version, as well as in track changes) in the re-submitted files.

Comments 1: The claimed contributions are not sufficiently novel. The paper essentially combines three known techniques: (i) SIFT with parameter adjustments, (ii) Harris corner detection, and (iii) relaxation matching. These individual components have been previously published in the literature, and the paper does not demonstrate a synergistic combination that produces qualitatively new capabilities. The term "I-SIFT" merely suggests parameter tuning or minor modifications to the standard SIFT descriptor, which does not constitute a novel method. The field has moved substantially beyond hand-crafted features; methods such as SuperPoint, SuperGlue, and learned feature matching achieve dramatically better performance on challenging scenes. The paper does not acknowledge or compare against any deep-learning-based approaches, which represents a critical gap. The authors must clearly articulate what specific algorithmic innovation distinguishes their work from prior arts, and why this innovation specifically benefits forest scene reconstruction.

Response 1: Thank you for your feedback, but our manuscript does not contain content regarding the use of SIFT or Harris as a baseline model and the adjustment of parameters. We indeed use an improved, deep-learning based method (ALIKED+LightGlue) to conduct our experiments. We incorporate an image pyramid scheme to achieve multi-scale feature extraction and in the feature matching stage, a neighborhood constraint matching strategy to select matching pairs for subsequent sparse reconstruction. Compared with original methods, our approach shows better performance in the forested scenes.

Comments 2: The related work section is superficial and fails to properly situate this work within the broader literature. Key missing references include:
   - Learned local features: SuperPoint (Detone et al., CVPR 2018), SuperGlue (Sarlin et al., CVPR 2020)
   - Point cloud generation in vegetation: NeRF-based approaches for vegetation, such as TRADNet: Temporal and Regional-Aware Diffusion Model for Point Cloud Generation, 2025
 Furthermore, the paper does not cite or compare with established forest-specific 3D reconstruction methods, which is surprising given the specific application claimed in the title.

 Recommendation: A comprehensive literature review covering both general feature extraction/matching advances and domain-specific (forest/vegetation) 3D reconstruction methods must be included.

Response 2: Thank you for your recommendation and valuable comments. Regarding the comparison of methods, we have already cited and discussed some state-of-the-art deep learning algorithms such as SuperPoint and SuperGlue in the Related Work section (References No. 24 and No. 27). We also reviewed point cloud generation in vegetation using approaches including NeRF-based methods, such as those shown in References No. 18, 19 and 38, 39. Our literature review covering both general feature extraction/matching advances and domain-specific (forest/vegetation) 3D reconstruction methods is in the 2nd and 3rd paragraphs of the Introduction. The TRADNet paper you mentioned is a diffusion model for point cloud generation of manmade objects such as chairs, airplanes and cars, which is not directly relevant to this study.

Comments 3: The method description is insufficient for reproducibility. Critical parameters and thresholds are not reported:
   - The specific changes made to SIFT to create "I-SIFT" are not clearly specified. What descriptor changes were made? What difference does "enhanced key point extraction" make algorithmically?
   - The adaptive threshold for Harris corner detection is not defined. How is the adaptation performed?
   - The relaxation matching strategy parameters are not disclosed. What is the convergence criterion? What is the iteration count?
   - All control parameters appear to be using default values without justification for the specific choices.
   Additionally, the mathematical formulation is weak. The paper lacks a proper formal description of the objective functions and algorithms used. Equations are sparse and lack proper variable definitions.
 The authors must provide a complete algorithmic description with all parameters specified, ideally in pseudocode form, to enable reproducibility.

Response 3: We thank reviewer for reminding us the importance of reproducibility. The specific data processing procedures (flowchart), relevant parameter settings, and threshold selection have been thoroughly explained in Section 2.2.3 (Data Processing) of the paper, to make sure that replicability of this study. Again, we would like to clarify that this paper does not propose or create the concept of “I-SIFT”, or the adaptive Harris corner detector. The key contributions are described in Section 2.2.1, Sparse Reconstruction Algorithm for Forest Stand Images. Thanks again for your attention.

Comments 4: The experimental evaluation is severely limited:

   a) Dataset: Results are presented on only 3 forest scenes from what appears to be a private dataset. No public standard dataset (e.g.,TUT dataset, Multi-View Stereo datasets, or other vegetation/forest benchmarks) is used. This makes comparison with prior work impossible.

   b) Baselines: Only SIFT is used as a baseline. The paper should compare against: SURF, ORB, AKAZE, and especially learned features (SuperPoint, SuperGlue). Without these comparisons, the value of the proposed method cannot be assessed.

   c) Quantitative metrics: Only 3D reconstruction accuracy is reported. Standard metrics for feature matching should include: inlier ratio, matching precision/recall, repeatability score, and homography estimation accuracy (as done in standard benchmarks like HPatches).

   d) Ablation study: There is no ablation study to validate the contribution of each component (I-SIFT, Harris corners, relaxation matching).
 It is unclear which component provides the actual improvement.

   e) Statistical significance: No error bars, confidence intervals, or statistical tests are provided. Results are presented as single point estimates, which is insufficient for a rigorous evaluation.

   f) Runtime analysis: Only matching time is reported. Full pipeline runtime (feature detection + description + matching + reconstruction) should be compared against baselines.

   The authors must expand experiments to (i) use public benchmark datasets, (ii) compare with state-of-the-art methods including learned approaches, (iii) report standard matching metrics beyond reconstruction accuracy, and (iv) conduct a proper ablation study.

Response 4: Thank you for your valuable feedback.

Regarding point a), our sample site 1 (Plot 1) uses data collected in-house. The data for sample site 2 (Plot 2) was obtained from external source, a temperate forest benchmark (Reference No. 41: Benchmarking laser scanning and terrestrial photogrammetry to extract forest inventory parameters in a complex temperate forest). The TUT dataset, or Multi-View Stereo datasets you suggested do not have as much forest scenes as our study required.

For point b), SIFT is a baseline model widely used in computer vision studies. Our baseline model used in this study is ALIKED+LightGlue. It would be nice to be able to comprehensively compare against other learned models; we thank reviewer for this insightful suggestion and will incorporate this into our future study.

Regarding point c), we have not only reported the precision of the 3D reconstruction but also compared the internal point ratio, matching accuracy, repeatability scores, and estimation precision of perspective invariance. Due to time constraints, we were unable to conduct a comprehensive comparison. We accept the reviewer’s feedback and will supplement such comparative experiments in the future.

For point d), we conducted ablation experiments, which are described in Section 3.3, Ablation Study and Analysis.

Regarding point e), we report results on feature count, average number of matched point pairs, average number of inliers, and metrics including average tacking length, mean reprojection error. These data is computed from hundreds of images. The metrics reported are deterministic outputs of a fixed pipeline on the given dataset. We therefore do not report confidence intervals nor error bars.

Point f) The improvements discovered in this study are most significant in terms of feature extraction and matching, so we compared the efficiency of this aspect and added the results of runtime in Tables 7 and 8. Our research aims to improve the efficiency of the matching phase. Therefore, we only record the time required for matching. The time spent on deep learning feature extraction and description is a relatively small portion compared to the time required for feature matching. The difference in performance after enhancing the extracted features using image Pyramids is minimal.

In summary, our study did follow what the reviewer had suggested: (i) used public benchmark dataset (from Reference No. 41), (ii) compared with state-of-the-art methods including learned approaches (ALIKED+LightGlue), (iii) reported standard matching metrics beyond reconstruction accuracy, and (iv) conducted a proper ablation study.

Comments 5: WRITING AND PRESENTATION QUALITY
   a) The term "forest scenes" is mentioned throughout, but the paper does not clearly articulate what specific characteristics of forest scenes make standard SIFT fail. Repetitive textures and illumination variation are generic challenges, not forest-specific ones.

   b) Figure quality is substandard. Fig. 6 and Fig. 7 are difficult to read at the provided resolution. Point cloud figures lack scale bars and orientation indicators.

   c) The paper uses passive voice excessively and contains grammatical errors throughout (e.g., "The results shows" should be "The results show"; "algorithm is improved" is vague).

   d) Section headings are inconsistent in style and do not follow standard IMRAD format.

   e) Figure captions lack sufficient detail. For example, "Fig. 5. Matching results comparison" provides no information about what is being compared, the number of matches, or the evaluation criteria.

Response 5: Thank you for your valuable feedback. In response to the reviewer’s suggestions regarding the quality of the writing, we have made extensive revisions to the entire document.

Firstly, we clarified in the Introduction the unique challenges posed by forest scenes, characterized by repetitive textures, dense canopies, and branch-leaf occlusion—image matching algorithms are prone to erroneously associating regions that are spatially distinct but visually similar; also there is the instability of scale and space due to frequent fine structures (branches and leaves), as well as the impact of non-rigid deformations and complex occlusions on feature tracking. Repetitive textures and illumination variation are generic challenges, but they are much severe in forest scenes. Many algorithms (traditional and learning-based alike) fail to properly process this kind of image from our experience.

 Secondly, all figures (particularly Figures 6 and 7) have been replaced with high-resolution versions at 600 DPI. Additionally, we have refined the language throughout the document, significantly reducing passive voice and correcting grammatical errors (such as subject-verb agreement issues), thereby making the expression more direct and accurate. Finally, we have strictly adhered to the IMRAD format guidelines for chapter headings and expanded the caption information for figures (such as matching logarithms, evaluation metrics, etc.) to ensure that all figures are self-explanatory without reference to the main text. A scale bar and other relevant missing information have also been added to the figures.

Response to Comments on the Quality of English Language

Point 1: The English needs to be polished.

Response 1: We appreciate your constructive comment. We agree that the language in the previous version needed improvement. To address this, we have thoroughly revised the manuscript. We believe the revised manuscript is now much clearer and easier to follow.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes an image-based sparse reconstruction pipeline tailored for forest scenes that integrates ALIKED for feature extraction with a multi-scale image pyramid (MS-ALIKED), LightGlue for feature matching, and a neighborhood-constrained pairing strategy augmented with NetVLAD retrieval to capture non-sequential loop closures. Evaluations on two plots (an urban palm stand and a complex mixed forest) show increased feature counts and coverage, higher matching precision, improved registration rates (100% on the complex plot), and denser, more stable sparse point clouds compared to SIFT and an ALIKED+LightGlue baseline, with modest reprojection errors.

This paper addresses an important and practically challenging problem: reliable image-based 3D reconstruction in forest environments where traditional SfM front-ends fail due to occlusion and repetitive textures. The integration of MS-ALIKED, LightGlue, and pairing constraints with NetVLAD retrieval is well-engineered and produces clear improvements in registration completeness, track length, and point cloud density on two real datasets. However, the methodological novelty is modest, several strong modern baselines are missing, and core claims about accuracy and adaptability are not substantiated with absolute geometric validation or efficiency measurements. Below are my comments:

How are inliers determined for AP (model type, estimator, thresholds)? Are you using MAGSAC++/LO-RANSAC in COLMAP, and are the settings consistent across methods?

Were temporally adjacent frames in addition to NetVLAD top K considered in the matching step for Plot 2? Could you clarify the pairing rule and the value of K selected?

What is the wall clock time and memory footprint of the feature extraction step, matching step, and SfM step for SIFT, ALIKED+LightGlue, and your approach on both datasets? How do the constraints of the neighborhood affect the speedup?

What was the configuration and training of the NetVLAD model like? Did you observe false positives in areas of the forest that were visually similar but spatially disjoint?

How does the de-duplication of cross-scale keypoints in MS-ALIKED work beyond the "per pixel" approach? Would the use of radius-based NMS affect the count vs. reprojection error trade-off?

How sensitive is your approach to the number of pyramid levels and the linear scaling schedule?

Could you evaluate your approach on additional deep-learning-based approaches like SuperPoint/SuperGlue, DISK+LightGlue, R2D2, and even a detector-free approach like LoFTR or DKM to put your approach in the best possible light in the presence of very competitive approaches in natural scenes?

Do you have access to any TLS/ALS or surveyed checkpoints for either plot to quantify absolute geometric accuracy (e.g., trunk axis locations, canopy envelope)? If not, can you provide MVS evaluations (e.g., completeness/precision) to back up your claim for "high-fidelity" results?

How does your method deal with dynamic foliage (wind effects) and motion blur resulting from handheld imagery? Any robustness studies?

Were your camera intrinsics calibrated or computed? Could you please specify your results for principal point coordinates, distortion model, and rolling shutter?

"Low reconstruction efficiency" is claimed to be solved in your paper. However, mathematically adding image pyramids and deep neural networks increases computational complexity.

Your paper's claim for providing a "cost-effective and flexible substitute for LiDAR" cannot be verified quantitatively since no independent ground truth data (e.g., TLS/LIDAR point clouds) is used to assess absolute 3D dimensional accuracy (e.g., errors in DBH or height)?

The proposed "neighborhood constraint strategy" is based on arbitrary thresholds (a linear window size of 5 for Plot 1, top 10 for Plot 2). However, there is no scientific justification for these parameters.

The conclusion section only contains qualitative

The discussion section only analyzes the internal ablation study quantitatively. There is no quantitative comparison to contemporary literature.

How do you logically go from your method for reconstructing individual trees to NetVLAD for global image retrieval in unstructured plots?

Comments on the Quality of English Language

The English needs to be polished.

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions (in clean version, as well as in track changes) in the re-submitted files.

Comments 1: How are inliers determined for AP (model type, estimator, thresholds)? Are you using MAGSAC++/LO-RANSAC in COLMAP, and are the settings consistent across methods?

Response 1: Thank you for pointing out this key concept. We have revised the manuscript to explicitly state that the fundamental matrix estimation is performed using COLMAP’s built-in LO-RANSAC (Locally Optimized RANSAC) algorithm, with an inlier threshold of 1 pixel. The inliers identified by this procedure are used to compute the Average Precision (AP) metric reported in our experiments. Relevant information has been added to pages 8, lines 4-8 of the article (clean version).

Comments 2: Were temporally adjacent frames in addition to NetVLAD top K considered in the matching step for Plot 2? Could you clarify the pairing rule and the value of K selected?

Response 2: In the matching step for Plot 2, we did not consider temporally adjacent frames. The specific image pairing strategy and the chosen K value are detailed in Section 2.2.3 (“Data Processing”) of the manuscript; namely “…set the similarity neighborhood window to 10 for Plot 2 (matching each image with its top 10 most similar images)”.

Comments 3: What is the wall clock time and memory footprint of the feature extraction step, matching step, and SfM step for SIFT, ALIKED+LightGlue, and your approach on both datasets? How do the constraints of the neighborhood affect the speedup?

Response 3: Thank you for the comment. In fact, feature extraction accounts for only a small fraction of the total time in the sparse 3D reconstruction pipeline. Specifically, ALIKED feature extraction takes approximately 2 minutes on both datasets; with the addition of an image pyramid, this increases to about 3 minutes. By contrast, the feature matching stage consumes the majority of both time and memory resources.

The core optimization of this paper focuses on improving the efficiency of the feature matching step. To this end, we have added the total matching time and the number of image pairs actually matched for each method in Tables 7 and 8, Section 3.2 (Experimental Results and Analysis of Sparse Point Cloud Reconstruction ), enabling a clear observation of the efficiency improvements.

Regarding memory consumption: due to the use of deep learning models, GPU memory demand is relatively high. All experiments were conducted on an NVIDIA RTX 4090 GPU, with peak VRAM usage reaching approximately 25 GB (about 80% of the card’s total memory). Our proposed method effectively alleviates the computational burden during matching by reducing redundant image pairs, thereby decreasing the time required for the matching stage. Relevant information has been added to page 18, Lines 1-2.

Comments 4: What was the configuration and training of the NetVLAD model like? Did you observe false positives in areas of the forest that were visually similar but spatially disjoint?

Response 4: We employed a publicly available, pre-trained NetVLAD model rather than a self-trained one. The model uses a VGG-16 backbone and was trained on the large-scale place recognition dataset Pittsburgh 250k. In our experiments, we use these pre-trained weights to extract a 4096-dimensional global descriptor for each input image and perform similarity retrieval based on L2 distance. In visually similar but spatially disconnected regions—such as forests with repetitive textures—NetVLAD can indeed suffer from appearance ambiguity, potentially leading to false positives. To mitigate this issue, we restrict the matching process to only the top-10 most similar images, thereby limiting the number of spurious matches at the retrieval stage. Relevant information has been added to Page 6, lines 6-10 of the article.

Comments 5: How does the de-duplication of cross-scale keypoints in MS-ALIKED work beyond the "per pixel" approach? Would the use of radius-based NMS affect the count vs. reprojection error trade-off?

Response 5: Thank you for raising this technical point about our feature processing pipeline.

Regarding the de-duplication of cross-scale keypoints in MS-ALIKED, our approach moves beyond a simple "per-pixel" deduplication. We implement a radius-based filtering strategy where, within a specific neighborhood radius, we retain only the keypoint with the highest confidence (response score). This mechanism acts as a form of non-maximum suppression, effectively preventing the spatial clustering of redundant features.

We acknowledge that this method reduces the overall count of extracted keypoints. However, this reduction is intentional and beneficial. The suppressed points are typically low-confidence detections (often found near strong edges or as "satellite" points around strong features) which are prone to localization errors and serve as primary sources of mismatching.

By removing these unstable candidates, we improve the quality of the remaining feature set. Our experiments confirm that while the quantity decreases, the trade-off results in a reduction in average reprojection error. This leads to more robust and accurate pose estimation, demonstrating a favorable balance between feature sparsity and matching precision.

Comments 6: How sensitive is your approach to the number of pyramid levels and the linear scaling schedule?

Response 6: We thank the reviewer for raising the question regarding the sensitivity of the pyramid levels and scaling schedule. Our method demonstrates good robustness in both aspects. Within a reasonable range (e.g., 3-5 levels), increasing the number of pyramid levels improves the coverage of multi-scale features. However, excessive levels lead to overly blurred deep-layer images, introducing noise and drastically increasing computational costs with diminishing returns. Regarding the linear scaling factor, as long as sufficient information overlap is maintained between adjacent layers, the matching stability is not significantly affected. We selected the current configuration as it achieves an optimal balance between reconstruction completeness and computational efficiency.

Comments 7: Could you evaluate your approach on additional deep-learning-based approaches like SuperPoint/SuperGlue, DISK+LightGlue, R2D2, and even a detector-free approach like LoFTR or DKM to put your approach in the best possible light in the presence of very competitive approaches in natural scenes?

Response 7: We sincerely appreciate the reviewer's suggestion to evaluate our approach against advanced deep learning-based methods such as SuperPoint/SuperGlue, DISK+LightGlue, and LoFTR. Our primary objective in this study is to explore superior approaches that can effectively handle 3D modeling of forest imagery. However, due to current time and resource constraints, we are unable to exhaustively compare all the mentioned competitive methods in this specific stage. We fully accept your feedback and plan to supplement our work with these comparative experiments in our future research to better position our approach. The shortcomings were also highlighted during the discussion, specifically in lines 30 to 40 on page 19.

Comments 8: Do you have access to any TLS/ALS or surveyed checkpoints for either plot to quantify absolute geometric accuracy (e.g., trunk axis locations, canopy envelope)? If not, can you provide MVS evaluations (e.g., completeness/precision) to back up your claim for "high-fidelity" results?

Response 8: We do have access to TLS reference point cloud. Our initial intention was to focus on sparse reconstruction, so there is no need to include LiDAR point cloud for validation. Thanks to your suggestion, we now include Lidar point cloud for visual assessment of dense reconstruction results. Section 3.4 on visual comparison of dense 3D reconstruction has been supplemented. The effectiveness of both methods (baseline model and our improved model) has been compared and analyzed. This new section is now on pages 18 and 19. The process of collecting reference data has also been expanded on pages 6 and 7.

Comments 9: How does your method deal with dynamic foliage (wind effects) and motion blur resulting from handheld imagery? Any robustness studies?

Response 9: We acknowledge that wind-induced foliage movement violates the static scene assumption of SfM. However, the image pyramid is specifically designed to mitigate such errors. Through Gaussian smoothing, high-frequency, unstable leaf features are suppressed, forcing the algorithm to focus on feature points from stable, rigid structures like trunks and branches, thereby ensuring stability. Regarding motion blur from handheld capture, the multi-scale pyramid strategy offers inherent robustness: even if features are diminished at one scale due to blur, other levels can still capture valid information. Combined with the high overlap of the image sequence, adjacent sharp frames can provide sufficient geometric constraints to compensate for any blurred images. The relevant description has been added to page 5, lines 27 to 30.

Comments 10: Were your camera intrinsics calibrated or computed? Could you please specify your results for principal point coordinates, distortion model, and rolling shutter?

Response 10: The camera intrinsics (focal length and principal point coordinates) in this study were computed simultaneously during the reconstruction process via the SfM pipeline, rather than relying on pre-calibrated parameters. For the distortion model, we adopted the Brown-Conrady model incorporating radial (k1, k2) and tangential (p1, p2) components, which were jointly optimized in bundle adjustment; the resulting coefficients fell within reasonable ranges, effectively correcting lens distortion. Given the slow handheld motion and high frame rate, the rolling shutter effect was negligible in our experiments. Consequently, we employed a global shutter model to balance computational efficiency and accuracy, and the results demonstrate that this simplified model is sufficient for the reconstruction requirements of the current scenario.

Comments 11: "Low reconstruction efficiency" is claimed to be solved in your paper. However, mathematically adding image pyramids and deep neural networks increases computational complexity.

Response 11: Thank you for bringing this issue to our attention. I agree that the use of layered image Pyramids does indeed increase computational complexity, thereby lengthening the time required for individual matches. However, our research focuses on using constrained matching to reduce the number of matches, thereby decreasing the overall time required for the entire matching process, thus addressing the issue of low reconstruction efficiency. This runtime data is now included in Tables 7 and 8 in Section 3.2.

Comments 12: Your paper's claim for providing a "cost-effective and flexible substitute for LiDAR" cannot be verified quantitatively since no independent ground truth data (e.g., TLS/LIDAR point clouds) is used to assess absolute 3D dimensional accuracy (e.g., errors in DBH or height)?

Response 12: We sincerely thank the reviewer for this valid critique. We agree that our previous phrasing regarding a "substitute for LiDAR" was indeed misleading. The core contribution of our work lies in the optimization of the SfM pipeline itself, rather than positioning it as a direct hardware replacement for LiDAR. We have revised the manuscript to clarify this distinction and have removed the ambiguous statements. The second point, which included the first page’s Highlight section, “What are the implications of the main findings? was removed, and revised to: “It is now possible to achieve large-scale, high-fidelity 3D reconstructions of complex natural environments such as forests using widely available and low-cost mobile video devices.” Relevant information was also removed from the abstract.

Comments 13: The proposed "neighborhood constraint strategy" is based on arbitrary thresholds (a linear window size of 5 for Plot 1, top 10 for Plot 2). However, there is no scientific justification for these parameters.

Response 13: Thank you very much for your insightful comment. We agree that it is crucial to provide clear explanations regarding the selection of parameters. These parameters were not set arbitrarily; rather, they were determined through empirical analysis based on the specific image shooting conditions in each experimental area. We conducted a systematic range test of the size of the constraint window and analyzed the trade-off between computational efficiency and the precision of sparse point clouds. We ultimately selected parameters that were relatively suitable. We have revised the wording in the article to reflect the work we have done: “Our previous experiments demonstrated that a balance between efficiency and accuracy was achieved by setting the linear neighborhood window to 5 for Plot 1 (matching each image with its five preceding and five succeeding neighbors) and the similarity neighbor-hood window to 10 for Plot 2 (matching each image with its top 10 most similar images)”.

Comments 14: The conclusion section only contains qualitative

Response 14: Thank you very much for your insightful comment that our conclusion was primarily qualitative. To address this, we have revised the Conclusion section to include a more comprehensive quantitative summary. It is now changed to:

“The experimental results show that:

1. Compared with traditional handcrafted feature extraction methods, deep learning–based approaches significantly increase both the number and spatial coverage of extracted features. When combined with the image pyramid, the proposed method further boosts the number of detected keypoints by approximately 1.5 times, and improves feature cover-age by 3% to 10%.

2. Compared with traditional handcrafted feature matching methods, deep learning–based matching achieves more accurate results. Integrating our proposed approach further enhances matching accuracy by 4.1% to 8.4%.

3. In terms of camera pose recovery, our method yields more accurate poses. Under consistent reprojection error thresholds, the average track length increases by 0.6 to 1.3, effectively mitigating camera pose drift.

4. By introducing neighborhood constraints, our method effectively reduces mis-matches caused by cross-image feature matching, eliminating a large portion of point cloud points with reprojection errors greater than 2 pixels. This ensures geometric consistency among matched features and thereby improves the accuracy of point cloud initialization.”

 

Comments 15: The discussion section only analyzes the internal ablation study quantitatively. There is no quantitative comparison to contemporary literature.

Response 15: Thank you very much for sharing this insightful perspective. We acknowledge that, due to the primary focus of our initial study on the ablation experiments and internal validation of the framework we proposed, we were unable to conduct direct comparative experiments with all contemporary methods. However, we found that due to significant differences in experimental setups (such as data sets, scene scales) and evaluation metrics among existing studies, direct quantitative comparisons were not possible. Most existing methods address different objectives or use metrics that are not directly compatible with our specific process.

Comments 16: How do you logically go from your method for reconstructing individual trees to NetVLAD for global image retrieval in unstructured plots?

Response 16: Thank you for your interest in the logical flow of our methodology. We have noted your mention of “reconstruction of individual trees” and would like to clarify that our work does not actually involve the detailed three-dimensional reconstruction of individual trees. Our primary objective is to extract discriminative global features directly from images to achieve efficient location recognition and image retrieval. Therefore, we utilize NetVLAD to encode features for the entire scene image, generating compact global descriptors, thereby bypassing the time-consuming process of local instance segmentation and reconstruction. We have further clarified this workflow in the relevant sections of the paper to avoid any potential misunderstandings.

Response to Comments on the Quality of English Language

Point 1: The English needs to be polished.

Response 1: We appreciate your constructive comment. We agree that the language in the previous version needed improvement. To address this, we have thoroughly revised the manuscript. We believe the revised manuscript is now much clearer and easier to follow.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the author's response. I have no further questions.

Author Response

Comment 1:

 Thank you for the author's response. I have no further questions.

Response 1: Thank you very much for taking the time to review this manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

I am fine with the modifications from the authors. However, I still notice some minor mistaks that I would like to draw attention to the authors:

Page 1, Abstract
"LiDAR" -> "Light Detection and Ranging (LiDAR)"

Page 1, Abstract
"ALIKED" -> "ALIKED [First-use definition]"

Page 7, Section 2.2.2
"GoPro action camera" -> "GoPro action camera[Manufacturer details]"

 

 

Comments on the Quality of English Language

Page 25
"a more thoroughly, quantitative evaluation" -> "a more thorough, quantitative evaluation"

Author Response

Thank you very much for taking the time to review this manuscript. 

Comment 1:

I am fine with the modifications from the authors. However, I still notice some minor mistakes that I would like to draw attention to the authors:

Page 1, Abstract

"LiDAR" -> "Light Detection and Ranging (LiDAR)"

Page 1, Abstract

"ALIKED" -> "ALIKED [First-use definition]"

Page 7, Section 2.2.2

"GoPro action camera" -> "GoPro action camera [Manufacturer details]"

Response 1: Thank you very much for your positive feedback and for carefully pointing out the remaining minor issues. We have addressed your suggestions as follows: on Page 1 of the Abstract, we have expanded "LiDAR" to "Light Detection and Ranging (LiDAR)" at first mention, and we have added a first-use definition for "ALIKED" (e.g., "ALIKED (A Lightweight Keypoint Detection and Description Network)"); on Page 7, Section 2.2.2, we have modified "GoPro action camera" to "GoPro action camera (GoPro Inc., San Mateo, CA, USA)";

Comment on the Quality of English Language:

Page 25

"a more thoroughly, quantitative evaluation" -> "a more thorough, quantitative evaluation"

Response: we have corrected the English language error from "a more thoroughly, quantitative evaluation" to "a more thorough, quantitative evaluation". 

 

We greatly appreciate your constructive comments, which have helped improve the clarity and quality of our paper.

Back to TopTop