Review Reports
- Yihao Sun1,
- Mingrui Wang1 and
- Xiaoyi Huang2
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper reported a remote sensing semantic segmentation network based on a score map and transformer-based fusion. The experimental results show some improvements compared to some baselines. However, some issues need to be addressed regarding the description and experiments.
1. The main contributions should be re-summarized to clarify why the novel approach improves the accuracy and speed of image segmentation. Insights into the improvements should be provided rather than merely pointing out the proposed approach. Additionally, Sections 2.1 and 2.3 contain repeated components that can be integrated for a more accurate discussion of the networks.
2. In the introduction, the motivation for the proposed approach should be clarified. The authors state that the side output from the global branch backbone has a relatively low feature map resolution, limiting its representative capacity. However, existing solutions have addressed this issue and the local-global structure is not new in remote sensing semantic segmentation. The differences between the proposed approach and existing networks, such as those in 10.1109/TGRS.2024.3373033, 10.1109/LGRS.2024.3414293, and 10.1109/TGRS.2023.3240982 should be clearly outlined.
3. Figure 1 is not informative enough. More text annotations or graphics would be preferred.
4. In the experiments, the ablation study is not well-presented. Using charts for parameter analysis would improve clarity. Additionally, more analysis is required to verify the effectiveness of the score maps, such as through visualization.
Comments on the Quality of English Language
N/A
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper titled "Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-based Fusion" proposes a deep learning framework based on the combination of convolutional neural network and Transformer for ultra-high resolution (UHR) semantic segmentation of remote sensing images. In this work, convolutional neural networks (CNNs) are used to extract local and global features, and a multi-head attention mechanism module is used for fast feature fusion, so as to solve the challenge of balancing computational efficiency and storage space in UHR image semantic segmentation. Experiments on the public ISPRS dataset demonstrate the effectiveness of the proposed method. However, the novelty of the proposed method is limited. In addition, some of the content in the manuscript is vague and difficult to understand. I have the following comments.
1. The principle of the two innovative modules is fuzzy, and it is suggested to explain how the score map module evaluates the feature score map to select the optimal local features and the specific input and output of the fast fusion module. At the same time, the proposed fast feature fusion method mainly improves the existing multi-head attention mechanism. There is a limit to innovation.
2. As mentioned in Figure 3, the method optimization objectives include one major loss and two additional losses. These losses are not reflected in the manuscript, and it is recommended to supplement the loss functions and rationale used in this section.
3. The EFFNet overview diagram (Fig. 1) and the fast fusion mechanism diagram (Fig. 2) in this manuscript are not clear about the meaning of each module, which is not conducive to understanding the role of each module. The legend needs to be more descriptive for clarity.
4. In the ablation study, it can be seen that different numbers of patches have a certain effect on the experimental knot, and the number of patch used in the experimental results is not reflected in the experimental description of the manuscript.
5. There are some writing errors in the manuscript, such as no punctuation at the end of page 11.
Comments on the Quality of English LanguageThere are some writing errors in the manuscript.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe core step of the paper is method for semantic segmentation of ultra-high-resolution remote sensing image. The backbone of the procedure procedure is based on residual neural network (ResNet50) and feature pyramid network (FPN). In addition, some algorithmic extensions are proposed such as score map and fast-transformer-based fusion. The contribution of the paper seems fair as it contains extensive experiments and results that might be interesting from a practical standpoint. In general, the literal presentation of the paper is good, but there is still room for improvement in this regard. The evaluation of results and discussion on related methods should be improved. In summary, I consider the contents of the paper might be suitable for publication, but the following issues should be addressed in a revised version of the paper.
- There are some minor issues in the literal presentation of the paper. For instance, (i) abstract, change “proposes an Feature Fusion…” to “proposes a Feature Fusion…”. (ii) line 238, change “mitigated” to “mitigates”. (iii) line 241, “leverage the expressive capability”. Please explain in what concrete terms expressive capacity is defined. According to the last paragraph of Section 3.3, it seems to be related to allocate attention weights appropriately. (iv) There is some redundancy in the text, e.g., lines 105-107 and 181-103. (v) Please explain the difference “semantic segmentation” and “segmentation”, both terms are used in the paper. (v) Conclusions, line 435, I would change “Our work” to “This work”.
Therefore, an English proofreading of the paper is required.
- The proposed procedure and methods used in comparison are based on computationally intensive methods. Some particular running measures are provided in Table 3. A general computational burden analysis (using e.g., big O notation) comparing the computational order of the implemented methods would provide a better comparison.
- The discussion on related methods should be extended. Several options are available for image segmentation in remote sensing that could be comparable or competitive with the proposed one. One of them is probabilistic hierarchical clustering that have successfully applied to estimate the number of endmembers in hyperspectral images. I suggest the following reference: https://doi.org/10.3390/rs12213585.
- The statistical significance of the results should be estimated and discussed. In addition, a set of Montecarlo experiments should be implemented in order to measure the variability of the results in terms of mean and standard deviation of the figures of merit. Theoretically, fusion steps help to improve classification accuracy but also the stability of the results. Please discuss this.
Comments on the Quality of English Language
Please see the "Comments and Suggestions for Authors" section.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAll the issues have been addressed, and this paper can be accepted
Comments on the Quality of English LanguageN/A
Reviewer 2 Report
Comments and Suggestions for Authorsno more question.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe quality of the paper has been improved significantly. All my concerns have been adequately addressed in the revised version of the paper including the following: improvement of the literal presentation of the paper; extension of the discussion on related methods and estimation of the computational complexity; and addition of a study of statistical significance and variability of the results.