Next Article in Journal
A New Spatial Filtering Algorithm for Noisy and Missing GNSS Position Time Series Using Weighted Expectation Maximization Principal Component Analysis: A Case Study for Regional GNSS Network in Xinjiang Province
Previous Article in Journal
Advances in Lightning Monitoring and Location Technology Research in China
 
 
Article
Peer-Review Record

Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation

Remote Sens. 2022, 14(5), 1294; https://doi.org/10.3390/rs14051294
by Li Yan 1,2, Jianming Huang 1, Hong Xie 1,*, Pengcheng Wei 1 and Zhao Gao 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2022, 14(5), 1294; https://doi.org/10.3390/rs14051294
Submission received: 13 February 2022 / Accepted: 3 March 2022 / Published: 7 March 2022

Round 1

Reviewer 1 Report

-

Reviewer 2 Report

It can be accepted now 

Reviewer 3 Report

First of all, I would like to thank the Editorial Board of Remote Sensing magazine for the opportunity to participate in this review process.

The paper deals with a topic of great current interest such as the use of techniques based on artificial intelligence for the automatic interpretation of aerial images and the extraction of semantic information (segmentation). It is evident that the topic is perfectly aligned with the journal's own topics and can be considered as a topic of great interest for the journal's readers.

It is important to emphasize that it is not easy to approach reviews of works of these characteristics, nevertheless, in this work a new method of semantic segmentation is presented, and according to the tests presented (also applied on standard data sets and available in order to reproduce the experiment) provide results that improve in most of the cases the algorithms that have been used. It is also important to note that the authors make the code available to allow readers to use it, which is undoubtedly a magnificent initiative.

As for the formal aspects of the work, I consider that it is well structured and presents an adequate length and content, as well as a very careful drafting and elaboration of figures and additional diagrams, and a complete list of updated bibliographical references (which is essential in these topics).

Finally, I would like to point out that in my opinion the conclusions are well argued in terms of the results. Therefore, I have no additional comments to make on this work, except to highlight its interest, and I consider that it is suitable for publication in Remote Sensing.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

The submitted manuscript describes a Transformer based network which has been validated under a remote sensing application to semantically segment aerial images. Overall the manuscript could be considered sufficient in context nonetheless, major issues should be corrected in order to improve the quality and enhance the scientific impact.

  • Various technical improvements could be applied as many errors were spotted even using English language e.g. first sentence of Section 2 (there is the secondary sentence but not the primary). In addition, many entries of the Reference list showcase that no effort was dedicated e.g. [40], instead of including the authors’ names, the word “contributors” was included. For [16] and [20], journal or conference were excluded. Figures 1 and 2 are of low resolution and quality as a representation scheme. In addition, all parameters used in equations should be defined right after the equations.
  • The novelty was not highlighted and justified sufficiently. Referring to the three bullets at Page 2, the first bullet is not a “new discovery” while the third is not a contribution rather than a claim. In addition, for Section 3, every presented processing step should not only be presented but also justified and describe what is its added value. Moreover, the description of the method is confusing. How SegFormer and EDFT are connected to produce the final semantic map? Considering results in Figure 5, there is either no connection between them or the analysis is misleading. Moreover, an extensive analysis is provided for the concatenation process in Section 3.2.3 which is just a method of representing the intermediate data with no added scientific value. No detail was provided about the depth images of the used datasets.
  • Related work is not focused on the application of such architectures to solve specific problem. Most of the referred works should have been focused on the remote sensing aspects and not just include some semantic segmentation techniques. On top of that, the analysis of the section should be concluded with a description about how the proposed method can cover the disadvantages of the other techniques. Hence, the comparison can be considered as efficient and valid.
  • Concerning the results presented in Figures 6 and 7, there should be updated with images that were produced by techniques in Table 1 and 2 for optimal and visual comparison. In addition, it would be expected to compare the method with the exact same methods but in different datasets. By simple checking the total number of Table 2, fewer techniques were compared than Table 1 without providing a justification. Finally, the results provided in those tables do not showcase that the method accomplishes SoA performance.

Reviewer 2 Report

===== Synopsis:

This study explains how depth information can aid pixel classification in aerial images in a minimal manner to save computation. Results show improvement of the state-of-the art. It is one of the very few manuscripts that has a real conclusion paragraph.

 

===== General Comments:

The study reads well overall, but some formulations appear careless. I am not an expert in this subject, but If I were to dive into it, I would definitely consult this article. I am afraid I cannot not give substantial comments, but I will only give some hints on how to improve some formulations.

 

One or the other sentence needs a comma, or even two or more. Hyphenation should be taken more seriously, otherwise it is very hard to read some of the sentences.

 

===== Specific Comments:

- Hyphenation lacking sometimes, for example:

line 192: depth aware self attention module

 

- line 149: "The decoder consists of all MLP layers and fuses multi-level features.". MLP for multi-layer perceptron? I think you mean 'only' instead of 'all'. Or: all layers are of type MLP.

 

- line 167: "But for the UNet [35] like network whose encoder has multi-stage outputs". Hard to read. I think you mean ... UNet-like network [35]...

 

- line 164: "SegNet [34] like network" same as above (I was rereading the paragraph)

 

- line 173: "consisting of transformer". unclear. consisting of A and B. or perhaps: consisting of a transformer for each one.

Reviewer 3 Report

1) The authors must include the confusion matrix which is used to detect the TP. TN, FP and FN values.

2) The TP, TN, FP and FN values also must be displayed.

3) The novelty of the paper is not very clear. It seems the existing methodologies are adopted but in a different way. This part must be clearly described.

4) What is the use of attention models in this paper? Justify its use in your work. How does it enhances the performance measures?

5) What do you infer from the qualitative results given in the paper?

6) Have you used just the concatenation of pixel values in fusion process? More information must be provided on that.

7) What is the reason behind obtaining different performance measures for different dataset? How can you then validate your approach?

8) What is the significance the word "semantic" in your title? How does depth assist in your proposed keyword?

9) Quantitative experimental results must be elaborated

10) What are the different types of classes in each dataset? Is there any relation between the content of each dataset and the proposed method?

Back to TopTop