Next Article in Journal
Spatio-Temporal Changes and Driving Forces Analysis of Urban Open Spaces in Shanghai between 1980 and 2020: An Integrated Geospatial Approach
Next Article in Special Issue
A Novel Multi-Scale Feature Map Fusion for Oil Spill Detection of SAR Remote Sensing
Previous Article in Journal
A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification
Previous Article in Special Issue
Evaluation of GSMaP Version 8 Precipitation Products on an Hourly Timescale over Mainland China
 
 
Article
Peer-Review Record

Road Extraction from Remote Sensing Imagery with Spatial Attention Based on Swin Transformer

Remote Sens. 2024, 16(7), 1183; https://doi.org/10.3390/rs16071183
by Xianhong Zhu 1, Xiaohui Huang 1,*, Weijia Cao 2, Xiaofei Yang 3, Yunfei Zhou 1 and Shaokai Wang 1
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2024, 16(7), 1183; https://doi.org/10.3390/rs16071183
Submission received: 23 February 2024 / Revised: 25 March 2024 / Accepted: 26 March 2024 / Published: 28 March 2024

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The manuscript submitted relates research carried out to develop a methodology that improves the extraction of streets and roads from orthoimagery.

The abstract conveys the main ideas of the manuscript adequately.

The statement of the research problem is clear and concise.

The literature about the topic is very extensive and continuously growing. The authors have included relevant references in a sufficient number.

Line 117. Please extend this sentence so that it makes sense without the citation. Maybe the authors should explicitly mention somewhere that Swin Transformer is a type of transformer-based neural network architecture designed for computer vision tasks.

I have some comments concerning the methodology:

The reader can deduct from Figure 1 that the final output of the procedure is a B&W raster or image that differentiates streets and roads, right? This is something not really new in the methodology proposed, regardless of the improvements in the metrics. Can the procedure be extended to obtain vector output for a real application in autonomous driving or path navigation as claimed in the abstract?

Lines 232 to 235. Please define the variables more precisely with mathematical rigor.

Lines 253 to 261. Please develop further this part of the manuscript to make it clearer.

Lines 307 to 315. It is to be welcomed that the authors describe the data used to test the method with sufficient detail. I would like to see the sensitivity of the procedure to the data quality, if possible, since it is very hard to find something like this in the literature. Otherwise, the authors can include this idea as a future line of research.

Line 325. Please define further the Stochastic Gradient Descend or include a reference.

Lines 334 to 338. How did the authors adapt the methods cited to the road extraction in the dataset used?

Line 341. How was ground truth extracted necessary to calculate the metrics of the different procedures?

Regarding the conclusions section, I must mention that this text seemed very short to me. The findings of research conducted might allow the authors to extend this part of the text further.

Concerning minor issues, I have a few comments.

The manuscript contains many acronyms that should be defined the first time they are used in the main text. Please, pay attention to it.

Section 2.1. I would name it “Conventional methods” rather than “traditional methods”

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

All my concerns have been addressed.

Author Response

Thank you very much for your helpful suggestions on this manuscript.

Reviewer 3 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

This paper presents a new road extraction model named Spatial Attention Swin Transformer (SASwin Transformer). It mainly contains a Spatial Self-Attention (SSA) module and a Spatial MLP (SMLP) module. The reported results confirm the SASwin Transformer is feasible. However, many points should be further explained before the possible acceptance.

1.     The main contributions and motivation are not clear. What are your advantages and features compared to an ocean of existing methods?

2.     Related work should be streamlined. For example, graphs-based methods can be omitted, as the proposed SASwin Transformer is unrelated to graph theory.

3.     The methodology description is not sufficiently clear. The distinction between the proposed method and existing methods is not evident.

4.     The experiments should be improved. The latest RS-oriented methods should be selected to testify to your SASwin Transformer.

 

5.     Reference should be updated and enriched.

Comments on the Quality of English Language

N/A

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The authors have addressed most of my comments. However, some other have not been reflected in the text of the manuscript:

1. As the authors have not studied the sensitivity of the procedure to the data quality (since it is very hard to find something like this in the literature) they should have explicitly included this idea as a future line of research.

2. In addition to the reference, a sentence defining Stochastic Gradient descend should have been included (line 312)

3. The authors have not explained in the manuscript how they adapted the methods cited to the road extraction in the dataset used.

4. The authors have not explained in the manuscript how was ground thruth obtained to calculate the metrics of the procedures.

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

All issues have been modified, and the current version can be accepted.

Comments on the Quality of English Language

N/A

Author Response

Thank you very much for your helpful suggestions on this manuscript.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a method, named SASwin Transformer for road extraction from remote sensing images. Two public datasets are used to evaluate the proposed method. This paper is written structurally and has experiments results. But there are some issues should be addressed. 

(1) The caption of Figure 10, '… ..., with differences highlighted in red circles', doesn’t match the color in figures. 

(2) Recently, many methods, which combined transformer with CNN, have been proposed to obtain the long-range dependency and global contextual information and local contextual information for road extraction. What are the advantages of your method compared to these methods? For example, 

Tao, Jingjing, et al. "Seg-Road: A Segmentation Network for Road Extraction Based on Transformer and CNN with Connectivity Structures." Remote Sensing 15.6 (2023): 1602. 

(3) Please compare the parameter size (e.g., in Bytes) and efficiency (e.g., FPS, FLOPS) with other models.

 

Author Response

see attached

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript research conducted to extract the road and street layout on the plan from orthoimagery using an algorithm based on Swin transformer.

The problem addressed has been extensively studied in recent years and it is difficult to evaluate whether it contains significant contributions to the state of knowledge.

The title of the manuscript contains words that are confusing regarding the actual aim and scope of the study. “Road extraction” is more appropriate than “road profile inversion”. The research described has nothing to do with longitudinal profile nor cross sectional profile of roads.

The procedure and algorithms as well as the data used seem to have been well described.

Table 1 and table 2 include methods that apparently have not been examined in the study presented, therefore, it is unlikely that they have been tested under the same conditions. Hence, the evaluation metrics cannot be compared.

It is not clear what the output of the process is, i.e. raster or vector. As a result, the reader cannot understand how the evaluation metrics were built up.  

Have the authors tested the sensitiveness of the method to the image resolution? This is the greatest gap existing nowadays in the scope of this study.

The conclusions are a bare summary of the research, without clearly stating any contribution.

 

Author Response

Please see attached

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

A Spatial Attention Swin Transformer (SASwin Transformer) is proposed in this paper to create a robust encoder capable of extracting roads from remote sensing imagery. The introduced idea is interesting and feasible. Also, the manuscript is written and organized well. Thus, I suggest the current version can be accepted.

Author Response

see attached

Author Response File: Author Response.pdf

Back to TopTop