Review Reports - DPSDA-Net: Dual-Path Convolutional Neural Network with Strip Dilated Attention Module for Road Extraction from High-Resolution Remote Sensing Images

Round 1

Reviewer 1 Report

This paper proposed a DPSDA-Net with improvement in the connectivity and integrity of road extraction in remote sensing images. Experiments on massive datasets, including STPA, LDSC, PDSA demonstrate the effectiveness of DPSDA-Net.

Overall, the paper is well-organized and recommend to accept it after minor revision.

1. Before describing the proposed model in detail, it is recommended to use a flow chart to describe the technical route.

2. How important is it to extract a feature map using a strip convolution? In the experimental analysis, each module is explained. It is necessary to elaborate on the function of strip convolution in the article.

3. There are a lot of gaps in the composition due to the graph of the article. The article would be better if it could be improved.

4. As far as the details of the experiment are concerned, the total number of training epoch of the experiment was not stated.

The structure of the article is reasonable, but attention should be paid to the standardization and accuracy of the language expression. Some expressions are not rigorous enough or inappropriate. It is suggested that the article should be carefully considered and corrected during the revision, so as to improve the quality and practicality of the article.

Author Response

请参阅附件。

Author Response File: Author Response.docx

Reviewer 2 Report

1. Compared with other models, the model has achieved good results on the two data sets, but the accuracy difference between the two data sets is too large. Additional explanations can be provided for the reasons of this problem.

2. Please read the full text carefully to ensure that the pictures are consistent with the description in the text, and correct some typos and grammatical errors.

3. In the comparison part of the experiment, whether the width and height of the blank space between pictures can be kept consistent to make the full text more rational.

4. As the author described in the article, the parameters of the network structure are relatively complex. Is it necessary to explain which operations in the module complicate the model?

5. When the paper introduced the characteristics and advantages of DPSDA-Net, the article clearly distinguished the role of each technical component through experiments, but did not explain in detail why the combination of these components can improve the accuracy and robustness of road extraction.

6. It is recommended that the author summarize the main contributions and innovations of the paper in more detail in the conclusion section, and discuss future research directions and challenges. This can help readers better understand the value and significance of the article, and stimulate more research thinking and exploration.

1. There are some grammar issues in the article, please carefully revise and improve it.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

review

This paper proposes a two-way convolutional neural network for road extraction, which finally achieves state-of-the-art results on the Massachusetts and LRSNY datasets.

1. 1. In your Figure 1, do not use the expression Maxpooling (2), try to change it to 2×2 Maxpooling, including UpsamplingBilinear2d below.

2. 2. In your Figure 1, your strip convolutions are all 3×1 convolutions. I think this is no different from 3×3 convolutions, and the receptive field is smaller than 3×3 convolutions. .

3. 3. Your picture 2 and picture 3 are blurry, please revise. At the same time, in the matrix change of your module in Figure 2, the C×C part, I think it is wrong, please modify it.

4. 4. I would like to know the specific implementation method of your left diagonal, and right diagonal convolution module, please describe in detail. In addition, what are the specific directions of 1×9 RD and 9×1 LD in your description.

5. 5. In your Figure 3, what is the function of F.interpolate?

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

This paper present a novel deep framework for the extraction (segmentation) of road networks in aerial remote sensing images. Specifically, it combines three advanced modules (stripped position attention, long distance shortcut connections and pyramid dilated stripped attention) within a classical UNet architecture.

The proposed model is benchmarked on two datasets commonly used for road extraction applications, and compared with several (6) other deep frameworks. An ablation study is also conducted.

Overall, the paper is quite well written (up to some minor typos and English flaws), the structure is sound and the presented results demonstrate that the propose model indeed outperforms the concurrent methods.

Here are my overall comments:

* I appreciate the effort made by the authors to describe in details the STPA, LDSC and PDSA modules in order to given some insight on their usefulness. However, I find the description of the STPA module (section 2.2) rather unclear, especially between l.265 and l.272)

* What are RC (ll.266, 272, 276, 278, 293) and RN (ll.270, 274)?

* The way the dimensions are presented and handled in Figure 2 is unclear: how a CxC matrix can become a (HxW)x(HxW) one (middle branch)? Same for the bottom branch (CxC becomes CxHxW)?

* The train/val/test split strategy for the conducted experiments is unclear, especially for the Massachusetts Road Dataset where 95% of images are put in the train set (1108/1171) while only 1% goes into the validation set (14/1171). Why this division? Also, it is then said (section 3.3) that the available (1000x1000 and 1500x1500) images are further divided into 512x512 images: what is the adopted proportion of train/val/test data then?

* Please thoroughly check the provided F1 score figures in table 1 and table 2: I recomputed them based on the provided precision / recall percentages, and I consistently found different numbers (approximately 1% higher than what is provided, for example in table 1 NL-LinkNet is 76.57% instead of 75.64%, and DPSDA-Net is 78.38% instead of 77.66%. Note that it does not change the provided conclusion as the relative ordering of the methods remains the same).

* What was the total training time for both datasets?

Here are my comments related to the quality of English:

* in general, please be careful with acronyms, some of them are used withouth being defined (VHR L.58, SE l.136, BN l.297, etc), some of them are also inconsistent, such as R-BiasUNet (ref [17]) with is also named R-BasicUNet and R_Basic_UNet

* l.203: the word "briefly" appears twice in the same sentance

* l.360: "What's more" is not scientific English

* There are some typos and inconsistencies/mismatches in the references: ref [9] is named after Long J et al in the text but the first author is Buslaev, same with refs [19] and [25] (mismatch in the text and name of first author), there is a typo in ref [30] (Mnih => Minh), and refs [17] and [31] are referring to the same article.

Author Response

Please see the attachment.

Author Response File: Author Response.docx