Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Capsule–Encoder–Decoder: A Method for Generalizable Building Extraction from Remote Sensing Images

Remote Sens. 2022, 14(5), 1235; https://doi.org/10.3390/rs14051235

by Zhenchao Tang^1,2,†

, Calvin Yu-Chian Chen^2,†

, Chengzhen Jiang³, Dongying Zhang^1,*

, Weiran Luo⁴, Zhiming Hong⁵

and Huaiwei Sun¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2022, 14(5), 1235; https://doi.org/10.3390/rs14051235

Submission received: 30 January 2022 / Revised: 21 February 2022 / Accepted: 1 March 2022 / Published: 2 March 2022

(This article belongs to the Special Issue Deep Learning for Very-High Resolution Land-Cover Mapping)

Round 1

Reviewer 1 Report

This paper presents a capsule-based CNN model for detecting buildings in high-resolution remote sensing imagery.

I have the following questions/comments:
- Page 5, third paragraph: Nice explanation about how the proposed method captures building parts and their features in an explainable way. As stated in this explanation, it seems logical that an explainable method will lead to better generalization. However, how do the authors know that the features of their method are actually explainable? Could they demonstrate this visually? Can they compare the features in their model to those in other methods that use regular CNNs for building extraction, and demonstrate/prove that their features are more explainable?
-page 6, line 220: how the authors compute the pose, color, and texture properties of a building to be used in eq. (1)? Are there some special operations/convolution blocks involved? I would like to know how these properties are captured in their model. Perhaps they could show some figures with features such as pose_map and c_map.

-page 8, line 291: How did the authors come up with the idea of using the sum of lambda*att_map in the second loss function (eq. 6)? The question I have is what would happen if they used posterior part distribution (part_map) in eq. 6? Is it not more efficient to utilize a pre-computed entity such as part_map in the loss function rather than generating a new entity (sigma(lambda*att_map)) from a computational standpoint? Comparing the results of these two approaches would be worthwhile.

- page 8, line 304, 305: missing x in 256x256 and 3000x3000.

- page 10, table 2: In view of the trend in Table 2, I would expect omitting the second loss to negatively affect the performance. Nonetheless, it would be helpful for readers to know how the model performs without the second loss (a=10, b=0).

- Page 14, Fig. From this figure, it is apparent that posterior features represent the edges and main body of the buildings. In these highlighted maps of the posterior features, I was not able to find any of the building parts mentioned on pages 6 and 7 (pose, color, texture, etc.). I, therefore, question whether the argument regarding the effectiveness of integrating those building parts in the capsules is valid.

Although the model proposed in this study shows a slight improvement in IoU and PA metrics over an existing dataset (Yellow River), it demonstrates a significant improvement over an unfamiliar test set. This illustrates the behavior of the model when a distribution gap exists between the training and test set. For further clarification on this domain gap, I recommend reporting the numerical results of the model over another building extraction dataset such as Chesapeake (https://lila.science/datasets/chesapeakelandcover).

The following settings are worth experimenting with: - Train on the Yellow River training set and test on the Chesapeake test set. - Train on the Chesapeake training set, test on the Chesapeake test set, and test on the Yellow River test set. - I would particularly like to see the results of the other two capsule-based models in the above settings.

In general, I would not recommend the publication of this work prior to seeing more numerical results and the response to my questions.

Author Response

Point 1: Page 5, third paragraph: nice explanation about how the proposed method captures building parts and their features in an explainable way. As stated in this explanation, it seems logical that an explainable method will lead to better generalization. However, how do the authors know that the features of their method are actually explainable? Could they demonstrate this visually? Can they compare the features in their model to those in other methods that use regular CNNs for building extraction, and demonstrate/prove that their features are more explainable?

Response 1: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

1) We visualize the output feature maps and add the feature maps output by regular CNNs for comparison. (page 15, Figure 7)

2) We add the description of feature maps visualization, thereby demonstrating that our method can output explainable feature maps. (page 15, lines 537 to 541)

Point 2: page 6, line 220: how the authors compute the pose, color, and texture properties of a building to be used in eq. (1)? Are there some special operations/convolution blocks involved? I would like to know how these properties are captured in their model. Perhaps they could show some figures with features such as pose_map and c_map.

Response 2: We are grateful for the valuable suggestion. We have made the following modifications in the revised manuscript:

1) For posture, color, and texture information, we use the parameters of UPerNet for initialization. The first layer is used to extract common features, and the remaining three layers are arranged in parallel to extract the posture, color and texture of different parts. We illustrate in Section 2.2. (page 6, lines 242 to 245)

2) We added Section 4.1 to the Discussion. We visualize the posture, color and texture of buildings in Section 4.1. (page 20, Figure 13)

3) Additionally, we elaborate in Section 4.1 the relationship between the explainability of the parts posterior distribution and the posture, color, and texture information. (pages 20 to 21, lines 617 to 641)

Point 3: page 8, line 291: How did the authors come up with the idea of using the sum of lambda*att_map in the second loss function (eq. 6)? The question I have is what would happen if they used posterior part distribution (part_map) in eq. 6? Is it not more efficient to utilize a pre-computed entity such as part_map in the loss function rather than generating a new entity (sigma(lambda*att_map)) from a computational standpoint? Comparing the results of these two approaches would be worthwhile.

Response 3: We are grateful for the valuable suggestion. We have made the following modifications in the revised manuscript:

1) The idea of lambda*attn_map in eq. 6 comes from the prototype network: the representation of the whole object is obtained according to the fusion of the parts, and then the difference lambda between the whole object and the part is calculated, and this difference is corrected into the attn_map, so that a more accurate spatial distribution of the parts can be obtained. The feasibility is that the more similar the part is to the whole object, the more likely the part belongs to the whole object. (page 8, lines 302 to 309)

2) In the paper, part_map is the result of lambda*attn_map concatenation. So using part_map has the same effect as using lambda*attn_map. In addition, we added an experimental comparison: using parts prior distribution and using parts posterior distribution as part_map, respectively, to test the effect of the two methods. (page 10, lines 372 to 374; page 11, Table 1, lines 409 to 412; page 12, Figure 4.a)

Point 4: page 8, line 304, 305: missing x in 256x256 and 3000x3000.

Response 4: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have modified using $\times$. (page 9, lines 343 to 344)

Point 5: page 10, table 2: In view of the trend in Table 2, I would expect omitting the second loss to negatively affect the performance. Nonetheless, it would be helpful for readers to know how the model performs without the second loss (a=10, b=0).

Response 5: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We added the case where there is no second loss. (page 11, Table 2; page 12, Figure 4.b)

Point 6: Page 14, Fig. From this figure, it is apparent that posterior features represent the edges and main body of the buildings. In these highlighted maps of the posterior features, I was not able to find any of the building parts mentioned on pages 6 and 7 (pose, color, texture, etc.). I, therefore, question whether the argument regarding the effectiveness of integrating those building parts in the capsules is valid.

Response 6: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

Posture, color and texture information are low-level features that can be used to describe building parts. We ultimately want to obtain explainable parts posterior distribution. Obtaining posture, color and texture information is a prerequisite for obtaining explainable parts posterior distributions. We have added analysis on posture, color and texture information in Section 4.1. (pages 20 to 21, lines 617 to 641)

Point 7: Although the model proposed in this study shows a slight improvement in IoU and PA metrics over an existing dataset (Yellow River), it demonstrates a significant improvement over an unfamiliar test set. This illustrates the behavior of the model when a distribution gap exists between the training and test set. For further clarification on this domain gap, I recommend reporting the numerical results of the model over another building extraction dataset such as Chesapeake. The following settings are worth experimenting with: - Train on the Yellow River training set and test on the Chesapeake test set. - Train on the Chesapeake training set, test on the Chesapeake test set, and test on the Yellow River test set. - I would particularly like to see the results of the other two capsule-based models in the above settings.

Response 7: We are grateful for the valuable suggestion. We have made the following modifications in the revised manuscript:

1) Since the public WHU dataset has a friendly Pytorch processing API, we introduce WHU dataset into the experiment. (page 9, line 330)

2) We performed further experiments in two settings: train on Yellow River training set, test on WHU test set; train on WHU training set, test on WHU test set and Yellow River test set. (page 10, lines 394 to 397)

3) We add the experimental results to Section 3.6. (Pages 17 to 19, Table 5 and Table 6, Figure 10, Figure 11 and Figure 12)

Author Response File: Author Response.docx

Reviewer 2 Report

Remote sensing building datasets with large areas and long time series are usually characterized by large variance and long-tailed statistical distributions, which will lead to degraded performance of deep learning models trained in the source domain only. For the building extraction task, the generalization ability using common deep learning methods is weak and robust. To solve this problem, this paper proposes a Capsule-Encoder-Decoder model. This algorithm converges faster and shows higher accuracy. However, there are some minor issues to consider.

1 As far as I know, in this work, a capsule is a vector, which is a potential representation of the original image extracted by the feature extractor. In this work, more presentations are needed to illustrate the difference between a feature representation vector and a capsule, and the advantages of using a capsule.

2 From the results in Figure 5, the algorithm proposed in the article extracts more non-buildings. Whether extracting many non-buildings as buildings will seriously affect the experimental results needs to be somewhat illustrated and analyzed.

3 For different network sizes, the obtained results should be different. The network size and testing time should be analyzed and explained in this paper to further illustrate the effectiveness of the algorithm.

4 For remote sensing, some recent literature should be included such as 10.1109/TGRS.2021.3128764

Author Response

Point 1: As far as I know, in this work, a capsule is a vector, which is a potential representation of the original image extracted by the feature extractor. In this work, more presentations are needed to illustrate the difference between a feature representation vector and a capsule, and the advantages of using a capsule.

Response 1: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We added a note about capsules and feature representation vectors, we described capsules and feature vectors, and stated the advantages of using capsules. (page 5, lines 205 to 214)

Point 2: From the results in Figure 5, the algorithm proposed in the article extracts more non-buildings. Whether extracting many non-buildings as buildings will seriously affect the experimental results needs to be somewhat illustrated and analyzed.

Response 2: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We analyzed this phenomenon: the proportion of non-building pixels extracted by the model is small, so it does not seriously affect the overall results. In addition, we analyze the reasons why the model extracts non-buildings. (page 14, lines 498 to 505) Moreover, we added more experimental results in Section 3.6 to prove the reliability of our algorithm. (Pages 17 to 19, Table 5 and Table 6, Figure 10, Figure 11 and Figure 12).

Point 3: For different network sizes, the obtained results should be different. The network size and testing time should be analyzed and explained in this paper to further illustrate the effectiveness of the algorithm.

Response 3: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We analyzed the network size and test speed. We also test the efficiency of CNN-based methods and capsule-based methods. The efficiency comparison of all methods is shown in Figure 9. The results show that our method has fast test speed and good generalization. As the size of the network decreases, the test speed becomes faster, but the generalization performance decreases accordingly. (page 17, lines 587 to 590, Figure 9)

Point 4: For remote sensing, some recent literature should be included such as 10.1109/TGRS.2021.3128764

Response 4: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have added recent literature to the Introduction. (page 2, lines 54 to 58)

Author Response File: Author Response.docx

Reviewer 3 Report

The paper proposes a new building extraction method from remote sensing images, based on capsule, encoder, and decoder.

The study is interesting and potentially useful to practitioners. The methodology seems partially original. The results, also proposing a comparison with other literature methods, are convincing. The discussion of methods and results is adequate. Furthermore, the paper's contribution to the research literature stands out.

“1. Introduction” section: The section adequately reports the paper motivation and gaps of the literature that the authors want to address in their research. The proposed approach should be only mentioned here and described in an extensive and in-depth manner in the next section. In my opinion, the description of the proposed model, from line 55 to line 85, should be largely moved to the “2. Materials and Methods” section. Please, move also Figure 1.
“1.3 Contributions” sub-section: The results of the study should not be reported in the introductory part of the paper. I recommend that a) you purge the contributions from the results or b) integrate this subsection in the “5. Conclusions” section and, eventually, replace it with a subsection on the objectives of the study
Lines 340-341: “beta parameters” instead of “parameter betas”.
Figure captions should be concise and comprehensive. Any additional information, explanation and/or interpretation should be reported in the text. Please, reduce the caption length of Figures 1, 2, 3, 5 and 6.
Figure 7: Figures always go after the first reference to that figure. So, please, move Figure 7.
Please, remove periods placed at the end of the equations 4, 5, 6, 9 and 10.
Please, remove commas placed at the end of equations 1, 2, 3, 7 and 8.
Please, always leave a space before the open parenthesis or brackets. Check it along with the whole paper.
Please, always leave a space after a period or comma (see lines: 236, 259-262, 266-268, 283-284, 320).
The items reported in lines from 303 to 306 should be organized in a bullet list.
Lines 303-304: “3000x3000” instead of “3000 3000” and “256x256” instead of “256 256”.

Author Response

Response to Reviewer 3 Comments

Point 1: “1. Introduction” section: The section adequately reports the paper motivation and gaps of the literature that the authors want to address in their research. The proposed approach should be only mentioned here and described in an extensive and in-depth manner in the next section. In my opinion, the description of the proposed model, from line 55 to line 85, should be largely moved to the “2. Materials and Methods” section. Please, move also Figure 1.

Response 1: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

1) We have removed the detailed description of the model in the Introduction. (pages 2 to 3, lines 71 to 88, the deleted part is marked in blue)

2) We removed the original description in Section 2.1. (pages 4 to 5, lines 186 to 193, the deleted part is marked in blue)

3) We re-added the description of the model in Section 2.1 and moved Figure 1 to Section 2.1. (page 5, lines 194 to 204, additions marked in orange)

Point 2: “1.3 Contributions” sub-section: The results of the study should not be reported in the introductory part of the paper. I recommend that a) you purge the contributions from the results or b) integrate this subsection in the “5. Conclusions” section and, eventually, replace it with a subsection on the objectives of the study.

Response 2: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We followed suggestion (b) to modify Section 1.3 as research objectives. (page 4, lines 161 to 178)

Point 3: Lines 340-341: “beta parameters” instead of “parameter betas”.

Response 3: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have made changes. (page 10, lines 382 to 383)

Point 4: Figure captions should be concise and comprehensive. Any additional information, explanation and/or interpretation should be reported in the text. Please, reduce the caption length of Figures 1, 2, 3, 5 and 6.

Response 4: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We reduced the title length for Figures 1, 2, 3, 5 and 6. (Page 5, Figure 1; Page 7, Figure 2; Page 8, Figure 3; Page 13, Figure 5; Page 14, Figure 6)

Point 5: Figure 7: Figures always go after the first reference to that figure. So, please, move Figure 7.

Response 5: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have moved Figure 7. (page 15, Figure 7)

Point 6: Please, remove periods placed at the end of the equations 4, 5, 6, 9 and 10.

Response 6: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have removed the period at the end of the equations. (page 8, Equation 4 and Equation 5; page 9, Equation 6; page 10, Equation 9 and Equation 10)

Point 7: Please, remove commas placed at the end of equations 1, 2, 3, 7 and 8.

Response 7: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have removed the comma at the end of the equations. (page 6, Equation 1 and Equation 2; page 7, Equation 3; page 9, Equation 7 and Equation 8)

Point 8: Please, always leave a space before the open parenthesis or brackets. Check it along with the whole paper.

Response 8: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have checked the whole paper.

Point 9: Please, always leave a space after a period or comma (see lines: 236, 259-262, 266-268, 283-284, 320).

Response 9: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have added spaces after periods and commas.

Point 10: The items reported in lines from 303 to 306 should be organized in a bullet list.

Response 10: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have organized the processing steps of the dataset into a list. (page 9, lines 342 to 346)

Point 11: Lines 303-304: “3000x3000” instead of “3000 3000” and “256x256” instead of “256 256”.

Response 11: We are grateful for the suggestion. We have made the following modifications in the revised manuscript:

We have modified using $\times$. (page 9, lines 343 to 344)

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

I would like to thank the authors for addressing my concerns and questions. I am satisfied with their response.

Article Menu

Capsule–Encoder–Decoder: A Method for Generalizable Building Extraction from Remote Sensing Images

Further Information

Guidelines

MDPI Initiatives

Follow MDPI