Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Looking for Change? Roll the Dice and Demand Attention

Remote Sens. 2021, 13(18), 3707; https://doi.org/10.3390/rs13183707

by Foivos I. Diakogiannis^{1,2,*,†,‡}

, François Waldner^3,‡

and Peter Caccetta^2,‡

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2021, 13(18), 3707; https://doi.org/10.3390/rs13183707

Submission received: 28 June 2021 / Revised: 10 August 2021 / Accepted: 9 September 2021 / Published: 16 September 2021

(This article belongs to the Special Issue Image Change Detection Research in Remote Sensing)

Round 1

Reviewer 1 Report

The article presents a new deep learning framework for semantic change detection tasks in very high-resolution aerial images, including a new loss function, a new attention module, new feature extraction building blocks, and a new backbone architecture tailored to the semantic change detection task. Specifically, the authors define a new form of set similarity, the fractal Tanimoto coefficient, which is based on an iterative evaluation of a variant of the Dice coefficient. This coefficient can provide finer detail of similarity, at a desired level (up to a delta function), and this is regulated by a temperature-like hyper-parameter. The article uses this similarity metric to define a new loss function and then uses a novel training loss scheme, where we use an evolving loss function, that changes according to learning rate reductions. This helps avoid overfitting and allows for a small increase in performance. The authors also propose a novel spatial and channel attention layer, the fractal Tanimoto Attention Layer (FracTAL), that uses the fractal Tanimoto similarity coefficient as a means of quantifying the similarity between query and key entries. This layer is memory efficient and scales well with the size of input features. Meanwhile, two new efficient self-contained feature extraction convolutional units, namely CEECNet and FracTAL ResNet units, are introduced. In addition, a new encoder/decoder architecture, i.e., a network macro topology tailored for the change detection task, is proposed. The key insight of the approach is to facilitate the use of relative attention between the two convolutional layers in order to fuse them. Finally, all of the proposed networks that presented in this contribution, mantis FracTAL ResNet and mantis CEECNetV1&V2, outperform other proposed networks and achieve state of the art results on the LEVIRCD and the WHU building change detection datasets. The authors also give a detailed description of the experimental analysis and code at the end.

However, this paper also has the following problems, which need to be discussed:

In Figure 4(b), the meaning of some symbols is confusing. As explained by the author, this symbol is normalized convolution layer with a relative fusion attention. But if you look at the diagram, the symbol still represents concatenation of features along the channel dimension, rather than containing a relative fusion attention and normalized convolution layer. Because those two parts are already represented in the picture. It is recommended that the authors think about it.
Similar to the previous question, in Figure 5, the meaning of some symbol is still unclear. If it has the same meaning as the one in Figure 4, how does it differ from the symbol ‘FUSE’? If it has the same meaning as the symbol ‘Fuse’, why are there two symbols representing the same meaning in the same picture?
In addition, there may be some typos to be corrected.

Author Response

Dear Reviewer,

thank you very much for the time taken to review our manuscript and for your suggestions/comments corrections. Below we address all comments / suggestion with blue color.

Reviewer 1:

However, this paper also has the following problems, which need to be discussed:

In Figure 4(b), the meaning of some symbols is confusing. As explained by the author, this symbol is normalized convolution layer with a relative fusion attention. But if you look at the diagram, the symbol still represents concatenation of features along the channel dimension, rather than containing a relative fusion attention and normalized convolution layer. Because those two parts are already represented in the picture. It is recommended that the authors think about it.

We thank you for this very useful comment. Although we do not see an explicit symbol in your comments, we assume that you mean the concatenation symbol $\uplus$. Indeed it is not clear in the caption of Figure 4.b to which concatenation we refer to (there are three in total concatenations) when we mention the CEECNet V2 version. We modified the caption to indicate that we change the concatenation followed by normed convolution for all three operations and we added a reference to Listing 3 that has this operation algorithmically in python code.
Similar to the previous question, in Figure 5, the meaning of some symbol is still unclear. If it has the same meaning as the one in Figure 4, how does it differ from the symbol ‘FUSE’? If it has the same meaning as the symbol ‘Fuse’, why are there two symbols representing the same meaning in the same picture?

Thank you for your comment that helps us clarify further our work. Yes, the symbol FUSE is the same as Listing 3 as well as the replacement of Concatenation with relative attention Fusion in Figure 4. The reason we used different symbols is because in Figure 4b we present the CEECNet V1 module (which uses simple concatenation as default), while figure 5 has FUSE because this is what is used always and is the operation that differentiates the mantis macrotopology by standard Siamese dual encoder architectures.
In addition, there may be some typos to be corrected.

Thank you very much for your comments, we carefully reviewed the manuscript for English (native English speaker) and typos.

Again, thank you very much for your time devoted to review our paper.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper is sound, well written and organized and deserves publication in its present form

Author Response

Dear Reviewer, we thank you very much for the time taken to carefully read our manuscript and for your high evaluation of our work.

Reviewer 3 Report

This paper proposed to use attention for change detection. Overall, the structure of this paper is well organized, and the presentation is relatively clear. However, there are still some crucial problems that need to be carefully addressed before a possible publication. More specifically,

A deep literature review should be given, particularly regarding advanced and latest remote sensing data processing and analysis. As a result, the reviewer suggests supplementing related works and discuss and cite them in the revised manuscript, e.g., “Graph Convolutional Networks for Hyperspectral Image Classification, IEEE Transactions on Geoscience and Remote Sensing, 2021, 59(7): 5966-5978.” and “Nonconvex-sparsity and nonlocal-smoothness-based blind hyperspectral unmixing. IEEE Transactions on Image Processing, 2019, 28(6), 2991-3006.”
The running time and algorithm complexity should be analyzed.
Parameter sensitivity analysis, e.g., hyper-parameters, should be given to show the effectiveness of the proposed method.
It is well-known that the remote sensing images tend to suffer from various degradation, noise effects, or variabilities in the process of imaging. Please give the discussion and analysis by referring to the paper titled by An Augmented Linear Mixing Model to Address Spectral Variability. The reviewer is wondering what will happen if the proposed method meets the various variabilities.
The motivation should be enhanced to explain why the studied topic is meaningful.

Author Response

Dear Reviewer,

thank you very much for the time taken to review our manuscript and for your suggestions/comments corrections. Below we address all comments in detail. Our responses are with blue color.

Reviewer 4: This paper proposed to use attention for change detection. Overall, the structure of this paper is well organized, and the presentation is relatively clear. However, there are still some crucial problems that need to be carefully addressed before a possible publication.

More specifically,

A deep literature review should be given, particularly regarding advanced and latest remote sensing data processing and analysis. As a result, the reviewer suggests supplementing related works and discuss and cite them in the revised manuscript, e.g., “Graph Convolutional Networks for Hyperspectral Image Classification, IEEE Transactions on Geoscience and Remote Sensing, 2021, 59(7): 5966-5978.” and “Nonconvex-sparsity and nonlocal-smoothness-based blind hyperspectral unmixing. IEEE Transactions on Image Processing, 2019, 28(6), 2991-3006.”

We thank you very much for your comment and the suggested references. We went carefully through the proposed manuscripts and we found no connection to our work. In particular, the first suggestion (GCNs – relates to a specialization of geometric deep learning for hyper spectral images) does not contain significant material for convolutional neural networks (other than to compare with the proposed graph convolution network approach), neither is specialized on semantic segmentation or change detection. In addition it specializes on hyper spectral images. The second reference is on hyperspectral images, and has nothing to do with convolutional neural networks or deep learning (nowhere in the text there is mention of deep learning or convolutions). Therefore we could not add these references to our manuscript.

In addition to the review papers on the topic of change detection, with and without deep learning, that we already had [26,27,28], we added three recent references that relate to attention and memory footprint reduction. These are the introduction of Linear Attention (Katharopoulos), it’s extension to 2D (Li et al) as well as the ViT (Visual Transformer) that partially helps reduce this problem. These are all latest state of the art papers relative to the attention problem we addressed. Given that our manuscript already cites 63 on the topic articles and has a full section on previous work we feel no additional references need to be added.
The running time and algorithm complexity should be analyzed.

We thank you very much for your comment. In Appendix D we provide information on the training characteristics and computational resources we used for this work. These include running time. In addition we included on Table 1 the model parameters. Note also that in Section 2.4 we explain that the Attention we defined has linear algorithmic complexity, and we compare with the quadratic complexity of self attention from “Attention is all you need” paper that introduced it.
Parameter sensitivity analysis, e.g., hyper-parameters, should be given to show the effectiveness of the proposed method.

We thank you very much for your comment. The selection of hyper parameters we provide (including network parameters as well as training characterstics in the Appendix D) do achieve state of the art results on change detection on two datasets, as well as excellent performance on semantic segmentation – as it was shown in our latest work on field boundary detection (Waldner et al 2020 - Detect, Consolidate, Delineate: Scalable Mapping of Field Boundaries Using Satellite Images). We do not claim that the current parametrization will achieve state of the art performance on every dataset, however it is an excellent starting point. As for any deep-learning algorithm, hyper-parameter optimisation remains important to achieve good results on new data.
It is well-known that the remote sensing images tend to suffer from various degradation, noise effects, or variabilities in the process of imaging. Please give the discussion and analysis by referring to the paper titled by An Augmented Linear Mixing Model to Address Spectral Variability. The reviewer is wondering what will happen if the proposed method meets the various variabilities.

Thank you very much for your comment. In our work we used various data augmentation techniques to increase variability of spectra – as described in the manuscript (e.g. artificial shadows, random brightness etc). This improves performance of the algorithm. In general, conv nets address very well the problem of spectral variability because they identify objects (or contiguous areas) of interest in the input data, they do not rely solely on spectral information. This is of course provided that the distribution of the spectra (numerical values) are within the same range that the network was trained on.

Our approach and work deviates significantly from the suggested reference which is on hyperspectral images (that have much richer spectral information) and that focuses on a completely different problem which is pixel-level analysis of spectra (mixing/unmixing). We feel that this does not relate to the topic of our manuscript, and we therefore are not able to include it in our present work.
The motivation should be enhanced to explain why the studied topic is meaningful.

Thank you very much for your suggestion, however this is a rather surprising comment. As explained in the first paragraph as well as literature review (section 1.1), change detection is one of the oldest and most common application of remote sensing. Building on prior work, we propose new methods to achieve this and show that they can achieve state of the art performance.

Again thank you very much for your careful reading of our work and your comments.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

The authors did not address the reviewer's concerns well. The newly-added values of this version and contributions are relatively limited. Experiments are somehow insufficient.

Article Menu

Looking for Change? Roll the Dice and Demand Attention

Further Information

Guidelines

MDPI Initiatives

Follow MDPI