Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation
Round 1
Reviewer 1 Report
With interest, I read the manuscript. It is appreciated that the manuscript is easy to follow and not too long. The message is clear and of interest to the community. The authors proposed a paper titled "Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network Via Token Aggregation". The proposed paper seemed to be promising in terms of computational simplicity and classification accuracy. I would like to accept the manuscript in the present form.
Author Response
Point 1: With interest, I read the manuscript. It is appreciated that the manuscript is easy to follow and not too long. The message is clear and of interest to the community. The authors proposed a paper titled "Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network Via Token Aggregation". The proposed paper seemed to be promising in terms of computational simplicity and classification accuracy. I would like to accept the manuscript in the present form.
Response 1: Thank you very much for your affirmation of our work. We will carefully revise the manuscript to your satisfaction.
Reviewer 2 Report
Dear authors,
I find your work interesting and usable. The introduction of your hybrid transformer seems to eliviate problems, exhibited by different transformer and attention networks, for change detection. However to make your manuscript publishable several key changes must be addresed. I will list them here:
- The text needs a better english. The sentences should be smaller. They should also have a maximum of one inserted sentence. By doing so the sentences are less complex and easier to understand. Also try not to use google translate .
- Break the introduction part into at least two or maybe even three. Keep the problem statement separate from the related work! Being that you hint at solution in introduction add maybe even third section with proposal.
- Keep the constants out of methods section. At the beginning of 5th page you state that the output channels is the same as hidden dimension, namely 768. What is the particularity of this number? Is it tied to specific image width or height? Number of patches? Patch size or length?
- Again, judging by the results reported in Table 1 and 2 it seems that your proposal is better than compared networks. However I am lacking the time efficiency comparison of your proposed network and other listed. I am aiming at something similar to what you already reported in Tables 3, 4 and 5. You should report number of parameters and number of floating point operations for the networks in Table 1 and 2 maybe in new Table or in Table 1.
With the above mentioned issues solved the manuscript could be published.
Author Response
Point 1:The text needs a better english. The sentences should be smaller. They should also have a maximum of one inserted sentence. By doing so the sentences are less complex and easier to understand. Also try not to use google translate.
Response 1:Thanks for your suggestion. We have tried to make our sentences shorter. For some expressions with proper nouns, we had to build long sentences, and hope you can understand.
Point 2: Break the introduction part into at least two or maybe even three. Keep the problem statement separate from the related work! Being that you hint at solution in introduction add maybe even third section with proposal.
Response 2: Thank you for your reminder. According to your request, we have re-divided the introduction part, including the appropriate introduction and related work. In addition, we have also supplemented some related work. We hope our corrections will satisfy you.
Point 3: Keep the constants out of methods section. At the beginning of 5th page you state that the output channels is the same as hidden dimension, namely 768. What is the particularity of this number? Is it tied to specific image width or height? Number of patches? Patch size or length?
Response 3: Thanks for your suggestion, we have removed this content from the article according your request. In fact, in many works based on Transformer, most of them generally regards it as the default method parameter. It represents the high-level dimension number in which the image patch sequence is initially embedded. Since it is usually implemented by convolution, representing the output channel of the convolution operation. In addition, it is not a particular value. The larger the value, the richer the feature information of the image patch sequence, but it also brings computational cost to the model. Therefore, in order to comprehensively consider the model parameters and efficiency, 768 is generally used as the visual Transformer model parameter.
Point 4: Again, judging by the results reported in Table 1 and 2 it seems that your proposal is better than compared networks. However I am lacking the time efficiency comparison of your proposed network and other listed. I am aiming at something similar to what you already reported in Tables 3, 4 and 5. You should report number of parameters and number of floating point operations for the networks in Table 1 and 2 maybe in new Table or in Table 1.
Response 4:Thank you for your suggestion. At your request we have added to the revised manuscript a comparison of the computational efficiency of different algorithms, given in Table 3 of the submitted version.
Reviewer 3 Report
1- The Abstract section needs improvement. There are no significant results in the abstract to show the importance, (state the statistics in the finding to show the promising capabilities of the proposed model)
2- I could not find any problem statement in the introduction section.
3- In the introduction, the authors should clearly include the following: problem, motivation, scope, objective.
4- The problem statement should also address: why is it important. Why is it needed to solve?
5- page 5 line 174 (Fig 4 should be Figure 4)
6- The introduction section merely stated previous similar work, no critical analysis was done to support the research motivation or the limitation of the current work. How does the related work relevant to the proposed model?
7- The Materials and Methods section needs to be improved more, there should be some introduction before starting with subsections
8- What is the basis of the proposed model? Why used transformers?
9- In the visualization, which technique has been used, can you please explain,
10- 8- Kindly make a comparison table with previous work.
Author Response
Thanks for your time to review our manuscript. The suggestions you put forward are instructive for us.
Point 1: The Abstract section needs improvement. There are no significant results in the abstract to show the importance, (state the statistics in the finding to show the promising capabilities of the proposed model).
Response 1: Thank you for your suggestion. Following your instructions, we have supplemented the summary with statistics of the results. We sincerely hope you will be satisfied.
Point 2: I could not find any problem statement in the introduction section.
Response 2: Thanks for your question. Due to our oversight, we did not clearly clarify the definition of change detection in the first paragraph of the introduction section, so we supplemented some expressions in the first paragraph. In addition, we add some contents for pointing the shortcomings that current method faced, in the second paragraph of the introduction section, which leads to our follow-up research on this problem.
Point 3: In the introduction, the authors should clearly include the following: problem, motivation, scope, objective.
Response 3: Thank you for your suggestion. We are sorry that the introduction of the manuscript is chaotically organized, so that you cannot find the problem, objective, etc., so we have re-divided the introduction part and regarded some of the contents as related work. In fact, we have mentioned the source and goal of the problem in several places in the introduction, as well as the improvements and advantages of our method compared to other algorithms. Also, we have revised the introduction section, hope you are satisfied.
Point 4: The problem statement should also address: why is it important. Why is it needed to solve?
Response 4: Thank you for your question. We are sorry that the logic organization in the introduction is quite confusing, so we have supplemented and revised the introduction. In fact, in terms of change detection technology, we have already mentioned that the current deep learning-based methods only use CNN deep network to extract features, or add attention modules to capture the global context, but these methods are not applicable to high-resolution remote sensing images, as mentioned in the introduction: some objects of high-resolution images exhibit different appearance behaviors in different time and space. In addition, we also repeatedly emphasized the lack of computational efficiency of other self-attention methods, so we use the Transformer model.The traditional Transformer structure cannot capture the representation of object with different granularities, so we introduce an improved hybrid Transformer.
Point 5: page 5 line 174 (Fig 4 should be Figure 4)
Response 5: Thanks for your correction. We have corrected the error and the figure is now Figure 8.
Point 6: The introduction section merely stated previous similar work, no critical analysis was done to support the research motivation or the limitation of the current work. How does the related work relevant to the proposed model?
Response 6:Thanks for your doubts. At your request, in the second paragraph of the introduction, we elaborated on the critical shortcomings of different types of change detection methods, and we also gave the shortcomings of the traditional Transformer structure in the third paragraph of the visual Transformer: it can only capture single-scale objects, but cannot obtain multiple receptive fields within a single extraction layer, avoiding the mixing of object information and background information. In fact, although the traditional methods based on self-attention can achieve good results, the key disadvantage is that the self-attention operation needs to consume a large amount of computing memory, so we first use the Transformer model for remote sensing image change detection.To a certain extent, our method is the first work using this technique on the change detection task, so there are few existing works for comparison. Hope you are satisfied.
Point 7:The Materials and Methods section needs to be improved more, there should be some introduction before starting with subsections.
Response 7:Thank you for your suggestion. We've added some content to the Materials and Methods section, including some basics and some related introductions, hope you're satisfied.
Point 8:What is the basis of the proposed model? Why used transformers?
Response 8:Thanks for your question. At present, in the task of remote sensing image change detection, traditional methods and methods based on machine learning have complex processing processes and poor performance. Therefore, our research is mainly focused on the improvement of deep network-based models. Since the visual Transformer theory will only be widely known in 2021, so far, little work has proposed a change detection algorithm based on Transformer structure. The BiT and TransCD we mentioned in the related work are the only methods at present, but they simply designed the network for the bi-temporal image task. Although the global receptive field can be obtained in a single transformer layer, it can only deal with the single-scale change object, so our method mainly improves the intra-layer structure on their basis, so that each layer can contain more granular receptive fields.
Point 9: In the visualization, which technique has been used, can you please explain,
Response 9:We used the feature heatmap visualization technique , that is, the response intensity of different positions in the image, the most prominent pixel/region in the image will have a higher activation value, resulting in a darker color, and the attention map is the representation of image relation, which makes the model to focus on more important areas. In fact, there is a lot of code for visualizing images like this on the internet.
Point 10: Kindly make a comparison table with previous work.
Response 10: Thank you for your suggestion. In fact, we have compared the previous work in the experimental part (Table 1, Table 2 and Table 3), and the visual comparison results are given in Figure 13 and Figure 14 in the experimental part. The corresponding results are also analyzed, and these methods are the most state-of-the-art change detection algorithms at present.
Reviewer 4 Report
The article is about a process to detect changes between images based on deep learning algorithms. Some comments mainly about the presentation of the state of the art, methods and experiments.
---- abstract : well-written and clear
-DDT layer = HDTD layer ?
---- introduction : the introduction could be more structured and the mentionned methods mose clearly presented and be explicit. To identify in which application the gaps of the existing methods are important and so need to be completed.
-« surface cover » = land cover?
-« land resource management » = water, mining?
-« salient regions are ignored »: to clarify
-« binary change detection » = so the work is based on the pixel-level change approach? Relation with the previous sentence?
-« dense No-local operations » = upper case ?
-« Visual Transformer inherited from natural language translation »: what are the similarities here between TAL and remote sensing?
-ViT[11] = to explicit since the first mention?
-not sure to understand « multihead attention » (to explain since the first mention) + « creatively »?
-line 68 « our hybrid transformer »: also introduced later line 83
-Introduction of the hybrid method in line 83, in the middle of the state-of-the-art? To gather the presentation of your hybrid model
-Baseline: relate « hybrid vision » and the « two manners named Late-Diff (LD) and Early-Diff (ED) »
-Figure 1 : management of the changes of colorimetry and season ?
-line 118 « Our contributions » related to the « various scale representations »?
contribution 3 : « two manners for representing »: to explicit here, even if repeated
contribution 5: « abundant experiments »: experiments to assess your model and to determine the added-values to the existing methods.
----Materials and Methods
-Cf. for Figure 2 in the introduction?
-« contain abundant semantic change information »: more details about this information
-not sure to understand the conclusion the section 2.1: choice for « Siamese H-TE »? Mix with state-of the art?
-Application on satellite and aerial images? In LIDAR scenes? In photographies?
-« multi-scale key-value pairs » : here and before, « multi-scale » to which range? Sub-meter to several meters? Resolution or size of the geographical objects?
-« channel dimension » = RGB and others?
-« By controlling the value of r »: r to be named
----Experiments
-For the dataset LEVIR-CD: only buildings to detect (roads would be possible to detect as well?)
-semantic features: what is the typology of the types of features extracted?
-Effects of rare classes ?
-Not sure to understand the shape of the c and f attention maps in Figure 15.
Author Response
Thank you very much for your time involved in reviewing the manuscript and your very encouraging comments on the merits, the comments provide an important direction for us to revise our paper .
Point 1: DDT layer = HDTD layer ?
Response 1: Thank you for your correction, we are sorry for our omission, the revised manuscript has corrected the error.
Point 2: Introduction : the introduction could be more structured and the mentionned methods mose clearly presented and be explicit. To identify in which application the gaps of the existing methods are important and so need to be completed.
Response 2:Thanks for your suggestion, we did not organize enough in the introduction section, so we re-divided the introduction into two parts: introduction and related work, and supplemented relevant statements in each part. In the introduction, we have clearly described the definition of the problem and the gaps that current methods exist, we hope you are satisfied.
Point 3-5: -« surface cover » = land cover?
-« land resource management » = water, mining?
-« salient regions are ignored »: to clarify
Response 3-5: We have revised them in corresponding section.
Point 6:« binary change detection » = so the work is based on the pixel-level change approach? Relation with the previous sentence?
Response 6:Thanks for your reminder, we've rephrase this sentence. In fact, binary change detection is the goal name of the task, which is to achieve pixel-level change/unchange detection. But the network we built is actually a feature-level network, that is, bi-temporal features or fused difference features are extracted through CNN and Transformer. Hope you can understand.
Point 7: « dense No-local operations » = upper case ?
Response 7:We have revised them, it should be lower case.
Point 8: « Visual Transformer inherited from natural language translation »: what are the similarities here between TAL and remote sensing?
Response 8:Thanks for your reminder. Transformer was indeed used to deal with natural language processing tasks at first, but since the composition of natural language is words, in order to process image sequences, the visual Transformer divides the image into different patches to model the global context of the entire image, while remote sensing image change detection can be regarded as the branch of image understanding, can also solve tasks with improved visual transformer.
Point 9:-ViT[11] = to explicit since the first mention?
Response 9: Thank you for your reminder. We've revised there.
Point 10:-not sure to understand « multihead attention » (to explain since the first mention) + « creatively »?
Response 10:Thanks for your suggestion, we have explained the multihead attention at the start of the Hybrid-Transformer Encoder. In fact, the multihead attention module of all Transformer models currently performs the operation in ViT by default.
Point 11:-line 68 « our hybrid transformer »: also introduced later line 83
Response 11:Thanks for your careful observation. Our hybrid transformer of Line 68 means that we have improved the multihead attention in our proposed transformer structure, and the re-mentioned in line 83 means that we re-emphasize the entire hybrid Transformer structure, so maybe there will not be too much conflicts between them, we hope you can understand.
Point 12:-Introduction of the hybrid method in line 83, in the middle of the state-of-the-art? To gather the presentation of your hybrid model.
Response 12:Thanks for your opinion. In the introduction part (Line 83) of the manuscript, we mainly point out the effectiveness of the proposed method by analyzing the shortcomings of other vision transformer models. In fact, the mentioned PVT and Swin Transformer are models based on other visual classification tasks. These models are not adopted in concurrent change detection work, so we only propose an improved model for the inherent shortcomings of these models. The structure is used for our change detection task, hope you are satisfied.
Point 13:-Baseline: relate « hybrid vision » and the « two manners named Late-Diff (LD) and Early-Diff (ED) »
Response 13:Thanks for your question. The « hybrid vision » means that the Transformer structure we designed is a hybrid method compared to the traditional ViT. This hybrid structure is utilized in both the encoder and the decoder, that is, the encoder in ViT is replaced by a hybrid version. At the same time, a decoder (which does not exist in ViT) is added, and this decoder is also a hybrid method. As for our change detection task, since both encoding and decoding are ultimately to obtain difference feature maps, so we propose two structures in the decoder: « two manners named Late-Diff (LD) and Early-Diff (ED) » , so that the difference feature map can be directly obtained through the Transformer module, that is, the two structures are also operated in hybrid versions. Hope you understand our explanation.
Point 14:-Figure 1 : management of the changes of colorimetry and season ?
Response 14: Thanks for your question. Figure 1 is only used to illustrate the pseudo-changes of the same object in the high-resolution remote sensing image due to the difference in appearance at different times. These irrelevant factors include satellite perspective and seasonal factors. These pseudo-changes are difficult to distinguish in many CNN-based deep models, so our method mitigates these false alarms.
Point 15: -line 118 « Our contributions » related to the « various scale representations »?
Response 15:Thanks for your careful observation, « Our contributions » in the manuscript means a summary of all our work, not related to « various scale representations », we have added a separate paragraph to the former.
Point 16: contribution 3 : « two manners for representing »: to explicit here, even if repeated
Response 16:Thank you for your reminder, we have added detailed improvement points in contribution 3, hope you are satisfied.
Point 17: contribution 5: « abundant experiments »: experiments to assess your model and to determine the added-values to the existing methods.
Response 17:Thank you for your suggestion. We have added quantitative improvement descriptions where appropriate, and in fact, the F1 value is the metric that all algorithms consider.
Point 18: -Cf. for Figure 2 in the introduction?
Response 18: Thank you for your careful observation, we have reorganize the images of the article.
Point 19:-« contain abundant semantic change information »: more details about this information
Response 19: Thanks for your suggestion, we have added the following content in the corresponding position: “As the absolute difference is first taken from the encoded token pair and then decoded (Early Difference), or the token pair is decoded first and then made (Late Difference),”.
Point 20:--not sure to understand the conclusion the section 2.1: choice for « Siamese H-TE »? Mix with state-of the art?
Response 20:Thanks for your question. In fact, we omit the additional default condition here that both the hybrid transformer encoder and decoder are built in a siamese structure, that is, during training, bi-temporal features are learned in a shared weight manner. This is also the default for basically all change detection models.
Point 21:-Application on satellite and aerial images? In LIDAR scenes? In photographies?
Response 21:Our method is applied to optical remote sensing imagery, especially high-resolution imagery.
Point 22:-« multi-scale key-value pairs » : here and before, « multi-scale » to which range? Sub-meter to several meters? Resolution or size of the geographical objects?
Response 22:Thanks for your question. In our modified version, «multi-scale key-value pairs» has been revised to «multiple key-value pairs», that is, multiple key-value pairs of the same size calculate the correlation in parallel, other « multi-scale» represents the way of feature fusion. Since multiple features extracted by the CNN network contain different scales, and we generate key-value pairs of different scales through different r values, the above «multi-scale» can also be understood as a kind of multi-scale fusion.
Point 23:-« channel dimension » = RGB and others?
Response 23:Here « channel dimension » represents the hyperparameters of the model, not RGB and others. In fact, in many works based on Transformer, most of them generally regards it as the default method parameter. It represents the high-level dimension number in which the image patch sequence is initially embedded. Since it is usually implemented by convolution, representing the output channel of the convolution operation. In addition, it is not a particular value. The larger the value, the richer the feature information of the image patch sequence, but it also brings computational cost to the model. Therefore, in order to comprehensively consider the model parameters and efficiency, 768 is generally used as the visual Transformer model parameter.
Point 24:-« By controlling the value of r »: r to be named
Response 24:Thank you for your suggestion. We have named it in the revised version, i.e.《By controlling the downsampling rate r》.
Point 25:-For the dataset LEVIR-CD: only buildings to detect (roads would be possible to detect as well?)
Response 25:Thanks for your careful observation, the experimental results do show that both buildings and roads can be detected accurately, but since the dataset itself mainly focuses on building changes, other objects may also be useful, but in practice may not be very important relative to buildings . Hope you can understand.
Point 26:-semantic features: what is the typology of the types of features extracted?
Response 26:Usually in different layers of the deep neural network, the extracted features contain different levels of semantic information. The high-level features contain abstract semantic information, while the low-level features contain detailed texture information. Therefore, by fusing different levels of semantic features, low-level and high-level representations can be effectively preserved, which contributes to the final change prediction.
Point 27:--Effects of rare classes ?
Response 27:Thanks for your question. In fact, in the change detection task, more attention is paid to the detection of positive samples, and the detection of negative samples is implied in our metrics such as mIOU and F1 value. At present, most algorithms are dedicated to the occurrence of missed alarms, because simple deep models can generate very few false alarms.
Point 28:-Not sure to understand the shape of the c and f attention maps in Figure 15.
Response 28:Thank you for your doubts, these tokens are the multi-head attention map generated by the multi-head attention layer, which is realized by calculating the global context of the features, strips of different colors represent different token information.
Round 2
Reviewer 2 Report
Dear authors,
thank for your responses. I agree with publication now.
Reviewer 3 Report
The authors addressed all the given comments