Attention-Guided Siamese Fusion Network for Change Detection of Remote Sensing Images

: Change detection for remote sensing images is an indispensable procedure for many remote sensing applications, such as geological disaster assessment, environmental monitoring, and urban development monitoring. Through this technique, the difference in certain areas after some emergencies can be determined to estimate their inﬂuence. Additionally, by analyzing the sequential difference maps, the change tendency can be found to help to predict future changes, such as urban development and environmental pollution. The complex variety of changes and interferential changes caused by imaging processing, such as season, weather and sensors, are critical factors that affect the effectiveness of change detection methods. Recently, there have been many research achievements surrounding this topic, but a perfect solution to all the problems in change detection has not yet been achieved. In this paper, we mainly focus on reducing the inﬂuence of imaging processing through the deep neural network technique with limited labeled samples. The attention-guided Siamese fusion network is constructed based on one basic Siamese network for change detection. In contrast to common processing, besides high-level feature fusion, feature fusion is operated during the whole feature extraction process by using an attention information fusion module. This module can not only realize the information fusion of two feature extraction network branches, but also guide the feature learning network to focus on feature channels with high importance. Finally, extensive experiments were performed on three public datasets, which could verify the signiﬁcance of information fusion and the guidance of the attention mechanism during feature learning in comparison with related methods.


Introduction
The development of remote sensing techniques increases the sensory ability of humans to their living environment without the limitations of space and time. Remote sensing sensors can be installed in satellites or airplanes to obtain multi-scale observation data of the Earth's surface to satisfy the different requirements of some applications. Identifying changes in our living environment using remote sensing data is a necessary task for humans. With the rise of artificial intelligence, change detection in remote sensing images is becoming the most common technology, surpassing human detection. Monitoring the development of urban areas is currently the biggest application area of change detection, by which unapproved construction projects, changes in land use and development trends of urban areas can be obtained. In [1], the authors propose the use of a classification method combined with spectral indices of some common landcovers to identify trends in urban growth. In [2], the authors propose the use of the landcover classification method to achieve Additionally, dictionary learning and sparse representation are also applied to construct the final feature space [40].
Recently, the neural network has become the most important research topic in the information technology domain and has achieved remarkable effectiveness in machine vision [41][42][43]. In fact, neural network-based change detection methods also have proliferated in recent years [44][45][46][47], and the growth trend is obvious [8]. Unsupervised neural network-based change detection methods aim to reduce the distance between two temporal images' feature spaces by using a neural network to learn new feature spaces [44][45][46][47][48][49][50].
Considering the power of label information for change detection, many unsupervised methods introduce some pseudo-label information obtained by pre-classification methods into the feature learning network, to improve the discriminant ability of the new feature space [45,46,49,50].
Although unsupervised methods cannot consider the data annotation problem, the obtained change detection results are difficult to use in real applications, where too many unconcerned change types are detected.
In contrast to unsupervised methods, supervised neural network-based methods utilize label information to improve the final classification, which only focuses on distinguishing specific change types from a whole scene. For this kind of method, change detection is mainly treated as a pixel classification problem or semantic segmentation problem. In [51], a new GAN architecture based on W-Net is constructed to generate a change detection map directly from an input image pair. The combination of the spatial characteristics of convolutional networks and temporal characteristics of recurrent networks could improve the efficiency of high-level features for change detection [52][53][54]. The Siamese network structure is usually considered as the basic network structure to extract high-level features of temporal images [54][55][56]. In addition, the attentional mechanism [57] is also introduced to the neural network to improve its performance. In [58], authors combined the convolutional neural network with the CBAM attention mechanism to improve the performance of feature learning for SAR image change detection. In [59], based on the Unet++ framework, an up-sampling attention unit replaced the original up-sampling unit to enhance the guidance of attention in both the spatial space and channel space. However, these methods simply add an attentional block into the network similarly to common operations and do not generate a deep chemical reaction between the attentional mechanism and change detection.
In this paper, the attention-guided Siamese fusion network is proposed for the change detection of remote sensing images. Although the Siamese network structure can extract high-level features of two temporal images, respectively, which could preserve the feature integrality of each image, it ignores the importance of information interaction between those two feature flows during the feature learning process. However, verified by our experiments, this information interaction of feature flows can improve the final performance. Therefore, here, we dexterously combine an attention mechanism with our basic Siamese network, which not only places more focus on important features, but also realizes the information interaction of two feature flows. The innovations of our work can be summarized as three points.
(1) The attention information fusion module (AIFM) is proposed. In contrast to common operations, which directly insert an attention block into a neural network to guide the feature learning process, ALFM utilizes an attention block as an information bridge of two feature learning networks to realize their information fusion. (2) Based on the ALFM structure and ResNet32, the attention-guided Siamese fusion network is constructed. The integration of multiple ALFMs can extend the influence of ALFM into a whole feature learning process and fuse the information of two feature learning network branches comprehensively and thoroughly. The following contents are organized as follows. The second section provides a detailed theoretical introduction of the proposed method. The third section shows the experiments designed to verify the availability and superiority of the proposed method. The final section is a conclusion of this work.

The Proposed Method
In the proposed change detection framework, two different temporal images for a certain area are discussed to detect the changes between them, which are denoted as I_1 and I_2, with the size as H × W. To train the fusion network, some pixels are labeled as changed or unchanged to construct the training sample set, such as D train = {(x 1 , y 1 , l 1 ), (x 2 , y 2 , l 2 ), . . . , (x n , y n , l n )}, where x i and y i represent the row-and column-coordinates of the i-th training sample, respectively, and l i ∈ {0, 1} is the label; 0 denotes an unchanged pixel, and 1 denotes a changed pixel. The rest of the pixels of the image pair are treated as the testing set for the fusion network. After the training and testing processes, the probability of each index being a changed pixel is obtained, which can also be denoted as a different map. The final change detection result can be easily achieved by using a simple method to split the different map into a changed pixel set and an unchanged pixel set, such as threshold segmentation.
Many abbreviations are used in this section to make the description more concise. Therefore, an abbreviations list is shown in Table 1 with their full descriptions.

Basic Siamese Fusion Network
ResNet is a classical network for classification, which demonstrates significant performance in many visual applications [60]. The most representative ResNet networks are ResNet50 and ResNet101, which involve a mass of parameters. Compared to the natural scene, the remote sensing scene is simpler, and the labeled samples are limited. Therefore, in our research work, one simple ResNet (ResNet32) structure was selected as the basic network framework. The specific structure of ResNet32 is shown in Table 2. When the change detection problem is treated as a pixel classification problem, the simplest scheme is to treat the concatenation of the patch pair of the center pixel in the temporal images as the input, and the probability of being a changed pixel or an unchanged pixel as the output of the network. Here, ResNet32 was used as the network of feature learning and classification, of which a simple schematic diagram is shown in Figure 1. This change detection method is denoted as ResNet-CD.
Remote Sens. 2021, 13, x 5 of 24 When the change detection problem is treated as a pixel classification problem, the simplest scheme is to treat the concatenation of the patch pair of the center pixel in the temporal images as the input, and the probability of being a changed pixel or an unchanged pixel as the output of the network. Here, ResNet32 was used as the network of feature learning and classification, of which a simple schematic diagram is shown in Figure 1. This change detection method is denoted as ResNet-CD. In ResNet-CD, the information of two temporal images is fused before feature extraction processing, which is the common operation of data-level fusion. Based on this fusion, the following feature learning procedure is primitive, where the intrinsic difference is not considered. As we know, information fusion can be divided into three levels, namely, data-level fusion, feature-level fusion and decision-level fusion [61]. For change detection, decision-level fusion is not possible as there is only one output of two input images. Therefore, feature-level fusion is another selectable method, which could also conduct feature extraction for two input images and fuse their high-level features.
The Siamese network structure is selected to satisfy the requirement of feature-level fusion in change detection. Different temporal images complete feature extraction from different network branches. After obtaining the high-level features, they are combined to import into a full connection layer, which maps features to two-dimensional output, indicating the probability of being changed or unchanged. The final change detection map is obtained by using the segmentation of the changed probability map. The basic Siamese fusion network for change detection is also built on the ResNet32, which is denoted as Siam-ResNet-CD. The structure schematic diagram of Siam-ResNet-CD is shown in Figure 2. The structures of these two network branches are consistent. The parameters of one branch are updated during the back-propagation processing, and another branch simply copies the parameters from that branch without any computation. In this situation, the network is symmetrical. There is another situation where the structure of the network is symmetrical, but the values of the parameters are not the same, and two branches are updated during back-propagation processing, which is called the Pseudo-Siamese network. In the experiment, we verified that the efficiency of the Siamese network was close to that of the Pseudo-Siamese network but had a lower computation cost. Therefore, in the proposed attention-guided Siamese fusion network, the real Siamese network is chosen as a basic network framework.  In ResNet-CD, the information of two temporal images is fused before feature extraction processing, which is the common operation of data-level fusion. Based on this fusion, the following feature learning procedure is primitive, where the intrinsic difference is not considered. As we know, information fusion can be divided into three levels, namely, data-level fusion, feature-level fusion and decision-level fusion [61]. For change detection, decision-level fusion is not possible as there is only one output of two input images. Therefore, feature-level fusion is another selectable method, which could also conduct feature extraction for two input images and fuse their high-level features.
The Siamese network structure is selected to satisfy the requirement of feature-level fusion in change detection. Different temporal images complete feature extraction from different network branches. After obtaining the high-level features, they are combined to import into a full connection layer, which maps features to two-dimensional output, indicating the probability of being changed or unchanged. The final change detection map is obtained by using the segmentation of the changed probability map. The basic Siamese fusion network for change detection is also built on the ResNet32, which is denoted as Siam-ResNet-CD. The structure schematic diagram of Siam-ResNet-CD is shown in Figure 2. The structures of these two network branches are consistent. The parameters of one branch are updated during the back-propagation processing, and another branch simply copies the parameters from that branch without any computation. In this situation, the network is symmetrical. There is another situation where the structure of the network is symmetrical, but the values of the parameters are not the same, and two branches are updated during back-propagation processing, which is called the Pseudo-Siamese network. In the experiment, we verified that the efficiency of the Siamese network was close to that of the Pseudo-Siamese network but had a lower computation cost. Therefore, in the proposed attention-guided Siamese fusion network, the real Siamese network is chosen as a basic network framework. When the change detection problem is treated as a pixel classification problem, the simplest scheme is to treat the concatenation of the patch pair of the center pixel in the temporal images as the input, and the probability of being a changed pixel or an unchanged pixel as the output of the network. Here, ResNet32 was used as the network of feature learning and classification, of which a simple schematic diagram is shown in Figure 1. This change detection method is denoted as ResNet-CD. In ResNet-CD, the information of two temporal images is fused before feature extraction processing, which is the common operation of data-level fusion. Based on this fusion, the following feature learning procedure is primitive, where the intrinsic difference is not considered. As we know, information fusion can be divided into three levels, namely, data-level fusion, feature-level fusion and decision-level fusion [61]. For change detection, decision-level fusion is not possible as there is only one output of two input images. Therefore, feature-level fusion is another selectable method, which could also conduct feature extraction for two input images and fuse their high-level features.
The Siamese network structure is selected to satisfy the requirement of feature-level fusion in change detection. Different temporal images complete feature extraction from different network branches. After obtaining the high-level features, they are combined to import into a full connection layer, which maps features to two-dimensional output, indicating the probability of being changed or unchanged. The final change detection map is obtained by using the segmentation of the changed probability map. The basic Siamese fusion network for change detection is also built on the ResNet32, which is denoted as Siam-ResNet-CD. The structure schematic diagram of Siam-ResNet-CD is shown in Figure 2. The structures of these two network branches are consistent. The parameters of one branch are updated during the back-propagation processing, and another branch simply copies the parameters from that branch without any computation. In this situation, the network is symmetrical. There is another situation where the structure of the network is symmetrical, but the values of the parameters are not the same, and two branches are updated during back-propagation processing, which is called the Pseudo-Siamese network. In the experiment, we verified that the efficiency of the Siamese network was close to that of the Pseudo-Siamese network but had a lower computation cost. Therefore, in the proposed attention-guided Siamese fusion network, the real Siamese network is chosen as a basic network framework.

The Attention-Guided Siamese Fusion Network (Atten-SiamNet)
In Siam-ResNet-CD, although high-level features of two image patches from different temporal images are extracted to form the Siamese network, it is obvious that the two branches are totally isolated without any information interaction between them. In Remote Sens. 2021, 13, 4597 6 of 23 other words, the intrinsic differences of two temporal images are considered, but the correlation between them is not considered in Siam-ResNet-CD. The temporal images are different observations of an identical region; there must be a large amount of shared relevant information. Therefore, the correlation should not be ignored during feature learning processing.
Here, the attention mechanism is chosen to realize the interaction of two feature learning branches based on their correlation, and conducts information fusion throughout the whole feature learning process, not just the final step. The attentional mechanism has been widely used in deep learning to improve the final performance [59]. Through parameter learning, the attention block could adaptively compute the importance of feature channels to guide the learning processing to pay more attention to channels or positions with high importance scores. Aiming to realize the interaction, an attention block is inserted into Siam-ResNet-CD in an unfamiliar way to improve the final performance. The attention information fusion module is the key procedure, where feature flows influence each other, and more important features receive more attention during learning processing.

Attention Information Fusion Module (AIFM)
The attention information fusion module is constructed as an information bridge of two feature learning branches. The features obtained by two branches are inputted into one attention mechanism network. Then, the computation of the importance score of feature channels is the information fusion of two branches as the full connection layer maps whole features into a new space adaptively. Therefore, the final importance scores are not obtained from two feature branches in isolation; they are the information fusion results from two feature flows. This is why we denoted it as the attention information fusion module.
In this module, there are two inputs and two outputs, where the inputs are the feature blocks from the front feature extraction layer of two branches and the outputs are the inputs of the next feature extraction layer of two branches. The structure of the attention fusion block is related to the SE attention mechanism [62], which is well known and applied in computer vision. In Figure 3, the schematic diagram of the attention information fusion module is shown. This module is constructed on the basic residual module; two inputs obtained new feature maps through the basic module independently, these new feature maps are imported into the attention block. The concatenation first fuses the feature maps together, and two full connection layers combine two branch feature maps. Finally, through the Sigmod activation function, a vector in the range from 0 to 1 can be obtained, of which one element measures the importance of one feature map. These feature maps are multiplied by the vector, in which one feature map corresponds to one element. Then, the feature maps with more importance would be more valuable than other maps.
For a better understanding of this module, it can be divided into two parts, namely, the common residual part, shown in Figure 4a, and the attention fusion part, shown in Figure 4b. The common residual part assures the basic feature learning performance of each branch. As the traditional residual module, the feature map (R H×W×C ) is entered into a convolutional residual neural network and mapped into an output feature map (R H×W×C ). In the attention fusion part, the feature maps obtained by the convolutional neural network (F(·, W)) of the Siamese branch in the same layer, such as F H×W×C 1 and F H×W×C 2 , are entered and concatenated as whole feature maps (V H×W×2C 1 ). Then, V 1 is squeezed into one sequence of 2C length through global average pooling. Furthermore, the excitation processing contains two full connection layers, of which the mapping function is denoted as f ex (·, W). Through this processing, the connection of two branches is constructed and shrinks the length of the sequence from 2C to C. The output of excitation processing (F scale (·, ·)) fuses the information of the Siamese branches and displays the importance of the feature channel for the final task as a weight vector. The weights of each channel are fed back into the branches, respectively, to enhance the influence of the important channel and reduce the influence of the less important channel for the following feature extraction processing. The parameters of f ex (·, W) are adaptively learned from the training data, which reflect the information interaction of whole 2C channels. Through the attention information fusion module, the information interaction of two branches and the import of the attention mechanism are both completed. For a better understanding of this module, it can be divided into two parts, namely, the common residual part, shown in Figure 4a, and the attention fusion part, shown in Figure 4b. The common residual part assures the basic feature learning performance of each branch. As the traditional residual module, the feature map ( × × ) is entered into a convolutional residual neural network and mapped into an output feature map ( × × ). In the attention fusion part, the feature maps obtained by the convolutional neural network ( (•, )) of the Siamese branch in the same layer, such as × × and × × , are entered and concatenated as whole feature maps ( × × ). Then, is squeezed into one sequence of 2C length through global average pooling. Furthermore, the excitation processing contains two full connection layers, of which the mapping function is denoted as (•, ). Through this processing, the connection of two branches is constructed and shrinks the length of the sequence from 2C to C. The output of excitation processing ( (•,•)) fuses the information of the Siamese branches and displays the importance of the feature channel for the final task as a weight vector. The weights of each channel are fed back into the branches, respectively, to enhance the influence of the important channel and reduce the influence of the less important channel for the following feature extraction processing. The parameters of (•, ) are adaptively learned from the training data, which reflect the information interaction of whole 2C channels. Through the attention information fusion module, the information interaction of two branches and the import of the attention mechanism are both completed.

Network Architecture of Atten-SiamNet-CD
Attention-guided Siamese fusion network-based change detection (Atten-SiamNet-CD) is a high-order version of Siam-ResNet-CD, as introduced clearly in Section 2.1. ResNet32 is also a basic network framework like Siam-ResNet-CD. Multiple AIFMs are integrated into the middle network part (Layer1, Layer2 and Layer3) of Siam-ResNet-CD. In ResNet 32, each layer contains 5 residual modules. In Atten-SiamNet-CD, five residual modules are replaced by five attention information fusion modules, as shown in Figure 5. Aiming to reduce the size of the feature map, the stride of the convolutional layer of Block 2 and Block 3 is set to 2, by which the output feature maps are reduced to one half of their original size.

Network Architecture of Atten-SiamNet-CD
Attention-guided Siamese fusion network-based change detection (Atten-SiamNet-CD) is a high-order version of Siam-ResNet-CD, as introduced clearly in Section 2.1. Res-Net32 is also a basic network framework like Siam-ResNet-CD. Multiple AIFMs are integrated into the middle network part (Layer1, Layer2 and Layer3) of Siam-ResNet-CD. In ResNet 32, each layer contains 5 residual modules. In Atten-SiamNet-CD, five residual modules are replaced by five attention information fusion modules, as shown in Figure 5. Aiming to reduce the size of the feature map, the stride of the convolutional layer of Block 2 and Block 3 is set to 2, by which the output feature maps are reduced to one half of their original size.  Like other pixel-level change detection methods [18][19][20], the image patch (k × k) around the center pixel is sampled as the original feature of the center pixel. Compared with the signal pixel, image patches contain more structural information on the scene, which is beneficial for change detection. As shown in Figure 5, image patches of a certain pixel in two temporal images are inputted into the network in pairs. After the attentionguided Siamese fusion network, high-level features are extracted and mapped into the final labels, which indicates if the pixel pairs are changed or unchanged. One-hot encoding is used here, and the cross-entropy loss is used to measure the distance of the label and predicted label, and guide the learning processing. Training data are sampled from labeled data containing some changed pixels and unchanged pixels.
Two temporal remote sensing images are predicted through the trained attentionguided Siamese fusion network. The result can be represented as a tensor, DI ∈ R H×W×2 . The value of this tensor denotes the difference in the pixel pairs, where DI :,:,1 is the probability of pixel pairs being changed and DI :,:,2 is the probability of pixel pairs being unchanged. Furthermore, DI :,:,1 and DI :,:,2 are complementary, of which the sum is 1. Therefore, only DI :,:,1 is used as the difference map in the following processing. Aiming to verify the feature learning performance of the attention-guided Siamese fusion network, the simple threshold segmentation method is used to obtain the final change detection results. These pixels, of which the difference values are larger than the threshold, are defined as changed pixels, and others are unchanged pixels. The threshold is set to be 0.6 based on multiple experiments.

Experimental Results
In the above section, the proposed attention-guided Siamese fusion network was clearly introduced. This section describes the many experiments that were designed to verify its performance-correctness and superiority. First, the datasets and experimental setting are introduced. The experimental analysis of important parameters is described. Finally, the experimental results of multiple change detection methods on experimental datasets are shown and analyzed to prove the superiority of the proposed method.

Introduction of Datasets and Experimental Setting
To ensure the equitability and reliability of the experimental results, temporal image pairs were selected to form three public change detection datasets. Ground truthing conducted by humans tends to mark the changes to nature made by humans, such as industrial development and housing construction, which is very important for urban development monitoring or the research of land use. Although there are multiple kinds of changes in one image pair, we focused only on labeled changes in the ground truthing and ignored unlabeled changes. In total, six image pairs were tested here, and the characteristics of each image pair varied considerably. In the following introductions, their characteristics are discussed.
The first dataset was SZTAKI AirChange Benchmark, which contained 13 image pairs with sizes of 952 × 640. The spatial resolution of this dataset was 1.5 m, and the labels were manually annotated. The label information concentrated only on certain changes, such as new urban areas, construction sites, new vegetation and new farmland. However, the labels only indicated which pixels were changed or unchanged without the change class. In this experimental section, Szada/2 and Tiszadob/3 were chosen as experimental data, which are shown in Figure 6. Szada/2 and Tiszadob/3 both showed the change in vegetation areas. However, Szada/2 only labeled areas changing from vegetation to human-made ones, such as roads, buildings and places. Tiszadob/3 displayed the areas changing from one kind of vegetation to another, which is obvious in two images. The second dataset was the QuickBird dataset provided by the Data Innovation Competition of Guangdong Government, which was captured in 2015 and 2017. In this experimental section, three image pairs were selected from this dataset, each of which contained 512 × 512 pixels. The source images contained four bands, which were converted to oneband images through the average of four-band images. Their schematic diagrams are shown in Figure 7. The ground truthing of these three image pairs showed an increase in buildings in the fixed area. QuickBird-1 and QuickBird-3 focused on the changes in factory buildings and ignored other changes. QuickBird-2 showed an increase in residential buildings around the lake. The structures of residential buildings and factory buildings were very different, and the original landcovers were also different. The second dataset was the QuickBird dataset provided by the Data Innovation Competition of Guangdong Government, which was captured in 2015 and 2017. In this experimental section, three image pairs were selected from this dataset, each of which contained 512 × 512 pixels. The source images contained four bands, which were converted to one-band images through the average of four-band images. Their schematic diagrams are shown in Figure 7. The ground truthing of these three image pairs showed an increase in buildings in the fixed area. QuickBird-1 and QuickBird-3 focused on the changes in factory buildings and ignored other changes. QuickBird-2 showed an increase in residential buildings around the lake. The structures of residential buildings and factory buildings were very different, and the original landcovers were also different. 512 × 512 pixels. The source images contained four bands, which were converted to oneband images through the average of four-band images. Their schematic diagrams are shown in Figure 7. The ground truthing of these three image pairs showed an increase in buildings in the fixed area. QuickBird-1 and QuickBird-3 focused on the changes in factory buildings and ignored other changes. QuickBird-2 showed an increase in residential buildings around the lake. The structures of residential buildings and factory buildings were very different, and the original landcovers were also different.  The third dataset was made up of heterogeneous data captured from an optical sensor and SAR in Shu Guang village in 2012 and 2008. The size of data was 921 × 593 pixels. The labeled changed area was construction built on farmland. Their schematic diagrams are shown in Figure 8. In contrast to the above two datasets, both obtained by optical sensors, this dataset contained two kinds of remote sensing images. In the visual sense, there were huge differences between these two images. From the imaging mechanism, the information reflected by each pixel was also different. Therefore, overcoming the intrinsic difference of different data sources is a challenge of change detection.
For each image pair, the training data in experiments were sampled randomly from labeled data, where there were 400 changed pixel pairs and 1600 unchanged pixel pairs, from which image patch pairs around the center pixels were split from the original image pairs as the original features inputted into the neural network. The size of patches is one important factor that may affect the final performance. Here, 22 × 22 was chosen as the The third dataset was made up of heterogeneous data captured from an optical sensor and SAR in Shu Guang village in 2012 and 2008. The size of data was 921 × 593 pixels. The labeled changed area was construction built on farmland. Their schematic diagrams are shown in Figure 8. In contrast to the above two datasets, both obtained by optical sensors, this dataset contained two kinds of remote sensing images. In the visual sense, there were huge differences between these two images. From the imaging mechanism, the information reflected by each pixel was also different. Therefore, overcoming the intrinsic difference of different data sources is a challenge of change detection. Remote Sens. 2021, 13,

Experiments about Network Architecture
In this subsection, the experiments that were designed for the verification are described, and the experimental results are shown and analyzed. The experiments were divided into two parts: first, the analysis of the influence of patch size on the change detection performance; second, the analysis of the difference between the Pseudo-Siamese structure and the Siamese structure.
Aiming to measure the performance of change detection results in all respects, five numerical indices were used in this section. Precision rate (Pre) denotes the ratio of real changed pixels to whole pixels in the predicted changed pixel set. Recall rate (Rec) denotes the ratio of predicted real changed pixels to the whole real changed pixels. Accuracy rate (Acc) denotes the ratio of pixels classified correctly to whole pixels, considered both changed pixels and unchanged pixels. The Kappa coefficient (Kappa) is a more reasonable classification index than Acc, in which the higher value represents a high classification result. The F1 coefficient is a harmonic mean of Pre and Rec, which contradict each other. Each index may concentrate on the performance of specific change detection methods, which may not be equitable.

Analysis of Image Patch Size
Image patch size is an important factor in data sampling, which decides the neighbor information of the center pixel. When the size is too large, the neighbor information is too plentiful to affect the judgment of the center pixel's situation. However, an overly small size may also affect it due to the limitation of neighbor information. In this subsection, the influence of different image patch sizes on the final results is discussed. In experiments, the patch size was set to be 6 × 6, 10 × 10, 14 × 14, 18 × 18, 22 × 22 and 26 × 26. Under the unique experimental setting, the experiments were performed on the Szada/2, QuickBird-1 and Shu Guang village data. In Figure 9, the Kappa and Acc of different patch sizes are shown, from which the quantitative analysis of performance influence could be realized. For each image pair, the training data in experiments were sampled randomly from labeled data, where there were 400 changed pixel pairs and 1600 unchanged pixel pairs, from which image patch pairs around the center pixels were split from the original image pairs as the original features inputted into the neural network. The size of patches is one important factor that may affect the final performance. Here, 22 × 22 was chosen as the patch size based on experimental experience, which was confirmed in the subsequent experiments. During the model training processing, the initial learning rate was set to be 0.001 and reduced to 10% per 80 iterations. The total iteration was 200. The Adam method [63], which is the one of most famous optimization algorithms, was chosen to optimize the objective function. The initialization of the convolutional layer was applied by using the Kaiming initialization [64]. The scale factor and the shift factor were initialized as 1 and 0, respectively. The batch size was set to be 128. The final results shown in this section are the average of 10 random experiments. The experiments were built on PyTorch 0.4.1 version provided by Facebook AI Research and the Lenovo Y7000P with NVIDIA GeForce RTX2060.

Experiments about Network Architecture
In this subsection, the experiments that were designed for the verification are described, and the experimental results are shown and analyzed. The experiments were divided into two parts: first, the analysis of the influence of patch size on the change detection performance; second, the analysis of the difference between the Pseudo-Siamese structure and the Siamese structure.
Aiming to measure the performance of change detection results in all respects, five numerical indices were used in this section. Precision rate (Pre) denotes the ratio of real changed pixels to whole pixels in the predicted changed pixel set. Recall rate (Rec) denotes the ratio of predicted real changed pixels to the whole real changed pixels. Accuracy rate (Acc) denotes the ratio of pixels classified correctly to whole pixels, considered both changed pixels and unchanged pixels. The Kappa coefficient (Kappa) is a more reasonable classification index than Acc, in which the higher value represents a high classification result. The F1 coefficient is a harmonic mean of Pre and Rec, which contradict each other. Each index may concentrate on the performance of specific change detection methods, which may not be equitable.

Analysis of Image Patch Size
Image patch size is an important factor in data sampling, which decides the neighbor information of the center pixel. When the size is too large, the neighbor information is too plentiful to affect the judgment of the center pixel's situation. However, an overly small size may also affect it due to the limitation of neighbor information. In this subsection, the influence of different image patch sizes on the final results is discussed. In experiments, the patch size was set to be 6 × 6, 10 × 10, 14 × 14, 18 × 18, 22 × 22 and 26 × 26. Under the unique experimental setting, the experiments were performed on the Szada/2, QuickBird-1 and Shu Guang village data. In Figure 9, the Kappa and Acc of different patch sizes are shown, from which the quantitative analysis of performance influence could be realized. From the results shown in Figure 9, there is an obvious variance in the performance of different patch sizes for the three datasets. With the increase in the patch size, the results also demonstrated a growth trend. However, when the patch size reached a certain value, such as 22 × 22, the results began to decrease or stabilize. This phenomenon verifies that the patch size should not be too big or too small in this situation. Although there were small differences in the variances of the results with patch size in different data, 22 × 22 could approximately satisfy the performance requirement of these data.
Additionally, in Figure 10, some difference maps for different patch sizes are shown, which were predicted by the proposed fusion model using Szada/2 data. In the visual sense, the difference maps showed some variances, which resulted from the learning and predicting processes of the neural network. Although the contrast ratio in Figure 10b,d is higher than that in Figure 10c, which did not have a large effect during the final threshold segmentation process, considering the details of the labeled region in the ground truthing, some regions were detected to be smaller than those in the ground truthing in Figure 10b, and some regions we detected to be larger than those in the ground truthing in Figure  10d. Figure 10c shows that the change detection results are better than others.
This can also be supported by theoretical analysis. The path size can determine which scale of neighborhood information is introduced into the feature learning process. In the proposed method, pixels in the neighborhood are all treated as the original features of the center pixel. If these neighbor pixels are in the same class as the center pixel, it will help the center pixel to be classified into the true class. On the other hand, if some of the neighborhood pixels do not belong to the same class as the center pixels, the value of those pixels will reduce the probability of being correctly classified. This influence on the margin of the changed area becomes significant. A low value means less neighborhood information and a value that is too large means a high influence of neighbor pixels. Therefore, this parameter could not be too large or too small. This parameter is usually determined by experiments for certain applications. Through our experiments, 22 × 22 was found to be the appropriate patch size, which was fixed in the following experiments. From the results shown in Figure 9, there is an obvious variance in the performance of different patch sizes for the three datasets. With the increase in the patch size, the results also demonstrated a growth trend. However, when the patch size reached a certain value, such as 22 × 22, the results began to decrease or stabilize. This phenomenon verifies that the patch size should not be too big or too small in this situation. Although there were small differences in the variances of the results with patch size in different data, 22 × 22 could approximately satisfy the performance requirement of these data.
Additionally, in Figure 10, some difference maps for different patch sizes are shown, which were predicted by the proposed fusion model using Szada/2 data. In the visual sense, the difference maps showed some variances, which resulted from the learning and predicting processes of the neural network. Although the contrast ratio in Figure 10b,d is higher than that in Figure 10c, which did not have a large effect during the final threshold segmentation process, considering the details of the labeled region in the ground truthing, some regions were detected to be smaller than those in the ground truthing in Figure 10b, and some regions we detected to be larger than those in the ground truthing in Figure 10d. Figure 10c shows that the change detection results are better than others.
This can also be supported by theoretical analysis. The path size can determine which scale of neighborhood information is introduced into the feature learning process. In the proposed method, pixels in the neighborhood are all treated as the original features of the center pixel. If these neighbor pixels are in the same class as the center pixel, it will help the center pixel to be classified into the true class. On the other hand, if some of the neighborhood pixels do not belong to the same class as the center pixels, the value of those pixels will reduce the probability of being correctly classified. This influence on the margin of the changed area becomes significant. A low value means less neighborhood information and a value that is too large means a high influence of neighbor pixels. Therefore, this parameter could not be too large or too small. This parameter is usually determined by experiments for certain applications. Through our experiments, 22 × 22 was found to be the appropriate patch size, which was fixed in the following experiments. Remote Sens. 2021, 13, x 14 of 24

Analysis of Pseudo-Siamese and Siamese Network
In this section, we analyzed the performance difference between the Pseudo-Siamese and Siamese networks. The experiments were performed on three datasets, namely, Szada/2, QuickBird-1 and Shu Guang village data, as in the above subsection. The main difference between the Pseudo-Siamese and Siamese networks is whether the parameters' shared mechanism is applied during the network training process. The Siamese network only updates these parameters on one branch during network training and duplicates these parameters to another branch. In contrast, the Pseudo-Siamese network updates whole parameters of two branches during network training. Here, all experimental settings were consolidated. The final change detection results for these three datasets are shown in Table 3. In Table 3, the bold text indicates the better value of the comparison. From these results, it is clear that the Siamese network obtained results that are comparative to those of the Pseudo-Siamese network from the three datasets. However, as the Siamese network

Analysis of Pseudo-Siamese and Siamese Network
In this section, we analyzed the performance difference between the Pseudo-Siamese and Siamese networks. The experiments were performed on three datasets, namely, Szada/2, QuickBird-1 and Shu Guang village data, as in the above subsection. The main difference between the Pseudo-Siamese and Siamese networks is whether the parameters' shared mechanism is applied during the network training process. The Siamese network only updates these parameters on one branch during network training and duplicates these parameters to another branch. In contrast, the Pseudo-Siamese network updates whole parameters of two branches during network training. Here, all experimental settings were consolidated. The final change detection results for these three datasets are shown in Table 3. In Table 3, the bold text indicates the better value of the comparison. From these results, it is clear that the Siamese network obtained results that are comparative to those of the Pseudo-Siamese network from the three datasets. However, as the Siamese network uses the parameters' shared mechanism, the parameter quantity of the Siamese network is much lower than that of the Pseudo-Siamese network, which is supported by the statistical result shown in Table 4. Considering both the performance and computation cost, the Siamese network structure is more suitable for the proposed method than the Pseudo-Siamese network structure. Therefore, the Siamese network was chosen in our research work.

Comparison with Other Methods
Aiming to certify the performance of the proposed method in change detection, some related change detection methods were chosen as comparison methods. DNN-CD [20] and CNN-LSTM [52] are two typical change detection methods based on neural networks. DNN-CD utilizes the deep RBM network as the classification network for change detection based on the label data obtained through pre-classification. CNN-LSTM combines the CNN network with the LSTM network to extract spectral-spatial-temporal features. Additionally, the basic models of the proposed methods, namely, ResNet-CD and Siam-ResNet-CD, were also considered to verify the performance of the proposed method. These methods were trained under the same experimental setting to achieve the best results for comparison, except DNN-CD trained the network to use whole labeled samples obtained through pre-classification. Figure 11 shows the change detection results obtained by different methods in SZTAKI AirChange Benchmark, and their numerical indices are shown in Table 5, where the indices in bold are the best results. Figure 11(b1-f1) show the results of Szada/2 data. Figure 11(b1) shows the results of DNN-CD; there are obvious mistake detection areas in the lower right corner. The results of CNN-LSTM-CD and ResNet-CD are shown in Figure 11(c1,d1), which detect the main change areas, but the details are not precise. The detection map of Siam-ResNet-CD in Figure 11(e1) and the detection map of Atten-SiamNet-CD are better than those of other methods. However, the integrality of the detection map of Atten-SiamNet-CD is better than that of Siam-ResNet-CD. Analyzing the numerical indices of those methods, similar results to the detection maps can be obtained. Although the result of Siam-ResNet-CD is close to that of the proposed method on Pre, the proposed Atten-SiamNet-CD achieved a remarkable improvement in Rec, Kappa and F1. Figure 11(b2-f2) show the results for Tiszadob/3 data. Compared with the Szada/2 data, these data are simpler, as there are less types of change in the labeled area. Therefore, whole methods obtained decent detection results both in terms of the detection maps and the numerical indices. However, the proposed Atten-SiamNet-CD achieved the best results. Other methods have more or less problems in certain aspects, which is also verified by the numerical results. Remote Sens. 2021, 13, x 16 of 24      13 show the change detection maps obtained by different methods for the QuickBird dataset, with their numerical indices are shown in Table 6. As shown in Figure 12, the change detection map of DNN-CD seems better than others in the visual sense, as there is obscurity at the edge of the changed area and unchanged area. However, there are also some errors, such as error detection in the top left area and leak detection in the lower area. Therefore, although the values of Pre and Acc obtained by DNN-CD are high, the values of Rec, Kappa and F1 are not higher than others. The proposed Atten-SiamNet-CD achieved the best results for Rec, Acc, Kappa and F1. As shown in Table 6, for these data, all methods obtained adjacent results for the numerical indices, namely, Precision, Acc, Kappa and F1. However, Atten-SiamNet-CD obtained a significant improvement in Recall, which means that this method should correctly detect more changed areas than other methods. This result can also be verified by the change detection maps shown in Figure 13. All methods could detect the main changed areas, but also showed some defects, such as the small building in the left corner. The results of Siam-ResNet-CD are not precise, as shown by the building marked by a red circle in Figure 13(d2). In both of above problems, Atten-SiamNet-CD has more precise    For QuickBird-2 data and QuickBird-3 data, the results of DNN-CD are not shown in Figure 13 and Table 6, as they are much lower than others. The main reason for this is that the label data obtained by pre-classification are not consistent with the ground truthing. In contrast to QuickBird-1 data, the change categories existing in the whole scene are very complex, and the ground truthing only annotates some areas of concern.

Results for SZTAKI AirChange Benchmark
In QuickBird-2, the change detection map of Atten-SiamNet-CD is much closer to ground truthing. Compared with other methods, it also obtained best results for Acc, Kappa and F1, which are close to the best results for Pre and Rec.
As shown in Table 6, for these data, all methods obtained adjacent results for the numerical indices, namely, Precision, Acc, Kappa and F1. However, Atten-SiamNet-CD obtained a significant improvement in Recall, which means that this method should correctly detect more changed areas than other methods. This result can also be verified by the change detection maps shown in Figure 13. All methods could detect the main changed areas, but also showed some defects, such as the small building in the left corner. The results of Siam-ResNet-CD are not precise, as shown by the building marked by a red circle in Figure 13(d2). In both of above problems, Atten-SiamNet-CD has more precise results than others.

Results for Shu Guang Village Data
Shu Guang village data contain two heterogeneous remote sensing images; one is an SAR image and the another an optical image. The difference in the imaging mechanism enlarges the variations between them. In these data, we focused only on the change from farmland to buildings. However, there was a large discrepancy in the unchanged areas. For DNN-CD, this affected the accuracy of the labeled data. Therefore, the result of DNN-CD is not listed here. Figure 14 shows the change detection results obtained by different methods for Shu Guang village data. Comparing the change detection results obtained by different methods, there is very little difference. Whole methods could not obtain good results, as shown in the area marked by a red circle in Figure 14a. The main reason for this is that the intrinsic difference is too large to affect the determination. However, in the numerical indices shown in Table 7, Atten-SiamNet-CD improved Pre, Acc, Kappa and F1.

Summary of Experiments
In the above subsection, the change detection results of different methods for six temporal image pairs are shown. The comparison of ResNet-CD and Siam-ResNet-CD demonstrated that the Siamese structure is more suitable for addressing change detection problems than the combination of two temporal image patches as input, which was supported by almost all the experiments. Analysis of the results of Siam-ResNet-CD and Atten-SiamNet-CD verifies the importance of information interaction between different branches during feature learnings. This provides strong experimental support for our findings.
In comparison with CNN-LSTM-CD, the proposed Atten-SiamNet-CD showed obvious advantages for SZTAKI AirChange Benchmark, QuickBird-1 and QuickBird-2. The results of the other two data are similar. As CNN-LSTM-CD uses two kinds of neural

Summary of Experiments
In the above subsection, the change detection results of different methods for six temporal image pairs are shown. The comparison of ResNet-CD and Siam-ResNet-CD demonstrated that the Siamese structure is more suitable for addressing change detection problems than the combination of two temporal image patches as input, which was supported by almost all the experiments. Analysis of the results of Siam-ResNet-CD and Atten-SiamNet-CD verifies the importance of information interaction between different branches during feature learnings. This provides strong experimental support for our findings.
In comparison with CNN-LSTM-CD, the proposed Atten-SiamNet-CD showed obvious advantages for SZTAKI AirChange Benchmark, QuickBird-1 and QuickBird-2. The results of the other two data are similar. As CNN-LSTM-CD uses two kinds of neural network to construct the two feature learning branches, besides spatial-spectral features, it can also extract temporal features. However, this method is a typical Siamese structure like Siam-ResNet-CD; it does not consider the interaction of two branches and the importance of each feature. This is the main reason that the proposed Atten-SiamNet-CD could perform better.
The performance of DNN-CD is highly dependent on the accuracy of labeled samples obtained by pre-classification. If the scene is complex, the labeled samples may involve a large amount of interference in the training processing and generate inaccurate change detection results. This is the main limitation of this method. Additionally, DNN-CD methods use the RBM as a basic network block; the image patches are rearranged into vectors inputted into the network, which could lose the spatial structural information during feature learning processing.
In summary, the proposed method is effective for change detection. Using the attention information fusion module to realize the interaction of two feature extraction branches can improve the final change detection results. In comparison with DNN-CD and CNN-LSTM-CD, the proposed Atten-SiamNet-CD has both performance and theoretical advantages.

Conclusions
The change detection of remote sensing images is a useful and significant research topic, which is promising for applications in practical problems. However, there are still many obstacles regarding the practical application of change detection in larger society. Besides these influences caused by imaging processing, the main challenge is the imbalance between the complexity of the scene and the limited cognition of the scene. It is impossible to annotate a mass of samples to train a perfect network for arbitrary scenes and arbitrary images. Commonly, only a few change types are considered in practice, although there are many change types in certain scenes. Thus, people only label some changed samples they find interesting. In the limited training set, the network structure is the key factor for better results. In this work, we focused on constructing a new network structure suitable for change detection. By analyzing the problems of basic networks such as ResNet-CD and Siam-ResNet-CD, an attention-guided information fusion network was constructed for change detection. In contrast to the common use of the attention mechanism, the attention block was integrated with double feature networks to realize the information interaction and fusion of two branches and also to guide feature learning processing. Of course, our experimental results also verified that the above procedure could perform better. Although exploring a more suitable network structure could improve the change detection results, it could not completely solve the practical application problem of change detection. In change detection, there are always a few labeled changed areas, and the rest of the areas are treated as unchanged. Therefore, the effect of unchanged areas cannot be ignored. Utilizing the information hidden in unchanged areas via some unsupervised methods to improve the effectiveness of supervised learning may be a good solution for change detection. In future works, we plan to explore this research direction.