MAFF-Net: Multi-Attention Guided Feature Fusion Network for Change Detection in Remote Sensing Images

One of the most important tasks in remote sensing image analysis is remote sensing image Change Detection (CD), and CD is the key to helping people obtain more accurate information about changes on the Earth’s surface. A Multi-Attention Guided Feature Fusion Network (MAFF-Net) for CD tasks has been designed. The network enhances feature extraction and feature fusion by building different blocks. First, a Feature Enhancement Module (FEM) is proposed. The FEM introduces Coordinate Attention (CA). The CA block embeds the position information into the channel attention to obtain the accurate position information and channel relationships of the remote sensing images. An updated feature map is obtained by using an element-wise summation of the input of the FEM and the output of the CA. The FEM enhances the feature representation in the network. Then, an attention-based Feature Fusion Module (FFM) is designed. It changes the previous idea of layer-by-layer fusion and chooses cross-layer aggregation. The FFM is to compensate for some semantic information missing as the number of layers increases. FFM plays an important role in the communication of feature maps at different scales. To further refine the feature representation, a Refinement Residual Block (RRB) is proposed. The RRB changes the number of channels of the aggregated features and uses convolutional blocks to further refine the feature representation. Compared with all compared methods, MAFF-Net improves the F1-Score scores by 4.9%, 3.2%, and 1.7% on three publicly available benchmark datasets, the CDD, LEVIR-CD, and WHU-CD datasets, respectively. The experimental results show that MAFF-Net achieves state-of-the-art (SOTA) CD performance on these three challenging datasets.


Introduction
Remote sensing image change detection (CD) uses two or more remote sensing images of the same area at different times to compare and analyze the atmospheric, spectral, and sensor information through artificial intelligence or mathematical statistics to obtain the change information of the area [1,2]. CD is an important research direction in the field of remote sensing and plays a great role in many fields such as land planning, urban expansion [3,4], environmental monitoring [5][6][7], and disaster assessment [8] as a key technology for monitoring surface conditions.
Recently, with the gradual maturity of remote sensing imaging technology, remote sensing image data with high resolution (HR) have been emerging. Compared with medium-resolution and low-resolution remote sensing images, HR remote sensing images have richer geometric and spatial information, which provide favorable conditions for humans to monitor surface changes more accurately. Therefore, the authors have paid more attention to the processing of HR remote sensing images. Effectively extracting the rich feature information of HR remote sensing images, better focusing on the change regions, avoiding the interference of other factors, and reducing the interference of pseudo-changes are the key issues of remote sensing image CD research [9].
There are many CD methods proposed, and different authors have made a more comprehensive summary classification from different aspects. In this paper, we will summarize and compare two perspectives from traditional methods and deep learning-based methods.
The traditional methods are divided into pixel-based remote sensing image CD methods and object-oriented remote sensing image CD methods according to the size of the basic unit [10]. The pixel-based remote sensing image CD method usually directly processes the input image according to the pixel-level spectral features, texture features, and other specific meaningful features (water bodies, vegetation indices). It obtains the difference image by difference or ratio. The change information is then extracted using a threshold segmentation method [11]. In the early days, methods such as image difference [12], image ratio [13], and regression analysis [14] were commonly used. However, these methods usually failed to obtain complete change information. To better utilize the spectral information of images, methods based on image transformation such as independent component analysis (ICA) [15] and multivariate alteration detection (MAD) [16,17] have emerged one after another and have achieved good results in land CD. For multispectral remote sensing images, the change vector analysis (CVA) [18] method is proposed to detect different changes in the ground. The CVA methods calculate the amplitude and phase angle and use the phase angle information to subdivide the changes. However, the performance of this type of method depends heavily on the quality of the spectral bands involved in the calculation, and the stability of the algorithm cannot be guaranteed. Therefore, improved versions of the CVA technique have been proposed during 2012-2016 to further improve the performance of CD [19][20][21][22]. With the development of HR optical remote sensing satellite technology, more and more HR remote sensing images are used for CD.
The characteristic of "different objects in the same spectrum" in HR remote sensing images easily leads to the phenomenon of "salt and pepper" in the detection results. This problem further limits the practical application of pixel-level CD methods in HR remote sensing images [23]. Object-based CD methods are commonly used in HR remote sensing image CDs. This is because it allows for a richer representation of information. Ma et al. [24] investigated the effects of semantic strategy, scale, and feature space on an unsupervised, object-based CD method in urban areas. Subsequently, Zhang et al. [25] proposed an object-based CD method for unsupervised CD by incorporating a multi-scale uncertainty analysis. Zhang et al. [26] proposed a method based on the box-whisker plot with cosine law, which outperformed the traditional CD method. For CD tasks where "from-to" change information has to be determined, Gil-Yepes et al. [27] and Qin et al. [28] utilized a postclassification comparison strategy. Although the object-based CD method can better utilize the spatial feature information of HR remote sensing images compared with the pixel-based CD method, it also relies on the traditional manual feature extraction method, which is not only complicated and low-efficiency, but also has less stable CD performance [9]. In recent years, deep learning methods have been widely used in natural language processing, speech recognition [29,30], and image processing [31][32][33]. Deep learning methods have excellent learning ability and do not require the manual design of feature factors to extract features. With the success of deep learning in the field of image processing, deep learningbased CD for remote sensing images has quickly attracted the interest of scholars. With the continuous development of technology, the field of remote sensing CD has also started to make some excellent research based on convolutional neural networks (CNNs) [34]. CNNs do not require feature extraction by manually designed features. In the field of remote sensing CD, ResNet [35], full convolutional networks (FCN) [36], and UNet [37] structures have been widely used for feature map extraction with certain results. With continuous research, the model of remote sensing CD has been continuously optimized and improved.
For example, the FC-EF [38] network performs a concatenation operation before feeding two images into the backbone network of the UNet structure, then processes the images separately through two branches of the network. These two branches have the same network structure and shared parameters, and, finally, the outputs of the two branches are combined using convolutional layers. The FC-Siam-conc [38] and FC-Siam-diff [38] improve the network by jump-connecting the three feature maps from the two encoder branches and the corresponding decoder layer. FC-Siam-diff improves the network by first differencing the feature maps of the two decoder branches, then finding the absolute value of the difference, finally using a skip connection strategy to connect with the corresponding decoder layer. Subsequently, the FCN-based UNet network was successfully applied to the CD task [39,40], which was trained in an end-to-end manner from scratch using only available CD datasets. Coarse-to-fine [41] proposes a detection framework based on coarse-to-fine detection to detect remote sensing change regions. It firstly uses an encoder and decoder to obtain coarse change maps of bi-temporal images, then applies the idea of residuals to obtain refined change maps. The method can effectively detect the change regions with good results. After considering the feature maps between different layers with the idea of residuals, many scholars also use the attention mechanism in the direction of remote sensing CD to extract richer and finer feature maps. ResNet is used as a backbone by STANet [42], and then a self-attention module for CD is added in the process of feature extraction, which can calculate any two pixels. The authors of this model introduced Transformer on top of ResNet, which makes the network performance further improved [43]. DASNet [44] proposes a dual-attention mechanism to generate better feature representations to enhance the performance of the network. Zhang et al. [45] first use the two Siamese network architectures as the raw images feature extraction network. To enhance the integrity of change map boundaries and internal densities, multi-level depth features are fused with image difference map features by an attention mechanism. In 2021, Hou et al. [46] proposed a novel attention mechanism for mobile networks by embedding location information into channel attention, calling it Coordinate Attention (CA). CA enhances feature representation. In addition, in 2021, HDFNet [47] uses the idea of a hierarchical fusion and dynamic convolution model to obtain a fine feature map. The network makes innovations in the fusion of features at different levels, which makes the network recognition performance superior. The above methods have achieved certain results in the field of remote sensing CD. However, the accurate extraction of effective feature representations and the adequate fusion of feature information at different scales are still research challenges in the field of remote sensing CD. For the benefit of retrieval, a summary of the above-mentioned methods is presented in Table 1. Table 1. Summary of contemporary CD methods.

Method Category Example Studies
Traditional CD methods

Pixel-based CD
In this paper, we propose a Multi-Attention Guided Feature Fusion Network (MAFF-Net) for remote sensing images to address the above problems effectively. The main contributions of this article are as follows:

1.
We propose the Feature Enhancement Module (FEM), which solves the problem that the features extracted from the backbone network have much interference information and the feature representation is not clear enough. The FEM captures not only crosschannel information but also direction-aware and location-sensitive information, which helps the model to locate the region of interest more accurately and enhance the representation of changing region features.

2.
To solve the problem of inadequate feature fusion and insufficient feature communication in different layers or scales, we designed the attention-based Feature Fusion Module (FFM), which is divided into FFM_ S1 and FFM_S2 according to the input feature maps. FFM_S1 fuses the high-level feature maps with the low-level feature maps by a cross-layer approach. This cross-layer feature fusion approach is of great benefit to highlight the spatial consistency of objects. FFM_S2 fuses two feature maps of the same scale, and it should be noted that one is the feature map of T1 and one is the feature map of T2. The role of FFM_S2 is to fully fuse the feature maps of the bi-temporal image pairs to obtain a better change map.

3.
We propose a Refinement Residual Block (RRB) using a residual structure, which can compensate for the shortcomings of using a single 3 × 3 convolutional kernel to refine the feature representation method.
We tested the model on three publicly available remote sensing image datasets. The experimental results validate the effectiveness of our proposed algorithm. The remainder of this article is organized as follows: Section 2 describes the proposed method in detail. In Section 3, corresponding experiments are designed to verify the effectiveness of the method in this article, and the experimental results are analyzed and discussed. Section 4 draws some conclusions about our method.

Methodology
In this section, a detailed description of the network proposed for the remote sensing image CD task is presented. First, the backbone of the architecture is described. Second, a detailed description of the proposed FEM is presented. Next, the attention-guided feature fusion mechanism is the focus of this section description, and these modules are described separately in this section. Then, the RRB proposed in this paper is introduced. Finally, the final prediction results are generated by applying convolutional operations [48,49] on the final fused feature maps.

Network Architecture
The overall structure of the proposed network in this paper is shown in Figure 1. The proposed network uses ResNet18 as its backbone network. Based on some previous work [42,50,51], the proposed network modifies Res-Net18 by removing the last max-pooling layer and the fully connected layer and retaining the layers in the first five convolutional blocks (Conv1 to Conv5).
First, the bi-temporal image pairs (T1, T2) are input to the feature extraction network to obtain sets of feature maps, F T1   For each set of feature maps, the proposed method uses only the last four feature maps. These feature maps are then fed into the Feature Enhancement Module (FEM) according to their respective scales to obtain two sets of updated feature maps, F 1 1 , F 2 1 , F 3 1 , F 4 1 and F 1 2 , F 2 2 , F 3 2 , F 4 2 . Next, the cross-layer feature fusion strategy is employed for each of the two updated feature maps. It should be noted here that our cross-layer feature fusion strategy targets different scale features of the same image. Specifically, take image T1 as an example. First, bilinear up-sampling [52][53][54] and convolution operations are performed on high-level features F 3 1 ∈ R 4C×H/4×W/4 to obtain F 3 1 ∈ R C×H×W , where H × W is the size of the feature map F 1 1 ∈ R C×H×W and C is the channel dimension of F 1 1 . Then, the feature maps F 1 1 and F 3 1 of the T1 image are concatenated to obtain feature F 13 1 ∈ R 2C×H×W . F 13 1 is input to the convolutional block attention module (CBAM) [55] and then output to F 13 1 ∈ R C×H×W after using 3 × 3 convolution on it. The same method is used to fuse  and F 24 1 , F 24 2 , are fed into our proposed RRB to further refine the feature representation to obtain F 13 12 ∈ R C×H×W and F 24 12 ∈ R 2C×H/2×W/2 , respectively. Finally, the two feature maps, F 13 12 and F 24 12 , are sent to the FFM for final fusion. The prediction map is obtained after applying a pixel classifier (equipped with the sequence 3 × 3 Conv, batch normalization (BN) [56], and ReLU [57]).

Feature Enhancement Module
The existing CD methods for HR remote sensing images have received less attention to the position information and channel relationships. HR remote sensing images have rich location-spatial information. To obtain accurate position information, a Feature Enhancement Module (FEM) based on coordinate attention (CA) is proposed in this paper to obtain the accurate location information and channel relationships of HR remote sensing images. The module can consider both position information and channel information. The structure of the FEM is shown in Figure 2.
In Figure 2, first, a 3 × 3 convolution operation is performed on the input F1. Then it is fed into the CA block to obtain the weighted feature map, F2 ∈ R C×H×W . Feature maps F1 and F2 are merged into one feature map by element-wise summation, and a 3 × 3 convolution operation is used to obtain F3.
In Figure 2, the coordinate attention module encodes H and W respectively. In the HR remote sensing image, for a given position (i, j), its pixel value on channel c is x c (i, j). The H average pooling output of the c-th channel at height h is as Equation (1) [46]: Similarly, the W average pooling output of the c-th channel at width w is as Equation (2) [46]: Then, the Reshape operation is used to permute the dimensionality of the z h c tensor to be the same as that of the z w c tensor. Next, the coordinate attention module uses concatenation, convolution, and activation function operations. The related definition is as Equation (3) [46]: where [,] indicates a concatenation operation, F C indicates a 1 × 1 convolution operation, and δ indicates the ReLU activation function. f is the output feature map of the ReLU layer. After the split operation, f can be decomposed into f h ∈ R C/r×1×H and f w ∈ R C/r×1×W . The reshape operation is used again to permute the dimension of the tensor f h to obtain f h ∈ R C/r×H×1 . Next, two 1 × 1 convolutional transforms F h and F w are used to transform f h and f w into tensor with the same number of channels as the input I1, respectively. Then, applying the sigmoid activation function [58] to the tensors updated by F h and F w , respectively, two outputs are obtained as shown in Equation (4) and Equation (5) [46]: where σ indicates sigmoid activation function. The Resize operation expands the size of g h ∈ R C×H×1 and g w ∈ R C×1×W to the same size as the input I1 ∈ R C×H×W , respectively, and the g h and g w , after being Resized, are used as attention weights. Finally, the output feature map I2 of the CA block is defined as Equation (6) [46]: where c is the c-th channel, g h c (i) is the weight of the i-th position in the H direction, g w c (j) is the weight of the j-th position in the W direction, and y c (i, j) is the value of the output feature map I2.

Feature Fusion Module
With the study of deep learning-based CD, it has been found that the CD task is unsatisfactory if it relies only on simple feature extraction networks. On the one hand, this is because simple feature extraction networks cannot eliminate semantic interference such as seasonal appearance differences and cannot accurately label change regions in the presence of diverse object shapes and complex boundaries. On the other hand, it is not fully exploited to multi-scale information, and the fusion of multi-scale features to make them communicate can help our network improve its performance.
Therefore, as shown in Figure 1, an attention-based Feature Fusion Module (FFM) is introduced into the CD network. The detail of the FFM is shown in Figure 3.
The proposed FFM is slightly different at different stages. The FFM whose input features are from FEM is named FFM_S1, and the FFM whose input features are from RRB is named FFM_S2. Specifically, the difference between FFM_S1 and FFM_S2 lies in the input part. The inputs of FFM_S1 are two feature maps of different scales of one image, while the input of FFM_S2 is two feature maps of the same scale of two images.
After FEM processing, two sets of updated feature maps, , were obtained. For FFM_S1, the inputs are the feature map pairs F 1 1 , F 3 1 and F 2 1 , F 4 1 and F 1 2 , F 3 2 and F 2 2 , F 4 2 , respectively. Figure 3 shows FFM_S1, and the structure of FFM_S2 is not drawn separately because the two only have different inputs. However, it should be emphasized here that FFM_S2, which has two input feature maps of the same scale, does not distinguish between high-level features and low-level features, and also does not need to up-sample high-level features such as FFM_S1. The next step is to describe FFM_S1. After experiments, it is found that the fusion of features by cross-layer is more effective. This may be because the high-level features will lose some semantic information carried by the original image or low-level features, such as some edge features, as the number of convolution layers increases, and the fusion with low-level features can compensate for this deficiency. At the same time, the semantic information carried by the feature maps between neighboring layers is not so obviously different, so the fusion method by cross-layer plays a role. For an original image T1, feature map pairs F 1 1 , F 3 1 and F 2 1 , F 4 1 are fed into FFM_S1, respectively. For original image T2, feature map pairs F 1 2 , F 3 2 and F 2 2 , F 4 2 are fed into FFM_S1, respectively. As shown in Figure 3, the high-level feature needs an up-sampling operation to make the feature map shape consistent with the low-level feature. Next, one 1 × 1 convolution is used to obtain the feature map F2 ∈ R C×H×W . The two inputs F1 ∈ R C×H×W and F2 are concatenated to obtain the feature map F3 ∈ R 2C×H×W . The resulting feature map can be viewed as a feature map with different channels. The calculation process of F3 is shown in Equation (7): where Conv denotes the 1 × 1 convolution, and [.,.] denotes the concatenation operation.
Considering that this direct aggregation of features in cross-layer does not yet communicate well in the channel and spatial dimensions, feed F3 to the CBAM. CBAM is an attention module consisting of the channel and spatial attention. It considers both the importance of pixels in different channels and the importance of pixels in different positions in the same channel. The CBAM outputs the feature map F4 ∈ R 2C×H×W . Then, the 3 × 3 convolution block is used, the main purpose of which is to recover the channels of the aggregated feature map to the number of channels of the input feature map. The above calculation process is shown in Equation (8): where Conv denotes the 3 × 3 convolution block. Next, in two subsubsections, two parts of CBAM, namely the channel attention module and the spatial attention module, are described in detail.

Channel Attention Module
In the Channel Attention Module (CAM), the vectors described as AF ca avg ∈ R B×C×1 and F ca max ∈ R B×C×1 are obtained by the average-pooling and max-pooling operations, respectively. Then, each of them is input to the shared multi-layer perceptron (MLP) with one hidden layer, respectively, to get two vectors, and the two vectors are merged to one feature vector by element-wise summation. After sigmoid activation, the feature map of the CAM is finally obtained. This is shown in Equation (9) [55]: where FC 0 and FC 1 denote the convolution operation in MLP and δ denotes the sigmoid function. The CAM compresses the feature map spatial dimensions to obtain a onedimensional vector before manipulating it. Channel attention is concerned with what is significant on this feature map. The average-pooling has feedback for every pixel point on the feature map, while max-pooling has feedback for gradients only where the response is greatest in the feature map when performing gradient backpropagation calculations.

Spatial Attention Module
In the Spatial Attention Module (SAM), it is the feature map output from the CAM that is used as input. First, do a max-pooling and average-pooling based on the channel to get the element-wise summation, and then a concatenation operation is performed on the two layers. Then, convolution is performed and reduced to 1 channel, and then the feature map output from the SAM is obtained by sigmoid activation. This is given by Equation (10) [55]: where Cat is the concatenation operation, f 7×7 represents a convolutional layer with a filter size of 7 × 7, and δ denotes the sigmoid function. The SAM is a channel compression mechanism that performs average-pooling and max-pooling in the channel dimension respectively. The max-pooling operation is to extract the maximum value on the channel, and the number of extractions is H × W. The average-pooling operation is to extract the average value on the channel, and the number of extractions is also H × W. Thus, a 2-channel feature map can be obtained.

Refinement Residual Block
The use of a single 3 × 3 convolutional kernel has some shortcomings in refining the feature representation. Inspired by Yu et al. [59], a Refinement Residual Block (RRB) is introduced to modify the channels of the aggregated feature map to be consistent with the input feature map and further refine the feature representation before the final feature fusion using FFM_S2. Its structure is shown in Figure 4.
As can be seen in Figure 4, the RRB has three inputs, one of which is the difference map of two feature maps. The three feature maps are first subjected to a concatenation operation, followed by two consecutive convolution blocks, each consisting of Conv 3 × 3, BN, and ReLU. The two convolution blocks output the feature maps F1 ∈ R C×H×W and F2 ∈ R C×H×W , respectively. Here, it should be noted that the number of channels of each convolutional block output is different. In addition, the module adds additional residual connections with the 1 × 1 convolutional layers for obtaining some additional spatial information of the remote sensing images. Finally, the four feature maps are subjected to element-wise summation and the final output feature map F4 ∈ R C×H×W is obtained.

Loss Function
In the training stage, a cross-entropy loss function optimized by Chen et al. [43] is used, which minimizes the cross-entropy loss to optimize the network parameters. Formally, the loss function is defined as Equation (11) [43]: where l(P hw , y) = −log P hwy is the cross-entropy loss and Y hw is the label for the pixel at location (h, w) [43].

Experiments and Results
In this section, the proposed network MAFF-Net is evaluated on three publicly available benchmark datasets to demonstrate its effectiveness. First, the details of the three datasets, the CDD dataset [60], the LEVIR-CD dataset [42], and the WHU-CD dataset [61], are introduced. Next, the implementation details are presented, including the experimental environment and evaluation metrics. Then, seven state-of-the-art (SOTA) comparison methods are introduced. In this section, quantitative and qualitative analyses of these methods are presented on three datasets.

Datasets and Settings
The CDD dataset has three types of images, synthetic images with no relative movement of objects, synthetic images with less relative movement of objects, and real remote sensing images with seasonal changes (obtained from Google Earth). In this paper, a subset of remote sensing image data with seasonal changes is selected. This subset has 16,000 images with an image size of 256 × 256 pixels, of which 10,000 images are used as the training set, 3000 images as the validation set, and 3000 images as the test set. As shown in Figure 5, the change scenarios of this dataset include building changes, road changes, and vehicle changes. The data set was considered for different sizes of objects.  The third dataset is named the WHU-CD dataset, which is a CD dataset of public buildings. The dataset covers the area where the 6.3 magnitude earthquake occurred in February 2011 and has been reconstructed in the following years. It consists of a pair of HR (0.075 m) aerial images of size 32, 507 × 15, 354. Considering that the authors of the original paper did not provide a solution for data segmentation, as shown in Figure 7, the solution of cropping the image into small pieces of size 224 × 224 was finally chosen, and dividing them into three random parts: 7918/987/955 for training/validation/testing, respectively.

Evaluation Metrics and Settings
For quantitative assessment, three indices, namely the F1-score (F1), Kappa coefficient (Kappa), and overall accuracy (OA) are used as the evaluation metrics. These three indices can be calculated as follows: where OA and PRE denote the overall accuracy and expected accuracy, respectively. The TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively. We implemented our proposed method with PyTorch, supported by NVIDIA CUDA with a GeForce GTX 2080Ti GPU. In the training stage, the feature extraction backbone of the proposed MAFF-Net is initialized from ResNet18. We used the Adam (β 1 = 0.5, β 2 = 0.9) optimizer and the entire training period was set to 200 epochs. The initial learning rate is 0.001 in the first 100 epochs, in the next 100 epochs, the value of the learning rate decays linearly to 0. Considering the GPU size, we set the batch size to 8 to facilitate GPU training.

Comparison of Experimental Results
In this section, the performance of the different methods is compared on the three datasets CDD, LEVIR-CD, and WHU-CD, respectively. The advantages and disadvantages of each method are further described based on the results of the quantitative and qualitative analyses. In addition, an ablation study is performed on the proposed method to compare and analyze the effectiveness of each of its modules.

Comparison Methods
To verify the effectiveness and superiority of our methods, we selected seven methods that are represented in the CD task and compared the performance of these methods in CDD, LEVIR-CD, and WHU-CD, respectively, and a brief description of the selected methods is as follows: 1.
CD-Net [62] combines the multi-sensor fusion SLAM and fast density 3D reconstruction for coarse alignment of image pairs followed by deep learning methods for pixel-level CD.

2.
FC-EF [38] refers to early fusion with full convolution. It concatenates the two input images before feeding them into the network, treating them as different channels of one image. It is then fed into a standard U-Net. 3.
FC-Siam-conc [38] connects three feature maps from the two encoder branches and the corresponding layer of the decoder. 4.
FC-Siam-diff [38] first finds the absolute value of the difference between the feature maps of the two decoder branches and then makes a skip-connection to the corresponding layer of the decoder.

5.
DASNet [44] is a CD model based on a dual-attentive fully convolutional twin neural network and proposes a weighted double-margin contrastive loss (WDMC) to be able to solve the sample imbalance problem. 6.
IFN [45] first uses the two Siamese network architectures as the raw images feature extraction network. To enhance the integrity of change map boundaries and internal densities, multi-level depth features are fused with image difference map features by an attention mechanism. 7.
STANet [42] proposes a new spatial-temporal attention neural network based on twin networks. The network exploits spatial-temporal dependence and designs a CD self-attentive mechanism to model spatial-temporal relations. A new HR remote sensing image dataset, LEVIR-CD, is also proposed.

CDD Dataset
For quantitative comparison, we calculated and summarized the evaluation metrics for CDD, LEVIR-CD, and WHU-CD, as shown in Tables 1-3, respectively. To compare the performance of each method more visually, we visualized the test results of each method on the three data sets, as shown in Figures 8-10, respectively. The white color indicates the changes that were correctly detected. Black indicates that no changes have been correctly detected. Red indicates false alarms. Blue indicates unpredicted changes.   As can be seen from Table 2, the proposed MAFF-Net reached the first on F1, Kappa, and OA on the CDD dataset. This also indicates that the proposed network performs optimally on this dataset. It is also evident from Figure 8 that the proposed network can better mark the change region, while there are few cases of wrong and missing detections. Specifically, as can be seen from the data in Table 2, CD-Net, which does not pay attention to the connections and interactions between multi-scale features, performs relatively poorly in the three evaluation metrics, 14.6% lower than the proposed MAFF-Net in terms of F1 score. This is somewhat related to its fewer network levels and relatively simple structure. Considering early fusion and late fusion strategies separately and using skip-connected encoding-decoding, the baselines of FC-EF, FC-Siam-conc, and FC-Siam-diff achieve better performance with their compact and efficient structures. Among these three baselines, the late fusion baseline shows a clear advantage over the early fusion baseline. The fusion of feature maps using bi-temporal image pairs with their difference maps achieves better results than the fusion of feature maps using only bi-temporal image pairs. FC-Siam-Diff scores 0.8%, 0.9%, and 0.1% higher than FC-Siam-conc on F1, Kappa, and OA, respectively. This is because the original image coding features are preserved as much as possible while obtaining the difference maps. This helps the network to achieve better performance. Based on the attention mechanism, which can further focus on the information exchange between feature maps, DASNet works better than FC-EF. IFN pays more attention to the connection and interaction of multi-scale information. It introduces channel attention and spatial attention and uses a post-fusion strategy for deep supervision. Its F1 and Kappa scores reached 90.1% and 89.2%, respectively. STANet proposes a spatial-temporal attention module based on a feature pyramid to better adapt the network to the detection task of complex scenes, ranking second in all evaluation metrics. The proposed MAFF-Net achieves the highest level in all metrics, respectively. It is able to detect and label the change regions better than other methods because the network employs an attention-based cross-layer feature fusion strategy and also designs a refinement residual block to further improve the network detection performance.
Also, the qualitative analysis in Figure 8 allows for further analysis of the performance of each network. For visual analysis, eight challenging sets of bi-temporal images were selected and visualized. Each set of images contains different ranges of change regions or change scenes. Among the three FCN-based baselines, FC-Siam-conc and FC-Siam-diff can give better results than FC-EF. As can be seen in Figure 8, only a small number of change regions (Figure 8a) can be marked by FC-EF, but it performs poorly for smaller change regions and more complex scenes (Figure 8b-h). This is because it does not preserve the features of each original image, especially the shallow features, which makes the detected change regions significantly inaccurate. In general, the other two baselines perform better than FC-EF, as evidenced by the completeness of the information in the regions of change detected in the illustrations. However, they still suffer from many missed and false detections, such as Figure 8b-g. In particular, in Figure 8e, they do not detect the change region at all. Therefore, there is still potential for improvement. By introducing dual attention in the decoding stage, DASNet can detect most of the change regions. However, its detection performance for small change regions needs to be improved. For example, in Figure 8e, there are many missed regions in its detection results, and there are also false detection regions. This demonstrates that it is not yet quite accurate in terms of the boundaries and details of the change regions. In addition, it also does not perform well in Figure 8b,f,h with false detections and missed detections.
IFN and STANet are relatively more complete in terms of local detail because of the introduction of channels and spatial attention. However, they still have false positives and false negatives in detecting some very small target regions or edges, as shown in the red and blue regions in Figure 8c,e,g,h. The processing of some regions is too smoothed, and some edge information is ignored to some extent. The proposed MAFF-Net can better label the change regions and accurately detect the edges of the change regions. It can be seen from the exhibited samples that there are very few red and blue regions representing false and missed detections. In particular, the detection performance is well for small and complex change regions, as shown in Figure 8e-h, for example. This also demonstrates that the proposed network can detect the change regions accurately in general.

LEVIR-CD Dataset
As can be seen from Table 3, the difference in performance between the three baselines of FCN is not significant, where the higher score among the three indicators is FC-Siam-Diff, with F1 and Kappa scores of 83.7% and 82.8%, respectively. DASNet, by introducing dual attention, improved the F1-score by about 0.9% compared to the three baselines of FCN. The F1 score of IFN reached third place with 86.2%, while the scores of Kappa and OA also performed well. However, the scores of all metrics are lower than those of STANet, which may be because STANet pays more attention to multi-scale information while introducing attention. By introducing an attention mechanism involving multiple scales, the proposed MAFF-Net improves the F1 score to 89.7%, which is better than other comparative methods. Moreover, Kappa and OA reached the highest values among the compared methods with 89.1% and 98.7%, respectively. Figure 9 also illustrates the change maps on eight selected sets of bi-temporal images. The change regions in these images cover multiple scenes, areas, shapes, and distribution ranges. For multiple regularly shaped building changes in Figure 9a,b, the overall contours of the buildings are correctly detected. However, the detection results of the CD-Net and FC-EF methods still have obvious false detection and missed detection areas. Although STANet can locate the change region, the detection of more complex and small change regions is not entirely correct. For example, as shown in Figure 9f,h, the proposed MAFF-Net is more accurate than the other methods, as seen from the fewer regions marked in red and blue. For Figure 9a,b,d, the attention-based methods DASNet and STANet and the proposed attention-based guided cross-layer feature fusion network MAFF-Net are visually closer to the GT. For the more densely distributed change regions in Figure 9c, DASNet, STANet, and MAFF-Net maintain visual correctness, while MAFF-Net has fewer errors and can accurately detect and distinguish multiple dense change regions. However, for Figure 9f-h with more complex edges and smaller change regions, IFN, DASNet, and STANet do not perform well. On the contrary, MAFF-Net shows better adaptability, and it can accurately detect changing regions with complex shapes and small objects.

WHU-CD Dataset
According to the data in Table 4, the performance of the methods with FCN as the baseline does not differ much. The double attention-based DASNet performs slightly better than IFN and STANet, with scores of 90.7%, 90.1%, and 99.0% for F1, Kappa, and OA, respectively. We attribute this to the fact that the weighted double-margin contrastive loss (WDMC) used by DASNet can solve the problem of sample imbalance. The proposed MAFF-Net achieved the best scores in all evaluation metrics compared to the other compar-ison methods. Compared with the method using FCN as the baseline, the proposed method obtained a 9.1%, 9.6%, and 1.0% increase in F1, Kappa, and OA, respectively. This also demonstrates the effectiveness of the proposed multi-attention-guided feature fusion-based method. Compared to DASNet, IFN, and STANet, the proposed method improves the gains for F1, Kappa, and OA by 1.7%, 2.0%, and 0.4%, respectively. Such gains are generated thanks to our fusion strategy that fully considers multi-scale features, while effectively exploiting the advantage of the attention to greatly improve the network performance. For visual comparison, Figure 10 shows some typical CD results for the test samples in the WHU-CD dataset. As shown in Figure 10a,c-e,h, there are many missed detections and false detections in the compared methods. As shown in Figure 10c,e,h, CD-Net not only has false detections but also has many missed detection regions. The performance of the FCN-based FC-EF, FC-Siam-conc, and FC-Siam-diff have been improved and the missed detection regions are significantly reduced. However, they still have the same problems as CD-Net as shown in Figure 10d,e,h. In Figure 10e,h, the attention-based DASNet, IFN, and STANet do not perform well, with significant missed detection regions and some false detection regions. In terms of consistency with the GT, the proposed MAFF-Net achieves the best visual performance. Specifically, as shown in the samples in Figure 10, MAFF-Net significantly reduces the missed detections and has a very low false detection rate compared with other methods. In addition, the change maps generated by MAFF-Net have clearer and more accurate boundaries compared with other methods.

Ablation Study
In the CD task, our proposed model achieves superior performance. To validate the effectiveness and feasibility of our proposed method, we conducted a series of ablation experiments on three datasets, CDD, LEVIR-CD, and WHU-CD, to verify that our model has advanced performance. We conducted five ablation experiments on three HR datasets, and in our experiments, the Baseline represents the ResNet18 network structure. In total, five ablation experiments were conducted in this paper: Baseline, Base-line+FEM, Baseline+FEM+FFM_S1, Baseline+FEM+FFM_S1+RRB, and the MAFF-Net (Baseline+FEM+FFM_S1+RRB+FFM_S2). As shown in Figure 11, the Baseline does not achieve good performance in detecting change regions, especially when the change region scene is more complex or the change region area is small (Figure 11d). Compared with the Baseline, the Baseline + FEM method obtains richer features after adding the FEM, which can help the network detect most of the change regions. It can be seen that the Base-line+FEM+FFM_S1 can effectively remove some irrelevant information (Figure 11f), while further capturing the change features and refining the feature representation. The FFM_S1 module adopts a cross-layer fusion strategy, which helps the model to fully fuse the features of high and low layers to achieve better feature representation. Compared with the Base-line+FEM method, the Baseline+FEM+FFM_S1 method detects more accurate and complete change regions. However, it can also be found that the method is slightly lacking when faced with small change regions or poorly characterized features (Figure 11f-1). Therefore, the Baseline+FEM+FFM_S1+RRB method aims to further refine the feature representation, which helps to detect smaller change features and improve the network performance. As can be seen by Figure 11g, the change map obtained by this method is already very close to the change region of the GT. Finally, the method proposed in this paper performs feature fusion feature maps to obtain a prediction map that is closest to the real change regions. As can be seen from Figure 11h, the change map obtained by the proposed method is very close to the GT, which also surfaces the effectiveness of the proposed method. Meanwhile, the proposed method shows good accuracy on three different datasets. By comparing the visualization results of each module, the effectiveness and accuracy of the MAFF-Net method proposed in this paper are effectively demonstrated. In addition, we also performed statistics and comparisons on the F1, Kappa, and OA values of different methods. As shown in Table 5, the model achieves optimal performance when all innovation modules are added, which also proves the effectiveness of our proposed innovation modules. In the Baseline+FEM method, as can be seen, there is a significant improvement in three indicators compared with the Baseline method. In the CDD dataset, Kappa, F1, and OA increased by 6.4%, 5.6%, and 1.5% compared with the Baseline, respectively. In the LEVIR-CD dataset, Kappa, F1, and OA were increased by 3.9%, 3.7%, and 0.4%, respectively, compared with the Baseline. In the WHU-CD dataset, Kappa, F1, and OA were increased by 4%, 3.9%, and 0.2%, respectively, compared with the Baseline.
In the Baseline+FEM+FFM_S1 method, it can be seen that all metrics are improved compared to the baseline+FEM method. In the CDD dataset, Kappa, F1, and OA improve by 1.1%, 1%, and 0.3%, respectively, compared to the Baseline. In the LEVIR-CD dataset, Kappa, F1, and OA improved by 1.3%, 1.2%, and 0.1%, respectively, compared to the Baseline. In the WHU-CD dataset, Kappa, F1, and OA improved by 1.6%, 1.5%, and 0.1%, respectively, compared to the Baseline. We can see the improvement of all metrics on all datasets, indicating the innovation and validity of our proposed FFM_S1, while the joint use of FFM_S1 and FEM achieves better performance and makes the model more accurate.
In the Baseline+FEM+FFM_S1+RRB method, it can be seen that there are improvements in all metrics compared with the Baseline+FEM+FFM_S1 method. In the CDD dataset, Kappa, F1, and OA improve by 1.6%, 1.3%, and 0.3%, respectively, compared to the Baseline. In the LEVIR-CD dataset, Kappa, F1, and OA improved by 0.6%, 0.6%, and 0.1%, respectively, compared to the Baseline. In the WHU-CD dataset, Kappa, F1, and OA improved by 0.6%, 0.5%, and 0.1%, respectively, compared to the Baseline+FEM+FFM_S1. We can see the improvement in all metrics on all datasets, indicating that our proposed RRB enhances the feature representation of the feature map, while the combined use of FFM_S1, FEM, and RRB leads to better performance of the model.
In the Baseline+FEM+FFM_S1+RRB+FFM_S2 method, it can be seen that all metrics are improved compared to the baseline+FEM+FFM_S1+RRB approach. In the CDD dataset, Kappa, F1, and OA improved by 0.6%, 0.6%, and 0.2%, respectively, compared to the Baseline. In the LEVIR-CD dataset, Kappa and F1 improved by 0.9% and 0.9%, respectively, compared to the Baseline. In the WHU-CD dataset, Kappa, F1, and OA improved by 0.6%, 0.5%, and 0.1%, respectively, compared to the Baseline+FEM+ FFM_S1+RRB. We can see the improvement of all the metrics on all datasets, indicating our proposal that FFM_S2 has a facilitating effect in fusing multi-scale feature information exchanges, while FFM_S1 and FFM_S2 have a mutual facilitating effect in feature extraction, and also, it is known experimentally that MAFF-Net helps the network fuse multi-scale features to achieve multi-scale information communication, which can improve the performance of the network.

Efficiency Analysis of the Proposed Network
Although the proposed network MAFF-Net achieves encouraging performance, it has some potential limitations. The computational complexity of MAFF-Net is relatively high and the number of parameters is large. This is not friendly to devices and applications with limited resources. In this section, the parameter amount (take M as the unit) and the training time of an epoch (take min/epoch as the unit) are used as quantitative indicators for evaluation. As shown in Figure 12, the number of trainable parameters of MAFF-Net is 49.08 million, which is the largest among the compared methods. However, from another perspective, the training efficiency of the proposed MAFF-Net is also relatively impressive. Compared with STANet and DASNet, the training time of the proposed method is reduced by 56.22% and 40.86%, respectively, which makes the proposed method more valuable in practical applications under the same equipment conditions. Though the number of training parameters and training time is comprehensive, the proposed method has space for improvement and enhancement in the future. For example, model compression can be performed in the proposed network, employing pruning and knowledge distillation [63,64] to reduce the size of the model.

Conclusions
In this paper, we propose a novel feature fusion network for remote sensing image CD tasks. To enhance the feature representation, we propose a Feature Enhancement Module (FEM), which introduces coordination attention (CA) that can capture long-range dependencies with precise location information while modeling inter-channel relationships. The FEM helps the network to further refine the features extracted by the backbone network ResNet18. The quantitative and qualitative analysis of the ablation study shows that the performance of the FEM on the Baseline is improved, which demonstrates the reasonability and effectiveness of the FEM. Considering that layer-by-layer feature fusion may lose part of the semantic information, we propose an FFM employing a cross-layer feature fusion strategy. The FFM uses semantic cues in the high-level feature map to guide feature selection in the low-level feature map. In addition, to highlight changing regions and suppress useless features, we introduce a CBAM in the FFM, which combines the advantages of channel attention and spatial attention, allowing the model to learn which region to focus on and pay more attention to critical information. Depending on the input features, we classified FFM into FFM_S1 and FFM_S2, both of which further enhance the feature fusion effect. Based on the ablation study in Section 3, we can see that the FFM significantly improves the performance of the network. To compensate for the shortcomings of using a single convolutional kernel for feature refinement, we propose a Refinement Residual Block (RRB) that employs a residual structure. The RRB changes the number of channels of the aggregated features and uses convolutional blocks to further refine the feature representation. Based on the comparison results between the proposed MAFF-Net and other methods in quantitative and qualitative analysis, the proposed method is able to efficiently detect changing regions and has a strong ability to select features through a feature fusion strategy guided by multiple attention mechanisms. On the three publicly available benchmark datasets CDD, LEVIR-CD, and WHU-CD, the F1 scores of MAFF-Net are improved by at least 1%, 2%, and 3%, respectively, compared to other methods. This demonstrates the better performance of our method than other SOTA methods.
However, it should be noted that, as shown in Figure 12, although the proposed model has an advantage in terms of training speed, it cannot be ignored that the number of parameters of the proposed model is relatively large, reaching 49.08 M. This has potential limitations for its practical application in the future. Therefore, in future work, we hope that the network can be made lightweight by using some model compression techniques. In addition, the proposed method solves the CD task of bi-temporal remote sensing images, and in future work, it will focus on the CD task of multi-temporal remote sensing images.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The