MSGFNet: Multi-Scale Gated Fusion Network for Remote Sensing Image Change Detection

: Change detection (CD) stands out as a pivotal yet challenging task in the interpretation of remote sensing images. Significant developments have been witnessed, particularly with the rapid advancements in deep learning techniques. Nevertheless, challenges such as incomplete detection targets and unsmooth boundaries remain as most CD methods suffer from ineffective feature fusion. Therefore, this paper presents a multi-scale gated fusion network (MSGFNet) to improve the accuracy of CD results. To effectively extract bi-temporal features, the EfficientNetB4 model based on a Siamese network is employed. Subsequently, we propose a multi-scale gated fusion module (MSGFM) that comprises a multi-scale progressive fusion (MSPF) unit and a gated weight adaptive fusion (GWAF) unit, aimed at fusing bi-temporal multi-scale features to maintain boundary details and detect completely changed targets. Finally, we use the simple yet efficient UNet structure to recover the feature maps and predict results. To demonstrate the effectiveness of the MSGFNet, the LEVIR-CD, WHU-CD, and SYSU-CD datasets were utilized, and the MSGFNet achieved F1 scores of 90.86%, 92.46%, and 80.39% on the three datasets, respectively. Furthermore, the low computational costs and small model size have validated the superior performance of the MSGFNet.


Introduction
The advance of satellite imaging technology has facilitated the acquisition of remote sensing images (RSIs).Change detection (CD) is the process of identifying changes in the ground within the same geographical area utilizing RSIs taken at two different times [1].Due to its wide application in urban sprawl detection [2], urban green ecosystems [3], damage assessment [4], etc., CD as a fundamental and important task has increasingly gained attention in the remote sensing field.
During the early stages of CD research, numerous methods have been proposed by researchers [5,6].For example, image difference was one of the earliest CD methods for subtracting bi-temporal images according to the corresponding pixels [7].To address spurious changes and counter positional errors, a robust change vector analysis method was proposed by Thonfeld et al. [8], combining intensity information with the advantages of change vector analysis (CVA).Researchers have made substantial progress through extensive research on these traditional methods [9][10][11].However, these traditional CD methods face new challenges with the increased spatial resolution of remote sensing images.On one hand, traditional CD methods are designed for medium-and low-resolution RSIs, resulting in poor performance when dealing with rich information in high-resolution RSIs [12].On the other hand, these methods rely on handcrafted features that are sensitive to radiation differences and illumination changes [13,14].Consequently, the application of traditional CD methods is limited in scope.
Recently, with the advent of the big data era, deep neural networks have demonstrated their strong feature extraction capabilities [15,16], with the end-to-end advantages of convolutional neural networks (CNNs) being particularly notable.CNNs have been widely employed in CD tasks and have spawned a number of promising CD methods [17,18].For example, Zhang et al. [19] integrated a CycleMLP block into a Siamese network, proposing an MLP-based method for CD.However, it is important to note that this method incurs a substantial inference time.Fang et al. [20] introduced a CD method that combines the UNet++ architecture with a Siamese network.This method mitigates the loss of localization feature information by establishing a dense connection between the encoder and decoder.
Although the methods mentioned above have achieved performance results, they do not consider the characteristics of bi-temporal multi-scale features, thereby resulting in incomplete detection targets and limited accuracy of results.Inspired by the widely used multi-scale pyramid architecture for extracting multi-scale feature information in medical image segmentation [21], several methods have been proposed to address these problems by using multi-scale features [22][23][24].For instance, Li et al. [23] proposed a multi-scale convolutional channel attention mechanism to generate detailed local features and integral global features.For capturing feature information on all scales, Xiang et al. [22] introduced a multi-receptive field position enhancement module incorporating convolutional layers with different kernel sizes.Despite the improvements achieved by the above methods through the incorporation of multi-scale features, they still exhibit certain shortcomings.On the one hand, these methods employ a simple concatenation strategy for fusing multiscale features without considering the interaction between them.On the other hand, they extract multi-scale features after a simple feature fusion (i.e., feature difference) rather than employing bi-temporal multi-scale feature fusion.Consequently, the simple feature fusion often has restrictions that are not discriminative enough and result in unsmooth detection boundaries.
To address such problems, this study investigates the multi-scale fusion of bi-temporal features to detect complete change targets and improve the accuracy of results, and we further propose a multi-scale gated fusion network (MSGFNet).In particular, we opt for a lightweight model, namely EfficientNetB4 [25], as the encoder for constructing the Siamese architecture.This architecture is utilized to extract multi-layer features from bi-temporal images.Then, we propose a multi-scale gated fusion module (MSGFM) that has a multiscale progressive fusion (MSPF) unit and a gated weight adaptive fusion (GWAF) unit.This module aims to obtain discriminative fusion features, improving the details of boundaries and effectively detecting the complete change targets.To gradually reconstruct the results, the decoder processes the fused multi-scale features in the end.The main contributions of this study may be summarized as follows: 1.
We propose a novel end-to-end CD network, namely the multi-scale gated fusion network (MSGFNet).The MSGFNet is designed with a weight-sharing Siamese architecture tailored to be compatible with the CD task; 2.
To improve the details of boundaries and detect the complete change targets, we propose an MSGFM comprising an MSPF unit and a GWAF unit.The MSGFM adaptively fuses bi-temporal multi-scale features based on gate mechanisms to obtain discriminative fusion features; 3.
To confirm the efficacy of the MSGFNet, we employed the LEVIR-CD, WHU-CD, and SYSU-CD datasets for our comparison experiments.The results demonstrate that the MSGFNet outperforms several state-of-the-art (SOTA) methods.Additionally, the MSGFM was validated through ablation studies.
The following section outlines the organization of the remainder of the paper.A brief review of the latest relevant works is given in Section 2. Section 3 details the overall framework of the MSGFNet.Section 4 sequentially offers information on experimental datasets, evaluation metrics, comparison methods, experimental details, results, and ablation studies.Section 5 concludes the paper.

Related Work
In this section, a brief review of the latest methods based on deep learning is given.The current deep-learning-based CD methods can be categorized into three groups based on network structure: CNN-based, transformer-based, and hybrid-based methods.

CNN-Based Methods
From the perspective of the fusion strategy, CNN-based methods can be further categorized into single-stream and two-stream methods [26].In detail, single-stream methods take inspiration from semantic segmentation tasks.Researchers have proposed some approaches to image-level fusion strategies that match the semantic segmentation networks.For instance, Sun et al. [27] introduced conventional long short-term memory into Unet for CD.Peng et al. [28] employed bi-temporal images that had been concatenated into a UNet++ network.They further proposed a fusion strategy on multiple side outputs to improve the accuracy of results.Nevertheless, the independent feature characteristics of each bi-temporal image cannot be directly captured by single-stream CD methods based on semantic segmentation networks.
In contrast to single-stream, two-stream methods leverage the Siamese architecture, which consists of two streams that share weights to generate features of bi-temporal images.Most existing CD methods [20,[29][30][31] adopt the Siamese architecture because it is appropriate for handling the input of RSIs.For instance, Dai et al. [29] introduced a building CD method that comprises a multi-scale joint supervision module and an improved consistency regularization module.Ye et al. [30] employed Siamese networks to propose a feature decomposition optimization reorganization network for CD.The edge and main body features were modeled using a feature decomposition strategy.Li et al. [32] proposed a lightweight CD method composed of three modules: a neighbor aggregation module (NAM), a progressive change identifying module (PCIM), and a supervised attention module (SAM), to improve the accuracy of results.Zhou et al. [33] introduced a context aggregation method utilizing a Siamese network.The multi-level features were fed into a context extraction module in this method, enabling the acquisition of long-range spatial-channel context features.

Transformer-Based Methods
Transformer-based methods, originally developed for natural language processing, are now being applied to encode bi-temporal images for CD.For example, Bandara et al. [34] introduced a CD method that combines a transformer with a Siamese architecture.This method introduced a transformer feature encoder to extract coarse and fine features with high and low resolution, respectively.Song et al. [35] introduced a progressive sampling transformer network (PSTNet) by using the excellent modeling ability of the transformer.In this method, the optimized tokens are iteratively mapped back to the original features to establish enhanced spatial connections in the spatial domain.Fang et al. [36] introduced a CD method, Changer, which uses a Siamese hierarchical transformer to extract multilayered features and then designs a flow-based dual-alignment fusion module to fuse the two branches' features.Zhang et al. [37] introduced a CD method that used a pure Swin transformer utilizing a Siamese network to extract long-term global features.However, transformer-based methods face limitations in terms of computational complexity and larger parameter sizes [38].In addition, transformer-based methods often result in irregular boundaries in the results due to their disregard for the subtle details of shallow features.

Hybrid-Based Methods
Hybrid-based methods combine CNN and transformer architectures, which aim to improve feature extraction abilities [39].For example, to couple the global and local fea-tures, Feng et al. [40] integrated a transformer and a CNN to design a CD method that was composed of an inter-scale feature fusion module and an intra-scale cross-interaction module, which were designed for obtaining discrimination feature maps and constructing spatial-temporal contextual information, respectively.To address the issues of blurred edges and neglect caused by sampling that is either too shallow or too deep, Song et al. [41] introduced a simple convolutional network and a progressive sampling CNN to generate fine and coarse features, respectively.Subsequently, a mixed-attention module was introduced to merge coarse and fine features.Finally, the results were generated by feeding the fused features into a transformer decoder.Chu et al. [42] proposed a dual-branch feature-guided aggregation network for CD.This method employs a dual-branch structure composed of a CNN and s transformer to extract both semantic and spatial features at various scales.However, in this method, the feature extractor is not only complicated but the network also has a large number of parameters.Tang et al. [43] introduced a W-shaped dual Siamese network (WNet) for CD.In this method, a deformable convolution was introduced into the CNN branch and transformer to mitigate the limited receptive fields and regular patch generation, respectively.Similarly, this method also possesses a significant number of parameters.Moreover, hybrid-based CD methods further require the design of a complicated fusion module to fuse the CNN features and token features, which are extracted from the CNN network and transformer network, respectively.

Framework
As depicted in Figure 1, the MSGFNet follows a standard U-shaped [44] network that employs a Siamese architecture.In particular, the MSGFNet comprises a Siamese feature encoder, an MSGFM, and a decoder for result prediction.First, to preserve the independence of features in bi-temporal images [45], each bi-temporal image is separately fed into the shared-weight Siamese EfficientNetB4 to generate the multi-level features.Subsequently, to effectively fuse multi-scale features aimed at improving the details of changed boundaries, the MSGFM is designed to adaptively fuse the corresponding bitemporal features at the same feature level.The fused features are decoded following the same skip connection method as in the classic UNet architecture [44], followed by a sigmoid classifier to generate the results.

Siamese Feature Encoder
Considering feature extraction abilities, network parameters, and computational memory, we chose EfficientNetB4 [25] as the backbone encoder for the Siamese architecture.More specifically, we made use of the first four convolutional stages of Efficient-NetB4 that have been pre-trained on ImageNet.In particular, the first stage is a common

Siamese Feature Encoder
Considering feature extraction abilities, network parameters, and computational memory, we chose EfficientNetB4 [25] as the backbone encoder for the Siamese architecture.More specifically, we made use of the first four convolutional stages of EfficientNetB4 that have been pre-trained on ImageNet.In particular, the first stage is a common 3 × 3 convolutional layer.The second, third, and fourth stages are each composed of identical MBConv blocks, with 2, 4, and 4 MBConv blocks, respectively.
The structure of the MBConv is depicted in Figure 2. In particular, the MBConv is composed of two 1 × 1 convolutional layers, a k × k depthwise convolutional layer, and a squeeze-excitation module.Within the MBConv block, the input features' channel dimension is increased by using the first convolutional layer.The kernel size k of the depthwise layer in the fourth stage is 5, whereas in other stages, it is 3. Squeeze-excitation is a specific attention mechanism that is able to suppress background feature information and enhance significant information.The purpose of the final convolutional layer is to reduce the channel dimension of the features to align them with the input features, allowing for the utilization of a residual connection mechanism.More details can be found in the literature [25].

Multi-Scale Gated Fusion Module
Generally, the changed objects in bi-temporal images often have significant size variations [24], which leads to incomplete detection targets and unsmooth boundaries in the results.Consequently, it is imperative to explore multi-scale feature fusion strategies to smooth the boundaries and improve the accuracy of results.Hence, an MSGFM that is capable of adaptively fusing multi-scale features is proposed.More specifically, the MSGFM comprises an MSPF unit and a GWAF unit.

Multi-Scale Progressive Fusion Unit
Previous studies [46,47] have demonstrated that the local receiving field is insufficient for accurately detecting ground objects of various shapes and sizes.To better capture ground objects of different sizes, we propose the use of an MSPF unit (Figure 3).Specifically, there are four parallel atrous convolutions and a progressive connection strategy used by the MSPF unit to progressively fuse the multi-scale features.f .First, to effectively capture multi-scale feature infor- Given the bi-temporal images represented as I 1 , I 2 ∈ R C×H×W , where H, W, and C denote the height, width, and image band numbers, respectively, the bi-temporal images are then separately input into each branch corresponding to the first four stages of the Siamese EfficientNetB4 to generate multi-level features.As a result, the multi-level features are represented as f 1 i , f 2 i , i ∈ {1, 2, 3, 4}, respectively, where i represents the i-th stage.The feature depths of the four stages are 48, 24, 32, and 56, respectively.The spatial scales of the extracted multi-level features in the successive stages are

Multi-Scale Gated Fusion Module
Generally, the changed objects in bi-temporal images often have significant size variations [24], which leads to incomplete detection targets and unsmooth boundaries in the results.Consequently, it is imperative to explore multi-scale feature fusion strategies to smooth the boundaries and improve the accuracy of results.Hence, an MSGFM that is capable of adaptively fusing multi-scale features is proposed.More specifically, the MSGFM comprises an MSPF unit and a GWAF unit.

Multi-Scale Progressive Fusion Unit
Previous studies [46,47] have demonstrated that the local receiving field is insufficient for accurately detecting ground objects of various shapes and sizes.To better capture ground objects of different sizes, we propose the use of an MSPF unit (Figure 3).Specifically, there are four parallel atrous convolutions and a progressive connection strategy used by the MSPF unit to progressively fuse the multi-scale features.

Multi-Scale Progressive Fusion Unit
Previous studies [46,47] have demonstrated that the local receiving field is insufficient for accurately detecting ground objects of various shapes and sizes.To better capture ground objects of different sizes, we propose the use of an MSPF unit (Figure 3).Specifically, there are four parallel atrous convolutions and a progressive connection strategy used by the MSPF unit to progressively fuse the multi-scale features.f .First, to effectively capture multi-scale feature information about ground objects, we utilize four parallel atrous convolutions with the same kernel size but different atrous rates to generate features at different pyramid scales.In particular, the kernel size for all four convolutions is set to 3 3 × , and the atrous rates of the four convolutions are 7, 5, 3, and 1, respectively.In addition, the output channels of the four convolutions are set to one-fourth of the channel of the input features.For instance, the bi-temporal features 1 f and 2 f are inputted into the four parallel atrous convolutions, which can be denoted as follows: Consider a pair of bi-temporal features of any stage generated from a Siamese feature encoder, denoted as f 1 and f 2 .First, to effectively capture multi-scale feature information about ground objects, we utilize four parallel atrous convolutions with the same kernel size but different atrous rates to generate features at different pyramid scales.In particular, the kernel size for all four convolutions is set to 3 × 3, and the atrous rates of the four convolutions are 7, 5, 3, and 1, respectively.In addition, the output channels of the four convolutions are set to one-fourth of the channel of the input features.For instance, the bi-temporal features f 1 and f 2 are inputted into the four parallel atrous convolutions, which can be denoted as follows: where Conv 3,i is the convolution function with different atrous rates, the subscript 3 denotes the kernel size of the convolution, and the subscript i represents the atrous rate of each convolution.f 1 3,i and f 2 3,i are the pyramid features, respectively.As described above, the proposed MSPF is a progressive process proposed to fuse bi-temporal multi-scale features.Specifically, the features f 1 3,7 and f 2 3,7 are fed into the GWAF to achieve weighted adaptive feature fusion.In addition, to mitigate the loss of fused feature information, each set of fused features using the upper GWAF is inputted into the next GWAF based on a progressive connection strategy, as depicted in Figure 3.The specifics of the GWAF will be explained in the next section.The above process can be formulated as follows: where GWAF denotes the weighted adaptive fusion operation and f i is the fused feature for each scale.It is essential to note that the first GWAF fusion unit does not have a progressive connection input.Subsequently, the four fused features are concatenated along the channel dimension, followed by a 1 × 1 convolutional layer employed to produce the discriminative fusion features.The process is formulated as follows: where F i represents the fused multi-level features and Conv 1 denotes a 1 × 1 convolutional layer.

Gated Weight Adaptive Fusion Unit
Previous studies [39,48,49] have generally fused the bi-temporal features using simple summation or concatenation.Nevertheless, it is difficult for these direct fusion strategies to effectively highlight the changed feature information and suppress the unchanged feature information.Taking inspiration from the gate mechanism [50], which can learn to highlight the contributions of changed regions, we propose a GWAF unit for bi-temporal multi-scale feature weighted adaptive fusion.Figure 4 depicts the details of the GWAF unit.

Gated Weight Adaptive Fusion Unit
Previous studies [39,48,49] have generally fused the bi-temporal features using simple summation or concatenation.Nevertheless, it is difficult for these direct fusion strategies to effectively highlight the changed feature information and suppress the unchanged feature information.Taking inspiration from the gate mechanism [50], which can learn to highlight the contributions of changed regions, we propose a GWAF unit for bi-temporal multi-scale feature weighted adaptive fusion.Figure 4 depicts the details of the GWAF unit. .In particular, the GWAF unit is roughly composed of three branches, the Given the same scale, bi-temporal features are represented as f 1 i and f 2 i , i ∈ {7, 5, 3, 1}.In particular, the GWAF unit is roughly composed of three branches, the individual inputs (i.e., f 1 i and f 2 i ) generated from the multi-scale atrous convolutional layer, and one fused feature ( f i+2 ) obtained from the upper-scale GWAF unit.It is essential to note that the topscale GWAF unit does not contain the additional fused feature.For convenient illustration, we simplified the subscripts of symbols.
To obtain the gated weight map G i between bi-temporal features, the bi-temporal features f 1 i and f 2 i are first concatenated along the channel dimension, and then a 3 × 3 convolutional layer is used to fuse the bi-temporal features.Subsequently, the fused feature f i+2 obtained from the upper-scale GWAF unit is added to the current scale with a residual connection strategy.After that, a sigmoid function is applied after a 1 × 1 convolutional layer to further fuse the multi-scale feature information to obtain the gated weight map G i .The process can be formulated as follows: where f 1 i and f 2 i are the bi-temporal features, Conv 3 is a 3 × 3 convolutional layer, f cat i denotes the concatenation features, and f i+2 is the fused feature generated from the upperscale GWAF unit.Conv 1 is the function of 1 × 1 convolution followed by a sigmoid layer.G i denotes the gated weight map.
To use the gated weight map G i to refine the changed feature information, the feature f 1 i is inputted into a 3 × 3 convolutional layer to extract more abstract semantic feature information.Then, the residual connection strategy is employed to add the features before and after convolution.Subsequently, the gated weight map G i is element-wise multiplied with the newly added features to generate the discriminate fused features.In addition, the inverse gated weight map 1 − G i is used for f 2 i to generate the enhanced feature f 2 G .The process can be formulated as follows: where f 1 G and f 2 G represent the adaptively fused bi-temporal features corresponding to f 1 i and f 2 i , respectively.After that, the enhanced features are concatenated along the channel dimension.Finally, a 1 × 1 convolutional layer is utilized to obtain the fused features of the i-th GWAF unit.The above process can be formulated as follows: where f i represents the features generated from the i-th GWAF unit.By combining the GWAF unit with the MSPF unit, this approach is capable of efficiently fusing the bi-temporal features to highlight the changed feature while suppressing the unchanged feature in bitemporal images.

Decoder
A decoder is employed to reconstruct the multi-level fused features to produce the results [23].UNet is a widely used semantic segmentation network that uses skip connections to transmit detailed feature information from the encoder to the decoder [44].As a result, many researchers have incorporated UNet into CD tasks and proposed a series of CD methods [51][52][53].Following this, we use UNet, which has a simple yet effective architecture, to generate the change maps.
Generally, the fourth-level fused bi-temporal features (i.e., F 4 ) are up-sampled to the spatial size of the third-level fused features (i.e., F 3 ).After that, the features up-sampled from F 4 and F 3 are concatenated in the feature direction.Subsequently, a convolutional block is utilized to project the concatenated features to obtain the corresponding features with the same channel numbers as the F 3 .The convolutional block comprises a 3 × 3 convolutional layer, a BN layer, and a ReLU layer.The above process can be formulated as follows: where U p 2 is the up-sampled operation and F 3 represents the generated features that have the same channel numbers as F 3 The above steps are repeated until we obtain the features F 1 Finally, the last 1 × 1 convolutional layer is employed to map the features F 1 to the predicted maps, which have two channels (i.e., representing the changed and unchanged classes).

Details of Loss Function
In classification tasks, the cross-entropy (CE) function is frequently employed, and the CD can be regarded as a unique two-label classification.Consequently, the loss function is the CE function, which is employed as the loss function and expressed as follows: where H and W are the image height and weight, respectively, p is the probability of the prediction results, and t i represents the corresponding truth map.

Results
The three public building datasets that were used in the experiments are first described.Next, the evaluation metrics, comparison methods, and experimental details are introduced in turn.Finally, the results of the experiment are carefully investigated.The WHU-CD dataset [54] is a dataset for detecting building changes.This dataset contains a pair of aerial images with 32, 507 × 15, 354 pixels that were obtained in 2012 and 2016, respectively, and has a spatial resolution of 0.2 m.The bi-temporal images in this study were cropped into 256 × 256 non-overlapping sub-images (Figure 5a), which were then divided into training, validation, and testing at an 8:1:1 ratio.

LEVIR-CD
The LEVIR-CD dataset [55] is a large-scale CD dataset that was collected using Google Earth from 2002 to 2018.This dataset includes 637 pairs of the size 1024 × 1024, with a spatial resolution of 0.5 m.Limited to the graphics processing unit (GPU) memory and following the division setting of the official study [55], the bi-temporal images were cropped into 256 × 256 non-overlapping sub-images (Figure 5b), and 7120/1024/2048 pairs were obtained for training, validation, and testing.

SYSU-CD
The SYSU-CD dataset [56] comprises a total of 20,000 pairs of aerial images with a resolution of 0.5 m and a spatial size of 256 × 256.This dataset encompasses various change types occurring in complex scenarios, such as building dilation, vegetation change, and sea construction.Following the official settings [56], the pairs in this dataset were divided into training, validation, and testing, with 12,000, 4000, and 4000 pairs, respectively.Some examples are illustrated in Figure 5c.

SYSU-CD
The SYSU-CD dataset [56] comprises a total of 20,000 pairs of aerial images with a resolution of 0.5 m and a spatial size of 256 256 × . This dataset encompasses various change types occurring in complex scenarios, such as building dilation, vegetation change, and sea construction.Following the official settings [56], the pairs in this dataset were divided into training, validation, and testing, with 12,000, 4000, and 4000 pairs, respectively.Some examples are illustrated in Figure 5c.

Evaluation Metrics
The four common evaluation measures that we used to thoroughly assess the MSGFNet were precision, recall, F1, and intersection over union (IoU).In these evaluation indicators, precision and recall denote detection error and omission error, respectively.F1 is a more comprehensive metric that could be computed by taking the harmonic mean of recall and precision [17].Therefore, this paper selects F1 and IoU as the main evaluation measures.These measures described above are defined as follows:

Evaluation Metrics
The four common evaluation measures that we used to thoroughly assess the MS-GFNet were precision, recall, F1, and intersection over union (IoU).In these evaluation indicators, precision and recall denote detection error and omission error, respectively.F1 is a more comprehensive metric that could be computed by taking the harmonic mean of recall and precision [17].Therefore, this paper selects F1 and IoU as the main evaluation measures.These measures described above are defined as follows: where TP represents the pixel numbers of true positives, TN represents the pixel numbers of true negatives, FN represents the pixel numbers of false negatives, and FP denotes the pixel numbers of false positives.

Comparison Methods
We conducted a comparative analysis using eight SOTA CD methods to assess the performance of the MSGFNet.A brief description of these methods is provided below: 1.
FC-EF [57]: FC-EF stands as a milestone method, utilizing a classic U-Net architecture.
In this method, the bi-temporal images are concatenated along the feature direction before being input into the network.

2.
FC-Siam-Diff [57]: FC-Siam-diff is a CD method with a Siamese CNN architecture.This network first extracts multi-level features from bi-temporal images and then uses the feature difference as the feature fusion module to generate change information.

3.
STANet [55]: STANet is a metric-based method.This method suggests using a spatiotemporal attention module based on self-attention mechanisms to model the spatial and temporal relationships to obtain significant information about changed features.4.
DSIFNet [58]: DSIFNet is a deeply supervised image fusion method.This method proposes an attention module to integrate multilevel feature information and employs the deep supervision strategy to optimize the network and improve its performance.5.
SNUNet [20]: SNUNet is a combination of the NestedUNet and Siamese networks.This method alleviates the localization information loss by using a dense connection between the encoder and decoder.Furthermore, an ensemble channel attention module is built to refine the change features at different semantic levels.6.
BITNet [59]: BITNet is a combination of a transformer and a CNN.This network first extracts semantic features by using the CNN, and then uses the transformer to model the global feature into a set of tokens, strengthening the contextual information of the changed features.7.
ChangeFormer [34]: ChangeFormer is a purely transformer-based change detection method.This method uses a Siamese transformer to build the bi-temporal image features and then uses the multi-layer perceptual to decode the difference features.8.
LightCDNet [60]: LightCDNet employs a lightweight MobileNetV2 to extract multilevel features and introduces a multi-temporal feature fusion module to fuse the corresponding level features.Finally, deconvolutional layers are utilized to recover the change map.
To achieve a fair comparison, all the different comparison methods were evaluated under the same experimental setting.If the comparison methods and the proposed MS-GFNet used the same dataset, we utilized the pre-trained weight models provided by the respective comparison papers.Otherwise, we employed the provided code and default parameters of the comparison methods.

Experimental Details
In this study, NVIDIA GeForce RTX 3080Ti graphics cards with 12 GB of RAM were used for training and all experiments were carried out using the PyTorch framework.During the training process, the AdamW optimizer was employed with a weight decay equal to 1 × 10 −4 , and an initial learning rate of 1 × 10 −3 .In addition, all experiments utilized a batch size of 8 and each dataset underwent training for 100 epochs.

Results
We analyzed the MSGFNet using the six SOTA methods on the two datasets in this part.We categorized the six SOTA methods into three classes: CNN-based, transformerbased, and hybrid-based.To enhance readability in the visualization comparisons, we depict the false positives in red, the false negatives in green, the true negatives in black, and the true positives in white.

Experimental Analysis on the WHU-CD Dataset
The experimental results on the WHU-CD dataset are displayed in Table 1.Notably, it can be observed that the proposed MSGFNet shows outstanding performance, with an F1 of 92.46% and an IoU of 85.98%.Furthermore, the hybrid-based BITNet and transformerbased ChangeFormer secure the second and third positions, respectively, with F1 scores of 91.25% and 89.82%.The proposed MSGFNet exhibits a superior F1 compared to BITNet and ChangeFormer, surpassing them by 1.21% and 2.64%, respectively.In addition, the BITNet and ChangeFormer outperform other CNN-based methods except for LightCDNet.However, despite the utilization of a CNN-based architecture, the MSGFNet achieved optimal results compared to the SOTA methods.This can be attributed to the effectiveness of the proposed MSGFM in capturing discriminatively changed feature information between the bi-temporal images.An intuitive visual comparison of all the methods is shown in Figure 6.It can be observed that both the STANet and FC-Siam-Diff not only exhibit a significant number of false negatives but also have rough boundaries in changed regions.Additionally, the boundary detection results reported by both the SNUNet and DSIFN are unsatisfactory.Compared to the second-ranked LightCDNet, the MSGFNet not only has more accurate boundary details but also has few false positives and false negatives.In summary, the MSGFNet achieves the best visualization performance on the WHU-CD dataset.

Experimental Analysis on the LEVIR-CD Dataset
The quantitative results of all methods on the LEVIR-CD dataset are displayed in Table 2. From the table, it is evident that the FC-Siam-Diff obtained the poorest performance.This may be attributed to the utilization of a simple Siamese UNet in the FC-Siam-Diff model; consequently, which leads to poor feature extraction and fusion ability.Correspondingly, other CNN-based methods, such as STANet, DSIFNet, and SNUNet, introduce various attention mechanisms that enhance the discriminative features of bi-temporal images.Consequently, these methods have shown varying degrees of improvement in the accuracy of results.ChangeFormer obtained the second-highest level of performance, achieving F1 and IoU of 90.40% and 82.48%, respectively.The MSGFNet has demonstrated improvements of roughly 0.46% in F1 and 0.77% in IoU when compared to ChangeFormer.In addition, the proposed method also achieved the highest precision, with a score of 92.12%.In conclusion, the quantitative analysis presented above validates the effectiveness of the MSGFNet.Figure 7 shows an intuitive visual comparison of all the methods on the LEVIR-CD dataset.For the first densely built case, the changed buildings in the results of the FC-Siam-Diff, STANet, DSIFNet, and SNUNet are clustered together to some extent.The results of the proposed MSGFNet show better detail and boundaries for the small ground objects.For the case featuring buildings of different scales in Figure 7, there are illumination changes and building shadows present between the bi-temporal images.The results of the proposed MSGFNet show that it has fewer false positives and false negatives than several SOTA methods while also preserving the integrity of small ground targets.

Experimental Analysis on the SYSU-CD Dataset
The experimental results on the SYSU-CD dataset are displayed in Table 3. Notably, the FC-Siam-Diff exhibits the least favorable performance, with an F1 value of 70.17% and an IoU of 55.11%.SNUNet slightly outperforms FC-Siam-Diff, which may be attributed to SNUNet's employment of the dense connection strategy that can alleviate the loss of feature information [20].Among these comparative methods, STANet, DSIFNet, and ChangeFormer exhibit comparable performances.Specifically, the above three methods obtained F1 scores of 77.75%, 77.46%, and 77.83%, respectively.Correspondingly, LightCDNet and BITNet were the second-and third-ranked methods, with F1 values of 78.52% and 78.72%, respectively.It is evident that the proposed MSGFNet outperforms the comparative methods in all evaluation metrics, except recall.Specifically, the proposed MSGFNet outperforms the second-ranked LightCDNet method by over 1.64% in F1.Despite the fact that STANet obtains the highest recall value, its F1 and IoU values are 2.64% and 3.63% lower than those of the proposed MSGFNet, respectively.In conclusion, the quantitative analysis presented above validates the effectiveness of the MSGFNet.
An intuitive visual comparison of all the methods is shown in Figure 8. Different from the WHU-CD and LEVIR-CD datasets, which only contain building changes, the SYSU-CD dataset is more challenging because it encompasses various change types occurring in complex scenarios [56].For the building changes in the first case, the results of FC-EF and FC-Siam-Diff contain many missed detections (e.g., false negatives).However, the results of other comparative methods, such as DSIFNet and ChangeFormer, have many false detections (e.g., false positives).For the second case, which is a vegetation change sample, all the comparative methods have a large area of missed detection.For the two different change cases, compared to the comparative SOTA methods, only the proposed MSGFNet could detect the complete change ground objects and has the best visualization.Compared to the second-ranked LightCDNet method, the proposed MSGFNet not only has few false positives and false negatives but also maintains better boundary details.In summary, our method achieves optimal performance on the SYSU-CD dataset.

Experimental Analysis on the SYSU-CD Dataset
The experimental results on the SYSU-CD dataset are displayed in Table 3. Notably, the FC-Siam-Diff exhibits the least favorable performance, with an F1 value of 70.17% and an IoU of 55.11%.SNUNet slightly outperforms FC-Siam-Diff, which may be attributed to SNUNet's employment of the dense connection strategy that can alleviate the loss of feature information [20].Among these comparative methods, STANet, DSIFNet, and ChangeFormer exhibit comparable performances.Specifically, the above three methods obtained F1 scores of 77.75%, 77.46%, and 77.83%, respectively.Correspondingly, LightCD-Net and BITNet were the second-and third-ranked methods, with F1 values of 78.52% and 78.72%, respectively.It is evident that the proposed MSGFNet outperforms the comparative methods in all evaluation metrics, except recall.Specifically, the proposed MSGFNet outperforms the second-ranked LightCDNet method by over 1.64% in F1.Despite the fact that STANet obtains the highest recall value, its F1 and IoU values are 2.64% and 3.63% lower than those of the proposed MSGFNet, respectively.In conclusion, the quantitative analysis presented above validates the effectiveness of the MSGFNet.An intuitive visual comparison of all the methods is shown in Figure 8. Different from the WHU-CD and LEVIR-CD datasets, which only contain building changes, the SYSU-CD dataset is more challenging because it encompasses various change types occurring in complex scenarios [56].For the building changes in the first case, the results of FC-EF and FC-Siam-Diff contain many missed detections (e.g., false negatives).However, the results of other comparative methods, such as DSIFNet and ChangeFormer, have many false detections (e.g., false positives).For the second case, which is a vegetation change sample, all the comparative methods have a large area of missed detection.For the two different change cases, compared to the comparative SOTA methods, only the proposed MSGFNet could detect the complete change ground objects and has the best visualization.Compared to the second-ranked LightCDNet method, the proposed MSGFNet not only has few false positives and false negatives but also maintains better boundary details.In summary, our method achieves optimal performance on the SYSU-CD dataset.

Model Size and Computational Complexity
On the other hand, we conducted a comparative analysis of model size (number of parameters) and computational efficiency (number of floating-point operations) of all the methods, as presented in Table 4.It is evident that the MSGFNet not only has the best performance in terms of F1 but also has the smallest model size in terms of network parameters.Specifically, the model parameters of the MSGFNet are just 0.58 M, which is lower than the FC-Siam-Diff and BITNet methods.Additionally, our method has the smallest FLOPs.The model size and computational complexity demonstrate that the

Model Size and Computational Complexity
On the other hand, we conducted a comparative analysis of model size (number of parameters) and computational efficiency (number of floating-point operations) of all the methods, as presented in Table 4.It is evident that the MSGFNet not only has the best performance in terms of F1 but also has the smallest model size in terms of network parameters.Specifically, the model parameters of the MSGFNet are just 0.58 M, which is lower than the FC-Siam-Diff and BITNet methods.Additionally, our method has the smallest FLOPs.The model size and computational complexity demonstrate that the MSGFNet more successfully obtains a compromise between performance and model size.For an intuitive visualization, the scatterplot between parameters and F1 of all methods is shown in Figure 9.

Ablation Studies
We conducted ablation studies on the LEVIR-CD dataset to demonstrate the effectiveness of the MSGF.Specifically, the proposed multi-scale gated fusion module consists of two units: a multi-scale progressive fusion unit and a gated weight fusion unit.Therefore, we performed individual corresponding ablation studies on both units.To begin with, "Base" refers to a Siamese encoder in the absence of any further modules, the multiscale progressive fusion unit is denoted as "MSPF", and the gated weight fusion unit is represented as "GWAF".More specifically, we removed the MSPF unit to validate its effectiveness.In this scenario, we utilized only the GWAF to fuse the bi-temporal features.It is essential to point out that there is no additional input branch generated from the upper-scale GWAF unit.We replaced the GWAF unit with the general difference fusion mode to validate the effectiveness of the GWAF unit.In addition, we set up an additional control group.In this control group, the network employs the same architecture as the FC-Siam-Diff to produce the change map.
Table 5 lists the quantitative results of the ablation studies.It can be observed that the mode "Base" without the MSPF and GWAF units generates the lowest performance.The results generated from each GWAF and MSPF unit are significantly better than the

Ablation Studies
We conducted ablation studies on the LEVIR-CD dataset to demonstrate the effectiveness of the MSGF.Specifically, the proposed multi-scale gated fusion module consists of two units: a multi-scale progressive fusion unit and a gated weight fusion unit.Therefore, we performed individual corresponding ablation studies on both units.To begin with, "Base" refers to a Siamese encoder in the absence of any further modules, the multi-scale progressive fusion unit is denoted as "MSPF", and the gated weight fusion unit is represented as "GWAF".More specifically, we removed the MSPF unit to validate its effectiveness.In this scenario, we utilized only the GWAF to fuse the bi-temporal features.It is essential to point out that there is no additional input branch generated from the upper-scale GWAF unit.We replaced the GWAF unit with the general difference fusion mode to validate the effectiveness of the GWAF unit.In addition, we set up an additional control group.In this control group, the network employs the same architecture as the FC-Siam-Diff to produce the change map.
Table 5 lists the quantitative results of the ablation studies.It can be observed that the mode "Base" without the MSPF and GWAF units generates the lowest performance.The results generated from each GWAF and MSPF unit are significantly better than the "Base" mode.Specifically, the utilization of the GWAF unit results in a 1.99% improvement in F1 and a 2.66% improvement in IoU.With the help of the MSPF unit, F1 and IoU are enhanced by 2.71% and 4.40%, respectively.Furthermore, the best results are produced when both the GWAF and MSPF units are combined at the "Base".In particular, there is an enhancement of 1.35% in F1 and 2.78% in IoU when the GWAF unit is added.On the other hand, when using the MSPF unit alone, F1 and IoU are both improved, by 0.63% and 1.04%, respectively.In general, these improvements indicate the effectiveness the proposed GWAF and MSPF units.Some examples of the results are shown in Figure 10.It is evident that the results of the "Base" mode have many false positives and false negatives.When the GWAF and MSPF units are added, respectively, the results improve slightly.Furthermore, when the GWAF and MSPF are both added, we achieve optimal visualization results.Specifically, the results have fewer false negatives and false positives, and the boundary details are more precise.The visual results validate the effectiveness of the GWAF and MSPF units.
On the other hand, when using the MSPF unit alone, F1 and IoU are both improved, by 0.63% and 1.04%, respectively.In general, these improvements indicate the effectiveness of the proposed GWAF and MSPF units.Some examples of the results are shown in Figure 10.It is evident that the results of the "Base" mode have many false positives and false negatives.When the GWAF and MSPF units are added, respectively, the results improve slightly.Furthermore, when the GWAF and MSPF are both added, we achieve optimal visualization results.Specifically, the results have fewer false negatives and false positives, and the boundary details are more precise.The visual results validate the effectiveness of the GWAF and MSPF units.

Conclusions
This paper proposes a CD method, namely the MSGFNet.To capture useful feature information, the MSGFNet combines EfficientNetB4 with a Siamese structure to extract the multi-level features.An MSGFM that comprises an MSPF unit and a GWAF unit is proposed to progressively and adaptively fuse bi-temporal multi-scale features.This module can obtain discriminative fusion features to smooth the details of changed object boundaries and improve the accuracy of results.Finally, the results obtained from three publicly available datasets show that the MSGFNet outperforms several SOTA methods in terms of both effectiveness and complexity.On the WHU-CD, LEVIR-CD, and SYSU-

Conclusions
This paper proposes a CD method, namely the MSGFNet.To capture useful feature information, the MSGFNet combines EfficientNetB4 with a Siamese structure to extract the multi-level features.An MSGFM that comprises an MSPF unit and a GWAF unit is proposed to progressively and adaptively fuse bi-temporal multi-scale features.This module can obtain discriminative fusion features to smooth the details of changed object boundaries and improve the accuracy of results.Finally, the results obtained from three publicly available datasets show that the MSGFNet outperforms several SOTA methods in terms of both effectiveness and complexity.On the WHU-CD, LEVIR-CD, and SYSU-CD datasets, the MSGFNet achieved improvements of 1.21%, 0.46%, and 1.64% in F1 and 1.68%, 0.77%, and 2.24% in IoU, respectively, compared to the SOTA methods that produced better values.Additionally, it is evident that the Params and FLOPs for the proposed MSGFNet are 3.99 G and 0.58 M, respectively.Both values are lower than those of several SOTA methods.In summary, the proposed MSGFNet outperforms several SOTA CD methods.

Figure 1 .
Figure 1.General overview of the MSGFNet.

20 Figure 2 .
Figure 2. The structure of the MBConv block.

Figure 3 .
Figure 3.The structure of the proposed MSPF unit.Consider a pair of bi-temporal features of any stage generated from a Siamese feature encoder, denoted as 1 f and 2f .First, to effectively capture multi-scale feature infor-

Figure 2 .
Figure 2. The structure of the MBConv block.

Figure 3 .
Figure 3.The structure of the proposed MSPF unit.Consider a pair of bi-temporal features of any stage generated from a Siamese feature encoder, denoted as 1 f and 2f .First, to effectively capture multi-scale feature information about ground objects, we utilize four parallel atrous convolutions with the same kernel size but different atrous rates to generate features at different pyramid scales.In

Figure 3 .
Figure 3.The structure of the proposed MSPF unit.

Figure 4 .
Figure 4.The structure of the GWAF unit.Given the same scale, bi-temporal features are represented as 1 i f

Figure 4 .
Figure 4.The structure of the GWAF unit.

Figure 5 .
Figure 5. Example display of the three datasets, where T1 and T2 present bi-temporal images and GT represents the ground truth.

Figure 5 .
Figure 5. Example display of the three datasets, where T1 and T2 present bi-temporal images and GT represents the ground truth.

Figure 6 .
Figure 6.Visual comparisons between the MSGFNet and the SOTA methods on the WHU-CD dataset.

20 Figure 7 .
Figure 7. Visual comparisons between the MSGFNet and the SOTA methods on the LEVIR-CD dataset.

Figure 7 .
Figure 7. Visual comparisons between the MSGFNet and the SOTA methods on the LEVIR-CD dataset.

Figure 8 .
Figure 8. Visual comparisons between the MSGFNet and the SOTA methods on the SYSU-CD dataset.

Figure 9 .
Figure 9.The scatterplot between parameters and F1 of all methods.

Figure 9 .
Figure 9.The scatterplot between parameters and F1 of all methods.

Figure 10 .
Figure 10.Visual comparisons of ablation experiments on the LEVIE-CD datasets.

Figure 10 .
Figure 10.Visual comparisons of ablation experiments on the LEVIE-CD datasets.

Table 1 .
Quantitative evaluation of the MSGFNet and the SOTA methods on the WHU-CD dataset.The best scores are marked in Bold.

Table 2 .
Quantitative evaluation of the MSGFNet and the SOTA methods on the LEVIR-CD dataset.The best scores are marked in Bold.

Table 3 .
Quantitative evaluation of the MSGFNet and the SOTA methods on the SYSU-CD dataset.The best scores are marked in Bold.

Table 3 .
Quantitative evaluation of the MSGFNet and the SOTA methods on the SYSU-CD dataset.The best scores are marked in Bold.Visual comparisons between the MSGFNet and the SOTA methods on the SYSU-CD dataset.

Table 4 .
Comparison of model size and computational complexity on the WHU-CD dataset.

Table 4 .
Comparison of model size and computational complexity on the WHU-CD dataset.

Table 5 .
Quantitative evaluation results of ablation studies on the LEVIR-CD dataset.The best scores are marked in Bold.

Table 5 .
Quantitative evaluation results of ablation studies on the LEVIR-CD dataset.The best scores are marked in Bold.