SMNet: Symmetric Multi-Task Network for Semantic Change Detection in Remote Sensing Images Based on CNN and Transformer

: Deep learning has achieved great success in remote sensing image change detection (CD). However, most methods focus only on the changed regions of images and cannot accurately identify their detailed semantic categories. In addition, most CD methods using convolutional neural networks (CNN) have difﬁculty capturing sufﬁcient global information from images. To address the above issues, we propose a novel symmetric multi-task network (SMNet) that integrates global and local information for semantic change detection (SCD) in this paper. Speciﬁcally, we employ a hybrid unit consisting of pre-activated residual blocks (PR) and transformation blocks (TB) to construct the (PRTB) backbone, which obtains more abundant semantic features with local and global information from bi-temporal images. To accurately capture ﬁne-grained changes, the multi-content fusion module (MCFM) is introduced, which effectively enhances change features by distinguishing foreground and background information in complex scenes. In the meantime, the multi-task prediction branches are adopted, and the multi-task loss function is used to jointly supervise model training to improve the performance of the network. Extensive experimental results on the challenging SECOND and Landsat-SCD datasets, demonstrate that our SMNet obtains 71.95% and 85.65% at mean Intersection over Union (mIoU), respectively. In addition, the proposed SMNet achieves 20.29% and 51.14% at Separated Kappa coefﬁcient (Sek) on the SECOND and Landsat-SCD datasets, respectively. All of the above proves the effectiveness and superiority of the proposed method.


Introduction
Accurate access to dynamics and change information on the land surface is important for understanding and studying the natural environment, human activities, and their correlations [1,2].With the rapid development of remote sensing sensors, it is possible to obtain massive high-resolution remote sensing images, which further provides reliable data sources for studying CD.Remote sensing image CD is used to identify the differences between two images, which are located at the same position but are obtained at different times.Therefore, remote sensing image CD is of great signification in many fields including urban planning, ecosystem assessment, and natural resource management [3][4][5].
Over the years, deep learning has been successfully applied to image classification [6,7], object detection [8,9], and semantic segmentation [10,11], which contributes new ideas and methods for remote sensing image CD.A simple and commonly used method is stacking the bi-temporal images and then feeding them into a CNN to extract change information [12][13][14].However, this method does not make full use of the information from the bi-temporal images.Therefore, more researchers adopted an architecture with two parallel CNN and some effective techniques (e.g., multi-scale feature fusion methods, attention mechanism, deep supervision) to improve the performance of the architecture.For example, Zhang et al. [15] proposed a CD framework with a hierarchical fusion strategy and introduced the dynamic convolutional module for adaptive learning, to improve results integrity.Fang et al. [16] used a Siamese network based on U-Net++ to extract bi-temporal images' features and adopted an integrated channel attention module (CAM) for deep supervision.Ling et al. [17] proposed an integrated residual attention multi-scale Siamese network, which effectively obtains multi-scale semantic information to alleviate the lack of unpredictable change details and global semantic information.Chen et al. [18] designed a dual-attention mechanism to capture more discriminative features than compared methods, which improves CD accuracy.Peng et al. [19] proposed a difference-enhancement denseattention CNN, which uses upsampling attention and difference-enhancement units to, respectively, extract and select change information.Guo et al. [20] proposed CosimNet for scene CD using thresholded contrastive loss to learn more discriminative metrics.Zhu et al. [21] adopted a global hierarchical mechanism to enhance unbalanced samples and improve SCD accuracy.Although CNN significantly improves CD accuracy for remote sensing images with its powerful feature extraction capabilities, the existing methods still suffer from the following problems.First, the CNN's receptive field is much smaller than the theoretical maximum, which makes it difficult to establish long-range dependencies in space and time.Hence, the global and contextual information of remote sensing images is often ignored by the CNN.Second, the CNN's methods of dealing with changed region boundaries are unsatisfactory.Last but not least, most algorithms for CD can predict whether image pixels have changed, but they cannot detect their semantic categories on the bi-temporal images.
To overcome the difficulty faced by CNN in handling bi-temporal features and to better leverage its feature extraction powers for SCD, we propose a method based on CNN and Transformer called symmetric multi-task network (SMNet) that not only locates changed areas but also identifies the type of change.First, to extract the local and global information from bi-temporal remote sensing images, a hybrid PRTB backbone composed of PR and TB is built [22], which extracts different hierarchical features.Then, a novel multi-content fusion module (MCFM) [23] is used to strengthen the change-related features obtained by subtracting the extracted corresponding hierarchical features.Finally, we use the multi-task prediction branches (i.e., two semantic and one change branches) to obtain SCD results.To improve overall network performance, the multi-task loss function is jointly used to supervise training.Extensive experiments on the SECOND dataset demonstrate that our proposed method obtains better accuracy than compared methods.The main contributions of this research are as follows: (1) We propose a novel multi-task model (SMNet) for remote sensing image SCD, which reduces confusion between semantic and change information.To sufficiently capture the local and global information from the bi-temporal images, we exploit the multi-scale feature extraction encoder that integrates the PR and TB to obtain more abundant semantic information.For enhancing the extraction of change-related features in complex scenes, the MCFM is introduced to mitigates false detection of fine-grained changes.
(2) Extensive experiments on two public datasets demonstrate the effectiveness of the proposed SMNet.The proposed method outperforms compared methods, yielding the highest SeK of 20.29% and 50.14%, respectively.
The rest of this paper is organized as follows.Related works are reviewed in Section 2. A detailed description of the proposed method is provided in Section 3. The experimental data, evaluation indicators, and training details are given in Section 4. The experimental results and discussion are reported in Section 5, and the conclusions are provided in Section 6.

Binary Change Detection
Numerous scholars have conducted extensive research on BCD.To suppress background noise in remote sensing images, Chen et al. [24] designed a feature constraint CD network that uses a self-supervised learning strategy to constrain feature extraction and feature fusion.Zhi et al. [25] proposed a novel neural network with a spatial-spectral attention mechanism and multi-scale dilation convolution modules, effectively alleviating the pseudo-changes caused by solar height and soil moisture in land cover CD for remote sensing images.To address the problems of feature diversity and scale-change flexibility, Lei et al. [26] employed scale-adaptive attention to establish relationships between feature maps and convolution kernel scales, utilizing a multi-layer perceptron (MLP) that fuses low-level details and high-level semantics to improve feature discrimination.Wei et al. [27] designed a location guidance module that accurately identifies changed regions.To handle the varying resolutions of bi-temporal images, a super-resolution module containing a generator and discriminator [28] is introduced to directly learn super-resolution features.Meanwhile, a stacked attention module is used to capture the more useful channel and spatial information, which effectively improved the accuracy of multi-resolution remote sensing image CD.The challenges of CD for remote sensing images are a small number of datasets and a huge work of data labeling.To alleviate the challenges, a semi-supervised convolutional network based on a generative adversarial network [29] is proposed, which uses two discriminators to enforce the feature distribution consistency of segmentation maps and entropy maps between the labeled and unlabeled data, to achieve high-precision CD on a small number of labeled datasets.

Semantic Change Detection
Despite the great success of BCD, it is difficult to meet the needs of practical production applications due to the lack of semantic information.Therefore, some scholars have researched SCD.To address a coarse boundary, Tsutsui et al. [30] developed a multi-task SCD method based on U-Net, which improves the accuracy of boundary predictions by simultaneously performing CD and semantic segmentation through a shared feature extraction network.A convolutional network for large-scale SCD [31] utilizes the multi-scale atrous convolution unit to enlarge the receptive field as well as capturing multi-scale information.Additionally, an attention mechanism and deep supervision strategy are further introduced to improve network performance.To overcome scale variation and class imbalance problems, a dual-task constrained deep Siamese convolutional network [32] is proposed, which introduces a dual-attention module to obtain discriminant features and results in a good performance on the WHU dataset.The temporal correlation of bi-temporal images is worth considering.To this end, an end-to-end network that combines CNN and recurrent neural network [33] is proposed to extract spatial and temporal information and directly applies changed labels for model training.A multi-task learning framework [34] is constructed, which uses a fully convolutional long short-term memory to capture the temporal relationship among spatial feature vectors and to boost the overall performance.Daudt et al. [35] applied multi-task learning to extract temporal correlations and feature-categorized information to improve SCD performance.To extract asymmetric change information from multi-temporal images, an asymmetric Siamese network [36] is designed to extract depth features through the asymmetric gating unit, which effectively distinguished changed features in complex scenes.Zheng et al. [37] constructed a multitask encoder-converter-decoder network with a reduced number of encoder branches and explored the relationships between semantic change and time symmetry.

Overview
The network architecture of the proposed SMNet is shown in Figure 1.SMNet adopts an encoder-decoder structure, which is mainly composed of three parts: a multi-scale feature extraction encoder, multi-content fusion enhancement, and the multi-task prediction decoder.First, the feature extraction encoder contains two symmetric branches composed of the PR and TB, which are used to extract semantic features at different levels from bi-temporal images (T1 and T2).Two paired sets of feature maps, [E ; i = 1, 2, 3, 4, 5) are generated for change analysis.To improve the ability of our network to capture useful information from complex scenes, we introduce the MCFM to aggregate foreground, background, and global features.The enhanced change features [F 1 , F 2 , F 3 , F 4 , F 5 ] are then generated by multi-scale feature enhancement composed of MCFM.Finally, the multi-task prediction decoder is used to make synchronous predictions on semantic and enhanced change features, which generate two semantic maps (S 1 , S 2 ) and a binary change map (B).Further, semantic change maps (P 1 , P 2 ) are generated by masking S 1 and S 2 with B.
decoder.First, the feature extraction encoder contains two symmetric branches compos of the PR and TB, which are used to extract semantic features at different levels from temporal images (T1 and T2).Two paired sets of feature maps, [E 1 T1, E 2 T1, E 3 T1, E 4 T1, E 5  and [E 1 T2, E 2 T2, E 3 T2, E 4 T2, E 5 T2], are obtained from the feature extraction encoder.Th five difference feature maps [D1, D2, D3, D4, D5] at each scale (E i T1, E i T2; i = 1, 2, 3, 4, 5) a generated for change analysis.To improve the ability of our network to capture use information from complex scenes, we introduce the MCFM to aggregate foregroun background, and global features.The enhanced change features [F1, F2, F3, F4, F5] are th generated by multi-scale feature enhancement composed of MCFM.Finally, the mu task prediction decoder is used to make synchronous predictions on semantic and e hanced change features, which generate two semantic maps (S1, S2) and a binary chan map (B).Further, semantic change maps (P1, P2) are generated by masking S1 and with B. Multi-content feature enhancement

Multi-Scale Feature Extraction Encoder
The remote sensing image SCD faces the challenge that pure CNN poorly captu global contextual information.The transformer compensates for this shortcoming by ploiting multi-head self-attention (MHSA) to establish a strong global dependency.T enables the receptive field to flexibly learn more powerful features.With the addition the transformer, both local and global features can be extracted to facilitate semantic u derstanding.The PRTB backbone is thus constructed to extract multi-scale semantic f tures at different spatial levels.
Figure 2 illustrates the detailed structure of the PRTB.Compared with the residu blocks of ResNet, the PR block (see Figure 2a) sees the activation functions (batch norm ization (BN) and rectified linear unit (ReLU)) as "pre-activation" of the weight layers.T PR block makes the information able to be directly transmitted from one layer to anoth in both forward and backward passes [38], which makes SMNet easier to train and i proves its generalizability.The input feature, Xl−1, in the l − 1 layer is first fed to BN a ReLU layers.Then, 3 × 3 convolutional layers are used to capture the local informatio Then, Xl in the l layer is generated by the PR block.The TB structure is illustrated in Figu

Multi-Scale Feature Extraction Encoder
The remote sensing image SCD faces the challenge that pure CNN poorly captures global contextual information.The transformer compensates for this shortcoming by exploiting multi-head self-attention (MHSA) to establish a strong global dependency.This enables the receptive field to flexibly learn more powerful features.With the addition of the transformer, both local and global features can be extracted to facilitate semantic understanding.The PRTB backbone is thus constructed to extract multi-scale semantic features at different spatial levels.
Figure 2 illustrates the detailed structure of the PRTB.Compared with the residual blocks of ResNet, the PR block (see Figure 2a) sees the activation functions (batch normalization (BN) and rectified linear unit (ReLU)) as "pre-activation" of the weight layers.The PR block makes the information able to be directly transmitted from one layer to another, in both forward and backward passes [38], which makes SMNet easier to train and improves its generalizability.The input feature, X l−1 , in the l − 1 layer is first fed to BN and ReLU layers.Then, 3 × 3 convolutional layers are used to capture the local information.Then, X l in the l layer is generated by the PR block.The TB structure is illustrated in Figure 2b.The MHSA module is used to extract global information, and layer normalization (LN), MLP, and residual connection modules are introduced to improve representability.Finally, X l is transformed into X l+1 through the n-layer TB.The computing formula is as follows: Remote Sens. 2023, 15, 949 5 of 19 where X i is the features extracted from the i-th layer after the images are input to the network, and CRB(•) indicates that the feature map passes through BN, ReLU, and 3 × 3 convolutional layers.
− X l is the output of the MHSA module.As shown in the left part of Figure 1, the feature extraction encoder consists of five stages.After the first stage, the characteristic resolution of the images remains unchanged, while the channel dimension is increased to 32.At the following stage, the feature resolution is halved, while the channel dimension is doubled.For example, input images T1 and T2 are both 256 × 256 × 3, and the sizes of the semantic feature maps, E i T1 and where Xi is the features extracted from the i-th layer after the images ar work, and CRB(•) indicates that the feature map passes through BN, ReL volutional layers. is the output of the MHSA module.As shown Figure 1, the feature extraction encoder consists of five stages.After t characteristic resolution of the images remains unchanged, while the ch is increased to 32.At the following stage, the feature resolution is halve nel dimension is doubled.For example, input images T1 and T2 are b and the sizes of the semantic feature maps, E i T1 and E i T2 (i = 1, 2, 3, 4, 5

Multi-Content Fusion Enhancement
Problems of false alarm, missed detection, and the loss of details ar methods and are usually caused by the insufficient ability of the mod ground (region-of-interest) information.Therefore, we introduce the feature extraction in the foreground and background, as well as globally the foreground information, background and global information are sup prove the discrimination of foreground features and obtain more ef shown in the middle part of Figure 1, we generate difference feature m spondent semantic feature maps (E i T1, E i T2) for change analysis.Then, D the MCFM to obtain the enhanced feature, Fi.The specific process is as

Multi-Content Fusion Enhancement
Problems of false alarm, missed detection, and the loss of details are common in SCD methods and are usually caused by the insufficient ability of the model to extract foreground (region-of-interest) information.Therefore, we introduce the MCFM to handle feature extraction in the foreground and background, as well as globally.While capturing the foreground information, background and global information are supplemented to improve the discrimination of foreground features and obtain more effective edges.As shown in the middle part of Figure 1, we generate difference feature map D i with correspondent semantic feature maps (E i T1 , E i T2 ) for change analysis.Then, D i is enhanced by the MCFM to obtain the enhanced feature, F i .The specific process is as follows: where i = 1, 2, 3, 4, 5.The specific MCFM architecture is illustrated in Figure 3.

Foreground and Background Branches
First, the difference feature, Di ∈ ℝ × × ℝ × × (C, H, and W are the channel dimension, height, and width of the feature) , is fed into the CA to reduce redundant information.The specific formula is as follows: where fca denotes the CA features, and ⊕ indicates concatenation in the channel dimension.
Then, to embed the CA information, we obtain the spatial attention (SA) feature, fsa, through the SA.This is computed as where fsa denotes the SA features.On the one hand, fsa is used as an input to the foreground information alongside the CA feature.On the other hand, fsa enters the background branch.Because fsa indicates highlevel features that omit spatial details, it leads to the lack of edge features in the continuous convolution process.Therefore, in the background branch, we introduce reverse attention to mine valid edge information.This process is written as follows: where ffore denotes foreground features, and fback denotes background features.⨂ represents element-wise multiplication.

Foreground and Background Branches
First, the difference feature, D i ∈ R C×H×W R C×H×W (C, H, and W are the channel dimension, height, and width of the feature), is fed into the CA to reduce redundant information.The specific formula is as follows: where f ca denotes the CA features, and ⊕ indicates concatenation in the channel dimension.Then, to embed the CA information, we obtain the spatial attention (SA) feature, f sa , through the SA.This is computed as where f sa denotes the SA features.On the one hand, f sa is used as an input to the foreground information alongside the CA feature.On the other hand, f sa enters the background branch.Because f sa indicates highlevel features that omit spatial details, it leads to the lack of edge features in the continuous convolution process.Therefore, in the background branch, we introduce reverse attention to mine valid edge information.This process is written as follows: where f fore denotes foreground features, and f back denotes background features.⊗ represents element-wise multiplication.

Global Branch
The foreground and background branches pay more attention to the local features of images; thus, we add a global branch to obtain more global features.Specifically, we apply global average pooling (GAP) on D i to integrate the global information and reduce redundant parameters.It also utilizes a 1 × 1 convolution for feature smoothing.Then, we reconstruct the global feature to the same size as the difference, D i , by bilinear interpolation upsampling.This rough operation loses detailed information, but the reconstructed features reflect the overall characteristics of original features.Next, the reconstructed features are fed into the SA and residually multiplied with D i to obtain the final global feature, f global .The entire process is formulated as follows: where Conv 1×1 (•) is the 1 × 1 convolution layer, and up(•) is the upsampling operation.

Feature Fusion
We obtain three kinds of features (i.e., f fore , f back , and f global ) and further reshape them using a 3 × 3 convolution layer.Then, we aggregate the reshaped features by concatenating along the channel dimension.We also apply a skip connection to retain the original features and generate the output features of F i ∈ R C×H×W .The entire process is written as follows: where Conv 3×3 (•) is 3 × 3 convolutional layer.

Multi-Task Prediction Decoder
The single-task methods use two decoders to directly generate two semantic change maps, which cause confusion between the change and semantic information.This normally reduces CD detection accuracy.Therefore, we design the decoder with multi-task prediction branches that include two semantic decoders for predicting the semantic maps of bitemporal images and a change decoder for predicting binary change maps, as shown on the right side of Figure 1.Each of the three decoders contains four stages that receive features from the previous stages, and the semantic decoders integrate semantic feature maps E i T1 and E i T2 (i = 1, 2, 3, 4, 5).The change decoder integrates the multi-scale enhanced features F i (i = 2, 3, 4, 5) through skip connections.Each stage consists of a bilinear upsampling layer and a 3 × 3 convolutional layer.At each stage of the decoder, the channel dimension is halved, while the feature resolution is doubled until the original image resolution is restored.For example, input images T1 and T2 are both 256 × 256 × 3, and the feature maps in the decoder are 32 × 32 × 256, 64 × 64 × 128, 128 × 128 × 64, and 256 × 256 × 32.Then, two semantic maps (S 1 and S 2 ) and a BCD map (B) are generated by the classification layer.Finally, P 1 and P 2 are obtained by masking S 1 and S 2 with B. The calculations are as follows: where S i T1 and S i T2 represent the feature maps of each stage of the two semantic decoders; B i represents the feature map of each stage of the change decoder, i = 2, 3, 4, 5.
Only the change loss is used to train multi-task prediction; hence, semantic information is lacking for training supervision, which negatively affects network performance.We use the multi-task loss function (L mul ), including the semantic losses, L sem1 and L sem2 , and binary change loss, L c , to jointly guide model training.The calculation is as follows: where α is used to adjust the effect of different loss functions for the network.The semantic losses, L sem1 and L sem2 , are the multi-class cross-entropy loss between the semantic segmentation results, S 1 and S 2 , and the ground truth (GT) semantic change maps, L 1 and L 2 .
The no-change class (index "0") is excluded from the loss calculation.The calculation of L sem on each pixel is as follows: where N is the number of semantic classes, and y i and p i denote the GT label and the predicted probability of the i-th class, respectively.
A severe class imbalance problem occurs with BCD, in which the number of unchanged pixels is much larger than the number of changed pixels.To alleviate this, binary change loss L c combines binary cross-entropy and dice loss to jointly supervise the BCD.The formula is as follows: where y b is the GT label, which is obtained by replacing non-zero labels with "1" as a change label.ŷb is the predicted probability of binary change.

Datasets
The SECOND [36] is a high-resolution dataset collected by several aerial platforms and sensors for remote sensing image SCD.Among the 4662 pairs of temporal images, 2968 are openly available, covering several cities (e.g., Hangzhou, Chengdu, and Shanghai).Each pair of data provides original images and corresponding semantic change labels.Each image has a fixed size of 512 × 512 pixels with the spatial resolution varying from 0.3 to 5 m.The dataset holds six land-cover classes, including non-vegetated ground surface, tree, low vegetation, water, building, and playground, and involves 30 common change categories.Figure 4 presents sample images from the SECOND.As can be seen, no-changed pixels account for more than 80% of the total, whereas the 30 change categories only take up small proportions, which poses a huge imbalance challenge to the SCD method.
The Landsat-SCD dataset is made up of Landsat images collected between 1990 and 2020.The observation area is Tumshuk, Xinjiang, China.The dataset consists of 8468 pairs of images, each having a fixed size of 416 × 416 pixels with a resolution of 30 m.The dataset relates a no-change class and four land-cover classes, including farmland, desert, buildings, and water.Figure 5 shows sample images from the Landsat-SCD dataset.The dataset contains many complex detection scenes, where the buildings are small and scattered.Changed pixels account for about 19% of the total, which provides a realistic evaluation dataset for SCD methods.

Evaluation Metrics
In this paper, four evaluation metrics are utilized to assess the performance of different methods, including overall accuracy (OA), mean Intersection over Union (mIoU), separated Kappa (κ) coefficient (SeK), and a comprehensive score (Score).OA reflects the proportion of correctly classified samples to all samples, defined as follows: where TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively.Owing to the obvious imbalance of positive and negative samples, the OA calculation can easily be dominated by negative samples; hence, it will fail to provide a reasonable Owing to the obvious imbalance of positive and negative samples, the OA calculation can easily be dominated by negative samples; hence, it will fail to provide a reasonable Owing to the obvious imbalance of positive and negative samples, the OA calculation can easily be dominated by negative samples; hence, it will fail to provide a reasonable perspective of full-task accuracy.Thus, we turn to mIoU and SeK.The former is used to evaluate SCD results from the BCD perspective, and the latter takes the SCD perspective.mIoU is the mean value of the IoU of no-change pixels (IoU 1 ) and changed pixels (IoU 2 ): SeK is the combination of IoU 2 and the new κ after unconsidered true predictions of non-changed pixels.SeK is calculated as follows: where Q = q ij , 0 ≤ i ≤ N, 0 ≤ j ≤ N is the confusion matrix, in which "0" represents the unchanged class, and N represents the number of categories.q i+ denotes the row sum of Q, and q +i denotes the column sum.Based on mIoU and Sek, Score can be calculated as [37]:

Training Details
All methods are implemented with PyTorch on Linux and trained using two NVIDIA RTX 2080Ti GPUs.During training, the same experimental parameters are used in all experiments.We use stochastic gradient descent to train the network with an initial learning rate of 0.01 and a weight decay of 0.9.The batch size and α are set to 4 and 0.5, respectively.We further split each dataset into training, validation, and testing sets randomly at a ratio of 7:1:2.To make better use of the GPU for training, we uniformly resize image patches to 256 × 256 and train the framework with 50 epochs.To improve robustness, data augmentation is performed via random flipping and rotating of the input images.

Comparison Experiments
To verify the superiority of SMNet on SCD tasks, several excellent remote sensing image CD models were compared: • FC-Siam-conc [12]: A fully convolutional Siamese network that fuses bi-temporal features through skip-connections for CD.

•
FC-Siam-diff [12]: A fully convolutional Siamese network that utilizes multi-layer difference features to fuse bi-temporal information.

•
DSIFN [39]: A deeply supervised differential network that generates change maps using multi-scale feature fusion.
• HRSCD-str3 [35]: A network that introduces temporal correlation information by constructing a BCD branch.• HRSCD-str4 [35]: A Siamese network that designs a skip operation to connect Siamese encoders with the decoder of the CD branch.

•
BiSRNet [40]: A bi-temporal semantic reasoning (SR) network that applies Siamese and cross-temporal SR to enhance information exchange between temporal and change branches.

•
FCCDN [24]: A feature constraint CD network based on a dual encoder-decoder that uses a non-local feature pyramid network to extract and fuse multi-scale features and proposes a densely connected feature fusion module to enhance robustness.

•
BIT [41]: A network that combines a CNN and transformer learns a compact set of tokens to represent high-level concepts that reveal change of interest in bi-temporal images.The transformer finds the relationship between semantic concepts in the token-based space-time.
The quantitative analysis of experimental results on the SECOND is shown in Table 1, indicating that our method delivers excellent performance.Specifically, SMNet achieves the best mIoU, SeK, and OA values of 71.95%, 20.29%, and 86.68%, respectively.Compared with the BIT, SMNet shows consistent improvements for all evaluation metrics, demonstrating that the PR and TB combination greatly assists feature extraction, resulting in the acquisition of much richer global information.Our multi-task method shares some of its obvious advantages with BiSRNet and FCCDN over the single-task methods, such as FC-Siam-conc, FC-Siam-diff, and DSIFN.The main reason may be that semantic and change information interferes with each other in the single-task method.The experimental results on the Landsat-SCD are shown in Table 2, where the proposed method achieves the best results on each evaluation metric.In particular, our method is 7.16% higher compared to BIT in SeK. Figure 6 visualizes partial results on the SECOND.Since the multi-task methods decouple the SCD task into two sample sub-tasks of se.Since the multi-task methods decouple the SCD task into two sample sub-tasks of semantic segmentation and BCD, which largely reduces the output space.As a result, multi-task methods are more accurate and complete than single-task methods (e.g., FC-Siam-conc).In complex scenes, as shown in the first and second groups in Figure 5, there are a large number of false and missed detection in compared methods due to the relatively similar spectral characteristics between low vegetation and trees in the images.However, the proposed method effectively establishes context information and clearly distinguishes low vegetation and trees by capturing richer global information.In the case of irregular changes at multiple scales, as shown in the third and fourth groups in Figure 5, FC-Siam-conc, BiSRNet, FCCDN, and BIT all have missed detection at different degrees.However, SMNet demonstrated higher adaptability to multiscale change regions and produced smoother edges by distinguishing foreground and background information.Since the resolution of the Landsat-SCD dataset is relatively low, it poses a greater challenge to the performance of the model.In this dataset, the proposed model accurately identifies fine-grained changes that are easily missed by compared models, such as the drying of rivers and the change of small farmland and buildings in Figure 7.
Figure 8 presents the intermediate results of the BIT and the proposed method on the SECOND.It can be seen that BIT shows obvious misjudgment in the recognition of semantic types, which may be caused by its transformer receiving only the lowest semantic features from the CNN, resulting in inadequate semantic information.The proposed method not only effectively reduces the response to unrelated information and improves the classification ability of different objects via the PRTB backbone, but it also successfully leveraged the MCFM module to accurately detect change regions and improve boundary expressability.It is worth noting that in the case of extreme label imbalances, water is detected more accurately by SMNet, which demonstrates that the proposed method achieves excellent performance for remote sensing image SCD.

Ablation Experiments
To validate the contributions of the key components of our framework, four sets of ablation experiments were performed on the SECOND.In this paper, we use the backbone network composed of the P-Res block as the baseline.Each module is joined to the network framework individually for performance evaluation on basis of the baseline.The results are listed in Table 2.
Remote Sens. 2023, 15, x FOR PEER REVIEW 12 of 19 the first and second groups in Figure 5, there are a large number of false and missed detection in compared methods due to the relatively similar spectral characteristics between low vegetation and trees in the images.However, the proposed method effectively establishes context information and clearly distinguishes low vegetation and trees by capturing richer global information.In the case of irregular changes at multiple scales, as shown in the third and fourth groups in Figure 5, FC-Siam-conc, BiSRNet, FCCDN, and BIT all have missed detection at different degrees.However, SMNet demonstrated higher adaptability to multi-scale change regions and produced smoother edges by distinguishing foreground and background information.Since the resolution of the Landsat-SCD dataset is relatively low, it poses a greater challenge to the performance of the model.In this dataset, the proposed model accurately identifies fine-grained changes that are easily missed by compared models, such as the drying of rivers and the change of small farmland and buildings in Figure 7. Figure 8 presents the intermediate results of the BIT and the proposed method on the SECOND.It can be seen that BIT shows obvious misjudgment in the recognition of semantic types, which may be caused by its transformer receiving only the lowest semantic features from the CNN, resulting in inadequate semantic information.The proposed method not only effectively reduces the response to unrelated information and improves the classification ability of different objects via the PRTB backbone, but it also successfully leveraged the MCFM module to accurately detect change regions and improve boundary expressability.It is worth noting that in the case of extreme label imbalances, water is detected more accurately by SMNet, which demonstrates that the proposed method achieves excellent performance for remote sensing image SCD.

Ablation Experiments
To validate the contributions of the key components of our framework, four sets of ablation experiments were performed on the SECOND.In this paper, we use the backbone network composed of the P-Res block as the baseline.Each module is joined to the network framework individually for performance evaluation on basis of the baseline.The results are listed in Table 2.As shown in rows 1 and 2 of Table 3, compared with the baseline, the addition of the multi-task loss achieves a gain of 0.34% and 0.73% for mIoU and Sek, respectively.This indicates that the multi-task loss improves semantic expressability via co-supervised model training.The third row of Table 2 shows that the addition of the PRTB backbone achieves significant improvements in all metrics, which is due to that our PRTB backbone considers more global information and context information of the model.From the last row of Table 2, introducing the MCFM increases the mIoU value from 71.50% to 71.95% and the Sek value from 19.45% to 20.29%, respectively.This illustrates that the MCFM is effective in highlighting the boundaries of change areas.Some of the visualization results are shown in Figure 9, which illustrates the improvements in terms of fewer missed and false detection.In particular, the ability to distinguish ground, low vegetation, and tree categories have been significantly increased.As shown in rows 1 and 2 of Table 3, compared with the baseline, the addition of the multi-task loss achieves a gain of 0.34% and 0.73% for mIoU and Sek, respectively.This indicates that the multi-task loss improves semantic expressability via co-supervised model training.The third row of Table 2 shows that the addition of the PRTB backbone achieves significant improvements in all metrics, which is due to that our PRTB backbone considers more global information and context information of the model.From the last row of Table 2, introducing the MCFM increases the mIoU value from 71.50% to 71.95% and the Sek value from 19.45% to 20.29%, respectively.This illustrates that the MCFM is effective in highlighting the boundaries of change areas.Some of the visualization results are shown in Figure 9, which illustrates the improvements in terms of fewer missed and false detection.In particular, the ability to distinguish ground, low vegetation, and tree categories have been significantly increased.As shown in rows 1 and 2 of Table 3, compared with the baseline, the addition of the multi-task loss achieves a gain of 0.34% and 0.73% for mIoU and Sek, respectively.This indicates that the multi-task loss improves semantic expressability via co-supervised model training.The third row of Table 2 shows that the addition of the PRTB backbone achieves significant improvements in all metrics, which is due to that our PRTB backbone considers more global information and context information of the model.From the last row of Table 2, introducing the MCFM increases the mIoU value from 71.50% to 71.95% and the Sek value from 19.45% to 20.29%, respectively.This illustrates that the MCFM is effective in highlighting the boundaries of change areas.Some of the visualization results are shown in Figure 9, which illustrates the improvements in terms of fewer missed and false detection.In particular, the ability to distinguish ground, low vegetation, and tree categories have been significantly increased.In addition, the performance exhibited by different layer transformer blocks on the SECOND dataset is discussed in this paper, and the results are shown in Table 4.It can be seen that too many transformer blocks decrease the detection effectiveness of the network.When the number of n-layers is set to (1, 2, 4, 2, 1), the best detection results are obtained at mIoU and SeΚ, which reduces redundant computation.In addition, the performance exhibited by different layer transformer blocks on the SECOND dataset is discussed in this paper, and the results are shown in Table 4.It can be seen that too many transformer blocks decrease the detection effectiveness of the network.When the number of n-layers is set to (1, 2, 4, 2, 1), the best detection results are obtained at mIoU and SeK, which reduces redundant computation.In addition, the performance exhibited by different layer transformer blocks on the SECOND dataset is discussed in this paper, and the results are shown in Table 4.It can be seen that too many transformer blocks decrease the detection effectiveness of the network.When the number of n-layers is set to (1, 2, 4, 2, 1), the best detection results are obtained at mIoU and SeΚ, which reduces redundant computation.The loss function provides effective supervision information for network training, which has a significant impact on network performance.In this paper, α is introduced in the multi-task loss to balance the loss values between the semantic and change branches.In the process of training, the larger value of α indicates more supervision information from semantic branches to the model.To analyze the sensitivity of α, we chose four sets of values from 0 to 0.8 for our experiments.Figure 10 shows the effect of different α values in the multi-task loss on each evaluation metric.It can be seen that all curves of the SEC-OND dataset reach the best value when α = 0.5, which indicates that semantic loss and binary change loss play equal roles in the co-supervision network training.However, all evaluation metrics have the lowest value when α = 0.8, which illustrates that semantic supervision is greatly important for remote sensing image SCD.The loss function provides effective supervision information for network training, which has a significant impact on network performance.In this paper, α is introduced in the multi-task loss to balance the loss values between the semantic and change branches.In the process of training, the larger value of α indicates more supervision information from semantic branches to the model.To analyze the sensitivity of α, we chose four sets of values from 0 to 0.8 for our experiments.Figure 10 shows the effect of different α values in the multi-task loss on each evaluation metric.It can be seen that all curves of the SECOND dataset reach the best value when α = 0.5, which indicates that semantic loss and binary change loss play equal roles in the co-supervision network training.However, all evaluation metrics have the lowest value when α = 0.8, which illustrates that semantic supervision is greatly important for remote sensing image SCD.The loss function provides effective supervision information for network trainin which has a significant impact on network performance.In this paper, α is introduced the multi-task loss to balance the loss values between the semantic and change branche In the process of training, the larger value of α indicates more supervision informatio from semantic branches to the model.To analyze the sensitivity of α, we chose four se of values from 0 to 0.8 for our experiments.Figure 10 shows the effect of different α valu in the multi-task loss on each evaluation metric.It can be seen that all curves of the SEC OND dataset reach the best value when α = 0.5, which indicates that semantic loss an binary change loss play equal roles in the co-supervision network training.However, a evaluation metrics have the lowest value when α = 0.8, which illustrates that semant supervision is greatly important for remote sensing image SCD.The SECOND dataset Is currently the only available dataset for remote sensing image SCD tasks with pixel-level annotations of large-scale images.The creation of such datasets is extremely time-consuming and laborious.Therefore, we plan to consider unsupervised or semi-supervised methods in the future to capture semantic change information through self-learning and achieve SCD tasks with comparable or even higher accuracy.The SECOND dataset Is currently the only available dataset for remote sensing image SCD tasks with pixel-level annotations of large-scale images.The creation of such datasets is extremely time-consuming and laborious.Therefore, we plan to consider unsupervised or semi-supervised methods in the future to capture semantic change information through self-learning and achieve SCD tasks with comparable or even higher accuracy.

Figure 1 .
Figure 1.The architecture of the proposed SMNet.

Figure 8 .
Figure 8. Comparisons of the results provided by the BIT and Ours.We highlight interesting regions with orange boxes.Definitions: S-map, semantic map.

Figure 8 .
Figure 8. Comparisons of the results provided by the BIT and Ours.We highlight interesting regions with orange boxes.Definitions: S-map, semantic map.
is a meaningful and challenging task in remote sensing field.It requires the full exploitation of the semantic and change features.In this paper, we propose SMNet, a novel CNN and transformer implementation model for remote sensing image SCD.It can output semantic and change information end-to-end.Based on an encoder-decoder architecture, we built the PRTB backbone, which uses the combination of CNN and transformer to extract image features with more global information.This improves the semantic expressability of our network.Then, we introduce the MCFM, which enhances the sensitivity of the model to change regions in complex backgrounds by fusing foreground, background, and global information.Finally, we use the multi-task loss to jointly guide network training, further improving the detection efficacy of SMNet.Comparative experimental results on the SECOND show that our new framework shows better accuracy.Ablation experiments then clearly demonstrate the necessity of each module for the detection results.
SCD is a meaningful and challenging task in remote sensing field.It requires the full exploitation of the semantic and change features.In this paper, we propose SMNet, a novel CNN and transformer implementation model for remote sensing image SCD.It can output semantic and change information end-to-end.Based on an encoder-decoder architecture, we built the PRTB backbone, which uses the combination of CNN and transformer to extract image features with more global information.This improves the semantic expressability of our network.Then, we introduce the MCFM, which enhances the sensitivity of the model to change regions in complex backgrounds by fusing foreground, background, and global information.Finally, we use the multi-task loss to jointly guide network training, further improving the detection efficacy of SMNet.Comparative experimental results on the SECOND show that our new framework shows better accuracy.Ablation experiments then clearly demonstrate the necessity of each module for the detection results.

Table 1 .
Comparison results on SECOND dataset.

Table 3 .
Ablation study on the SECOND dataset.

Table 3 .
Ablation study on the SECOND dataset.

Table 3 .
Ablation study on the SECOND dataset.

Table 4 .
The effect of the n-layer transformer block on the SECOND dataset.

Table 4 .
The effect of the n-layer transformer block on the SECOND dataset.

Table 4 .
The effect of the n-layer transformer block on the SECOND dataset.