Dual-Task Semantic Change Detection for Remote Sensing Images Using the Generative Change Field Module

: With the advent of very-high-resolution remote sensing images, semantic change detection (SCD) based on deep learning has become a research hotspot in recent years. SCD aims to observe the change in the Earth’s land surface and plays a vital role in monitoring the ecological environment, land use and land cover. Existing research mainly focus on single-task semantic change detection; the problem they face is that existing methods are incapable of identifying which change type has occurred in each multi-temporal image. In addition, few methods use the binary change region to help train a deep SCD-based network. Hence, we propose a dual-task semantic change detection network (GCF-SCD-Net) by using the generative change ﬁeld (GCF) module to locate and segment the change region; what is more, the proposed network is end-to-end trainable. In the meantime, because of the inﬂuence of the imbalance label, we propose a separable loss function to alleviate the over-ﬁtting problem. Extensive experiments are conducted in this work to validate the performance of our method. Finally, our work achieves a 69.9% mIoU and 17.9 Sek on the SECOND dataset. Compared with traditional networks, GCF-SCD-Net achieves the best results and promising performances.

Semantic change detection (SCD) can detect the change regions and identify the semantic labels simultaneously. However, SCD-based datasets that are openly available are still limited [18]. Hence, to validate the proposed methods, a large-scale semantic change detection dataset (HRSCD) was built by Daudt et al. [18], and a sequential training framework for semantic change detection was proposed. Based on semantic change detection, Mou et al. [19] proposed a recurrent convolutional neural network (ReCNN) network, and two data sets were built to validate their work, but the proposed datasets are not publicly available. Although, existing methods can achieve the semantic change detection with promising results, their studies face the problems of locating and identifying the area of change [20]. Consequently, Yang et al. [20] proposed an asymmetric Siamese network (ASN) for dual-task semantic change detection, and a large-scale semantic change feature map and semantic change feature map is an effective method to improve the segmentation results. In this paper, the main contributions are as follows. (1) We propose a novel dual-task semantic change network to identify the change region in bitemporal images, and it achieves strong results using the SECOND dataset. The proposed SCD-based model effectively solves the dual-task semantic change detection problem.
(2) To the best of our knowledge, we are the first to exploit the generative change field method to guide two branch networks to achieve dual-task semantic change detection.
(3) In order to alleviate the influence of an imbalanced label between the change region and no-change region, we propose a robust separable loss function that enables to improve the performance of the network.

Materials and Methods
To simplify the mathematical modeling, Table 1 depicts the meanings of the abbreviated letters we used.  (1) We propose a novel dual-task semantic change network to identify the change region in bitemporal images, and it achieves strong results using the SECOND dataset. The proposed SCD-based model effectively solves the dual-task semantic change detection problem.
(2) To the best of our knowledge, we are the first to exploit the generative change field method to guide two branch networks to achieve dual-task semantic change detection.
(3) In order to alleviate the influence of an imbalanced label between the change region and no-change region, we propose a robust separable loss function that enables to improve the performance of the network.

Materials and Methods
To simplify the mathematical modeling, Table 1 depicts the meanings of the abbreviated letters we used.

Siamese Convolutional Network
Siam-Conv has been widely used to extract feature information in existing works [13,[15][16][17][18], which shows the effectiveness of Siam-Conv for change detection. Therefore, we exploited Siam-Conv to extract the feature information from the image pairs, as shown in Figure 1.
Let I 1 and I 2 represent an input image pair, I i ∈ R C×H×W , i ∈ {1, 2}. Let E θ (·) be Siam-Conv (as shown the Siam-Conv module in Figure 1) for extracting the feature maps Remote Sens. 2021, 13, 3336 4 of 15 from the bitemporal images, and θ the learning parameters of the Siam-Conv. Then, the feature extraction can be formulated as follows: (1) where f i is the feature maps captured from the input image at time Ti, and R In this work, we used seven residual blocks, which are from ResNet34 proposed by He et al. [28], to extract the shallow features. The details of Siam-Conv are listed in Table 2. Firstly, we use Layer0 (correspond to "Layer0" in Figure 1) to extract the salient features and reducing the size of the feature maps, which contains a convolutional layer (Conv) with a kernel size of 7 × 7, Batch Normalization layer (BN), Rectified Linear Unit (ReLU) and Maxpooling layer (Maxpool). Next, we used three residual blocks ("Layer1" in Figure 1) to extract the low-dimension features with a size of 64 × 64, and four residual blocks ("Layer2" in Figure 1) were used to capture the middle dimension features with a size of 32 × 32.

General Networks for Dual-Task Semantic Change Detection
Traditional change detection methods cannot effectively achieve dual-task semantic change detection; the main reason is that the general networks are designed based on a single task, so it is impossible to achieve dual-task change detection. Hence, based on classical segmentation networks, such as UNet [29] and PSPNet [30], we built two dual-task semantic change detection networks, as shown in Figure 2.

UNet-SCD
In the Siam-Conv module, we exploited ResNet34 to extract the feature representation from the bitemporal images, separately. Let f 1 and f 2 represent the feature maps captured from the image pairs at two different times, T 1 and T 2 . Next, the feature pairs were fused by a concatenation operation, which can be formulated by where f ∈ R C 32 × H 32 × W 32 . In order to obtain the difference features, we used two convolutional groups (Conv + BN + ReLU) to refine the difference maps. The convolutional group was defined as follows: Then, we used two upsampling branches and a skip connection strategy to guide each branch to generate the difference maps.

PSPNet-SCD
Different from the UNet-SCD network, the bitemporal images were downsampled by 1/8 in the Siam-Conv module. Then, we use the concatenation operation to fuse the bitemporal features. Meanwhile, we use the pyramid pooling module (PPM) [30] to capture features with different receptive fields. Finally, through reusing the bitemporal features outputted from the Siam-Conv module, we can achieve the dual-task semantic change maps. The fusion process can be defined as follows: where D means the difference features are taken from PPM, andf represents the bitemporal feature maps. In order to extract the difference maps from D f i , we use the two convolutional groups to generate the semantic change maps in the different time periods, separately.

Generative Change Field Network for Dual-Task Semantic Change Detection
Above, we established two types of dual-task semantic change detection networks. However, their results obtained on the SECOND dataset are mediocre. Hence, we developed a method (GCF-SCD-Net) that can achieve strong performance for dual-task SCD in this work.
We introduce a generative change field (GCF) module that only focuses on change regions based on difference maps. Since the details of the Siam-Conv module are given in Table 2, we only present the configuration of the GCF-based dual-task semantic change detection in Table 3. The process of generating a change field is as follows: Firstly, we can obtain the bitemporal features E θ (I 1 ) and E θ (I 2 ). Let Df represents the fusion features, D f ∈ R 3C×H×W , which can be achieved through the concatenation operation Then, using two Conv modules, we reduced the dimension of D f to obtain f , f ∈ R C×H×W . Next, nine residual blocks were used to obtain the difference maps (correspond to "Layer3" and "Layer4" in Figure 1). To effectively capture the feature representation from f , we used the PPM strategy, the PPM module shown in Figure 1, to extract the difference feature maps at different scales. In this work, we used four scales of the feature maps with bin sizes of 1 × 1, 2 × 2, 3 × 3 and 6 × 6, respectively. Finally, a convolutional group and output layer were used to generate the binary change maps, ϕ, which are marked with a green box in Figure 1.
The aim of the GCF-SCD-Net is to achieve dual-task semantic change detection; thus, we exploited two branches to generate bitemporal semantic change maps. A "Seg1" module contains two convolutional layers is used to increase the dimensions of f 1 from 128 to 512, therefore, we can obtain the feature maps f 1 at time of T 1 , f 1 ∈ R 512×32×32 . After that, we make use of difference feature maps generated by GCF module to guide branch-1 (top side of Figure 1) to predict the pixels' categories of the T1 image (p 1 ). Similarly, the semantic change map of T2 image (p 2 ) can be obtained in the same way; this is shown on the bottom of Figure 1.
The final binary change map and semantic change maps are generated by the softmax activation function, which can be formulated by where ψ(·) is the softmax function and ψ(·) max represents the maximum index of prediction, so we have where p (x,y) c represents the prediction of a pixel in the x-row and y-column of the c-th channel. S is the channels, which is equal to the total number of change types and nochange types. Through the abovementioned strategy, we can obtain the change region maps and semantic change segmentation maps simultaneously.

Dual-Task Semantic Change Detection Loss Function
Model training plays an important role in achieving surprising results. A suitable loss function is good for obtaining better performance and reducing training time. In this article, a loss function consists of generative change field loss (gcf_loss) and semantic change loss (sc_loss). The loss function can be written as

WCE_Loss
For our task, cross-entropy loss (CE_loss) is capable of measuring the similarity between prediction and ground truth, which is an appropriate choice for calculating the loss. The CE_loss can be formulated as follows: where N is the number of training samples, and y i and p i are the ground truth and prediction of the i-th samples. However, the distribution of the target class is unbalanced. Hence, according to the statistics of the pixels for each class, we added a distribution weight based on the number of each category to the cross-entropy loss function (WCE_loss), so Equation (11) can be rewritten as where w is the weight, w = (w 1 , w 2 , . . . , w S ).

Separable Loss
Due to the influence of the imbalance label, we propose a separable loss function that calculates the loss of the no-change regions and change regions, respectively. Let ϕ(·) indicate the softmax function, where ϕ(p 0 i,j ) is the prediction of the 0th channel of p i,j activated by the softmax function, p i,j ∈ R C×H×W . We define the separable loss as follows: whereB = E − B: E is the matrix with all the elements of one, and B is the ground truth of the binary change. y * i,j and p * i,j are the ground truth and prediction of the j-th sample at time Ti without the no-change type, p * i,j ∈ R (S−1)×H×W , y * i,j ∈ R (S−1)×H×W . We use the no-change loss to train the network through calculating the loss of the no-change channel. Meanwhile, change loss that ignores the no-change channel can contribute to the loss without the influence of the imbalance label between change and no change.

Union Loss
In this work, we make use of the WCE_loss and separable loss to punish the network based on the generative change field and semantic change prediction during the training process. Let y andŷ represent the prediction and ground truth, and b and B are the binary change prediction and ground truth; then, the loss can be formulated as

Results
In this section, we first introduce the experimental setup in Section 3.1. Next, we describe the dataset used in this work and the evaluation criteria of the models in Sections 3.2 and 3.3. Further, in Sections 3.4 and 3.5, we present the experimental results and analysis.

Implementation Details
To ensure a fair comparison, all experiments were conducted with the same training strategy, software environment and hardware platform. The details are as follows.
The total number of training epochs is 100 for all experiments. We use the stochastic gradient descent (SGD) algorithm [31] to train the networks; the momentum is set to 0.9, with a weight decay of 5 × 10 −4 . The initial learning rate was 0.005, which dynamically decreased every 30 epochs by 1/10 during training. The input size of the image pair was

Dataset
Currently, the public semantic change detection datasets are still limited. Most of the existing public benchmarks [32][33][34][35] mainly focus on binary change detection. Mou et al. [19] built a multi-class change detection benchmark for SCD, while the dataset is a single-task SCD and is not publicly available.
Hence, we used the dual-task SCD benchmark (SECOND) http://www.captainwhu.com/PROJECT/SCD/ (accessed on 8 March 2021) proposed by [20] to validate our methods. The dataset is a dual-task-based semantic change detection dataset. There are six categories in the SECOND dataset, including non-vegetated ground surface, tree, low vegetation, water, buildings and playgrounds. It contains 4662 pairs of aerial images, and each sample has size of 512 × 512, in three bands (red, green and blue).

Metrics
For change detection, the Over Accuracy (OA) and mean Intersection over Union (mIoU) are usually utilized for validating the performance of the methods [15,18,19,26,35]. Since the numbers of categories are unevenly distributed, Yang et al. [20] proposed a Separated Kappa (SeK) coefficient to alleviate the influence of an imbalance label.
In this article, we utilize three types of metrics, namely, OA, mIoU and SeK, to evaluate the performance of our methods. Then OA is defined as where S represents the total number of change categories and no-change type ("I = 0"). p i,i indicates the total number of pixels correctly predicted by network, and p i,j represents the total number of pixels that are predicted for i-th change type, but in fact they belong to the j-th type. mIoU = (IoU 1 + IoU 2 )/2, where IoU 1 is used to evaluate the prediction in the no-change regions, and IoU 2 is used for validating the change regions.
The SeK can be formulated as follows: where ζ is the consistency between the prediction and the ground truth [20].

Effect of the GCF Module
Firstly, we evaluate the effect of the GCF module for dual-task semantic change detection on the SECOND dataset. For a fair comparison, all experiments were with the Remote Sens. 2021, 13, 3336 9 of 15 conventional cross-entropy loss function, to contribute to the loss during training process. As illustrated in Table 4, in terms of the SCD task, the conventional methods, such as FC-EF, FC-Siam-conv and FC-Siam-diff, cannot achieve competitive results. Compared with the FC-Siam-diff, the proposed dual-task-based PSPNet-SCD increases the performance of the semantic change detection considerably by 1.2% of IoU2, 0.8% of mIoU and 1.8 of SeK, respectively. More surprisingly, the GCF-based network can effectively improve the IoU2 and SeK by 2.6% and 2.7 compared with FC-Siam-diff, which demonstrates that the proposed GCF module is effective. We can also see in Table 4 that the GCF-based network shows higher performance compared with PSPNet-SCD, with the improvements in IoU2 and SeK achieved by GCF-SCD-Net being 1.4% and 0.9, respectively. Therefore, the use of the GCF module can effectively increase the performance of the models.

Performance Analysis of Separable Loss
An imbalance label is a severe problem that causes difficulty in network training. In order to obtain strong results for the SCD task, this article introduce a separable loss to improve the performance of semantic segmentation in the change region. As shown in Table 5, it is obvious that the proposed separable loss can alleviate the imbalance label problem. Networks with separable loss outperform those with WCE_loss, such as PSPNet-SCD and GCF-SCD-Net achieving improvements in IoU2 by 1.9% and 1.7%, respectively. The increments in SeK obtained by PSPNet-SCD and GCF-SCD-Net are 1.4 and 1.5. Although the cost of improvement is that OA and IoU1 would be influenced or even slightly decreased, the segmentation results in the change region are greatly improved. Lin et al. [36] proposed a focal loss function to alleviate the influence of sample imbalance. Consequently, to demonstrate the superiority of the separable loss, we report the results by using the focal loss in Table 5. Obviously, the proposed loss function performs with prominent superiority across all evaluation metrics.
The reason for this phenomenon is that separable loss calculates losses for the change region and the no-change region, separately, which guides the model to pay more attention to the loss of the change area, and alleviates the problem of over-confidence caused by the imbalance label. In particular, the improvements in the overall metrics obtained by GCF-SCD-Net further demonstrate the competitive performance of the proposed GCF module.

Discussion
To validate the performance of GCF-SCD-Net, we list the results obtained by [20] and our methods in Table 6. The proposed method achieves the best results on the SECOND dataset, 16.5 in SeK and 69.1% in mIoU. In the testing process, Yang et al. [20] improved the detection results by flip methods and a multiscale strategy. Since the proposed network did not use the multiscale strategy to optimize the parameters of the models, we only used the flip method to validate the performance of the networks. As shown in Table 6, although ASN-ATL outperforms GCF-SCD-Net slightly in mIoU, the proposed network achieves the best results in SeK by an improvement of 1.1. Table 6. Comparison with the state-of-the-art methods (% for all except seK).

Flip
attention to the loss of the change area, and alleviates the problem of over-confidence caused by the imbalance label. In particular, the improvements in the overall metrics obtained by GCF-SCD-Net further demonstrate the competitive performance of the proposed GCF module.

Discussion
To validate the performance of GCF-SCD-Net, we list the results obtained by [20] and our methods in Table 6. The proposed method achieves the best results on the SECOND dataset, 16.5 in SeK and 69.1% in mIoU. In the testing process, Yang et al. [20] improved the detection results by flip methods and a multiscale strategy. Since the proposed network did not use the multiscale strategy to optimize the parameters of the models, we only used the flip method to validate the performance of the networks. As shown in Table 6, although ASN-ATL outperforms GCF-SCD-Net slightly in mIoU, the proposed network achieves the best results in SeK by an improvement of 1.1.
Above, the proposed method stably improves the performance of the semantic change detection, which effectively demonstrates the superiority and robustness of GCF-SCD-Net. Flip attention to the loss of the change area, and alleviates the problem of over-confidence caused by the imbalance label. In particular, the improvements in the overall metrics obtained by GCF-SCD-Net further demonstrate the competitive performance of the proposed GCF module.

Discussion
To validate the performance of GCF-SCD-Net, we list the results obtained by [20] and our methods in Table 6. The proposed method achieves the best results on the SECOND dataset, 16.5 in SeK and 69.1% in mIoU. In the testing process, Yang et al. [20] improved the detection results by flip methods and a multiscale strategy. Since the proposed network did not use the multiscale strategy to optimize the parameters of the models, we only used the flip method to validate the performance of the networks. As shown in Table 6, although ASN-ATL outperforms GCF-SCD-Net slightly in mIoU, the proposed network achieves the best results in SeK by an improvement of 1.1.
Above, the proposed method stably improves the performance of the semantic change detection, which effectively demonstrates the superiority and robustness of GCF-SCD-Net. Above, the proposed method stably improves the performance of the semantic change detection, which effectively demonstrates the superiority and robustness of GCF-SCD-Net.
To present the change detection results intuitively, we visualized the segmentation results to demonstrate the performance of the proposed methods. Figure 3 shows the change detection results generated by FC-EF, FC-Siam-conv, FC-Siam-diff and the three types of dual-task semantic change detection networks proposed in this work.
According to semantic segmentation results in Samples A and B, we can note that the proposed GCF module enables to identify the change region accurately and the no-change region in complex scenarios. Since "tree" and "low vegetation" have a similar texture and color, most of the SCD networks have a poor detection performance, but this does not limit the segmentation results of GCF-SCD-Net, as shown in Sample B. In terms of sample C, conventional change detection methods cannot identify the pseudochange region well, but the proposed SCD-based UNet and PSPNet are capable of alleviating this problem, which demonstrates that existing change detection networks are improper for dual-task SCD. Generally, the trees on both sides of the road are elongated. According to Sample D, we note that the proposed GCF module performs well for the stripe scenario. Due to the small number of samples (such as "Playground"), it is difficult to accurately identify these change types; our method performs well under the abovementioned conditions. Figure 4 depicts the visual results of the semantic prediction and binary prediction based on GCF-SCD-Net, where we can see that the GCF module can extract the change regions accurately. Consequently, the semantic change detection module can effectively classify the categories of each pixel based on the change field.      Figure 4. Semantic change maps and binary change maps generated by GCF-SCD-Net. c1 is an image pair; c2 and c3 are the semantic label and prediction; images in c4 were obtained by fusing the raw images and semantic prediction masks; c-5,6,7 represent the binary change label, binary change prediction and binary fusion results.
Above, the visual results fully demonstrate that our method is effective and superior to the existing methods.

Conclusions
In this work, in order to address the problem that existing methods are incapable of obtaining a significant result for dual-task semantic change detection, we proposed a generative change field (GCF)-based dual-task semantic change detection network for remote sensing images. The proposed network consists of a Siamese convolutional neural network (Siam-Conv) module for extracting the feature representation from the raw image pairs, a generative change field module for obtaining the binary change map and two generative semantic change modules for generating the semantic segmentation maps of the bitemporal images. Moreover, it is an end-to-end SCD network. To alleviate the sample imbalance problem, we designed a separable loss for better training the deep models.
Extensive experiments were conducted in this work to demonstrate the competitive performance that can be achieved by GCF-SCD-Net, compared with existing methods as well as the proposed dual-task SCD networks (UNet-SCD and PSPNet-SCD). What is more, we validate the effectiveness of the proposed separable loss function; it is worth noting that the proposed separable loss is a general strategy to alleviate the sample imbalance problem. Therefore, it can be applied to other benchmark datasets that suffer from label imbalance.
At present, the SECOND dataset is the only public dataset for dual-task semantic change detection. In the meantime, we note that the proposed network and conventional networks perform poor regarding edge detection and contour extraction in the intersecting zone. Consequently, in future work, we intend to build a large-scale, very-high-resolution benchmark dataset for semantic change detection based on multi-source satellite data. To achieve better segmentation results, we intend to use the Markov Random Field (MRF) [37] method as well as boundary loss [38] to optimize the segmentation results.