1. Introduction
Change detection aims at identifying significant differences between ground targets or phenomena in multi-temporal remote sensing images and is one of the most important means by which humans can observe surface changes in the earth. It has been applied to several fields, such as urban monitoring [
1,
2], forest monitoring [
3], open-pit mine monitoring [
4,
5,
6], and disaster assessment [
7].
With the rapid development of remote sensing technology, the increasing availability of Earth observation data from various satellites, such as WorldView, QuickBird, ZY-3, GaoFen, Sentinel, and Landsat, have made remote-sensing-based change detection a widespread concern among researchers [
8]. Among them, detailed information, such as the texture, spectrum, and location of ground targets, can be captured at a finer scale by VHR remote sensing images [
9]. Therefore, VHR remote sensing images are considered one of the important data sources for change detection and serve related studies [
10,
11,
12].
Due to the non-negligible disadvantages of visual interpretation for change detection, such as high costs and low efficiency, several traditional and automatic methods have been proposed. Some of the most widely used methods include mainly algebraic-based methods, transformation-based methods, and classification-based methods. (1) Algebraic-based methods: algebraic operations or transformations are performed on multi-temporal remote sensing images to obtain change maps, such as change vector analysis (CVA) [
13], image regression [
14], and image differencing [
15]. The key to this type of method is determining the threshold of change. As there is not yet a reliable method for selecting the change threshold, their accuracy is strongly subject to human influence. (2) Transformation-based methods: change maps are obtained by data down-dimensioning and highlighting difference information on multi-temporal images, such as tasseled cap transformation [
16] and principal component analysis (PCA) [
17]. However, this type of method is likely to affect the location of change areas and the determination of the change types. (3) Classification-based methods: change maps are obtained by comparing multiple classification maps [
18,
19,
20]. This strategy of using classification followed by change detection is prone to error accumulation. Although the efficiency of these methods has been improved, they share a common and non-negligible disadvantage in that they all conduct comparative analysis on handcrafted features (textures, spectra, etc.) to detect changes, which makes it difficult to accurately represent the various types of complex environments in remote sensing images.
Since the successful application of deep learning to computer vision, many CNNs (Convolutional Neural Networks) have been proposed, such as FCN (Fully Convolutional Networks), PSPNet (Pyramid Scene Parseing Network), DeepLabv3+, and SegNet [
21,
22]. These CNNs demonstrate a powerful learning capability for structured features of images and provide new research ideas for change detection [
23,
24]. The existing research on deep-learning-based change detection can be divided into early-difference networks and late-difference networks according to the data fusion methods of multi-temporal remote sensing images. The general flow of change detection for both types of networks is shown in
Figure 1. (1) Early-difference networks: two images are fused by channel stacking or taking the absolute difference to meet the single input of this type of network. The change information of two images is always present from input to output, so the networks can focus on the discovery of change regions throughout, without error accumulation. However, the early layers of the networks are unable to provide deep features of a single image for images reconstruction (the deep features here refer to information such as the actual boundaries and internal integrity of the ground targets), resulting in change detection results that are prone to rough boundaries and scattered holes [
5,
25,
26]. (2) Late-difference networks: the late-difference networks are the opposite of the early-difference networks in that they use two inputs to receive two images, with the early layers extracting the deep features of the two images and the late layers obtaining the change information by taking the absolute difference. Although the early layers can provide deep features of a single image to reconstruct the images, the two-stage approach to extracting change features is prone to error accumulation, resulting in change detection results that are prone to spurious changes, such as background changes due to season and shadows [
6,
27,
28]. In summary, while some progress has been made in deep-learning-based change detection research, the problem of the existing single-level difference networks prone to change detection results with rough boundaries, scattered holes, and spurious changes remains to be solved (early-difference networks and late-difference networks are collectively known as single-level difference networks).
To address the above problems, this study constructs an early-difference network and a late-difference network, and then combines them to propose a multi-level difference network (MDNet) for change detection from VHR remote sensing images. MDNet enables reconstructing images and reducing error accumulation to be conducted together in the network, i.e., it is possible to simultaneously optimize the rough boundaries, scattered holes, and spurious changes in the change detection results, and improve the accuracy. Specifically, the architecture of the encoder-decoder is used by MDNet for end-to-end change detection. The encoder consists of a late-difference network and an early-difference network. After the two images are input into the late-difference network, first, the deep features are extracted separately, and then, the change features are extracted. The absolute difference between the two images is input into the early-difference network, and the change features are extracted directly. To effectively fuse the two heterogeneous change features, the Multi-level Change Features Fusion Module (MCFFM) proposed in this study is used in the decoder for the weighted fusion of the two. Further, the shallow information from the encoder is introduced in the decoder using skip connections [
29], which is to reduce information loss due to increasing network depth, then reduce missed detections in small areas of changes. For example, landslides are small ground targets in remote sensing images, and the proportion of pixels may even be less than 6% in some datasets [
21,
30]. Therefore, reducing the missed detection in small areas of changes is crucial to improving the accuracy of similar tasks.
The main contributions of this study include:
- (1)
MDNet for high-precision change detection from VHR remote sensing images is proposed by combining an early-difference network and a late-difference network. This study demonstrates that multi-level difference networks are more advantageous than the widely used single-level difference networks for change detection from VHR remote sensing images.
- (2)
MCFFM for the effective fusion of multi-level change features is proposed, which further enhances the performance of MDNet.
- (3)
The change detection of open-pit mines over a large area is implemented based on the publicly available OMCD dataset, and experimental results on this dataset show that the proposed MDNet has the best change detection performance. Then, an OMCD dataset containing a total of two open-pit mines was produced, and localized and fine-scale change detection of open-pit mines was implemented on this dataset. The experimental results show that the proposed MDNet outperforms all benchmark methods.
- (4)
A multi-scenario suitability analysis was carried out using the Season-varying Change Detection Dataset, and the results showed that MDNet could detect changes in other scenarios very well.
2. Methods
2.1. MDNet
The architecture of encoder-decoder is used by MDNet for end-to-end change detection, with the encoder performing the extraction of multi-level change features, and the decoder performing the fusion of multi-level change features and generation of change detection results. Its structure is shown in
Figure 2.
The encoder consists of a late-difference network and an early-difference network, both of which use ResNet50 for extracting features, as the residual structure of ResNet50 allows it to better extract the deep features of the images. In the encoder, the late-difference network takes two images as input, then layer-by-layer down-sampling is used to extract multiscale features from each of the two images, and the absolute difference between the features is calculated layer-by-layer to obtain multiscale change features. The early-difference network takes the absolute difference between the two images as input and extracts the multi-scale change features directly. In the decoder, the two change features extracted by the encoder with large differences are first stacked in channels, and then the MCFFM proposed in this study is used to effectively fuse the two, followed by layer-by-layer up-sampling, and finally, the change detection result is output. It is worth noting that the information loss in the late layers of the network increases with the depth of the network. Therefore, multi-scale change information from the early layers is introduced into the decoder by the skip connections and MCFFM to reduce information loss and improve change detection accuracy. Finally, to reduce the effect of sample imbalance on the training process, this study used a joint loss function to calculate the loss of MDNet [
31].
For the feature map size after feature extraction, taking the 3 × 512 × 512 images as an example (3 × 512 × 512 means that the image has 3 channels, a height of 512, and a width of 512), it is input to the MDnet and then passes through the encoder and decoder in turn, finally outputting a 2 × 512 × 512 change map. In the encoder, the sizes of the 5 feature maps are 64 × 512 × 512, 128 × 256 × 256, 256 × 128 × 128, 512 × 64 × 64, and 1024 × 32 × 32, in that order. In the decoder, the 4 feature maps have sizes of 512 × 64 × 64, 256 × 128 × 128, 128 × 256 × 256, and 64 × 512 × 512, in that order.
2.2. ResNet50 for Feature Extraction
ResNet50 is a residual network proposed by He et al. [
32] to mitigate the performance degradation of deep neural networks due to increasing network depth. The residual blocks of ResNet50 consist of two types: convolutional block and identity block, as shown in
Figure 3. The backbone of both residual blocks is two 1 × 1 convolutions, one 3 × 3 convolution, and several batch normalizations and ReLUs. The difference between the 2 is that the convolutional block adds a 1 × 1 convolution at the skip connection, whereas the identity block does not. The role of the convolutional block is to change the size of the feature map and save computational resources. The identity block is used to increase the depth of the network and extract deeper features.
ResNet50 can be divided into five stages from input to output, as shown in
Figure 4. Firstly, stage 1 changes the feature map size to 1/4 of the original by 7 × 7 convolution and 3 × 3 MaxPool. Stages 2 to 5 proceed in sequence, each consisting of a convolutional block and several identity blocks.
2.3. MCFFM for Multi-Level Change Feature Fusion
Not all of the high-level features extracted by MDNet contribute to the recognition of image differences, and irrelevant features can instead make network training more difficult. In addition, there is a problem of heterogeneity between the two types of change features extracted by the late-difference network and the early-difference network in MDNet. So, direct up-sampling of the two after channel stacking does not make full use of the effective change information. With this in mind, this study proposes the MCFFM for an effective fusion of the two change features, as shown in
Figure 5.
To obtain accurate spatial information about the change feature map, MCFFM decomposes the global pooling in the X and Y directions [
33], and
Figure 6 shows an example of MaxPool. Specifically, the global features of the change feature map (C × H × W, where C, H, W represent the number of channels, height, and width, respectively) were first extracted along the X and Y directions using pooling kernels of size H × 1 and 1 × W, respectively. Then, the MaxPool feature
and AvgPool feature
of the channel
at width
can be expressed by Equations (1) and (2), respectively. Similarly, the MaxPool feature
and AvgPool feature
of the channel
at height
can be expressed by Equations (3) and (4), respectively, where
x represents the value of the pixel in the image.
After obtaining the global features in both directions, matrix addition is performed on the 2 feature maps in the X direction, then 1 × 1 convolution, batch normalization, ReLU, 1 × 1 convolution, and Sigmoid are performed in turn to finally obtain the weight matrix Xw (C × 1 × W). Similarly, the weight matrix Yw (C × H × 1) in the Y direction is obtained in the same way. Next, the weight map Zw (C × H × W) with information about the exact location is calculated using the formula Zw = Yw × Xw. Finally, the input and Zw are multiplied element by element to obtain the fused change feature.
In summary, MCFFM is essentially an attention-based fusion method that can automatically discover the importance of change features in both spatial and channel dimensions. First, MCFFM computes change features with precise location information using two pooling methods in two directions, resulting in a weight map that accurately represents change pixels and non-change pixels (high weights for change pixels and low weights for non-change pixels). Then, MCFFM compresses and expands the change feature map in the channel dimension through operations, such as 1 × 1 convolution, batch normalization, ReLU, 1 × 1 convolution, and Sigmoid, which effectively constructs the dependencies between channels. As a result, the important channel information is preserved, and the unwanted channel information is discarded, thus solving the problem of cross-channel heterogeneity in multi-level change features fusion.
2.4. Joint Loss Function for Loss Calculation
The loss function is used to calculate the difference between the reference and the predicted value. To reduce the impact of sample imbalance on network training, this study used a joint loss function (
) consisting of a cross-entropy loss function (
) and a DICE coefficient loss function (
) to train MDNet [
31]. The joint loss function can combine pixel-related losses and region-related losses, which is defined in Equation (5).
The
CEL can effectively measure the discrepancy between the true and predicted distributions, which is related to pixels and defined in Equation (6).
where
n is the number of pixels,
and
represent the reference value and predicted probability value, respectively.
and
.
can effectively calculate the overlap between the reference and predicted values, which is related to the regions and defined in Equation (7).
where
denotes the
coefficient;
and
denote the two sample sets, respectively;
denotes the intersection between
and
; and
and
denote the number of elements in
and
respectively. In change detection,
denotes the set of standard change pixels, and
denotes the set of change pixels predicted by the network.
4. Discussion
In order to further analyze the advanced performance of the proposed MDNet, this study presents an in-depth discussion in
Section 4.1,
Section 4.2,
Section 4.3,
Section 4.4 and
Section 4.5 using the publicly available OMCD dataset as an example. In
Section 4.6, this study verifies the multi-scenario suitability of MDNet using the Season-varying Change Detection Dataset. In
Section 4.7, some of the future work worth doing is discussed in detail.
4.1. Multi-Level vs. Single-Level
Ablation experiments were designed in this study to demonstrate the effectiveness of MDNet, and
Figure 12 shows the change detection results for MDNet and its three ablation networks. As shown in regions 1 and 6, the false detections of the results from MDNet were significantly increased, regardless of whether the early-difference network or the late-difference network was removed. There is also a marked increase in missed areas, such as areas 2, 3, 4, and 5. This is due to the inability of the early-difference network to provide the deep features of a single image for images reconstruction and the error propagation in the late-difference network. It can be seen that multi-level difference networks have an advantage over single-level difference networks in change detection. The completeness of the detection results from MDNet with MCFFM removed is significantly reduced, as the network is unable to effectively fuse the multi-level change features. It is therefore concluded that MCFFM can further improve the performance of multi-level difference networks.
Figure 13 illustrates the accuracy evaluation of MDNet and its three ablation networks.
Precision,
Recall,
F1-
score, and
IoU for MDNet are 86.8%, 91.6%, 89.2%, and 80.4%, respectively, all of which are optimal. The four evaluation metrics of MDNet after removing one of MCFFM, early-difference network, and late-difference network have all decreased in varying degrees. The ranking of the comprehensive performance for the three ablation networks is MDNet without MCFFM > MDNet without early-difference network > MDNet without late-difference network.
Finally, to ensure that the number of layers in the proposed MDNet is optimal, we tested four cases of increasing or decreasing the number of layers in the network: (1) Case 1: decreasing one layer; (2) Case 2: decreasing two layers; (3) Case 3: increasing one layer; (4) Case 4: increasing two layers. It was then tested with the publicly available OMCD dataset as an example.
Table 6 shows the change detection accuracy of MDNet for the four cases. From
Table 6, we can see that the MDNet proposed in this paper is optimal and that increasing or decreasing the number of layers will reduce the change detection accuracy of the network. Therefore, it can be concluded that the number of layers in the proposed MDNet is indeed reasonable and optimal.
4.2. Effectiveness Analysis of MCFFM
Currently, the field of computer vision frequently employs attention for feature fusion, so in this study, four commonly used attention modules were selected to compare with the proposed MCFFM, namely, CA (Coordinate Attention) [
33], CBAM (Convolutional Block Attention Module) [
37], BAM (Bottleneck Attention Module) [
38], and SENet (Squeeze-and-Excitation Networks) [
39].
Figure 14 illustrates the accuracy evaluation of MCFFM and attention modules as applied to multi-level change feature fusion. It can be found that MCFFM achieves optimal
Precision,
Recall,
F1-
score, and
IoU, proving that it is better able to fuse multi-level change features. In terms of overall performance, the ranking of the four attention modules is CA > CBAM > BAM > SENet.
4.3. Effectiveness Analysis of Feature Extraction Network
To verify that ResNet50 in MDNet can perform feature extraction better, four commonly used feature extraction networks were selected for comparison in this study, which are ResNet18 [
40], VGG16 [
41], Xception [
6], and MobileNetV2 [
42].
Figure 15 shows the accuracy evaluation of MDNet after adding each feature extraction network. It can be found that the change detection performance of MDNet is significantly improved by using ResNet50 for feature extraction compared to the other four feature extraction networks. Ranking of the overall performance for the four feature extraction networks is ResNet18 > VGG16 > Xception > MobileNetV2.
4.4. Comparison of Network Size and Efficiency
To verify the feasibility of the proposed MDNet, statistics on the number of parameters (
Figure 16a), network size (
Figure 16b), and time cost of network training and testing (
Figure 16c) for the seven networks were conducted in this study. Among all networks, MDNet is the largest in number of parameters (7.25 × 10
7) and network size (277.01 MB). Its time cost of training (65 s/epoch), apart from being significantly higher than DeepLabv3+ (41 s/epoch) and PSPNet (36 s/epoch), is roughly on par with other networks, with a maximum time difference of no more than 7 s/epoch. Its time cost of testing (12 s/epoch) was about the same as all the networks compared, with a maximum time difference of no more than 3 s/epoch. Overall, given the better change detection performance of MDNet and the small difference in time cost between the different networks, its larger number of parameters and network size are acceptable.
4.5. The Training Process of MDNet
The trend of loss in training can reflect the performance and stability of the network. The training process of MDNet and its three ablation networks is shown in
Figure 17. As the number of epoch increases, the loss of the four networks gradually decreases with less fluctuation. After 134 epochs, the loss of MDNet was steadily lower than that of the 3 ablation networks. After 182 epochs, the loss of all 4 networks stabilized. As a result, the proposed MDNet can be trained stably and with lower overall loss, which is more advantageous than single-level difference networks.
4.6. Multi-Scenario Suitability Analysis of MDNet
To verify that the MDNet proposed in this study can detect changes in other scenarios very well, a multi-scenario suitability analysis was conducted using the Season-varying Change Detection Dataset.
Figure 18 shows the changes in cars and highways. From region 1, it can be seen that the results of MDNet show the changes in cars quite complete, while the results of CSA-CDGAN and DeepLabv3+ are unable to detect the complete cars. From region 2, it can be seen that the results of other networks have more missed areas than the results of MDNet.
Figure 19 shows the changes in roads. As can be seen from region 3, all the roads detected by the networks used for comparison have a large number of disconnections and a small number of false detections, while the roads detected by MDNet have only a very small number of disconnections and almost no false detections, which is closest to the ground truth.
Figure 20 shows the changes in buildings. From region 4, it can be seen that the change in buildings detected by MDNet is closest to ground truth, with the other networks showing more false detections in their results.
Table 7 shows the accuracy of each network applied to the Season-varying Change Detection Dataset. Except for
Recall, the
Precision,
F1-
score, and
IoU of MDNet are all optimal and significantly improved compared to other networks.
In summary, for the Season-varying Change Detection Dataset, the visual effect and accuracy of the change detection results for MDNet are significantly better than those of other networks, which indicates that MDNet can detect changes in more scenarios well and has some prospects for generalization.
4.7. Prospects
The multi-level difference network is an idea for constructing change detection networks, which aims at introducing multiple effective and complementary change features to improve the accuracy of change detection. Notably, it is not limited to a specific network structure and has some potential for development. The proposed method is not optimal in terms of network size and training time cost, but it is not significantly different from the other methods. It is worth noting that the proposed method does perform optimally, which is due to the new network architecture and the feature fusion module proposed in this study. In order to work towards a more comprehensive and superior model, the following solutions can be considered for future work: (1) Using pruning algorithms to compress the network and reducing the training time of the network, such as filter-wise, channel-wise, shape-wise, and block-wise pruning. (2) In this study, ResNet50 was chosen for feature extraction because it gives MDNet the best change detection performance among the five tested feature extraction networks. However, its network size is comparatively large (111.67 MB). Therefore, without degrading the accuracy, a smaller feature extraction network can be attempted, such as replacing a moderate amount of conventional convolution in ResNet50 with Group Convolution or Depthwise Separable Convolution to reduce the number of network parameters.
Last but not least, the results of automatic change detection methods inevitably have varying degrees of holes, false change areas, and rough boundaries, which hardly correspond to the actual change of the ground target. For holes and false change areas, optimization can be attempted using the closed and open operations of mathematical morphological operations. For rough boundaries, optimization can be performed using filtering, such as median filtering. All of these operations can be implemented using OpenCV (Open Source Computer Vision Library), but how to determine the parameters for these operations requires more in-depth study.
5. Conclusions
An automatic and accurate change detection method based on VHR remote sensing images is of great importance to the national economy. In this study, to address the problem of limited change detection accuracy in existing single-level difference networks, a novel deep learning model, MDNet, and a multi-level change feature fusion module, MCFFM, are proposed for change detection from VHR remote sensing images. Three datasets were used in the experiments, including the publicly available OMCD dataset, self-made OMCD dataset, and Season-varying Change Detection Dataset. The superiority of MDNet was demonstrated by comparing it with advanced deep learning models, such as SMCDNet, SNUNet, DA-UNet++, CSA-CDGAN, DeepLabv3+, and PSPNet. The following conclusions were drawn from this study:
- (1)
The multi-level difference networks are more beneficial than single-level difference networks in achieving high-precision change detection from VHR remote sensing images.
- (2)
MCFFM can further enhance the change detection performance of multi-level difference networks, as it can fuse multi-level change features more effectively.
- (3)
ResNet50 is a good deep feature extractor for high-resolution remote sensing images.
- (4)
Although MDNet has more parameters than all the compared networks, its training and testing time is in the same order of magnitude as all the compared networks, so it is feasible to apply MDNet to change detection for high-resolution remote sensing images.
- (5)
MDNet has advanced performance not only in open-pit mines change detection, but also in other scenarios.