1. Introduction
Remote sensing as an effective means to obtain spatial and temporal information of ground objects with full coverage and long time series provides an effective guarantee for land cover monitoring [
1,
2], agricultural surveys [
3], disaster assessment [
4,
5], military reconnaissance [
6], etc. A VHR remote sensing image consists of rich spatial information and detailed features that can be used to detect changes to specific ground objects, such as buildings [
7]. Buildings are important carriers of urban development. Real-time and accurate monitoring of building change information can provide a scientific basis and reference for ecological environmental protection, natural resource management, and urban expansion analysis [
8,
9]. Therefore, how to detect building changes automatically and efficiently using multi-temporal VHR remote sensing images has become a research hotspot that has received a significant amount of attention [
10].
Building change information can be detected by the direct comparison of images [
11,
12] or by post-classification comparisons [
13,
14]. The former method uses differences between images and obtains change results by clustering or thresholding [
15]. Yuan et al. [
16] used multichannel Gabor filters to process images at different scales and orientations and extract building texture features, which solved false changes caused by projection differences. In addition to the difference calculation using the radiometric luminance values of each band of image pixels, some feature indices can be used. Examples include the commonly used morphological building index (MBI) [
17] or the normalized difference built-up index (NDBI) [
18]. The post-classification comparison method often uses various machine learning models to classify multi-temporal remote sensing images and then perform change detection. Among them, random forest (RF) [
19], decision tree [
20], support vector machine (SVM) [
21], Markov random field (MRF) [
22], and conditional random field (CRF) [
23] have been applied. Tan et al. [
24] constructed an SVM-based hyperspectral remote sensing image classification model, and the results showed that the radial basis kernel function had the highest classification accuracy when SVM performed hyperspectral classification. Zhang et al. [
25] used the RF and utilized it to obtain image-level change detection results to reduce environmental effects such as illumination and observation angle, and then fused pixel-level change results and image objects to achieve image-level and target-level building change detection. Although classical change detection methods are well researched, the detection results are evidently affected by atmospheric conditions, and often perform poorly in complex scenes, which cannot meet the demand of high-precision change detection.
After several years of development, the satellites have evolved from the kilometer level to the meter and submeter level, and the increased spatial resolution allows us to perform finer monitoring of ground changes. At the same time, deep learning methods have injected new vigor into building change detection research. Zhu et al. [
26] combined an improved SegNet network with image morphology to identify new buildings, but it is not sensitive to changes with large structural differences. A Unet++-based network with multiple side outputs was proposed by Peng et al. [
27]. Since the depth features of a single image are not extracted, it limits the accuracy of change detection. Daudt et al. [
28] proposed three end-to-end structures, in which the siamese structure with bi-branches can extract feature information separately and effectively enhance the accuracy of change detection. Fang et al. [
29] combined the siamese network and NestedUNet, and proposed the SNUNet change detection network, which obtained better results on the CDD dataset through dense skip connections between the encoder–decoder. Chen et al. [
30] divided the image into multiscale subregions, extracted features using a bi-branches pyramid model, and produced a remotely sensed building change detection dataset (LEVIR-CD). Although these siamese building change detection methods enhance the efficiency of change detection and achieve better accuracy, most models have inadequate feature extraction of the images and are prone to missed detection.
The attention mechanism can focus attention on the region of interest, improving the building change detection accuracy to some extent [
31,
32,
33]. DASNet [
34] introduces an extended attention mechanism consisting of two components, namely channel attention mechanism and spatial attention mechanism, which enhances the performance of the model in detecting building changes. Lei et al. [
35] proposed an SNLRUX++ network for building change detection, which improves the prediction of feature maps at different scales by cascading multiscale feature fusion methods on dense building detection performance. In a study by Wang et al. [
36], Unet++ with a multilevel difference module is combined to highlight change regions while reducing the influence of “pseudo-change”. The above deep learning method extracts the image features at a deeper level compared to traditional methods. However, continuous down-samplings causes insufficient spatial position information, which will lead to an incomplete detection of change regions and rough building edges.
Based on this, a siamese EfficientNet B4-MANet network (Siam-EMNet) is proposed for building change information extraction of VHR images. This article’s primary contributions are as follows:
(1) A bi-branched EfficientNet B4 encoder structure is designed. This encoder structure is compounded and expanded according to the width, depth, and resolution of the network, which helps to better predict building change regions and acquire higher quality prediction results. Meanwhile, pretrained weights are used to make the experimental results converge more stably.
(2) A Siam-EMNet change detection network is constructed. The decoder integrates PAB and MFAB from the MANet to up-sample the feature mapping of the encoder. The details are recovered and improved in the up-sampling process, the edge information of the changing regions can be detected more accurately, and the missed detection of small regions can be effectively avoided.
(3) The Siam-EMNet model is optimized using a hybrid loss function with a weighted combination of dice loss and cross-entropy loss to reduce the detection error caused by the imbalance of change and unchanged samples.
The rest of the paper is arranged as follows:
Section 2 explains the proposed Siam-EMNet method;
Section 3 performs the analysis of experimental results;
Section 4 discusses the method of this paper; and
Section 5 concludes the work of this paper.
4. Discussion
- (1)
Validation of model generalization
A single dataset cannot measure the generalizability of the model. In addition to the LEVIR-CD dataset, experiments were also performed on the WHU-CD [
54] change detection dataset, which is provided by Shunping Ji’s team at Wuhan University, with a spatial resolution of 0.075 m and a spatial size of 32,507 pixels × 15,354 pixels, and the original image is cropped to 256 pixels × 256 pixels by preprocessing. The method in this paper uses 3107 pairs of images as the training set, in addition to 923 pairs for the test set and 433 pairs for the validation set. The quantitative assessment results of the WHU-CD dataset are shown in
Table 7, with 92.22%, 93.46%, 95.92%, and 92.77% for the Siam-EMNet method’s precision, recall, accuracy, and F1-score, respectively. Compared to SOTA methods, Siam-EMNet has an optimal recall, accuracy, and F1-score, and precision is only lower than DSIFN; however, the recall, accuracy, and F1-score of the proposed method were 11.66%, 1.23%, and 4.5% higher than DSIFN. The higher the recall, the lower the missed rate of the model, and consequently the more complete the detection of the change regions. Using two datasets, the proposed network model is validated for better generalization.
- (2)
Effect of different versions of EfficientNet on change detection results
EfficientNet has eight versions of EfficientNet B0-B7. Different versions of EfficientNet have different abilities to extend depth, width, and resolution, and have certain requirements on image resolution and feature type as well as computer capability. To explore the performance difference of different versions of EfficientNet in building change detection, this paper uses the EfficientNet B0–B6 structure as the encoder. Due to the fact that EfficientNet B7 has relatively high computing power requirements, it is not used as a comparative experiment. As seen from the quantitative assessment results in
Table 8, the EfficientNet B4 architecture selected in this paper has the best performance in detecting changes in buildings.
- (3)
Effect of different loss functions the change detection results
In this paper, the effect of hybrid loss functions on model performance is explored through ablation experiments. As seen from the quantitative assessment results in
Table 9, when the cross-entropy loss function is used alone, precision is slightly higher than the hybrid loss function used by the proposed method, but the recall, accuracy, and F1-score values are lower than the hybrid loss function used. Therefore, the combination of the dice loss function and the cross-entropy loss function optimizes the proposed model to the greatest extent and reduces the detection error caused by the imbalance between the changed and unchanged samples.
5. Conclusions
A VHR remote sensing image building change detection network based on Siamese EfficientNet B4-MANet (Siam-EMNet) was proposed to solve the problems of the incomplete detection of change areas and rough edges in the field of building change detection. The encoder structure compound expands the width, depth, and resolution of the network and enhances the ability to extract building feature information. The decoder structure integrates PAB and MFAB, enhancing the detection of building edge details and effectively avoiding the missed detection of small regions. The encoder can effectively extract the bi-temporal feature information of VHR remote sensing images by skipping connections between the encoder and decoder to achieve efficient multilevel information fusion. The results obtained using the LEVIR-CD dataset show that Siam-EMNet is effective for building change detection with a precision, recall, accuracy, and F1-score of 92.00%, 88.51%, 95.71%, and 90.21%, respectively. Compared to BIT, CDNet, DSIFN, L-Unet, P2V-CD, and SNUNet, Siam-EMNet achieves the best comprehensive performance. Additionally, the WHU-CD dataset was used to verify the generalizability of the network model, and the results show that the proposed network model has a good generalization. Despite the outstanding advantages of EfficientNet B4 in terms of accuracy, the model itself has a large number of parameters. Therefore, in future research, the number of parameters of the model can be reduced by pruning to improve the adaptability of the model.