1. Introduction
The change detection task is mainly to obtain image differences through multi-temporal remote sensing images of the same geographical area using appropriate algorithms [
1]. Change detection technology can be used in the planning of urban built-up areas, disaster assessment, ecological monitoring, etc., so that we can better and faster detect the development of the world from a quantitative perspective. At present, the application of change detection technology mainly uses high-resolution optical images. However, optical images are susceptible to cloud, sun angle, seasonal changes, and other factors, which will greatly affect the results of change detection. However, SAR images have strong penetration, which are not affected by cloud and fog, solar altitude angle, and other factors, and can provide all weather data. It can make up for the shortcomings of optical images, so it has important research value in the application of remote sensing images [
2]. However, the resolution of SAR image datasets is generally lower than that of optical images, and the information in the image that can be intuitively reflected is less than that of optical images.
Figure 1 shows the corresponding optical image and SAR image in the same area of Changshu City. Through the comparison of the imaging effect, the optical image had a more obvious geometric structure, clearer object edge contour, and richer spectrum and texture through the superposition of visible light bands. Therefore, the application of SAR image change detection is limited by the reason that it lacks enough information. In this paper, the method of deep learning was used to obtain more targeted and deeper SAR image information, which expands the application range of SAR image change detection compared with traditional methods. At present, there are some deep learning change detection technologies for SAR images, but there are few studies mainly on built-up areas. Therefore, the use of SAR images for the study of urban built-up areas such as urban road changes and urban building changes has a research potential that cannot be ignored for urban planning and urban change. Because there is a lack of public datasets for SAR image change detection on the Internet, the dataset in this paper was co-produced by the authors and their team. The addition of the SAR datasets also further verifies the practicability of deep learning in SAR imagery.
Traditional SAR image change detection techniques are usually based on three processes: (1) image preprocessing; (2) generating differential image; and (3) analyzing the differential image to identify the changed area [
3]. During image preprocessing, it is necessary to establish the spatial correspondence between bitemporal SAR images based on image registration [
4], which can ensure good image consistency in the spatial domain [
5]. While the focus of SAR change detection is the generation and analysis of DI (differential images), the quality of DI and the method for analyzing DI will greatly affect the final detection results. The common algorithms for generating DI are the difference operator and ratio operator. Due to the influence of the SAR image imaging mechanism, it is generally preferred to use the ratio operator, which uses ratio operation on the bitemporal SAR images to generate the DI such as the logarithmic operator and mean operator. The logarithmic operator converts the pixel value of the image into a logarithmic dimension, and then calculates the ratio of image pairs [
6]. It can effectively reduce the larger difference caused by ratio calculation and lower the influence of background independent points in invariant classes, but its ability to preserve edge information is weak [
5]. The mean operator integrates over the local neighborhood pixels, calculates the mean ratio of image pairs [
7], considers the spatial neighborhood information, and has a good suppression of noise in change detection [
8]. The common methods for DI analysis can be divided into supervised and unsupervised methods. Unsupervised change detection usually uses the thresholding method and clustering method. In the thresholding method, the optimal threshold is found and the pixels of DI are classified into the change class and invariant class by the optimal threshold. If it is greater than the threshold, it is set to white (change class), and if it is less than or equal to the threshold, it is set to black (invariant class). Classical thresholding algorithms such as that proposed by Gabriele Moser et al. is a change detection method combining the image ratio with the generalization of the Kitler and Illingworth minimum error threshold algorithm (K&I) [
9]. Then, on the basis of the KI algorithm, models were built to find the optimal threshold. Therefore, an improved algorithm based on the KI algorithm, named the generalized KI (GKI) threshold selection algorithm, was proposed [
5], which was extended to consider the non-Gaussian distribution of the SAR image amplitude [
10]. While the DI analysis method of clustering is more convenient and flexible without modeling, among the most popular clustering methods, the fuzzy C-means algorithm (FCM) can retain more information than hard clustering in some cases. Krinidis and Chatzis [
11] proposed the fuzzy C-means (FCM) clustering method with local neighborhood information (FLICM), which took into account both local spatial information and gray level. This method defines an automatic fuzzy factor to improve robustness without human intervention. Although these unsupervised analysis methods do not need the prior information of the training set, the above methods cannot meet the current needs well with the increase in data complexity and application scope.
In supervised methods, using a training set with ground truth change situations as prior information to train classifiers such as support vector machine (SVM) and extreme learning machine (ELM) [
12]. Chao Li et al. proposed a context-sensitive similarity measure method based on a supervised classification framework that magnified the difference between the changed and unchanged pixels [
13]. I.J. Goodfellow et al. [
14] used a method that combined the thresholding method with deep belief networks, which first used a thresholding method to convert the DI into a binary image, then pulled the training samples into a column vector for training in the DBN, but this method ignored the spatial information of the samples and could not learn the spatial features of samples. Overall, supervised classification uses features extracted from tag data as a priori information and tends to produce a better classification effect than unsupervised classification. However, at present, most of the experiments of these methods have used open datasets of farmland, water area, glacier, and other natural areas, and the specific research on urban built-up areas needs further analysis.
In conclusion, there are still some indispensable problems in the widespread SAR image change detection. The first is the problem of speckle noise suppression, speckle noise caused by the fading of the radar target echo signal is an inevitable defect in the process of SAR imaging, which makes the image details blurred and reduces the image quality, so that it is difficult to construct a good difference map. Then, there are the problems of generating DI with good performance and effective classifiers. Good DI is an essential condition for correct detection results in most of the methods, and effective classifiers determine the final change detection result. At present, most of the commonly used classifiers are based on complex formulas or the optimization of objective functions to perform classification, but these methods cannot make good use of the image features, so the classification effect is not ideal [
15].
Figure 2 shows the effect of SAR image processing by using the traditional SAR image change detection method CD_DCNet.
Figure 2c is the real situation, and
Figure 2d is the real situation of change detection. We can see that the traditional method is greatly disturbed by noise.
Deep learning technology is the latest technology in computer vision and image processing [
16]. The core idea is the hierarchical representation of input data; through the neural network layer, we can learn the hierarchical representation and more robust feature representation of the data. At present, it has outstanding performance in image feature extraction and classification, which has been widely concerned by scholars at home and abroad [
15]. Change detection technology based on CNN reduces the parameters that need to be trained in the model, reduces the complexity, and improves the generalization ability of the model at the same time. Using CNN to learn features from remote sensing images to study the changes in bitemporal images has become a common method in the field of change detection. Tao Liu et al. proposed a new dual-channel convolution neural network that contained two parallel CNN structures and could extract features from two multi-temporal SAR images [
17]. However, due to the inherent limitations of convolution operation, receptive areas are constrained by connected pixels, and this CNN network structure is too simple that rich contextual information cannot be extracted. Fang Liu et al. proposed a convolution neural network with local constraints, and a spatial constraint called local constraint was added to the output layer of the CNN to learn the multi-layer difference map [
18]. However, this kind of network is a CNN network for polarized SAR data characteristics, and its applicability in SAR urban building complex change detection has not been proven.
This paper proposes a network architecture combining early and late fusion strategies. Compared with other CNN-based SAR image change detection networks, the design is deeper and more complex, but it can extract deeper network features. It is a SAR image change detection model combining a UNet++ and Transformer structure. Compared with the traditional SAR image change detection process, there is no difference map generation process, and there is no need to analyze the difference map. Instead, it analyzes the training samples that we have labeled and does not consider the influence of noise. To some extent, the influence of speckle noise inherent in the SAR image of this method is suppressed. First, the dual-time SAR images are respectively composed of a single network with shared weights to learn the features of the single temporal images layer by layer, then the learned feature maps between the two adjacent layers are connected. Later, it can be put into the Transformer decoder to extract rich contextual feature information to learn the local–global semantic features. Finally, the up-sampling encoder restores the image to the original resolution size to obtain our change detection results. The bitemporal CNN network is used to learn the depth features, respectively, to avoid the influence of the pseudo-changes and the noise in SAR images on the deep learning results of parallel operation to a certain extent, encode the connected context features to realize the mastery of local-global features, the shallow learning features are retained, and the context information is fully learned in the whole process. The SAR change detection model is able to accurately locate the changes of the building group, and the accuracy of the model is improved. The specific contributions of our model are as follows:
- (1)
We propose an end-to-end network architecture that combines the UNet++ and Transformer model. This is the first attempt to use the combination of UNet++ and Visual Transformer in SAR image change detection. Through this improved hybrid architecture, the model can capture the long-distance dependencies and realize the full learning of multi-layer features.
- (2)
The change detection accuracy of this model is still high in SAR images with large noise. It has excellent performance in the change detection of building groups in urban built-up areas, and can well reduce the influence of noise and false change in the SAR images.
- (3)
The experimental results show that the proposed model was better than other change detection models in terms of expression ability, suppression of change and noise influence. The experimental results on several representative datasets were better in terms of F1, IOU, and other evaluation indices.
The rest of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 introduces the proposed TransUNet++SAR method.
Section 4 describes the experiments and analysis of the results. Finally,
Section 5 presents the conclusions of this paper.
4. Results
In this section, we first describe the datasets in our experiment including the access method, processing method, pixel size, etc., experimental basic settings, and evaluation metrics. This was followed by an analysis of the results as well as a benchmark comparison, followed by an ablation study, and finally an additional analysis. This was followed by Evaluation Index, Comparative Methods, Dataset Evaluation.
4.1. Datasets and Parameter Settings
- (1)
Beijing Datasets
The change detection dataset of the change in the complex was obtained from ESA in GRD format and IW scanning mode. First, after acquiring the image, we performed geometric correction, image registration, and region cropping on the images to obtain the bitemporal images with 10,000 pixels × 10,000 pixels. Then, we depicted the outline of the area with buildings with obvious changes in the bitemporal images, and made it into a 10,000 × 10,000 pixel label, where the changed area was white and the unchanged area was black. Then, the bitemporal images and labels were cut into 256 pixel × 256 pixel samples, and the samples without architectural changes were removed. The remaining samples were mirrored and rotated to form datasets for change detection. It contained 5000 training sets, 1000 test sets, and 1000 validation sets.
- (2)
Qingdao Datasets
The change detection dataset in Qingdao was captured by the Sentinel-1 satellite. The dataset covers the urban area of Huanxiu Street, Chaohai Street, and Longshan Street near Qingdao City, Shandong Province. The dataset contains a pair of SAR images taken on 12 April 2017 and 16 April 2022, with change detection tags for the buildings. The size of each complete image is 35,154 × 21,177, and the spatial resolution is 10 m per pixel. The dataset was divided into 256 × 256 pixel non-overlapping images, and the invalid images and labels without architectural changes were filtered out. Then, according to a certain proportion, 4000 training sets, 800 validation sets, and 800 test sets were obtained.
- (3)
Guangzhou Datasets
The Guangzhou change detection dataset consisted of 2099 pairs of C-band SAR images taken by Sentinel-1 satellites between 2017 and 2022. The spatial resolution of these images is around 10 m. Level-0, Level-1, and Level-2 data products are available to users. In this paper, the GRD series with IW mode imaging product was used to cut the dataset into 256 × 256 pixel blocks, and then each block was flipped. Finally, the image was rotated by 90°, 180°, and 270°, respectively. The images were divided by the ratio of 3:1:1, and finally, 1260 training sets, 420 validation sets, and 419 test sets were obtained.
In order to ensure the accuracy of the label of SAR image samples, the change detection samples of urban areas used in this paper were selected after comparing the preprocessed high-resolution optical image with the SAR image with a registration error less than 5 pixels, which ensures that the label is indeed the change of urban buildings and the delineation of the change range is accurate. The specific comparison between optical and SAR images is shown in
Figure 7.
Our experiments were implemented based on the Pytorch framework using NVIDIA GeForce GTX 3060Ti GPU for training, validating, and testing. The cross-entropy loss function, weight decay of 5 × 10−4, momentum of 0.9, and Adam with a learning rate of 1 × 10−4 were used to optimize the model. The running time was related to the number of samples: it took about 8 h for a training set with 1000 samples and 15 min for a training set with 800 samples at testing time.
4.2. Evaluation Index
For model evaluation, precision (P), recall (R), intersection over union (IoU), and F1-score were used. These are calculated as follows:
TP is the number of actual positive classes predicted to be positive, FN is the number of actual positive classes predicted to be negative, FP is the number of actual negative classes predicted to be positive, and TN is the number of actual negative classes predicted to be negative. IOU and F1-score are comprehensive evaluation indices, which can better reflect the generalization ability of the model.
4.3. Comparative Methods
We selected two SAR image unsupervised classification change detection algorithms: GarborPCANet and CWNN, and two optical image change detection algorithms, DDCNN and BIT_CD, to verify the effectiveness of the model.
GarborPCANet [
37]: This method uses PCA filters and the representative neighborhood features of each pixel are utilized. Therefore, this method is more robust to speckle noise and produces less noisy change maps.
CWNN [
38]: The change detection method used is based on a convolutional wavelet neural network. It mainly solves the problem of noise influence and sample limitation: the dual-tree complex wavelet transform is used to effectively reduce the influence of speckle noise, and the virtual sample generation scheme is used to solve the problem of sample limitation.
DDCNN [
16]: It is a transformer (bit) for efficient modeling of context in the spatio-temporal domain. A bitemporal image is represented by multiple tokens and the transformer encoder is used for context modeling. The learned context-rich tokens are then returned to pixel space and passed through the transformer decoder to refine primitive characters.
BIT_CD [
24]: It is a transformer (bit) for efficient modeling of context in the spatio-temporal domain. The bitemporal images are represented by multiple tokens and the transformer encoder is used for context modeling. The learned context-rich tokens are then returned to pixel space and passed through the transformer decoder to refine primitive characters.
4.4. Dataset Evaluation
We used the SAR change detection datasets in Beijing, Qingdao, and Guangzhou to verify the efficacy of the TransUNet++SAR algorithm. The evaluation index results of the validation set can be seen in
Table 1,
Table 2 and
Table 3. Compared with other related change detection algorithms, TransUNet++SAR was improved in F1-score and IOU index.
- (1)
Beijing Datasets
The SAR datasets of built-up areas contained multiple scenes of built-up areas, and three typical scenes were selected for visualization.
Figure 8 show the change detection results of the three building groups, respectively. According to the optical image, the first line of
Figure 8. is the local change map of a large building. Comparing (a) with (b), it can be seen that after the building is eradicated, the gray value becomes significantly smaller, and the change can be accurately detected by the SAR model. The second line of
Figure 8. shows the local change of a square. In (a), the square is being built, and in (b) it has been completed, where it can be observed that the gray value of the building part increased significantly, and the change was also completely outlined by our SAR model. The third line of
Figure 8. is the local change map of a small building group. After the removal of part of the building group, the corresponding gray value became smaller, and our model will detect their changes. In addition, there were some other changes involved in the third line of
Figure 8., but they were not detected because they were not the changes of the building. For the change of large buildings, the detection result of the optical algorithm was obviously better than that of the SAR algorithm. In contrast, the SAR image algorithm was more susceptible to noise and false changes. The performance of TransUNet++SAR was close to the effect of BIT_CD, but the boundaries detected by TransUNet++SAR were closer to the boundaries of the real tag. As shown in
Figure 8, TransUNet++SAR had high detection accuracy for large area building changes.
- (2)
Guangzhou Datasets
Compared with Beijing, the building density in Guangzhou is smaller, and the building planning situation is not as neat and regular as that in Beijing. Large buildings change less than in the Beijing area. Limited by the conditions, only 263 SAR images in Guangzhou were collected, which were rotated and mirrored for training. The reduction in training samples and the small scale of changing buildings in the sample led to a slightly lower accuracy of the training model than that of Beijing. Although the detected boundaries were slightly smaller than the boundaries of the real markers, both TranUNet++SAR and BIT_CD could roughly detect that there was a change here. As is shown in
Figure 9, when the buildings in the last image of the Guangzhou datasets changed, the gray value of the SAR image did not change much, but our method still performed better than the results of the other datasets. In contrast, the change outline of TransUNet++SAR was smoother, and BIT_CD was affected by spurious changes, while TransUNet++SAR avoided them to the greatest extent. Therefore, TransUNet++SAR was still better than the other algorithms in this case.
- (3)
Qingdao Datasets
We conducted experiments on the Qingdao datasets to further evaluate the effectiveness of the proposed method. As shown in
Table 3, TransUNet++SAR achieved 96.51% and 90.61% in F1-score and IOU, respectively, and outperformed the compared methods. We selected the results of some large buildings such as building groups and new built-up areas for visualization, as shown in
Figure 10. The figure shows that TransUNet++SAR detected changes in the SAR images better.
4.5. Computational Complexity
Table 4 shows the computational complexity of each model, “Params” is the space complexity, which is the storage space size occupied by the algorithm during its running, and “Flops” is the time complexity, which is the running time of the algorithm. As can be seen from the table, the two methods of unsupervised classification, GarborPCANet and CWNN, had the smallest time and space complexity, but both also had the lowest accuracy. Among the supervised classification methods, DDCNN had lower computational complexity, but its accuracy was far inferior to the other two supervised classification methods. Although the computational complexity of BIT_CD was smaller than that of our method, the accuracy was not as good as that of our method. Especially on the Beijing dataset, P, R, F1, and IOU were lower than that of our method: 1.16%, 4.43%, 4.38%, 9.79%, respectively.
5. Discussion
We compared the accuracy indices of our model with the other four models. In the Beijing, Guangzhou, and Qingdao datasets, the proposed model achieved the optimal level in the evaluation indicators of precision, recall, intersection over union, and F1-score. In the Beijing datasets, the precision, recall, F1-score, and IoU were 98.76%, 96.00%, 97.36% and 96.67%, respectively, which exceeded 44.08%, 57.70%, 42.26%, and 58.65%, respectively, of the baseline method GarborPCANet. This was 1.16%, 4.43%, 4.38%, 9.19% higher than the similar model (BIT_CD), respectively. In the Guangzhou dataset, although the accuracy indices of our method were reduced, our model could still achieve more accurate results, which had a good tradeoff between recall and precision. In the Qingdao dataset, our model was still at least 1% more accurate than the other models. Through the comparison of the quantitative stage results of different datasets, we can conclude that our proposed model could show a clear change boundary and was less susceptible to noise than traditional methods, and the high accuracy of change detection was also the best.
6. Conclusions
In this paper, we proposed a network architecture for change detection in bitemporal SAR images (TransUNet++SAR).The model took the Visual Transformer module into the network architecture as part of the decoder to make up for the shortcomings of CNN in learning the global environment and long-term dependencies. This encoder structure combined with UNet++ enhances the feature extraction of the underlying structure, realizes the description of the details, and improves the ability of the edge information of the changing region.
In addition, series operation, instead of parallel operation, was used in the input of the CNN network, and each layer was connected together after processing one-by-one and input into the encoder structure. This model, compared with the traditional unsupervised classification of SAR image change detection, reduced the image noise influence on CNN network feature extraction, and improved the overall ability of feature extraction. As a result, our image can avoid the influence of noise and spurious changes, which is also a great advantage of the supervised classification of SAR image change detection compared with unsupervised classification.
The experimental results showed that our network structure is effective for SAR datasets. TransUNet++SAR showed good results and achieved good consistency on the SAR image datasets with different building distribution forms such as the Guangzhou dataset, Qingdao dataset, and Beijing dataset.
In the following work, we will continue to study the adaptability of our model to higher resolution SAR images such as the Gaofeng-3 image, and adjust the model structure according to the image characteristics, so that the model can extract more information and improve the applicability of the model to other SAR images.