JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection

Chen, Hao; Sun, Chengzhe; Li, Jun; Du, Chun

doi:10.3390/rs16234569

Open AccessArticle

JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection

by

Hao Chen

,

Chengzhe Sun

^*,

Jun Li

and

Chun Du

Department of Cognitive Communication, College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(23), 4569; https://doi.org/10.3390/rs16234569

Submission received: 31 October 2024 / Revised: 27 November 2024 / Accepted: 3 December 2024 / Published: 5 December 2024

Download

Browse Figures

Versions Notes

Abstract

Remote sensing image building change detection aims to identify building changes that occur in remote sensing images of the same areas acquired at different times. In recent years, the development of deep learning has led to significant advancements in building change detection methods. However, these fully supervised methods require a large number of bi-temporal remote sensing images with pixel-wise change detection labels to train the model, which incurs substantial time and manpower for annotation. To address this issue, this study proposes a novel single-temporal semi-supervised joint learning framework for building change detection, called JointNet4BCD. Firstly, to reduce annotation costs, we design a semi-supervised learning manner to train our model using a small number of building extraction labels instead of a large amount of building change detection labels. Furthermore, to improve the semantic understanding capability of the model, we propose a joint learning approach for building extraction and change detection tasks. Lastly, a decision fusion block is designed to fuse the building extraction results into the building change detection results to further improve the accuracy of building change detection. Experimental results on the two widely used datasets demonstrate that the proposed JointNet4BCD achieves excellent building change detection performance while reducing the need for labels from thousands to dozens. Using only ten labeled images, JointNet4BCD achieved F1-Scores of 83.93% and 83.45% on the LEVIR2000 and WHU datasets, respectively.

Keywords:

remote sensing; building change detection; joint learning; semi-supervised learning

1. Introduction

Remote sensing (RS) image change detection (CD) is a significant and challenging fundamental task aimed at identifying semantic changes within the same region across multi-temporal RS images [1,2]. This process typically involves taking a pair of bi-temporal RS images as input and producing pixel-level CD results as output. Presently, remote sensing building change detection (BCD) finds extensive application in areas such as land resource management [3], smart city development [4], and natural disaster evaluation [5].

With the emergence of RS big data and the development of deep learning (DL), many deep neural network-based methods for RS image CD have been proposed and achieved high accuracy results. However, most of these methods are supervised learning ones, relying on a large number of high-quality pixel-wise labels for effective model training. Due to the differences in data distribution caused by variations in sensors and regions, a change detection network trained on one dataset may not generalize well to others. This necessitates the creation of new datasets for training change detection models, which is often unavoidable. Unfortunately, manually annotating pixel-level change detection labels for a large number of bi-temporal RS images is both labor-intensive and time-consuming. Creating these labels involves visually comparing two remote sensing images with numerous objects to identify those that have changed. Annotators then use specialized tools to draw change maps that mark the altered objects. This process is complex and demanding. Fully supervised learning methods require extensive training data with change detection labels, leading to significant human and time costs in creating change detection datasets. As a result, supervised learning-based change detection models usually have either insufficient training data or costly manual labeling, limiting their real-world applications. To tackle this issue, methods based on weakly supervised learning [6,7,8], unsupervised learning [9,10,11] and semi-supervised learning [12,13,14] have been developed.

Unsupervised methods [15] can achieve CD without requiring any labeled data; however, the accuracy achieved is often insufficient to support practical applications at present. Weak supervised methods aim to substitute precise pixel-level labels with coarser labels such as bounding boxes [16], image-level labels [17], or graffiti [18]. These techniques can reduce the labeling efforts compared to supervised ones. However, how to generate and use coarse labels which can effectively represent fine changes between bi-temporal RS images is still a thorny challenge, particularly in the case of high-resolution RS images. The third method is the semi-supervised learning method, which leverages a limited set of labeled images alongside a substantial number of unlabeled images to train the change detection model. The labeled images provide the model with essential guidance, while the use of unlabeled images helps to improve its robustness and generalization capabilities. Semi-supervised methods can be implemented through various techniques, including generative adversarial networks (GANs) [19], autoencoder [20], and graph neural networks [21]. Recent studies [22,23] have shown these approaches to be effective for change detection tasks.

In addition to the aforementioned three methods, a single-temporal supervised CD method known as ChangeStar [24] has been introduced. This method primarily targets the building change detection task and leverages an annotated building extraction dataset to create a pseudo-bi-temporal image change detection dataset for training purposes. Specifically, during the training of the CD model, the inputs to the network are two distinct single-temporal images instead of a pair of bi-temporal RS images. The change detection labels are generated by applying an exclusive-or (XOR) operation on the building extraction labels associated with these single-temporal images. This method bypasses the process of manually labeling the pixel change masks of the bi-temporal RS images. Combining this method with a semi-supervised learning approach can achieve highly accurate building change detection results with even fewer building extraction labels, which has been proven as a promising approach [14].

However, the performance of the current single-temporal semi-supervised change detection methods has not been fully exploited. On the one hand, since the building change detection labels are generated by the corresponding building extraction labels, high-quality pseudo-building extraction labels are essential for training a single-temporal building change detection model in a semi-supervised learning manner. On the other hand, the building change detection results can also be regarded as potential supervised signal for building extraction model training. We argue that the building change detection task and the building extraction task can mutually enhance each other. Therefore, JointNet4BCD is proposed, aiming to improve the building change detection accuracy of single-temporal semi-supervised methods through the joint learning of change detection and building extraction, along with the incorporation of a decision fusion block. JointNet4BCD integrates joint learning, semi-supervised learning, and single-temporal supervised change detection to explore the change detection performance achievable with an extremely limited amount of single-temporal labeled data. This approach further harnesses the potential of large amounts of unlabeled data in the change detection task. Additionally, the introduction of the decision fusion block provides a valuable reference for enhancing the performance of change detection based on joint learning.

The main contributions of this paper are summarized as follows:

(1): To unleash the potential of single-temporal change detection methods, this paper proposes a single-temporal semi-supervised joint learning method called JointNet4BCD. By jointly training a semi-supervised building extraction task and a semi-supervised building change detection task, the proposed model enhances the ability of potential pivotal feature extraction and representation, which is efficient for insufficient labeling in the semi-supervised learning manner.
(2): To further enhance the effectiveness of the proposed semi-supervised joint learning method, we design a decision fusion block. The accuracy of the building change detection results is increased by fusing the building extraction results and the change detection results at the decision level, while consistency regularization boosts the robustness of the decision fusion effect.
(3): Comprehensive experiments conducted on two widely used high-resolution RS datasets show that our model achieves notable improvements in the accuracy of building change detection and outperforms multiple state-of-the-art methods.

The rest of the paper is organized as follows. Section 2 reviews related work on change detection methods, Section 3 details our proposed method, Section 4 describes the experimental design and presents the results with their analysis, Section 5 discusses the advantages and limitations of the proposed method and outlines future work, and Section 6 concludes the paper.

2. Related Work

This section will briefly describe traditional change detection methods, deep learning-based change detection methods and semi-supervised CD methods.

2.1. Traditional Change Detection Methods

In the beginning, change detection was realized by algebraic computations between image bands [25], including operations such as band differences, band ratios, band regressions, and spectral angles. As research evolved, methods focusing on feature extraction and statistical analysis of data through image transformations became prevalent in change detection. Such methods include PCA regression, PCA interpolation, and joint PCA [26]. However, these methods mainly focus on pixel-level analysis and do not incorporate spatial information. As a result, the generated change detection maps often do not match the boundaries and details of the actual objects. The advent of high-resolution remote sensing imagery exacerbates these issues.

To deal with complex spectral and geometric features as well as the richer details in high-resolution remote sensing images, object-based approaches have been developed. These approaches divide the image into many homogeneous regions, using objects as the basic processing units instead of pixels. This allows the use of object-level spatial contextual information such as shape, texture, and spatial relationships with adjacent objects. Object-based approaches can be further categorized into four classifications [27]: (1) object-based change detection, which achieves change detection by directly comparing segmented objects; (2) classification-based change detection, which achieves change detection by directly comparing the classified segmented objects from two images, relying on the accuracy of the classification for detection precision; (3) multi-temporal object-based change detection, which first performs unified segmentation on multi-temporal images and then achieves change detection by calculating the feature similarity between segmented objects; (4) hybrid-based change detection, which overcomes the limitations of single methods by integrating various pieces of spatial information using different approaches.

However, with the continuous improvement in remote sensing image quality, the accuracy and reconstruction capability of traditional methods often fall short of meeting the requirements for practical applications, especially when dealing with high-resolution remote sensing images for change detection.

2.2. Deep Learning-Based Change Detection Methods

Over the years, as deep learning (DL) evolves, numerous DL-based RS image processing methods [2] have been developed and applied in diverse areas, such as RS image registration [28], road extraction [29,30], online map generation [31,32], and image semantic segmentation [33,34].

Owing to their strong feature extraction and representation capabilities, DL-based methods have seen widespread application in change detection tasks in recent years. The most commonly used architecture for remote sensing change detection is UNet [35,36] and its extensions [37,38,39]. Tang et al. [40] propose a U-shape Siamese transformer for remote sensing image change detection. The network integrates Unet, SiameseNet, Swin Transformer, and a feature fusion module, which can fuse global and local dual-time image features effectively. Shao et al. [41] adapted Siamese UNet for both satellite and UAV images. Fang et al. [42] developed SNUNet, integrating concatenated and nested UNet to significantly improve change detection accuracy.

The use of attention mechanisms to enhance change detection models has become a prominent research topic. Zhang et al. [43] integrated spatial and channel attention mechanisms into the UNet decoder, allowing the network to weight feature channels and focus on spatial dimensions more effectively. Chen et al. [44] proposed a spatio-temporal attention method to capture correlations between pixels across different locations and times, improving robustness to lighting and misregistration. Li et al. [45] proposed a spatial–temporal attention with a difference enhancement-based network to extract important features from the difference map to detect changed regions. Sun et al. [38] incorporated the graph attention mechanism (GAT) [46] before the UNet decoder, using encoder-extracted features as nodes and their dependencies for feature fusion, enhancing long-range feature extraction and integration. In recent years, the rapid development of Transformer [47,48,49] has facilitated significant applications of self-attention and cross-attention in change detection tasks. Chen et al. [50] introduce a bitemporal image transformer (BIT) to model context information. Li et al. [51] propose TransUNetCD, which leverages the strengths of both transformers and U-Net for change detection. Zhou et al. [52] use self-attention to model the contextual–semantic relationships between input bitemporal images. Noman et al. [53] develop a Transformer-based change detection method trained from scratch, utilizing a shuffled sparse-attention operation that focuses on selected sparse informative regions to capture the inherent characteristics of the CD data.

Zheng et al. [24] proposed a single-temporal supervised change detection method, which is a new supervised approach for building change detection. Unlike traditional supervised remote sensing change detection methods, which require bi-temporal images and their corresponding CD labels, this method utilizes single-temporal remote sensing images and their building extraction labels. Pseudo-bi-temporal RS images are generated by randomly combining images from the building extraction dataset, and CD labels are produced by applying XOR operations to the building extraction labels of these pairs. Subsequently, the pseudo-bi-temporal images and their pseudo-change detection labels are used to train the single-temporal supervised change detection model. The excellent test performance on the remote sensing change detection dataset proves the effectiveness of training CD models with single-temporal images.

2.3. Semi-Supervised Change Detection Methods

In the domain of deep learning, having access to large volumes of data is essential for training effective neural networks. Despite the ease of data collection in the age of big data, annotating data is typically labor-intensive and time-consuming, particularly for pixel-level labeling. Semi-supervised learning (SSL) stands out as a promising approach to overcome this limitation, as it seeks to train models using only a small quantity of labeled data alongside a significant amount of unlabeled data [54].

Semi-supervised learning has gained increasing attention in RS image processing, particularly for change detection, as it helps alleviate the challenges of labeling variations between RS images, thus largely saving the time and labor required for labeling. Peng et al. [22] introduced a semi-supervised GAN-based change detection approach that employs a generator to forecast change detection outcomes for both labeled and unlabeled bi-temporal images, while utilizing two discriminators to enhance the consistency of these predictions. Zhao et al. [20] introduced a Siamese Variable Autoencoder (VAE) that integrates unsupervised feature learning with supervised fine-tuning for SAR image change detection. Sun et al. [23] developed a method based on consistent regularization and pseudo-labeling, where unlabeled images are first pseudo-labeled by a weak model and then strongly augmented to improve prediction consistency. Zhang et al. [55] proposed a feature prediction alignment (FPA) method that effectively utilizes unlabeled RS image pairs through class-aware feature alignment and pixelwise prediction alignment, reducing the need for labeled data. Shen et al. [56] introduced the Progressive Uncertainty-aware and Uncertainty-guided Framework (PUF), which decodes and quantifies aleatoric uncertainty in RS images. PUF includes Progressive Uncertainty-aware Learning (PUAL) and Uncertainty-guided Multi-view Learning (UML), using uncertainty values to generate distorted and mixed image pairs, guiding the model to learn more discriminative features from high-quality samples.

3. Methods

JointNet4BCD is a single-temporal semi-supervised change detection method that does not require any change detection labels. Through the use of a few single-temporal remote sensing images with building extraction labels and a large quantity of unlabeled bi-temporal images, JointNet4BCD can achieve high-precision building change detection. It consists of two parts, as shown in Figure 1. The first part is the single-temporal joint learning part, which uses a small number of single-temporal remote sensing images with building extraction labels to guide the model in simultaneously learning building extraction and change detection tasks. The second part is the unsupervised joint learning part, which uses a large number of unlabeled bi-temporal remote sensing images and trains the model based on the consistency regularization principle to improve the robustness and generalization of the model. The backbone network for both parts is the Nested UNet with dense channel attention [14]. Additionally, a decision fusion block is added at the end of the network to integrate the results of building extraction and change detection, thereby enhancing the change detection accuracy of the model.

3.1. Single-Temporal Joint Learning

The single-temporal joint learning component in Figure 1a aims to achieve guidance for both the change detection task and the building extraction task using a small amount of building extraction labels to allow the model to understand and extract task-relevant feature representations. Unlike the traditional joint learning-based change detection methods that use both change detection labels and segmentation labels for training, the single-temporal joint learning component uses only the building extraction labels to train the change detection model and the building extraction model.

Specifically, first, by using replacement sampling, random shifting, and flipping, JointNet4BCD randomly pairs single-temporal images

X_{l}

into pseudo-bi-temporal remote sensing images

X_{l 1}

and

X_{l 2}

. Then, the features of

X_{l 1}

and

X_{l 2}

are extracted by two feature extraction encoders with shared weights, respectively. Next, on the one hand, the extracted features are inputted into the building extraction decoder to obtain the building extraction results

{\hat{Y}}_{b e_{1}}

and

{\hat{Y}}_{b e_{2}}

. On the other hand, the two features are merged in the channel dimension, and inputted into the change detection decoder to obtain the change detection results

{\hat{Y}}_{c d}

. During this process, the encoder is shared between the change detection task and the building extraction task. Therefore, the semantic understanding and feature representation ability of the encoder can be improved by dual-task joint training.

Furthermore, to enhance the change detection efficacy by leveraging the information from the building extraction decoder, while simultaneously circumventing the interference stemming from disparate feature domains during feature fusion, Joint-Net4BCD embraces the idea of decision-level fusion. By concatenating the outputs of the decoders into the decision fusion block for further processing, the final change detection result,

{\hat{Y}}_{f i n a l}

, is obtained.

It is noted that in JointNet4BCD, the encoder and decoder network structure we use is a nested UNet that incorporates a dense attention mechanism [14].

3.2. Unsupervised Joint Learning

The unsupervised joint learning component in Figure 1b aims to improve the network’s robustness and generalization through the use of unlabeled bi-temporal images. It is based on consistent regularization training, i.e., first perturbing the data input to the model and forcing the model to generate consistent segmentation results for the data before and after the perturbation. This mechanism improves the model’s robustness and makes the model focus on task-relevant semantic information rather than other irrelevant color and shape features. The consistency regularization training of JointNet4BCD targets both building extraction and change detection tasks, which further enhance the model’s robustness and generalization capacity.

The specific flow is shown in Figure 1b, in which the solid line part is the processing flow of the unlabeled image without perturbation, and the dashed line part is the processing flow of the image after perturbation. First, the unlabeled remote sensing image pairs

X_{u 1}

and

X_{u 2}

are inputted into the model to obtain the building extraction maps

{\hat{Y}}_{u 1}

and

{\hat{Y}}_{u 2}

, and the change detection results

{\hat{Y}}_{c d_p l}

. Then, the image pairs after data perturbation

X_{p e r 1}

and

X_{p e r 2}

are inputted into the model to obtain the building extraction maps

{\hat{Y}}_{p e r 1}

and

{\hat{Y}}_{p e r 2}

, and the change detection results

{\hat{Y}}_{c d_p e r}

. Next,

{\hat{Y}}_{u 1}

,

{\hat{Y}}_{u 2}

and

{\hat{Y}}_{c d_p l}

are input into the decision fusion block to obtain

{\hat{Y}}_{f i n a l_p l}

.

{\hat{Y}}_{p e r 1}

,

{\hat{Y}}_{p e r 2}

and

{\hat{Y}}_{c d_p e r}

are also fed into the decision fusion block to obtain

{\hat{Y}}_{f i n a l_p e r}

. Finally, the model is trained to produce

{\hat{Y}}_{u 1}

close to

{\hat{Y}}_{p e r 1}

,

{\hat{Y}}_{u 2}

close to

{\hat{Y}}_{p e r 2}

,

{\hat{Y}}_{c d_p l}

close to

{\hat{Y}}_{c d_p e r}

, and

{\hat{Y}}_{f i n a l_p l}

close to

{\hat{Y}}_{f i n a l_p e r}

. It is noted that the data perturbations in the processing flow include random color perturbations and random deformations, which are described in [23].

3.3. Backbone Network

The backbone network for both parts is the Nested UNet with dense channel attention [14]. This network uses residual convolutional blocks as basic units, connecting various convolutional blocks through nested skip connections to extract and fuse features at different levels. Additionally, to mitigate the issue of information confusion during the fusion of features from different levels, the network employs channel attention mechanisms to process the concatenated features, further enhancing the network’s ability to integrate multi-level features.

3.4. Decision Fusion Block

The purpose of the decision fusion block is to use the model’s building extraction capability to enhance the accuracy of the change detection outcomes. Since the change detection decoder and the building extraction decoder are trained on the change detection task and the building extraction task, respectively, the features they extract are not the same, and direct fusion will lead to feature confusion. Therefore, we adopt the idea of decision-level fusion and construct a decision fusion block, as shown in Figure 2.

First, the obtained building extraction results

{\hat{Y}}_{s e g 1}

,

{\hat{Y}}_{s e g 2}

and

{\hat{Y}}_{c d}

are spliced along the channel dimension. Next, the features are fused using the attention mechanism and convolutional neural network to obtain the final CD result

{\hat{Y}}_{f i n a l}

.

From Figure 1, we can see that the decision fusion block exists both in the single-temporal joint learning component and the unsupervised joint learning component. And the two decision fusion blocks share weights with each other. With the consistency regularization in the unsupervised joint learning component, the robustness of the decision fusion block is further enhanced.

3.5. Loss Calculation

In scenarios where the CD task involves a significantly larger number of unchanged pixels compared to changed pixels, we adopt a combination of cross-entropy loss

L_{c e}

and dice loss

L_{d i c e}

as the loss function.

Denote the label as Y and the change detection result as

\hat{Y} = {\hat{y}}_{k}, k = 1, 2, \dots, H \times W

, where H represents the height and W represents the width of the image, and

{\hat{y}}_{k}

is the value of the

k^{t} h

pixel. Let c be 0 or 1, signifying whether there is a change in the

k^{t} h

pixel according to the label. The dice loss and the cross-entropy loss are calculated as follows:

L_{d i c e} = 1 - \frac{2 \cdot Y \cdot s o f t m a x (\hat{Y})}{Y + s o f t m a x (\hat{Y})}

(1)

L_{c e} = \frac{1}{H \times W} \sum_{k = 1}^{H \times W} l o g (\frac{e x p ({\hat{y}}_{k c})}{\sum_{l = 0}^{1} e x p ({\hat{y}}_{k l})})

(2)

Therefore, the loss function used in the JointNet4BCD can be calculated as follows:

L = L_{c e} + L_{d i c e}

(3)

In JointNet4BCD, the single temporal joint learning loss and the unsupervised joint learning loss must be calculated. The single-temporal joint learning loss aims to guide the model to understand the tasks with labels. Assume that the building extraction labels corresponding to the single-temporal images

X_{l 1}

and

X_{l 2}

are

Y_{b e 1}

and

Y_{b e 2}

, respectively, and the pseudo-change detection labels are calculated through the exclusive or (XOR) operation of

Y_{b e 1}

and

Y_{b e 2}

, which are denoted as

Y_{c d}

. The single-temporal joint learning loss consists of four parts: 1. the loss

L ({\hat{Y}}_{b e 1}, Y_{b e 1})

calculated from the building extraction result

{\hat{Y}}_{b e 1}

with the label

Y_{b e 1}

; 2. the loss

L ({\hat{Y}}_{b e 2}, Y_{b e 2})

calculated from the building extraction result

{\hat{Y}}_{b e 2}

with the label

Y_{b e 2}

; 3. the loss

L ({\hat{Y}}_{c d}, Y_{c d})

computed from the CD result

{\hat{Y}}_{c d}

and the pseudo-CD label

Y_{c d}

; 4. the loss

L ({\hat{Y}}_{f i n a l}, Y_{c d})

computed from the CD result

{\hat{Y}}_{f i n a l}

and the pseudo-CD label

Y_{c d}

.

The single-temporal loss is calculated as follows:

\begin{matrix} L_{s} = L ({\hat{Y}}_{b e 1}, Y_{b e 1}) + L ({\hat{Y}}_{b e 2}, Y_{b e 2}) \\ + L ({\hat{Y}}_{c d}, Y_{c d}) + L ({\hat{Y}}_{f i n a l}, Y_{c d}) \end{matrix}

(4)

The unsupervised joint learning loss aims to improve the model’s robustness and generalization. It also consists of four parts: 1. the loss

L ({\hat{Y}}_{p e r 1}, {\hat{Y}}_{u 1})

calculated from the building extraction result

{\hat{Y}}_{p e r 1}

and the label

{\hat{Y}}_{u 1}

; 2. the loss

L ({\hat{Y}}_{p e r 2}, {\hat{Y}}_{u 2})

calculated from the building extraction result

{\hat{Y}}_{p e r 1}

and the label

{\hat{Y}}_{u 2}

; 3. the loss

L ({\hat{Y}}_{c d_p e r}, {\hat{Y}}_{c d_p l})

calculated from the building extraction result

{\hat{Y}}_{c d_p e r}

and the label

{\hat{Y}}_{c d_p l}

; 4. the loss

L ({\hat{Y}}_{f i n a l_p e r}, {\hat{Y}}_{f i n a l_p l})

computed from the building extraction result

{\hat{Y}}_{f i n a l_p e r}

with the label

{\hat{Y}}_{f i n a l_p l}

. The unsupervised joint learning loss is calculated as follows:

\begin{matrix} L_{u} = L ({\hat{Y}}_{c d_p e r}, {\hat{Y}}_{c d_p l}) + L ({\hat{Y}}_{f i n a l}, Y_{c d}) \\ + L ({\hat{Y}}_{p e r 1}, {\hat{Y}}_{u 1}) + L ({\hat{Y}}_{p e r 2}, {\hat{Y}}_{u 2}) \end{matrix}

(5)

Finally, the total loss of SemiJontNet is obtained by weighted summation of

L_{s}

and

L_{u}

, which is calculated as follows:

\begin{matrix} L_{t o t a l} = L_{s} + λ * L_{u} \end{matrix}

(6)

Here,

λ

is a weighting parameter for the loss

L_{u}

.

4. Experiments

4.1. Datasets

To validate the effectiveness of this method, we chose two commonly used RS image CD datasets: the LEVIR dataset and the WHU dataset. Both are VHR datasets that include a substantial number of images.

The LEVIR dataset, accessible at https://justchenhao.github.io/LEVIR/ (last accessed on 26 October 2022), is a comprehensive remote sensing dataset designed for building change detection (BCD). It consists of 637 very-high-resolution (VHR) images from Google Earth, each with a size of 1024 × 1024 pixels and a resolution of 0.5 m per pixel. The dataset includes bi-temporal images captured across 20 different locations in Texas, USA, between the years 2002 and 2018. It focuses on detecting changes in land use, especially in buildings, covering a diverse array of structures such as high-rise apartments, small garages, cottage homes, and large warehouses. The images present many pseudo-changes due to seasonal influences, and the buildings exhibit varied geometric features, making the dataset particularly challenging. In order to cope with GPU memory limitations, the image pairs were divided into sections of 256 × 256 pixels and then randomly distributed into training and test sets. The training set contains 7120 images, whereas the test set includes 1024 images. Recognizing that 2000 images from the LEVIR dataset are adequate for training a high-precision change detection model and to reduce training duration, we created the LEVIR2000 dataset, which is composed of 2000 randomly selected images from the original LEVIR dataset.
The WHU dataset (https://study.rsgis.whu.edu.cn/pages/download/building_dataset.html, accessed on 26 October 2022) consists of remote sensing imagery and change maps from the same region in Christchurch, New Zealand, captured in 2012 and 2016. Each image has a resolution of 0.075 m per pixel and dimensions of 32,507 × 15,345 pixels, with the primary focus being changes in buildings. To accommodate GPU memory constraints, we divided the images into segments of 256 × 256 pixels, creating a training set with 2000 image pairs and a test set with 996 image pairs. The WHU dataset is considered challenging due to the significant variation and the highly heterogeneous distribution of the changes in the buildings.

4.2. Evaluation Metrics

In order to evaluate the effectiveness of different change detection methods effectively, we apply metrics such as the F1-Score, Kappa coefficient (Kappa), and Intersection-over-Union (IoU). Their definitions are as follows:

F 1 = \frac{2 \times P \times R}{P + R}

(7)

K a p p a = \frac{O A - P R E}{1 - P R E}

(8)

I o U = \frac{T P}{T P + F P + F N}

(9)

Let TP represent the number of true positives, TN the number of true negatives, FN the number of false negatives, and FP the number of false positives. We can calculate P, R, OA, and PRE as follows:

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

\begin{matrix} P R E = \frac{(T P + F N) \times (T P + F P) + (T N + F P) \times (T N + F N)}{{(T P + T N + F P + F N)}^{2}} \end{matrix}

O A = \frac{T P + T N}{T P + T N + F P + F N}

(12)

4.3. Experimental Setting and Baselines

The implementation of the proposed method is performed in the Pytorch framework on a machine powered by an Intel Core i7-10700 CPU (2.9 GHz, 8 cores, 16 GB RAM) and supported by two NVIDIA GTX 4090 GPUs with 24 GB of RAM each. The AdamW optimizer is employed with settings of a learning rate at 1 × 10⁻³ and weight decay at 1 × 10⁻². Training for the proposed semi-supervised method lasted 2000 epochs for both the LEVIR and WHU datasets. The batch size for the labeled data in these datasets is set to N, where N is the count of labeled single-temporal images. To balance training speed and GPU memory usage, the batch size for the unlabeled data is set to 10. According to the findings from the ablation studies in [23], the hyperparameter

λ

is configured to 8. In terms of the CD network, the configuration includes setting the number of convolution kernels for each convolution module to 8, 16, 32, 64, 128, with the size of the convolution kernels standardized at 3 × 3.

In order to demonstrate the effectiveness of our method, we compare the proposed approach with the following state-of-the-art fully supervised CD and semi-supervised CD methods:

KPCA-MNet [15]: KPCA-MNet is an unsupervised change detection method that utilizes a deep Siamese network composed of weight-shared kernel principal component analysis (KPCA) convolutional layers to extract high-level spatial–spectral feature maps. The change detection results are obtained through threshold segmentation.
SNUNet-ECAM [42]: A fully supervised method for remote sensing change detection, combining elements of a Siamese network with a nested UNet design. It utilizes an ensemble channel attention mechanism (ECAM) to effectively fuse the outputs from four nested UNets of varying depths.
BIT-CD [50]: As a fully supervised remote sensing change detection method that utilizes a transformer-based approach, BIT-CD introduces the bi-temporal image transformer (BIT) to model contexts efficiently and effectively within the spatial–temporal domain.
s4GAN [57]: By employing a feature matching loss, s4GAN, which is a semi-supervised semantic segmentation method, can minimize the gap between the segmentation maps predicted and the actual ones from semi-supervised data.
semiCDNet [22]: A semi-supervised change detection method that leverages GANs, using two discriminators to ensure better consistency in segmentation and entropy maps across both labeled and unlabeled data.
SemiSANet [23]: As a semi-supervised method for change detection, SemiSANet utilizes consistency regularization to attain high accuracy in CD results, particularly in situations where labels are limited.
FPA [55]: FPA introduces a novel semisupervised change detection framework that effectively utilizes unlabeled remote sensing image pairs through class-aware feature alignment and pixelwise prediction alignment, achieving state-of-the-art performance across multiple benchmark datasets.
PUF [56]: PUF is a progressive uncertainty-aware and uncertainty-guided framework, enhancing semi-supervised change detection performance by decoding and quantifying aleatoric uncertainty in remote sensing images.
ChangStar [24]: ChangeStar represents the first fully supervised approach that relies completely on single-temporal remote sensing images paired with building extraction labels to train a robust change detection model.

To ensure a fair comparison among the diverse input data utilized by these methods, we have established certain guidelines. For SNUNet-ECAM, a fully supervised method, and the semi-supervised methods semiCDNet, s4GAN, and semiSANet, we will use the same number of bi-temporal images as the single-temporal images used in the JointNet4BCD. For example, if JointNet4BCD uses 20 labeled single-temporal images, we will use 20 pairs of bi-temporal images as labeled data for the aforementioned methods. Note that JointNet4BCD uses building extraction labels, while the other methods use CD labels. As indicated in the original paper [24], ChangeStar relies on the xview2 building extraction dataset. Furthermore, since KPCA-MNet is an unsupervised method, it will be trained without labels.

4.4. Comparison Experiments

4.4.1. Prediction on LEVIR2000 Dataset

On the LEVIR2000 dataset, experiments were carried out to compare the proposed method with the baseline methods. Moreover, to gauge the time taken by each method, we recorded the training time for a single epoch and referred to this as “TIME”.

Under three evaluation metrics, the quantitative results were calculated and summarized for different quantities of labeled data, as displayed in Table 1.

The experimental results show that the change detection accuracy of JointNet4BCD significantly outperforms the comparison algorithms. KPCA-MNet, as an unsupervised approach, demonstrates ineffectiveness in building change detection, achieving an F1-Score of just 9.72%. This is due to the fact that KPCA-MNet does not target semantic change detection, i.e., it does not consider whether the change object is a building or not when performing change detection. The CNN-based fully supervised algorithm SNUNet-ECAM and the Transformer-based fully supervised algorithm BIT-CD do not work well with change detection labels of 10, 15, 20, and 30 pairs, with F1-Scores ranging from 15% to 35%. This is due to the large number of labels required by the fully supervised algorithm, making it difficult to train a highly accurate fully supervised change detection model when the number of labels is small. The s4GAN and semiCDNet are semi-supervised GAN-based algorithms. When there are only 10 pairs of change detection labels, the F1-Score of s4GAN is 32.95% and the F1-Score of semiCDNet is 38.42%. When the number of change detection labels has 15 pairs, the change detection results of both s4GAN and semiCDNet are significantly improved, and the F1-Score reaches more than 47%. When the number of labels reaches 30 pairs, s4GAN reaches an F1-Score of 54.30% and the F1-Score of semiCDNet reaches 60.49%. It can be seen from the experimental results that these two GAN-based semi-supervised algorithms are more sensitive to the number of labels and perform poorly when the number of tags is particularly low. The semi-supervised algorithm semiSANet based on consistent regularization achieves an F1-Score of 70.02% when the number of change detection labels is 10 pairs, which is better than the previous semi-supervised algorithms. However, as the number of labels increases from 10 to 30 pairs, semiSANet’s accuracy improves little and the F1-Score reaches 73%. The labels used by the single-temporal methods ChangeStar and JointNet4BCD are both building extracted labels. As for the uncertainty-aware framework PUF, when the number of labeled data pairs is fewer than 20, the PUF method shows moderate performance with an F1-Score below 65%. However, with an increase in the number of labeled data pairs to 30, PUF’s F1-Score sees a substantial improvement to 80.30%, positioning it as the second-best method among those evaluated. The single-temporal fully supervised approach ChangeStar is trained using the entire xview2 building extraction dataset and tested on the LEVIR2000 dataset, with an F1-Score of 72.07%. JointNet4BCD, a single-temporal semi-supervised algorithm, achieves an F1-Score of 83.93% with just 10 building extraction labels, exceeding the performance of the second-best algorithm by 11.86%. The change detection accuracy of JointNet4BCD gradually improves as the number of labels increases, but not by much. When the number of labels is 30, the F1-Score of JointNet4BCD reaches 84.93%, Kappa reaches 84.28%, and IoU reaches 73.80%, exceeding the second best algorithm by 11.93%, 12.24%, and 16.31%. The experimental results indicate that JointNet4BCD is the most effective among the compared algorithms. Despite having a very small number of labels, JointNet4BCD can still produce highly accurate change detection results. On the one hand, this is because JointNet4BCD uses a small amount of single-temporal data for random pairing to generate a large volume of pseudo-bi-temporal data, which guides the model in understanding the CD task. On the other hand, the combination of semi-supervised learning and joint learning greatly improves the encoder’s ability to extract features. Meanwhile, the incorporation of the decision fusion block allows for the integration of decision information from both the building extraction decoder and the change detection decoder to generate the final change detection results, thereby further enhancing the model’s change detection performance.

In order to visualize the performance of each model more clearly, the change detection result images are illustrated in Figure 3. From the visualization results, it is evident that our method demonstrates significant advantages both in terms of detection accuracy and the detail captured in the change maps.

4.4.2. Prediction on WHU Dataset

Comparative experiments on the WHU dataset were conducted between the proposed method and the baseline methods. The quantitative results, which were calculated and summarized based on different numbers of labeled data, are displayed in Table 2.

From the experimental results on the WHU dataset, it can be seen that JointNet4BCD outperforms the comparison methods in terms of change detection accuracy significantly. The F1-Score of the unsupervised approach KPCA-MNet on the WHU dataset is only 20.87%, which is because its change detection capability extends beyond buildings. The CNN-based model SNUNet-ECAM trained on a small number of labels shows better change detection than KPCA-MNet. Its F1-Score is 32.28% when the number of labels is 10 pairs. When the number of labels increases to 15 pairs, its F1-Score increases more and increases to 51.10%. However, when the number of labels increased to 20 and 30 pairs, its F1-Score increased slightly, only 52.87% and 54.44%. The Transformer-based fully supervised method, BIT-CD, exhibits the same trend but achieves better change detection accuracy than SNUNet-ECAM. When the change detection label is 10 pairs, the F1-Score is 38.49%. And with 30 pairs of labels, the F1-Score is 59.07%. Due to the limited number of labels, the fully supervised algorithm’s effectiveness remains consistently poorer. On the WHU dataset, the semi-supervised algorithm outperforms the fully supervised method significantly. When using 10 pairs of change detection labels, both s4GAN and semiCDNet, which are GAN-based semi-supervised algorithms, achieve F1-Scores above 60%. However, the effectiveness of s4GAN and semiCDNet in change detection does not improve when the number of labels is increased from 10 pairs to 30 pairs because of the limited label count. When using 10 labeled pairs, SemiSANet, a consistency regularization-based semi-supervised method, is less effective with an F1-Score of 53.16%, compared to the GAN-based algorithm. However, once the number of labeled pairs surpasses 15, SemiSANet’s F1-Score climbs above 75%, overtaking the GAN-based method. The feature-prediction alignment (FPA) method and the uncertainty-aware (PUF) method exhibit similar trends in the experimental results on this dataset. With a very small number of labeled pairs, change detection performance is generally modest, and F1-Scores stay below 60%. However, as the number of labeled pairs increases, there is a significant performance improvement.

For example, when the number of labeled pairs reaches 20, PUF’s F1-Score jumps to 82.94%, an increase of 9.22% from 63.72% with 15 labeled pairs. Similarly, when the number of labeled pairs reaches 30, FPA’s F1-Score rises to 81.68%, an increase of 21.47% from 60.21% with 20 labeled pairs. In contrast, our model, JointNet4BCD, demonstrates superior performance even with a very limited number of labels. With only ten labeled pairs, the F1-Score already reaches 83.45%. This highlights the significant advantage of our method in scenarios where the number of labeled data is extremely small. To better illustrate the experimental results, we visualize the performance of various models with 20 labeled pairs in Figure 4.

4.5. Ablation Studies

To assess the effectiveness of each component within JointNet4BCD, we performed ablation studies on the LEVIR2000 and WHU datasets. Specifically, we removed the semi-supervised learning component, the joint learning component, and the decision fusion block from JointNet4BCD to evaluate the impact of these three components. The results of these experiments are detailed in Table 3 and Table 4.

The ablation experiments on the LEVIR2000 dataset are illustrated in Table 3. It can be observed that when only using single-temporal supervised learning and change detection task learning, the model’s performance in change detection is poor. With 10 labeled images, the F1-Score is 33.67%. As the number of labels increases, the F1-Score also gradually increases, but it never exceeds 60%. When the number of labels is 15, 20, and 30, F1-Score is 48.78%, 57.68%, and 59.18%, respectively. It is evident that the number of labels significantly affects the performance of the single-phase change detection task.

When combining semi-supervised learning with single-temporal supervised learning, the model’s performance improves rapidly. With 10 labels, the model achieves an F1-Score of 75.53% in change detection tasks, representing a 41.86% increase compared to using only single-temporal supervised learning. As the number of labels increases, the model’s change detection performance also gradually improves. At label quantities of 15, 20, and 30, F1-Scores reach 79.88%, 81.38%, and 83.29%, respectively. It can be seen that the combination of semi-supervised learning and single-temporal supervised learning allows the model to achieve an F1-Score of over 80% even with only 20 labeled images.

When joint learning is added, the model’s change detection performance is further enhanced. With only 10 labeled images, the model’s F1-Score reaches 83%, a 7.47% increase compared to using only single-temporal semi-supervised learning. As the number of labels increases, the model’s performance increases slowly. When the number of labels increases to 30, the model’s F1-Score only increases by 1.11%. This demonstrates that joint learning can further leverage the information from an extremely small number of labels, allowing the model to understand the task and learn task-related image features with just a few labels. As a result, the model’s reliance on the quantity of labels is reduced. After incorporating the decision fusion block, the model’s performance is further improved. The performance on the LEVIR2000 dataset is on average 0.76% higher than without the inclusion of the decision fusion block.

Table 4 presents the results of the ablation study on the WHU dataset. When using only single-temporal supervised learning, the model’s change detection performance was better than on the LEVIR2000 dataset, with an F1-Score exceeding 60%. However, as the number of labels increased, the F1-Score did not improve accordingly but fluctuated around 66%, with the best result being 66.48%. After incorporating semi-supervised learning, there was a significant performance improvement in the model. With 10 labels, the model’s F1-Score was only 68.82%, a mere 2.66% increase compared to not using semi-supervised learning. But as the number of labels increased, providing more guidance for the model to learn the change detection task, the model’s performance rapidly improved. With 15, 20, and 30 labels, the F1-Scores were 76.20%, 83.17%, and 83.93%, respectively, which is an increase of over 13% compared to when semi-supervised learning was not used. Subsequently, the addition of joint learning resulted in a significant leap in the model’s performance. With 10 labels, the model’s F1-Score for the change detection task reached 83.06%, a 14.24% increase over not using joint learning. As the number of labels increased, the model’s performance steadily improved, reaching 84.12%, 85.82%, and 86.81% with 10, 15, and 20 labels, respectively. After the decision fusion of the decision-making module, the model’s performance was further enhanced. With 30 labels, the model’s F1-Score exceeded 87%. Compared to not incorporating the decision-making module, the model’s average F1-Score increased by 0.61%.

5. Discussion

5.1. Performance and Efficiency

The proposed JointNet4BCD excels in leveraging the potential of data through the use of semi-supervised and joint learning methods. It achieves high-precision change detection with only a minimal amount of labeled data, thereby significantly reducing the labor and time costs associated with label creation. On the LEVIR2000 dataset, JointNet4BCD achieves an F1-Score of 83.93% with just 10 building extraction labels, which is a substantial improvement over the second-best method. Similarly, on the WHU dataset, JointNet4BCD outperforms all other methods, achieving an F1-Score of 83.45% with only 10 labeled pairs. This highlights the efficiency and effectiveness of the semi-supervised and joint learning approach in leveraging a minimal amount of labeled data.

5.2. Critical Considerations and Limitations

While JointNet4BCD demonstrates impressive performance, several critical considerations and limitations must be addressed. First, due to JointNet4BCD simultaneously performing building extraction and change detection tasks, and utilizing both labeled and unlabeled data for training, it requires more time and larger GPU memory for training. Second, JointNet4BCD requires that the labeled and unlabeled data come from the same dataset to ensure the accuracy of change detection, meaning the sensors and regions must be consistent. Therefore, additional label annotation is necessary when training the model across different regions, although the labeling cost is significantly lower compared to fully supervised algorithms. Third, the backbone of JointNet4BCD is entirely based on convolutional neural networks, which lack the ability to effectively aggregate global features. Finally, the design of the decision fusion block is relatively simplistic, which may limit its effectiveness in more complex scenarios.

5.3. Future Work

Given the limitations and the promising potential of JointNet4BCD, we propose several avenues for its future enhancement. Firstly, optimizing the underlying algorithms and simplifying the network architecture could significantly reduce the computational costs associated with training JointNet4BCD. Secondly, leveraging advanced large-scale model training techniques, alongside the accumulation of extensive, unlabeled bi-temporal remote sensing data spanning various regions, sensor types, and resolutions can equip JointNet4BCD with robust cross-domain change detection capabilities. Additionally, integrating state-of-the-art models like Transformers or state space model into JointNet4BCD could enhance its global feature aggregation abilities. Lastly, refining the decision fusion block, perhaps through the adoption of cross-attention mechanisms, would facilitate more effective information exchange between semantic segmentation and change detection processes, ultimately boosting the overall performance of the system.

6. Conclusions

This paper proposes a semi-supervised method, JointNet4BCD, for the detection of building changes in remote sensing images. The method combines single-temporal semi-supervised learning and joint learning. By employing semi-supervised learning, JointNet4BCD is able to train highly accurate models for building change detection with just a few single-temporal RS images that have building extraction labels, alongside many unlabeled bi-temporal RS images. Joint learning enables JointNet4BCD to utilize building extraction labels to train the model for both the building extraction and change detection tasks simultaneously, thereby enhancing the model’s feature extraction capability. Additionally, the incorporation of the decision fusion block enables the use of building extraction results to further improve the model’s change detection accuracy. Comprehensive experiments have demonstrated that JointNet4BCD outperforms state-of-the-art algorithms even under conditions of extreme labeled data scarcity. Using only ten labeled images, JointNet4BCD achieved high-precision change detection on both the LEVIR2000 dataset and the WHU dataset, with F1-Scores of 83.93% and 83.45%, respectively. However, JointNet4BCD has some limitations, including high computational costs and a lack of global feature aggregation capabilities. In future work, we intend to further improve the algorithm by leveraging self-supervised learning and large model techniques to enhance JointNet4BCD’s cross-domain change detection capabilities. Additionally, we plan to introduce Transformer technology to enable JointNet4BCD to better extract and aggregate global features.

Author Contributions

Conceptualization, C.S. and H.C.; methodology, C.S. and H.C.; software, C.S.; validation, C.S., C.D. and H.C.; data curation, H.C.; writing—original draft preparation, C.S.; writing— review and editing, C.D. and J.L.; supervision, H.C.; project administration, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work in this paper is supported by the National Natural Science Foundation of China (Grants 42471403, 42101435, 42101432, and 62106276).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, T.; Wang, L.; Yin, D.; Sun, K.; Chen, Y.; Li, W.; Li, D. Deep learning for change detection in remote sensing: A review. Geo Spat. Inf. Sci. 2023, 26, 262–288. [Google Scholar] [CrossRef]
Cheng, G.; Huang, Y.; Li, X.; Lyu, S.; Xu, Z.; Zhao, H.; Zhao, Q.; Xiang, S. Change detection methods for remote sensing in the last decade: A comprehensive review. Remote Sens. 2024, 16, 2355. [Google Scholar] [CrossRef]
Madasa, A.; Orimoloye, I.R.; Ololade, O.O. Application of geospatial indices for mapping land cover/use change detection in a mining area. J. Afr. Earth Sci. 2021, 175, 104108. [Google Scholar] [CrossRef]
Khan, A.; Aslam, S.; Aurangzeb, K.; Alhussein, M.; Javaid, N. Multiscale modeling in smart cities: A survey on applications, current trends, and challenges. Sustain. Cities Soc. 2022, 78, 103517. [Google Scholar] [CrossRef]
Lv, Z.; Wang, F.; Cui, G.; Benediktsson, J.A.; Lei, T.; Sun, W. Spatial–spectral attention network guided with change magnitude image for land cover change detection using remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Li, Z.; Tang, C.; Liu, X.; Li, C.; Li, X.; Zhang, W. MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5625213. [Google Scholar] [CrossRef]
Cao, Y.; Huang, X.; Weng, Q. A multi-scale weakly supervised learning method with adaptive online noise correction for high-resolution change detection of built-up areas. Remote Sens. Environ. 2023, 297, 113779. [Google Scholar] [CrossRef]
Wang, L.; Zhang, M.; Shi, W. CS-WSCDNet: Class Activation Mapping and Segment Anything Model-Based Framework for Weakly Supervised Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5624812. [Google Scholar] [CrossRef]
Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Remote Sensing Image Change Detection with Mamba. arXiv 2024, arXiv:2406.04207. [Google Scholar]
Ding, L.; Zhang, J.; Guo, H.; Zhang, K.; Liu, B.; Bruzzone, L. Joint spatio-temporal modeling for semantic change detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610814. [Google Scholar] [CrossRef]
Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal enhancement and interlevel fusion network for remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609414. [Google Scholar] [CrossRef]
Zuo, Y.; Li, L.; Liu, X.; Gao, Z.; Jiao, L.; Liu, F.; Yang, S. Robust Instance-Based Semi-Supervised Learning Change Detection for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4404815. [Google Scholar] [CrossRef]
Yang, Y.; Tang, X.; Ma, J.; Zhang, X.; Pei, S.; Jiao, L. ECPS: Cross Pseudo Supervision Based on Ensemble Learning for Semi-Supervised Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5612317. [Google Scholar] [CrossRef]
Sun, C.; Chen, H.; Du, C.; Jing, N. SemiBuildingChange: A Semi-Supervised High-Resolution Remote Sensing Image Building Change Detection Method with a Pseudo Bi-Temporal Data Generator. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622319. [Google Scholar] [CrossRef]
Wu, C.; Chen, H.; Du, B.; Zhang, L. Unsupervised change detection in multitemporal vhr images based on deep kernel pca convolutional mapping network. IEEE Trans. Cybern. 2021, 52, 12084–12098. [Google Scholar] [CrossRef]
Wu, C.; Du, B.; Zhang, L. Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9774–9788. [Google Scholar] [CrossRef]
Zhao, M.; Hu, X.; Zhang, L.; Meng, Q.; Chen, Y.; Bruzzone, L. Beyond Pixel-Level Annotation: Exploring Self-Supervised Learning for Change Detection with Image-Level Supervision. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614916. [Google Scholar] [CrossRef]
Chen, H.; Peng, S.; Du, C.; Li, J.; Wu, S. SW-GAN: Road extraction from remote sensing imagery using semi-weakly supervised adversarial learning. Remote Sens. 2022, 14, 4145. [Google Scholar] [CrossRef]
Yang, S.; Hou, S.; Zhang, Y.; Wang, H.; Ma, X. Change detection of high-resolution remote sensing image based on semi-supervised segmentation and adversarial learning. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1055–1058. [Google Scholar]
Zhao, G.; Peng, Y. Semisupervised SAR image change detection based on a siamese variational autoencoder. Inf. Process. Manag. 2022, 59, 102726. [Google Scholar] [CrossRef]
Yang, F.; Zhang, H.; Tao, S. Semi-supervised classification via full-graph attention neural networks. Neurocomputing 2022, 476, 63–74. [Google Scholar] [CrossRef]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A semisupervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5891–5906. [Google Scholar] [CrossRef]
Sun, C.; Wu, J.; Chen, H.; Du, C. SemiSANet: A Semi-Supervised High-Resolution Remote Sensing Image Change Detection Model Using Siamese Networks with Graph Attention. Remote Sens. 2022, 14, 2801. [Google Scholar] [CrossRef]
Zheng, Z.; Ma, A.; Zhang, L.; Zhong, Y. Change is everywhere: Single-temporal supervised object change detection in remote sensing imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; 2021; pp. 15193–15202. [Google Scholar]
Ridd, M.K.; Liu, J. A comparison of four algorithms for change detection in an urban environment. Remote Sens. Environ. 1998, 63, 95–100. [Google Scholar] [CrossRef]
Falco, N.; Marpu, P.R.; Benediktsson, J.A. A toolbox for unsupervised change detection analysis. Int. J. Remote Sens. 2016, 37, 1505–1526. [Google Scholar] [CrossRef]
Chen, G.; Hay, G.J.; Carvalho, L.M.; Wulder, M.A. Object-based change detection. Int. J. Remote Sens. 2012, 33, 4434–4457. [Google Scholar] [CrossRef]
Xu, Y.; Li, J.; Du, C.; Chen, H. NBR-Net: A Non-rigid Bi-directional Registration Network for Multi-temporal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620715. [Google Scholar]
Xu, Y.; Chen, H.; Du, C.; Li, J. MSACon: Mining Spatial Attention-Based Contextual Information for Road Extraction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–17. [Google Scholar] [CrossRef]
Chen, H.; Li, Z.; Wu, J.; Xiong, W.; Du, C. SemiRoadExNet: A semi-supervised network for road extraction from remote sensing imagery via adversarial learning. ISPRS J. Photogramm. Remote Sens. 2023, 198, 169–183. [Google Scholar] [CrossRef]
Song, J.; Li, J.; Chen, H.; Wu, J. MapGen-GAN: A fast translator for remote sensing image to map via unsupervised adversarial learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2341–2357. [Google Scholar] [CrossRef]
Song, J.; Li, J.; Chen, H.; Wu, J. RSMT: A Remote Sensing Image-to-Map Translation Model via Adversarial Deep Transfer Learning. Remote Sens. 2022, 14, 919. [Google Scholar] [CrossRef]
Li, Z.; Chen, H.; Wu, J.; Li, J.; Jing, N. SegMind: Semi-supervised remote sensing image semantic segmentation with Masked Image modeling and contrastive learning method. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4408917. [Google Scholar] [CrossRef]
Yang, L.; Chen, H.; Yang, A.; Li, J. EasySeg: An Error-Aware Domain Adaptation Framework for Remote Sensing Imagery Semantic Segmentation via Interactive Learning and Active Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4407518. [Google Scholar] [CrossRef]
Lv, Z.; Huang, H.; Gao, L.; Benediktsson, J.A.; Zhao, M.; Shi, C. Simple multiscale UNet for change detection with heterogeneous remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Chen, T.; Lu, Z.; Yang, Y.; Zhang, Y.; Du, B.; Plaza, A. A Siamese Network Based U-Net for Change Detection in High Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2357–2369. [Google Scholar]
Moustafa, M.S.; Mohamed, S.A.; Ahmed, S.; Nasr, A.H. Hyperspectral change detection based on modification of UNet neural networks. J. Appl. Remote Sens. 2021, 15, 028505. [Google Scholar] [CrossRef]
Sun, C.; Du, C.; Wu, J.; Chen, H. SUDANet: A Siamese UNet with Dense Attention Mechanism for Remote Sensing Image Change Detection. In Pattern Recognition and Computer Vision, Proceedings of the 5th Chinese Conference, PRCV 2022, Shenzhen, China, 4–7 November 2022; Proceedings, Part IV; Springer: Berlin/Heidelberg, Germany, 2022; pp. 78–88. [Google Scholar]
Lv, Z.; Huang, H.; Sun, W.; Lei, T.; Benediktsson, J.A.; Li, J. Novel enhanced UNet for change detection using multimodal remote sensing image. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2505405. [Google Scholar] [CrossRef]
Tang, Y.; Cao, Z.; Guo, N.; Jiang, M. A Siamese Swin-Unet for image change detection. Sci. Rep. 2024, 14, 4577. [Google Scholar]
Shao, R.; Du, C.; Chen, H.; Li, J. SUNet: Change Detection for Heterogeneous Remote Sensing Images from Satellite and UAV Using a Dual-Channel Fully Convolution Network. Remote Sens. 2021, 13, 3750. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Li, Z.; Cao, S.; Deng, J.; Wu, F.; Wang, R.; Luo, J.; Peng, Z. STADE-CDNet: Spatial–temporal attention with difference enhancement-based network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2024, 241, 122666. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar]
Zhou, Y.; Wang, F.; Zhao, J.; Yao, R.; Chen, S.; Ma, H. Spatial-temporal based multihead self-attention for remote sensing image change detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6615–6626. [Google Scholar]
Noman, M.; Fiaz, M.; Cholakkal, H.; Narayan, S.; Anwer, R.M.; Khan, S.; Khan, F.S. Remote sensing change detection with transformers trained from scratch. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704214. [Google Scholar]
Han, K.; Sheng, V.S.; Song, Y.; Liu, Y.; Qiu, C.; Ma, S.; Liu, Z. Deep semi-supervised learning for medical image segmentation: A review. In Expert Systems with Applications; Elsevier: Amsterdam, The Netherlands, 2024; p. 123052. [Google Scholar]
Zhang, X.; Huang, X.; Li, J. Semisupervised change detection with feature-prediction alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Shen, J.; Zhang, C.; Zhang, M.; Li, Q.; Wang, Q. Learning Remote Sensing Aleatoric Uncertainty for Semi-Supervised Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5635413. [Google Scholar]
Mittal, S.; Tatarchenko, M.; Brox, T. Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1369–1379. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flowchart of JointNet4BCD. (a) Single-temporal joint learning: This part generates pseudo-bi-temporal images

X_{l 1}

and

X_{l 2}

using single-temporal images and their building extraction labels, simultaneously training for building extraction and change detection tasks. The solid blue line represents the building extraction process for

X_{l 1}

, the solid green line represents the building extraction process for

X_{l 2}

, and the solid yellow line represents the change detection process. (b) Unsupervised joint learning: This part uses unlabeled bi-temporal remote sensing images for consistency regularization training of building extraction and change detection tasks. Solid lines represent the processing flow for the original images, while dashed lines represent the processing flow for the perturbed images. The blue lines represent the building extraction process for image

X_{u 1}

, the green lines represent the building extraction process for image

X_{u 2}

, and the yellow lines represent the change detection process.

Figure 1. Flowchart of JointNet4BCD. (a) Single-temporal joint learning: This part generates pseudo-bi-temporal images

X_{l 1}

and

X_{l 2}

using single-temporal images and their building extraction labels, simultaneously training for building extraction and change detection tasks. The solid blue line represents the building extraction process for

X_{l 1}

, the solid green line represents the building extraction process for

X_{l 2}

, and the solid yellow line represents the change detection process. (b) Unsupervised joint learning: This part uses unlabeled bi-temporal remote sensing images for consistency regularization training of building extraction and change detection tasks. Solid lines represent the processing flow for the original images, while dashed lines represent the processing flow for the perturbed images. The blue lines represent the building extraction process for image

X_{u 1}

, the green lines represent the building extraction process for image

X_{u 2}

, and the yellow lines represent the change detection process.

Figure 2. Flowchart of decision fusion block.

Figure 3. A visual comparison of the change detection maps produced by various approaches on the LEVIR2000 dataset, trained with 20 labeled images. The parts enclosed in the red box indicate areas where false alarms or missed detections are present. (a) Image

T_{1}

. (b) Image

T_{2}

. (c) Ground truth. (d) KPCA-MNet. (e) SNUNet-ECAM. (f) BIT-CD. (g) s4GAN. (h) semiCDNet. (i) semiSANet. (j) ChangeStar. (k) FPA. (l) PUF. (m) JointNet4BCD.

Figure 3. A visual comparison of the change detection maps produced by various approaches on the LEVIR2000 dataset, trained with 20 labeled images. The parts enclosed in the red box indicate areas where false alarms or missed detections are present. (a) Image

T_{1}

. (b) Image

T_{2}

. (c) Ground truth. (d) KPCA-MNet. (e) SNUNet-ECAM. (f) BIT-CD. (g) s4GAN. (h) semiCDNet. (i) semiSANet. (j) ChangeStar. (k) FPA. (l) PUF. (m) JointNet4BCD.

Figure 4. Visual comparisons of CD maps by different approaches on WHU dataset using 20 labeled images for training. The parts enclosed in the red box indicate areas where false alarms or missed detections are present. (a) Image

T_{1}

. (b) Image

T_{2}

. (c) Ground truth. (d) KPCA-MNet. (e) SNUNet-ECAM. (f) BIT-CD. (g) s4GAN. (h) semiCDNet. (i) semiSANet. (j) ChangeStar. (k) FPA. (l) PUF. (m) JointNet4BCD.

Figure 4. Visual comparisons of CD maps by different approaches on WHU dataset using 20 labeled images for training. The parts enclosed in the red box indicate areas where false alarms or missed detections are present. (a) Image

T_{1}

. (b) Image

T_{2}

. (c) Ground truth. (d) KPCA-MNet. (e) SNUNet-ECAM. (f) BIT-CD. (g) s4GAN. (h) semiCDNet. (i) semiSANet. (j) ChangeStar. (k) FPA. (l) PUF. (m) JointNet4BCD.

Table 1. The experimental results on the LEVIR2000 dataset.

Method	Number of Labeled Images
	10				15				20				30
	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME
KPCA-MNet	0.0972	0.7214	0.0511	440 s	0.0972	0.7214	0.0511	440 s	0.0972	0.7214	0.0511	440 s	0.0972	0.7214	0.0511	440 s
SNUNet-ECAM	0.2755	0.2570	0.1597	0.13 s	0.3390	0.3192	0.2041	0.17 s	0.3335	0.3121	0.2001	0.23 s	0.3375	0.3076	0.2030	0.37 s
BIT-CD	0.2294	0.2074	0.1295	0.42 s	0.1504	0.1351	0.0813	0.52 s	0.1911	0.1793	0.1056	0.81 s	0.3274	0.3106	0.1957	1.07 s
s4GAN	0.3295	0.2882	0.1973	99 s	0.4767	0.4479	0.3129	107 s	0.5626	0.5404	0.3914	115 s	0.5430	0.3076	0.2030	140 s
semiCDNet	0.3842	0.3502	0.2378	90 s	0.4769	0.4568	0.3131	112 s	0.6007	0.5863	0.4293	147 s	0.6049	0.5903	0.4336	178 s
semiSANet	0.7002	0.6876	0.5387	85 s	0.7132	0.7014	0.5543	90 s	0.7376	0.7272	0.5843	97 s	0.7300	0.7204	0.5749	113 s
ChangeStar	0.7207	0.7081	0.5633	25 min	0.7207	0.7081	0.5633	25 min	0.7207	0.7081	0.5633	25 min	0.7207	0.7081	0.5633	25 min
FPA	0.5294	0.5106	0.3600	57 s	0.5686	0.5473	0.3972	52 s	0.5613	0.5421	0.3901	61 s	0.6696	0.6565	0.5033	57 s
PUF	0.6163	0.6185	0.4454	135 s	0.6402	0.6327	0.4708	135 s	0.6416	0.6229	0.4724	135 s	0.8030	0.7736	0.6709	135 s
JointNet4BCD	0.8393	0.8326	0.7232	125 s	0.8403	0.8333	0.7245	131 s	0.8449	0.8385	0.7315	144 s	0.8493	0.8428	0.7380	152 s

Table 2. The experimental results on the WHU dataset.

Method	Number of Labeled Images
	10				15				20				30
	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME	F1	Kappa	IoU	TIME
KPCA-MNet	0.2087	0.7210	0.1165	440 s	0.2087	0.7210	0.1165	440 s	0.2087	0.7210	0.1165	440 s	0.2087	0.7210	0.1165	440 s
SNUNet-ECAM	0.3228	0.2594	0.1924	0.13 s	0.5110	0.4486	0.3432	0.17 s	0.5287	0.4674	0.3594	0.23 s	0.5444	0.4883	0.3740	0.37 s
BIT-CD	0.3849	0.2840	0.2382	0.42 s	0.5393	0.4735	0.3691	0.52 s	0.5830	0.5368	0.4114	0.81 s	0.5907	0.5451	0.4191	1.07 s
s4GAN	0.6364	0.5924	0.4667	99 s	0.5898	0.5351	0.4182	107 s	0.6197	0.5674	0.4490	115 s	0.5763	0.5182	0.4048	140 s
semiCDNet	0.6251	0.5765	0.4547	90 s	0.6265	0.5892	0.4561	112 s	0.6335	0.5853	0.4636	147 s	0.6331	0.5952	0.4631	178 s
semiSANet	0.5316	0.4870	0.3620	85 s	0.7603	0.7331	0.6132	90 s	0.7778	0.7568	0.6364	97 s	0.7812	0.7595	0.6410	113 s
ChangeStar	0.6704	0.6358	0.5042	25 min	0.6704	0.6358	0.5042	25 min	0.6704	0.6358	0.5042	25 min	0.6704	0.6358	0.5042	25 min
FPA	0.5716	0.5401	0.4002	57 s	0.5838	0.5537	0.4122	58 s	0.6021	0.5694	0.4307	63 s	0.8168	0.7981	0.6904	63 s
PUF	0.5468	0.3464	0.3763	140 s	0.6372	0.4599	0.4676	142 s	0.8294	0.8024	0.7085	142 s	0.8692	0.8466	0.7687	143 s
JointNet4BCD	0.8345	0.8271	0.7159	125 s	0.8487	0.8325	0.7371	131 s	0.8664	0.8537	0.7644	144 s	0.8730	0.8597	0.7746	152 s

Table 3. The ablation study results on the LEVIR2000 dataset.

Method	Semi	Joint	DIM	Number of Labeled Images
				10			15			20			30
				F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa	IoU
JointNet4BCD	×	×	×	0.3367	0.2945	0.2024	0.4878	0.4588	0.3225	0.5768	0.5573	0.4053	0.5918	0.5700	0.4203
	✓	×	×	0.7553	0.7457	0.6068	0.7988	0.7900	0.6651	0.8138	0.8055	0.6860	0.8329	0.8257	0.7136
	✓	✓	×	0.8300	0.8228	0.7093	0.8322	0.8247	0.7126	0.8411	0.8341	0.7257	0.8411	0.8344	0.7258
	✓	✓	✓	0.8393	0.8326	0.7232	0.8403	0.8333	0.7245	0.8449	0.8385	0.7315	0.8493	0.8428	0.7380

Table 4. The ablation study results on the WHU dataset.

Method	Semi	Joint	DIM	Number of Labeled Images
				10			15			20			30
				F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa	IoU	F1	Kappa	IoU
JointNet4BCD	×	×	×	0.6616	0.6254	0.4943	0.6268	0.5811	0.4564	0.6648	0.6237	0.4979	0.6583	0.6191	0.4907
	✓	×	×	0.6882	0.6510	0.5246	0.7620	0.7340	0.6155	0.8317	0.8131	0.7119	0.8393	0.8216	0.7230
	✓	✓	×	0.8306	0.8123	0.7103	0.8412	0.8242	0.7260	0.8582	0.8427	0.7517	0.8681	0.8539	0.7669
	✓	✓	✓	0.8345	0.8271	0.7159	0.8487	0.8325	0.7371	0.8664	0.8537	0.7644	0.8730	0.8597	0.7746

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Sun, C.; Li, J.; Du, C. JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection. Remote Sens. 2024, 16, 4569. https://doi.org/10.3390/rs16234569

AMA Style

Chen H, Sun C, Li J, Du C. JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection. Remote Sensing. 2024; 16(23):4569. https://doi.org/10.3390/rs16234569

Chicago/Turabian Style

Chen, Hao, Chengzhe Sun, Jun Li, and Chun Du. 2024. "JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection" Remote Sensing 16, no. 23: 4569. https://doi.org/10.3390/rs16234569

APA Style

Chen, H., Sun, C., Li, J., & Du, C. (2024). JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection. Remote Sensing, 16(23), 4569. https://doi.org/10.3390/rs16234569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

JointNet4BCD: A Semi-Supervised Joint Learning Neural Network with Decision Fusion for Building Change Detection

Abstract

1. Introduction

2. Related Work

2.1. Traditional Change Detection Methods

2.2. Deep Learning-Based Change Detection Methods

2.3. Semi-Supervised Change Detection Methods

3. Methods

3.1. Single-Temporal Joint Learning

3.2. Unsupervised Joint Learning

3.3. Backbone Network

3.4. Decision Fusion Block

3.5. Loss Calculation

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setting and Baselines

4.4. Comparison Experiments

4.4.1. Prediction on LEVIR2000 Dataset

4.4.2. Prediction on WHU Dataset

4.5. Ablation Studies

5. Discussion

5.1. Performance and Efficiency

5.2. Critical Considerations and Limitations

5.3. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI