1. Introduction
Both building change detection and damage assessment using Remote Sensing (RS) images play a vital role in timely and accurate post-disaster analysis and response [
1,
2]. Optical RS imagery is one of the most widely used data sources for change detection. Rapid detection of building changes and assessment of damage levels is crucial for disaster response, resource allocation, and recovery planning to minimize loss of life and economic impact [
3].
Figure 1 represents the hierarchical structure of building change detection and damage assessment tasks. Specifically, building change detection focuses on identifying changed building areas between bi-temporal images. On the other hand, building damage assessment can be divided into building segmentation, which identifies all buildings in the images, and damage classification, which assesses the severity of damage in segmented buildings.
In recent years, deep learning has become the dominant method of Geospatial Artificial Intelligence (GeoAI) for RS of building change detection and damage assessment [
4]. Among the various approaches, models based on convolutional neural networks (CNN) are popular due to their ability to capture spatial patterns [
5,
6]. However, CNNs have inherent limitations, such as limited receptive fields, which limit their ability to model complex and diverse multi-temporal scenes [
7]. To address this, attention mechanisms have been integrated into deep learning. For example, D2ANet utilizes dual-temporal aggregation and attention modules to capture multi-level changes [
8].
Moving beyond CNNs, Vision Transformer (ViT) has gained attention. ViTs are advantageous in handling complex spatial relationships and capturing long-distance dependencies in images [
9]. Recent works, such as the Bitemporal Attention Module with ViT, utilize cross-attention mechanisms to achieve temporal fusion for building change detection [
10]. Additionally, the ChangeMamba architecture, based on the Visual Mamba model, has shown promising results in enhancing the accuracy of building damage assessment [
11]. Despite these advancements, supervised learning approaches have limitations, including reliance on large labeled datasets and overfitting to training datasets [
12].
Self-supervised learning (SSL) has contributed to the field of deep learning by allowing models to learn generalized feature representations from unlabeled datasets [
13]. SSL achieves this by creating pretext tasks from raw data, which use patterns within raw data to train models without labels. Once pre-trained on these pretext tasks, the model can be fine-tuned on labeled datasets for downstream tasks such as classification, segmentation, and object detection. Notable SSL frameworks in computer vision include Bootstrap your own latent (BYOL) [
14], Simple framework for contrastive learning (SimCLR) [
15], Momentum contrast (MoCo) [
16,
17], Self-distillation with no labels (DINO) [
18], and Masked Autoencoder (MAE) [
19].
In the context of RS building change detection, SSL frameworks like RECM combine RGB-elevation knowledge distillation and image mask prediction to improve detection performance [
20]. However, RS images differ significantly from typical RGB images (e.g., ImageNet [
21]). ImageNet provides clear object categories, while RS images capture the Earth’s surface, featuring objects that vary widely in scale, color, shape, and texture due to different weather conditions, human mobility, and urban changes [
22]. Thus, masking large regions of RS images (the methodology of MAE) might cause semantic loss. Other SSL methods, such as the spatiotemporal contrastive representation learning model (ST-CRL), learn features of building damages using contrastive learning [
3]. This method focuses on learning an embedding space where similar features are close together, and dissimilar ones are pushed apart [
15]. However, RS images often contain information across coarse, middle, and fine-grained levels, which are not fully exploited solely by contrastive learning methods [
22].
However, existing SSL research primarily relies on either denoising or masked prediction strategies, with limited investigation into their impact on information representation. To address this gap, we compared these two strategies by evaluating their generative ability. Based on this, we further propose a novel SSL framework, known as the Denoising AutoEncoder-enhanced Dual-Fusion Network (DAEDFN), which integrates denoising strategies instead of masked prediction. This approach helps to preserve essential semantic information while forcing the model to learn latent semantic representations. To further evaluate the effectiveness of the proposed SSL framework, we conduct experiments on five datasets for two main downstream tasks: building damage assessment and building change detection. In general, the main contributions of this work are summarized as follows:
Investigate the performance of denoising and masking strategies for semantic information reconstruction in remote sensing images.
Develop a dual denoising autoencoder (DAE) with a Vision Transformer backbone and contrastive learning strategy for self-supervised pretraining, enabling effective extraction of multi-scale image representations for various vision tasks.
Design and implement two transfer learning networks, composed of task-specific decoders, incorporating an edge guidance module and edge detection loss, to effectively adapt the pretrained model for building damage assessment and change detection tasks.
3. Materials and Methods
3.1. Datasets
The experiments are conducted on five widely used benchmark datasets for building damage assessment and change detection tasks: xBD, LEVIR, LEVIR+, SYSU-CD, and WHU-CD. The xBD dataset, provided by Carnegie Mellon University and the Defense Innovation Unit, USA [
23], is the largest publicly available dataset for building damage assessment. It includes 850,736 annotated buildings spanning a total area of 45,362 square kilometers. Image resolution of the dataset is 0.8 meters. Building annotations are categorized into four damage levels: no damage, minor damage, major damage, and destroyed. In this study, the xBD training set is used for self-supervised pretraining and the supervised building damage assessment task.
The LEVIR dataset [
51] contains 637 pairs of Google Earth image patches, each with a high resolution of 0.5 m/pixel and a size of 1024 × 1024 pixels. These images were collected from 20 different regions in various cities across TX, USA, between 2002 and 2018. This dataset includes diverse building types such as villas, high-rise apartments, small garages, and large warehouses. In this study, it is used for the building change detection task.
The LEVIR+ dataset is an extension of the LEVIR-CD dataset [
52]. It includes over 985 very high-resolution (0.5 m/pixel) bitemporal GE images with dimensions of 1024 × 1024 pixels. These images were captured from 20 different regions located in various cities across TX and span a time period ranging from 2002 to 2020. LEVIR+ is used for the building change detection task in this study.
The SYSU CD dataset [
53] is a category-agnostic CD dataset, which introduces a comprehensive collection of 20,000 pairs of 0.5 meter/pixel resolution aerial images from Hong Kong, spanning 2007 to 2014, to advance the field of CD. This dataset is distinguished by its focus on urban and coastal changes, featuring high-rise buildings and infrastructure developments, where CD poses significant challenges due to shadow and deviation effects. In this study, the SYSU CD dataset is used in building the change detection task.
The WHU-CD dataset [
54], a subset of the larger WHU Building dataset, is tailored for the building CD task. It comprises two aerial datasets from Christchurch, New Zealand, captured in April 2012 and 2016, with a spatial resolution of 0.3 meters/pixel. This dataset is particularly focused on detecting changes in large and sparse building structures. The aerial images captured in 2012 cover an area of 20.5 km
2, featuring 12,796 buildings, while the image captured in 2016 shows an increase to 16,077 buildings within the same area, reflecting significant urban development over the four-year period. The dataset follows an official split into training (21,243 × 15,354) and testing (11,265 × 15,354) areas. In this study, WHU is used for the building change detection task.
In this study, the multitemporal image pairs and associated labels of all datasets are cropped to 224 × 224 pixels for input to the network. We also divide the datasets into training and test sets using an 80%/20% ratio. For few-shot experiments (e.g., 5% and 10% settings), we randomly sample the specified proportion from the training set while keeping the test sets fixed.
3.2. Problem Statement
3.2.1. SSL Pretraining
In the context of SSL using a dual DAE, the model aims to learn feature representations by reconstructing clean images from noisy inputs. The pretraining process can be formulated as follows:
where
and
are pre-event and post-event,
is the original clean input,
represents the noisy input,
E is the encoding function that extracts latent representations,
D is the decoding function that reconstructs the clean images, and
is the reconstructed output.
The objective is to minimize the reconstruction error while ensuring the decoder D also constrains the dissimilarity between pre-event and post-event image pairs. This pretraining step allows the encoder to extract meaningful latent features, which are later frozen or fine-tuned for downstream tasks.
3.2.2. Building Change Detection Task
Binary change detection focuses on identifying where changes happen between bi-temporal images. It can be formally defined as
where
, and
are pre-event and post-event images, respectively;
is the building change mask for the
,
pair.
In this task, the decoder of the pre-trained DAE is replaced with a task-specific segmentation head. The latent representation extracted from the encoder is used to predict the change mask:
where
represents the segmentation head,
is the predicted segmentation mask.
3.2.3. Building Damage Assessment Task
Building damage assessment extends binary change detection by identifying both the location and the severity of building damage. This can be considered a one-to-many semantic change detection task [
4]. It is defined as
where
is the building mask at
, and
is the post-disaster damage level of the building at
, where
is the number of damage classes.
For this task, we add two task-specific heads: a segmentation head for localization and a classification head for damage severity prediction. The formulation is as follows:
where
and
represent the segmentation head and classification head,
is the predicted segmentation mask, and
is the predicted damage severity level.
3.3. Image Reconstruction Strategy Comparison
Masked image reconstruction randomly masks patches in an image and learning to generate the original one [
19], thus enabling the representation of images from unlabeled data. Meanwhile, there are works that utilize self-supervised pretraining to improve image denoising [
55,
56], and some studies have proposed hybrid approaches that unify masked and denoising strategies for representation learning [
57]. However, to the best of our knowledge, no prior work has systematically compared these two strategies in terms of their effectiveness for RS image reconstruction. Given that RS images contain complex object environments and changes, evaluating their reconstruction performance is essential.
In this study, we compare denoising and masked reconstruction strategies using generative evaluation metrics, namely PSNR and SSIM [
58]. Based on this analysis, we determine the most suitable strategy for constructing our SSL framework.
3.4. Overview of DAE-Enhanced Dual-Fusion Network
The proposed method consists of three key stages (
Figure 2): Self-Supervised Pretraining, Supervised Building Damage Assessment, and Supervised Building Change Detection. This framework integrates self-supervised feature extraction with fine-tuning for both building change detection and building damage assessment tasks. Our proposed method consists of three stages: Stage A (self-supervised pretraining), Stage B1, and Stage B2 (two downstream tasks). Stage A serves as a shared pretraining phase, where a dual-encoder-decoder model learns spatial and semantic representations. After Stage A, the pretrained encoder is reused in two independent tasks (B1 and B2), each trained with task-specific supervision.
Stage A: Self-Supervised Pretraining Integrating Unified Denoising and Contrastive Learning
In this stage, distorted bi-temporal input pairs are passed through a dual encoder–decoder structure. The model learns visual representations using a combination of reconstruction loss (recovering clean images) and contrastive InfoNCE loss (aligning latent embeddings). This ensures the encoder effectively captures spatial features and semantic differences across bi-temporal images.
Stage B1: Supervised Building Damage Assessment using Multi-task Learning
The pretrained encoder is frozen and utilized for supervised building damage assessment. The encoder features are processed through a dual decoder consisting of an FPN for building segmentation and a multi-scale ResNet for predicting damage severity levels. This generates both segmentation masks for building localization and post-event damage maps for each pixel.
Stage B2: Supervised Building Change Detection with Transfer Learning
In this stage, the Transformer blocks in the pretrained encoder are fine-tuned, followed by multi-head adapters for extracting features from pre-event and post-event images, which are further concatenated and processed using an FPN for generating high-quality change detection masks.
This unified approach effectively combines self-supervised learning for feature extraction and transfer learning for task-specific fine-tuning, enabling competitive performance across building change detection and damage assessment tasks.
3.5. Self-Supervised Pretraining Integrating Unified Denoising and Contrastive Learning
3.5.1. Network Architecture
The self-supervised pretraining framework is illustrated in
Figure 2A. It follows a structure inspired by MAE [
19], but instead employs a denoising strategy [
59] instead of a masked image modeling strategy. The proposed pretraining framework consists of two key components: the denoising strategy and a dual DAE with a ViT backbone.
First, Gaussian noise is generated from a normal distribution and added to both the pre-event and post-event images. This distortion disrupts the semantic content of the inputs, forcing the network to learn meaningful representations by reconstructing the clean images.
The dual DAE model consists of an encoder-decoder architecture, both based on the ViT base model. The encoder comprises 12 transformer blocks, which map noisy inputs into latent representations. The decoder includes eight transformer blocks followed by a final linear projection layer, which reconstructs the original clean images. During training, the distorted bi-temporal inputs (, ) are passed through the dual DAE to recover the clean pre-event and post-event images.
3.5.2. Loss Functions
The self-supervised pretraining phase optimizes a combined loss function consisting of a Mean Squared Error (MSE) loss and a contrastive InfoNCE loss [
60]:
Here, the MSE loss is calculated between the reconstructed images and the original input patches, encouraging the model to accurately reconstruct the input data. The contrastive InfoNCE loss ensures the model aligns the latent representations of pre-event and post-event images while distinguishing them from unrelated samples. It is defined as
where
is the temperature parameter, and B is defined as
Here,
and
are the latent representations of pre-event and post-event images
and
, respectively.
is the cosine similarity, defined as
The contrastive loss ensures that the representations of corresponding pre-event and post-event images are pulled together, while unrelated samples are pushed apart. This alignment allows the encoder to learn discriminative features that highlight subtle differences between pre- and post-disaster inputs.
3.6. Supervised Building Damage Assessment Using Multi-Task Learning
Figure 2(B1) shows the network architecture for building damage assessment task. This multi-task network consists of a frozen dual DAE encoder, a two-branch FPN-ResNet decoder, and an Edge Guidance Module (EGM). It is designed to predict building localization and damage severity simultaneously.
First, the pretrained DAE encoder is frozen for latent image representation extraction. For building segmentation, the FPN is used to capture pre-disaster building characteristics at multiple spatial resolutions. The EGM, composed by two 1 × 1 convolution layers, enhances the segmentation results by emphasizing edge information. For building damage assessment, multi-level features of post-disaster images are extracted using a ResNet-50 backbone, upsampled, and concatenated with the pre-event edge map. A final classification head outputs a multi-class damage severity map.
To supervise building segmentation, a combined Focal-Dice loss is employed:
where
handles class imbalance [
61],
measures the overlap alignment [
62].
For building edge and mask predictions, the combined loss can be extended as
where
, and
. After experiments, we set w to 0.5 for balancing edge and mask detections.
For damage severity classification, the loss is cross-entropy loss:
The final loss for the BDA task is
3.7. Supervised Building Change Detection with Transfer Learning
Figure 2(B2) shows the network tailored to building change detection with binary segmentation only. Similarly, pre-event and post-event images are first fed into the dual DAE encoder, where multi-head adapters are integrated into each Transformer block to fine-tune features. After feature extraction, latent features from pre-event and post-event images are compared with compute a difference map. Finally, the EGM refines the output by producing edge maps and segmentation masks for precise boundary detection.
Similar to the BDA task, a customized combined Focal-Dice loss supervises the segmentation and edge predictions:
Here, and are applied to edge maps, and and are applied to segmentation masks. After experiments, we set w to 0.5 for balancing edge and mask detections.
The ground truth edge map Edge is computed using the Sobel operator [
63]:
where
and
are the image gradients along the horizontal and vertical directions, respectively.
3.8. Evaluation Metrics
The evaluation of our method is based on the F1 Score (F1), a metric that balances precision and recall using their harmonic mean. The F1 Score is defined as
In which,
where TP represents true positives and FN denotes false negatives.
with FP standing for false positives.
Specifically, F1 Score for building localization is known as ), and F1 Score for different damage severity classes is known as ), F1 Score for damage severity classification is known as ). In the xBD dataset, 1 to 4 are “no damage”, “minor damage”, “major damage”, and “destroyed”, respectively.
In addition to the F1 Score, we also evaluate our method using the Intersection over Union (IoU), a widely adopted metric for measuring the overlap between predicted and ground truth regions. The IoU is defined as:
where TP denotes true positives, FP false positives, and FN false negatives. A higher IoU indicates better spatial agreement between predictions and ground truth.
We chose to use the F1 score and IoU as the sole evaluation metric in this study to maintain consistency with the previous work [
3] we compared against. This ensures a fair comparison of results.
3.9. Experimental Setup
The proposed architectures are implemented in PyTorch [
64]. The bi-temporal image pairs and associated labels are resized to 224 × 224 pixels before input to the network. During SSL pretraining, the Adam optimizer is employed with an initial learning rate of
scheduled using ReduceLROnPlateau and a weight decay of
. The batch size is set to 32, and data augmentation techniques, including random flip, color jitter, random grayscale, and Gaussian blur, are applied to improve generalization. The xBD training set serves as the pretraining dataset.
For downstream tasks, all experimental settings remain the same except for the learning rate, which is adjusted to
. To address class imbalance, we randomly sampled xBD training sets of size n = 200,000 and testing sets of size n = 20,000 [
3]. The same data augmentation methods are applied to ensure robustness in downstream training for both building change detection and building damage assessment. For both training and inference, we used an NVIDIA TITAN RTX GPU (NVIDIA Corporation, Santa Clara, CA, USA) featuring 24 GB of memory, with CUDA 12.2.