1. Introduction
Island building change detection is a crucial technology for environmental monitoring [
1], disaster response [
2], and urban planning [
3]. It plays a vital role in the scientific management and sustainable development of island resources [
4]. With the advancement of remote sensing capabilities, high-resolution remote sensing images have become increasingly available, providing a solid data foundation for refined surface monitoring [
5]. However, despite the strong potential of remote sensing images in detecting and characterizing buildings over large areas, accurately extracting and analyzing building change information from dual-temporal remote sensing images remains a highly challenging task [
6].
Traditional change detection methods primarily rely on handcrafted feature engineering, where low-level visual features from bi-temporal remote sensing images are compared to identify changes. For instance, He et al. [
7] proposed an improved superpixel clustering approach that integrates regional consistency and boundary information to extract large-scale building changes. While effective in low-density urban areas, this method often suffers from over-segmentation and error propagation in dense or high-rise regions due to the structural complexity and texture heterogeneity within superpixels, thereby reducing detection accuracy. Feature point-based matching strategies have also been widely applied in multi-temporal image registration and change extraction [
8]. For example, Wang et al. [
9] developed a matching algorithm constrained by neighborhood topology and affine geometry, which demonstrates strong performance in multi-source image registration but is prone to mismatches in repetitive or weak-texture regions, undermining the reliability of subsequent change analysis. To improve detection stability, some researchers have incorporated multi-temporal constraints. Zhang et al. [
10] introduced a three-temporal logical constraint change vector analysis method, which enhances detection consistency and performs well in significant building demolition and construction scenarios. Nevertheless, this method remains insufficiently responsive to subtle or low-contrast changes and shows limited robustness against noise and pseudo-changes. Overall, while these traditional methods have expanded the technical pathways of change detection, their reliance on handcrafted features leads to weak generalization ability, complex parameter tuning, and low automation [
11], making them inadequate for accurately extracting subtle building changes in high-resolution island imagery.
In recent years, deep learning has emerged as a dominant paradigm in change detection due to its powerful capabilities in image modeling, feature representation, and pattern recognition [
12,
13]. Fully convolutional networks (FCNs) pioneered the end-to-end pipeline from raw images to change maps [
14]. Subsequently, UNet introduced skip connections to achieve multi-scale feature fusion, demonstrating excellent performance in semantic segmentation and being widely adapted to building change detection (BCD) tasks [
15]. However, due to the limited receptive fields and isotropic nature of conventional convolution kernels [
16,
17], FCN-based models still struggle with capturing complex building contours, precise boundary localization, and scale adaptability. To address these issues, various architectural improvements have been proposed. Zhao et al. [
18] developed HRSCD-Net, which integrates a high-resolution semantic preservation module and a multi-scale context aggregation mechanism, significantly improving the detection of small buildings and achieving competitive results on datasets such as HRSCD and LEVIR-CD. Nonetheless, its convolution-dominated design remains less effective in representing irregular or complex building boundaries, particularly under challenging island conditions.
To further enhance feature discrimination, attention mechanisms have been incorporated into change detection networks [
19]. For example, SNUNet integrates both channel and spatial attention modules into the UNet architecture [
20], thereby improving the sensitivity and robustness of the model to change regions, and surpassing conventional methods across multiple benchmark datasets. More recently, Transformer architectures have attracted growing attention owing to their global modeling capability [
21]. BIT-CD employs a dual-temporal Transformer framework to model temporal dependencies, achieving notable improvements in small-object change detection and boundary preservation, and is regarded as one of the current state-of-the-art models [
22]. However, the quadratic growth of computational complexity with input resolution imposes substantial demands on computational resources and data scale, which significantly constrains the applicability of Transformer-based approaches in large-scale, high-resolution scenarios.
To address the widespread class imbalance and boundary blurring issues in island building change detection, this study proposes the MSDT-Net model based on multi-scale deep feature fusion. The model employs a twin network architecture for feature extraction, replacing traditional complex structures with ConvNeXt-Tiny with shared weights. Specifically, the model introduces the Transformer-based Difference Feature Enhancement Module (DTEM), which explicitly models the temporal and spatial difference features between dual-temporal images, effectively enhancing the small-target change detection capability and boundary localization accuracy. Based on this, the innovatively designed Edge-Aware Smoothing Module (MSA) integrates multi-scale convolutions and channel attention mechanisms to effectively enhance the representation of boundary features in change regions, overcoming the limitations of traditional methods in contour recognition.
The two modules work in synergy to address two key challenges in island scenes—namely, the difficulty of detecting small objects and the ambiguity of building boundaries—from the feature representation perspective. This collaborative design enables structured modeling and differential enhancement of complex island environments. Their joint integration not only enhances the model’s sensitivity to subtle local variations but also strengthens the stability of cross-temporal feature alignment, thereby providing solid semantic support for subsequent change region segmentation.
In response to the class imbalance characteristic in island scenarios, the proposed FDUB-Loss joint loss function uses a dynamic weight adjustment mechanism to significantly improve the model’s sensitivity to sparse change targets. Experimental results demonstrate that in island scenarios with a very low proportion of changed pixels, the model effectively addresses the issues of missed detections and false positives in traditional methods, significantly improving detection accuracy and boundary integrity.
The contributions of our work can be summarized as follows:
(1) A novel remote sensing change detection framework, termed MSDT-Net, is proposed, which creatively integrates the advantages of Siamese neural networks and the ConvNeXt architecture to achieve a balanced representation of local texture details and global semantic consistency. By effectively exploiting the spatiotemporal correlations of multi-temporal imagery and multi-scale contextual information, the proposed method significantly improves the continuity of change contours and successfully mitigates the problem of class imbalance.
(2) Two core functional modules are designed: the Multi-scale Smoothing Attention (MSA) module and the Differential Temporal Encoding Module (DTEM). These modules form a complementary structural relationship—MSA focuses on boundary smoothing and multi-scale perception, while DTEM explicitly models spatiotemporal discrepancies. Their cooperation enables highly sensitive detection of small-scale change boundaries in complex island environments.
(3) Furthermore, a Focal–Dice–IoU Boundary Unified Loss (FDUB-Loss) is introduced. This hybrid loss function adaptively optimizes the model with respect to both imbalanced sample distributions and boundary ambiguity, maintaining high accuracy and robustness even in low-change scenarios. It effectively alleviates the class imbalance issue and substantially enhances the model’s sensitivity to subtle variations. The proposed loss function can be seamlessly integrated into existing deep learning models for change detection, markedly improving their performance in island building change detection tasks.
Overall, the innovations of this study lie not only in the novel architectural design but also in the systematic optimization strategies tailored for the specific challenge of island building change detection, achieving breakthroughs in both methodological and application aspects.
2. Related Work
Building change detection refers to the process of analyzing remote sensing imagery data from different temporal instances of the same geographic region to identify and extract building-related change information [
23]. Traditional change detection methods often struggle with interference factors such as diverse building forms, varying lighting conditions, and seasonal vegetation differences. Recently, the rapid development of deep learning technologies has brought new breakthroughs to this field, particularly with the introduction of Siamese neural network architectures, which have significantly improved change detection performance. Yun et al. [
24] were among the first to apply Siamese networks to building change detection in coastal environments. By using a shared-weight mechanism, they effectively alleviated the feature confusion between marine backgrounds and building targets. However, this method still exhibits notable scale sensitivity in detecting small buildings. To improve classification accuracy, Pan et al. [
25] proposed a simplified object network that innovatively combines object-oriented segmentation with deep learning. By adopting a multi-level feature fusion strategy, they significantly improved classification accuracy for high-resolution images. However, the real-time performance of their change detection still requires improvement. Zheng et al. [
26] developed HFA-Net, which introduced a high-frequency attention module to enhance the expression of structural edge features. It achieved significant results in building change detection on ultra-high-resolution imagery. However, misclassification still occurs in the presence of strong shadow interference. Notably, the application of Transformer architectures, such as the Siamese Transformer proposed by Rao et al. [
27], achieved superior global semantic modeling through a hierarchical attention mechanism. Yet, its complex network structure and large parameter size limit its practical application.
Although these methods have made remarkable progress in multi-scale perception [
28], boundary delineation [
29], and context modeling [
30], they still face two key technical bottlenecks: First, building edge features are easily ignored or blurred during detection, leading to insufficient boundary localization accuracy. This issue is especially prominent in detecting small buildings on islands. Second, the class imbalance problem, caused by the extremely low proportion of changed areas in the imagery, causes models to overly favor the “no change” class, severely restricting the detection performance of small change targets. To address the core challenge of class imbalance, recent research has proposed several innovative solutions. Alcantarilla et al. [
31] introduced multi-scale feature fusion and dynamic weight adjustment mechanisms to significantly improve detection accuracy. However, the method still has limitations in adapting to heterogeneous data fusion. Wang et al. [
32] designed a foreground-prior sampling strategy in their HANet to enhance the modeling of change areas, but this method is prone to sample bias in complex backgrounds. Mou et al. [
33] developed a cascading attention network that strengthens long-term temporal dependency modeling, though its network complexity reduces computational efficiency. Kemker et al. [
34] developed a progressive change detection framework that effectively reduces false alarm rates but exhibits noticeable latency in handling sudden changes. MASNet [
35] employed a bidirectional attention mechanism to suppress false change interference, but it still faces limitations in boundary preservation in low-texture areas.
In addition, substantial progress has been made in recent years in attention-based feature fusion strategies. The SegFormer model [
36] achieves efficient inter-layer feature aggregation through a lightweight multi-scale attention architecture, balancing global modeling capability and computational efficiency while maintaining strong semantic consistency. The FCIHMRT (Feature Cross-Layer Interaction Hybrid Method) [
37] further introduces a cross-layer interaction mechanism that enhances semantic correlations among hierarchical features through a hybrid attention strategy, thereby improving feature fusion in complex scenes. These studies provide valuable insights for the design of the proposed Multi-scale Smoothing Attention (MSA) module, enabling it to better integrate multi-scale differential features while preserving boundary smoothness.
A comprehensive analysis of the existing studies reveals that current methods still have significant shortcomings in global and local feature collaborative modeling, distinguishing between true changes and interfering factors, and representing internal structures of change targets. In particular, in complex scenarios characterized by extreme sample imbalance, boundary blurring, and sparse changes, a unified robust detection framework has yet to be established.
3. Materials and Methods
To address the challenges of class imbalance and boundary blurring in island building change detection, this paper proposes a novel Edge-Aware Siamese Masking Network (MSDT-Net). Compared to traditional change detection networks, this model innovatively integrates two core modules: the Edge-Aware Smoothing Module (MSA), which enhances the feature representation of island building boundaries through a collaborative design of multi-scale convolutions and channel-space attention mechanisms, and the Difference Modeling Module (DTEM), which explicitly models the spatiotemporal difference features of dual-temporal imagery based on a Transformer architecture, significantly improving the model’s ability to detect subtle changes in island buildings. Furthermore, the FDUB-Loss loss function, designed in this study, optimizes the model’s sensitivity to sparse change regions in island scenes by integrating Focal, Dice, and IoU losses. The following sections will first provide an overview of the MSDT-Net architecture in
Section 3.1, followed by detailed explanations of the MSA and DTEM modules in
Section 3.2 and
Section 3.3, respectively.
3.1. MSDT-Net
The proposed MSDT-Net is a change detection model based on Siamese ConvNeXt, designed to model feature differences, enhance boundary awareness in complex backgrounds using the Edge-Aware Smoothing Module, and employ a progressive upsampling decoder to recover high-resolution masks for accurate pixel-level change detection.
The MSDT-Net model processes dual-temporal remote sensing images I
1 and I
2 of size H
0 × W
0 as input, aiming to generate a binary change detection map M∈{0,1}^(H
0 × W
0) of the same size, where 1 represents the changed area (e.g., newly constructed/demolished buildings) and 0 represents the unchanged area. As shown in
Figure 1, the model architecture consists of four key components: feature extraction network, difference modeling module, multi-scale attention module, and lightweight decoder. Based on the Siamese neural network architecture, the ConvNeXt-Tiny with shared weights is used as the backbone network to extract deep features from the dual-temporal remote sensing images, effectively representing the spatial semantic information of island buildings through a 768-dimensional feature map. To address the challenge of difference modeling in change detection, a Transformer-based difference encoding module is designed. The dual-temporal features and their absolute differences are concatenated and linearly projected into a 512-dimensional space, and a two-layer Transformer encoder is employed to capture long-range spatial dependencies, generating a difference feature map enriched with contextual information. To enhance boundary recognition, a Multi-Scale Attention Module (MSA) is introduced. This module extracts multi-scale local features through parallel 3 × 3, 5 × 5, and 7 × 3 convolutional kernels, and dynamically integrates information from different receptive fields via a channel attention mechanism, significantly improving the feature response strength at the boundaries. Finally, a lightweight transposed convolution decoder progressively upsamples to a resolution that is one-fourth of the original, producing the binary change detection map.
3.2. Multi-Scale Smoothing Attention (MSA)
Boundary feature representation plays a key role in remote sensing image analysis, such as building extraction [
38], land cover classification [
39], and change detection [
40]. Multi-scale boundary feature fusion has been proven to significantly improve boundary localization accuracy in previous studies [
12,
21]. Inspired by the multi-scale residual structure of MSRN [
41], the proposed Edge-Aware Smoothing Module (MSA) integrates multi-scale convolutions and channel-space attention mechanisms to adaptively enhance the representation of building boundary features. Specifically, the parallel convolution kernels of 3 × 3, 5 × 5, and 7 × 7 cooperate to extract local detail features, while the attention mechanism dynamically weights important feature channels, jointly enhancing the model’s ability to capture the complex boundaries of island buildings.
In addition, to further enhance the channel-wise attention modeling capability, the Squeeze-and-Excitation (SE) mechanism [
42] is incorporated into the MSA module. The SE block adaptively learns the importance weights of different channels through the “Squeeze” and “Excitation” operations, thereby strengthening the response to critical boundary features and suppressing redundant information. This mechanism effectively improves the model’s sensitivity to boundary regions and its feature representation capacity, providing more accurate boundary feature support for subsequent change detection.
The architecture of the MSA module, illustrated in
Figure 2, systematically enhances feature representation by integrating a multi-scale convolution and attention fusion mechanism. Parallel convolutional kernels of sizes 3 × 3, 5 × 5, and 7 × 7 operate under different receptive fields to jointly capture both local details and global contextual information, thereby alleviating the limitation of single-scale convolutions in boundary characterization. Meanwhile, a channel attention module based on the SE structure adaptively reweights semantic channels, enabling the model to emphasize boundary-related features. In addition, spatial attention guides the network to focus on regions with significant boundary variations, effectively mitigating boundary blurring caused by background interference. This design jointly optimizes feature representation from the perspectives of scale, channel, and space, thereby addressing the insufficient extraction of building boundaries while improving the recognition accuracy of minority-class change regions.
Let
X denote the feature map and k represent the size of the two-dimensional convolution kernel.
Xk denotes the feature map extracted using a convolution kernel of size
k ×
k. Extracting multi-scale features is achieved by employing convolutional kernels of different sizes, where larger kernels capture broad contextual information suitable for large-scale buildings, while smaller kernels preserve fine-grained details, particularly beneficial for delineating the boundaries of small-scale buildings.
Let
z denote the global descriptor vector and
s represent the channel attention weight vector. The dimensionality of channels is first reduced through a fully connected layer
W1, which significantly decreases the number of model parameters while maintaining computational efficiency. Subsequently, a
ReLU activation function is applied to introduce nonlinear transformation capability, thereby enhancing the representational power of the model. Finally, a fully connected layer
W2 is employed to restore the original channel dimensionality, and a Sigmoid activation function is used to generate normalized attention weights ranging from 0 to 1.
The channel attention-weighted output feature map, denoted as
X(catt), is obtained by applying the channel attention weight vector s to the input feature map
X(cat). By performing element-wise multiplication between the channel attention weights and the multi-scale features, each feature map is scaled by its corresponding attention weight
s, which strengthens the responses of important feature channels while suppressing those of less informative ones.
A 1 × 1 convolution is applied to project the fused multi-scale features into the target output dimension, which not only accomplishes feature integration but also effectively controls the number of parameters and computational complexity. Finally, the output feature map Y is obtained.
3.3. Differential Transformer Encoding Module (DTEM)
In order to capture fine-grained change features between pre- and post-temporal remote sensing images more effectively, this paper designs a Transformer-based Difference Modeling Module. This module explicitly constructs the difference representation between images on the basis of the Siamese feature extraction structure and introduces a global modeling mechanism to enhance the model’s ability to discriminate small-scale targets and boundary changes in complex scenes.
The core design of the DTEM lies in the integration of explicit feature differencing and global modeling to achieve precise characterization of bi-temporal image discrepancies. This design effectively highlights change regions while suppressing noise responses in non-change areas.
Specifically, as shown in
Figure 3, the module first computes the pixel-wise absolute difference between the two input feature maps to explicitly characterize the change regions across the temporal pair. During the differential modeling stage, a pixel-wise absolute difference operation is employed, defined as
, to explicitly describe the spatial and semantic feature differences between two temporal images. Compared with simple concatenation or weighted summation, this operation provides a more intuitive representation of change regions and effectively reduces interference from unchanged areas. The original bi-temporal features and the difference map are then unfolded into sequences and concatenated into a unified representation. This fusion strategy enhances the model’s sensitivity and discriminative ability toward change regions while preserving the original semantic representations. This representation not only preserves the original semantic information but also strengthens sensitivity to change regions.
During multi-scale feature fusion, simple channel-wise concatenation may overlook semantic discrepancies across different scales, potentially introducing conflicts or redundancy. To address this issue, the DTEM leverages stacked Transformer encoders with self-attention mechanisms to perform global modeling on the fused features. Through global attention modeling implemented by a multi-layer Transformer encoder, DTEM can adaptively select complementary information during the feature fusion process and effectively suppress semantic conflicts and feature redundancy across scales. As a result, the model maintains strong discriminative capability for small objects and sparse change regions. This enables adaptive selection of complementary information and suppression of conflicting representations, thereby ensuring that multi-scale features retain strong discriminability for small objects and sparse change regions.
To adapt the fused features to the Transformer input structure, they are projected into fixed-dimensional embeddings before being processed by stacked Transformer encoders. Through the self-attention mechanism, DTEM captures long-range spatial dependencies and cross-temporal global change relations, significantly enhancing the separability of small-scale objects and sparse changes. This dual modeling strategy, based on differential and global representations, enables DTEM to more effectively distinguish building changes from environmental noise in complex island scenes, thereby achieving more robust change detection performance. Finally, the encoded features are reconstructed into spatial structures to support subsequent change region segmentation tasks.
The absolute difference between the bi-temporal feature maps is computed, where
F1 and
F2 are extracted from different time phases. Their difference
D directly reflects pixel-level change information and serves as the fundamental input for subsequent change detection. The differential operation defined in this equation represents the key step of the entire module, explicitly mapping the changes between bi-temporal features into a difference tensor that provides essential input for subsequent global modeling via the Transformer.
The spatial feature maps are flattened into a sequence of length
L, transforming the two-dimensional spatial structure into a one-dimensional sequence, which enables the Transformer to process image data while keeping the batch dimension
B and channel dimension
C unchanged. This serialization process ensures the spatial continuity of positional information within the Transformer input, facilitating the capture of long-range dependencies.
Three serialized features are concatenated along the channel dimension, where the concatenation operation preserves both the original feature information (F1seq, F2seq) and the explicit difference information Dseq, thereby providing the Transformer with enriched contextual representations. The tri-feature concatenation strategy enables the model to simultaneously incorporate three types of information—original semantics, temporal differences, and contextual awareness—thus enhancing the completeness and robustness of differential modeling.
3.4. Focal–Dice–IoU Boundary Joint Loss Function
To improve the model’s detection ability for small-scale changes in island buildings, this paper designs the Focal–Dice–IoU boundary joint loss function, which effectively mitigates the class imbalance problem and improves the precision of detecting subtle changes. The function is formed by weighted fusion of Focal Loss (
Lfocal), Dice Loss (
Ldice), and IoU Loss (
Liou):
where the weights are set as
λ1 = 0.5,
λ2 = 1.0,
λ3 = 1.0. Lfocal reduces the impact of easily classified samples using a balancing factor
α = 0.25 and a modulation factor
γ = 2, focusing on optimizing the recognition of hard-to-classify samples. Ldice enhances the model’s sensitivity to small targets and boundary features based on the overlap metric, which measures regional overlap. Liou, through the intersection-over-union (IoU) constraint, enhances the completeness of the predicted region. In the specific implementation, let
N be the total number of pixels in the image,
qn∈{0,1} represent the true label,
pn∈[0,1] be the predicted probability, and introduce a smoothing term of ϵ = 1 × 10
−6 to ensure the stability of numerical calculation. This multi-objective optimization strategy improves the model’s performance in detecting fine-grained changes in island buildings while maintaining computational robustness. It also shows significant advantages in addressing key issues like class imbalance and boundary blurring.
5. Discussion
To validate the generalization ability of MSDT-Net across domains, this study directly transfers the model trained on the typical island building change data to a land-based urban building change detection task for testing, as depicted in
Figure 12. Island buildings typically have features such as dispersed layouts, large scale variations, and strong background interference, making them visually more challenging. Therefore, the model trained on this type of scenario can learn robust features sensitive to building changes more effectively. When transferred to a land-based scenario with more stable structures and uniform target shapes, it is expected to demonstrate good transfer performance and detection accuracy.
From the visualization results, MSDT-Net demonstrates good cross-domain adaptation ability in land-based urban building change detection. In terms of accuracy, the model can accurately identify building change areas in island building images, with the white detection results highly consistent with the actual annotations, validating its precise detection ability in high-interference backgrounds. In urban land images, despite the presence of various building types and dense distributions, the model is still able to consistently capture the changed targets. This shows that even when the target shape changes, the model maintains high detection performance, indicating that its accuracy does not degrade significantly when transferred to the land-based scenario. In terms of stability, whether facing small-scale targets in island environments or large-scale building changes in land-based scenarios, the model’s output results do not show significant performance fluctuations, reflecting its good robustness. This performance stability is crucial for cross-regional change detection in practical applications, especially for diverse remote sensing data processing scenarios. From the efficiency perspective, the model achieves good performance with no need for retraining on land-based urban buildings, saving a significant amount of data collection, model training time, and computational resources.
Although there are differences between island and land buildings in terms of style and environmental background, they share certain geometric structure and boundary characteristics. The model extracts cross-domain representative discriminative features through multi-scale difference modeling and edge-aware mechanisms, enabling effective knowledge transfer from high-complexity scenarios to structurally stable regions.
6. Conclusions
Island building change detection is a key technology for environmental monitoring, disaster early warning, and urban planning. It plays a crucial role in the dynamic monitoring and sustainable development of island resources. However, due to challenges such as class imbalance and boundary blurring, existing methods face significant difficulties in recognizing subtle changes and complex boundaries of island buildings.
This study proposes the MSDT-Net model, which achieves high-precision detection based on a twin ConvNeXt architecture. Its innovations are threefold: the introduction of an edge-aware smoothing module to address recognition accuracy issues caused by boundary blurring; the design of a difference modeling module to enhance detection performance for target and boundary changes; and the introduction of the Focal–Dice–IoU boundary joint loss function, which effectively alleviates the class imbalance problem and significantly improves the model’s sensitivity to sparse change regions.
Experiments show that MSDT-Net significantly improves key metrics on the self-built island dataset, performing excellently in scenarios with minimal change pixels. Multi-scenario tests demonstrate its strong generalization and robustness. Tests across land and island scenes verify its effectiveness in diverse remote sensing data processing scenarios.
In summary, the MSDT-Net model, through innovative architecture design and loss function optimization, successfully addresses the challenges of class imbalance and boundary blurring in island building change detection, providing a reliable technical path for the task. Future work will focus on optimizing the model structure to enhance its adaptability and robustness in complex remote sensing imagery.