1. Introduction
Bi-temporal change detection for remote sensing image is to generate change maps by comparing and analyzing two images in different phases, which constitutes a vital task within the domain of remote sensing. This task finds common applications in disaster assessment, urban design, agricultural planning, geography surveys, and river monitoring [
1,
2,
3]. Although deep learning has made considerable progress in analyzing large-scale, high-resolution remote sensing images [
4,
5,
6], bi-temporal CD still faces great challenges that some interfering factors may contribute to the pseudo-change phenomena. Pseudo-change arises from the misclassification of unchanged areas as changed. Common causes of pseudo-change include temporary objects, shadow variations, and cloud interference. As shown in
Figure 1a, these interferences lead to shadow and projection differences in unchanged areas being erroneously detected as changed areas [
7,
8,
9]. Among these, pseudo-change caused by thin cloud interference is a particularly noteworthy problem [
10]. Moreover, existing remote sensing image CD datasets generally do not focus on the effects of thin cloud interference, resulting in CD models trained on these datasets generally failing to account for the influence of thin cloud interference.
Conventional change detection methods produce difference maps by examining variations in the spectral information within RS images at the pixel level, and then obtain change maps through threshold or clustering methods. Widely adopted techniques encompass principal component analysis (PCA) [
11] and change vector analysis (CVA) [
12], and support vector machine methods [
13]. However, these methods mainly focus on the spectral changes of a single pixel and ignore the spatial background information between pixels. They also have the problem of being sensitive to the subjective selection of empirical thresholds and lack of robustness.
Deep learning-based algorithms exhibit superior performance in comparison to traditional methods that rely on handcrafted features, primarily because they possess the capability to automatically learn discriminative features from large-scale, high-quality datasets. Among these, algorithms based on deep Convolutional Neural Networks (CNN) show particularly strong results. CNNs are widely used in CD for extracting discriminative features, encompassing classical CNN architectures and their extended versions, such as ResNet [
14], U-Net [
15], and HRNet [
16]. The CNN model captures the spatial distribution and structure of ground objects in an image through convolution operations, and can abstractly represent the information in the image layer by layer. However, the CNN model is limited by a fixed receptive field, which leads to insufficient response to changes in ground objects of different scales and difficulty in distinguishing some complex changes of ground objects [
17]. After adding the attention mechanism, the model can more effectively focus on areas with important information in the image [
18]. It can adaptively learn the importance of different parts of the original input image, thereby alleviating the limitation of the fixed receptive field inherent in the pure CNN method. Therefore, in recent years, numerous studies have incorporated attention mechanisms into change detection tasks to enhance detection accuracy. For example, Liu et al. [
19] designed feature exchange and channel attention modules to effectively simulate contextual information in bi-temporal images. Feng et al. [
20] introduced a joint attention module by combining self-attention and cross-attention to guide the global feature distribution of the input and promote information coupling. However, CNNs often struggle to model long-range dependencies and contextual relationships between pixels due to their inherent locality constraints.
Transformer [
21], initially developed for handling natural language processing problems, has drawn significant attention within the computer vision community. Compared with pure CNN-based models, transformers have powerful capabilities in modeling long-distance dependencies [
22], can capture global context information and process long-distance-related features. This development has greatly promoted the progress of CD algorithms. For example, Chen et al. [
23] proposed a dual spatiotemporal image transformer that uses an encoder to capture spatiotemporal context and a decoder to refine the original features, thereby promoting the effective use and modeling of spatial context information and significantly improving the effectiveness of CD. Zhang et al. [
24] designed a U-shaped transformer framework based on the Swin Transformer as the basic unit. The model processes bi-temporal images through an encoder to extract multi-scale features, and restores detail change information through a decoder, thereby obtaining more accurate CD results. The construction of a joint CNN–Transformer model can comprehensively integrate the advantages of the two architectures and effectively handle the multi-scale changes of ground objects. The model is good at capturing local details while considering global correlations [
18,
21,
22]. Although the CNN–Transformer model has shown excellent performance in CD tasks, it still has some shortcomings. First, the increasing complexity of ground objects in bi-temporal remote sensing images, coupled with style changes such as shadow changes, temporary objects, and cloud occlusions, may lead to inconsistent feature representations of the same object. Previous CNN–Transformer models have difficulty in effectively alleviating the pseudo-changes caused by differences in feature distribution.
GCN-based change detection networks have also made increasing progress recently [
25]. These approaches are capable of modeling relationships among pixels in remote sensing images, enabling the extraction of more complete feature information. Leveraging graph-based feature representations allows for the exploration of semantic similarities among unlabeled pixels [
26]. The quadratic growth of graph computation with respect to node numbers often restricts their application to small-scale scenes or necessitates simplified graph constructions. Because graph operations are computationally expensive, many studies employ graph convolution merely as an auxiliary feature extractor to improve information interaction between bi-temporal images [
27]. In particular, within fully supervised change detection, achieving an appropriate trade-off between model complexity and detection performance remains a key challenge when applying graph convolution.
Despite the remarkable progress of CNN, Transformer, and GCN-based CD methods, an implicit assumption is commonly shared by most existing works: the bi-temporal images can be mapped into a naturally consistent feature space through feature extraction and interaction alone. In other words, these methods focus primarily on cross-temporal feature fusion while overlooking a more fundamental issue—the potential distribution discrepancy between features extracted from images acquired at different times. However, in real-world remote sensing scenarios, this assumption is frequently violated. Factors such as cloud occlusion, shadow variation, illumination changes, and temporary objects introduce significant perturbations to the visual appearance of the scene without altering the actual land-cover semantics. As a consequence, the feature distributions of bi-temporal images may exhibit noticeable divergence even before feature interaction is performed, as shown in
Figure 1b. This phenomenon leads to what is commonly observed as “pseudo-change”, where models mistakenly interpret domain-induced feature shifts as semantic changes. Unlike most existing change detection methods that directly model cross-temporal feature interaction, this work starts from a different observation: pseudo changes caused by clouds, shadows, and transient objects are essentially the result of domain-induced feature distribution discrepancies between bi-temporal images. If features from two temporal phases are not aligned into a consistent representation space beforehand, subsequent interaction mechanisms (e.g., attention, GCN, or transformer-based modeling) may unintentionally amplify these discrepancies and misinterpret them as real changes.
We argue that pseudo-change is not merely a feature representation problem, but essentially a bi-temporal covariate domain shift problem. Formally, although the conditional relationship between the scene and its change label remains invariant, i.e., , the marginal distributions of the inputs differ significantly, i.e., . Existing CD frameworks, regardless of whether they employ CNNs, Transformers, or GCNs, rarely address this discrepancy explicitly. Instead, they attempt to enhance cross-temporal interaction under the implicit assumption that the feature spaces are already aligned. This observation motivates a fundamentally different perspective for change detection under interference conditions: before performing cross-temporal feature interaction, it is necessary to first align the feature distributions of bi-temporal images to suppress domain-induced variations.
Domain adaptation is widely used in remote sensing image processing to align the feature spaces between different domains—such as images from different sensors, locations, or resolutions—so that models trained on one domain can perform well on another. Techniques include joint distribution adaptive-alignment methods, Maximum Mean Discrepancy (MMD) methods, and adversarial learning methods. These domain adaptation strategies significantly improve the transferability and generalization of remote sensing models across diverse and challenging scenarios. Adversarial learning methods, such as Generative Adversarial Networks (GANs), make it difficult to distinguish between the source and target domain features, thereby enabling more seamless feature mapping. In this work, we first introduce domain adversarial learning into the change detection pipeline to explicitly minimize the distribution discrepancy between features extracted from two temporal phases. Unlike conventional feature alignment or attention-based fusion strategies, our approach treats the bi-temporal images as samples drawn from different domains and performs adversarial domain alignment guided by clustering-based domain discovery.
Furthermore, we observe that once domain-induced variations are suppressed, semantic reasoning across temporal images becomes more reliable. To this end, we project the aligned feature maps into a compact graph space via soft semantic clustering, where graph convolution and inter-graph interaction (GFIM) are employed to model non-local semantic dependencies efficiently. This design not only reduces computational complexity compared to pixel-level graph reasoning, but also enhances robustness to imaging disturbances by operating at a higher semantic abstraction level. The Boundary Feature Module (BFM) further refines spatial details by integrating shallow texture cues with semantic reasoning. To validate the above hypothesis that pseudo-change originates from domain shift, we construct a new real-world dataset named “Cloud Interference Change Detection (CICD)”. Unlike existing benchmarks that mainly focus on seasonal or structural variations, CICD explicitly contains cloud occlusion, shadows, illumination differences, and temporary objects, providing a dedicated testbed for evaluating domain-robust change detection methods.
Therefore, instead of strengthening cross-temporal interaction first, we propose to mitigate domain discrepancy prior to interaction. This design philosophy forms the core of the proposed DAAINet.
In summary, this paper makes the following contributions:
- 1.
We propose DAAINet, a framework that prioritizes bi-temporal feature distribution alignment through domain adversarial learning guided by clustering-based pseudo-domain discovery. This alignment ensures that subsequent cross-temporal interaction operates on domain-consistent representations, effectively suppressing interference-induced false responses.
- 2.
We design a sequential pipeline of alignment, interaction, and refinement, where semantic graph interaction (GFIM) and boundary correction (BFM) are performed only after domain alignment, forming a logically consistent mechanism to reduce pseudo changes.
- 3.
We construct the CICD dataset with diverse interference conditions to specifically evaluate domain-induced pseudo changes. Extensive experiments demonstrate that the proposed framework achieves superior robustness in interference-rich scenarios while maintaining competitive performance on standard benchmarks, validating the effectiveness of the proposed domain alignment perspective.
3. Methodology
In real-world remote sensing, domain discrepancies often exist between training and testing data [
57]. Specifically, under interference conditions, models are typically trained on clean bi-temporal image pairs. Such discrepancy primarily alters the input distribution
, while the underlying mapping from input clean images
and interference images
to change labels
Y remains consistent. Formally, this corresponds to a covariate shift scenario, defined as:
In this case, various interference sources cause significant shift in the feature distribution of the visual inputs, without altering the intrinsic semantic relationship between the observed scene and its change label. Consequently, the model experiences performance degradation not because of a change in the change-detection concept, but due to a mismatch in the input domain statistics.
3.1. Dataset Construction
In the domain of remote sensing change detection, most publicly available datasets have primarily focused on factors such as seasonal variations, illumination differences, shadow effects, vegetation dynamics, and architectural diversity. In contrast, the impact of cloud interference—despite its frequent occurrence in optical remote sensing imagery—has received comparatively limited attention. Clouds and their associated shadows often obscure ground objects, distort spectral signatures, and disrupt temporal consistency, thereby introducing significant challenges for accurate change detection.
To address this critical gap, we propose a real-world cloud interference change detection dataset, specifically designed to evaluate algorithmic robustness under complex atmospheric conditions. The dataset is constructed from multi-temporal high-resolution optical imagery collected over diverse geographic regions and land-cover types. It consists of 294 image pairs with dimensions of
pixels, encompassing diverse land cover types including forests, grasslands, rivers, bridges, and residential buildings, ensuring a representative coverage of urban, agricultural, and natural environments. Each image pair is accompanied by pixel-level annotations that delineate true surface changes, while explicitly accounting for areas affected by cloud cover and shadows. This enables a more realistic assessment of change detection methods in scenarios where atmospheric perturbations may be mistaken as land-cover transitions. Some examples of images in the dataset are shown in
Figure 2. The images are collected from multiple remote sensing competitions dataset, covering diverse geographic regions, sensors, and acquisition intervals, which helps reduce geographic bias and improves generalization.
Annotation Rules: To ensure that the CICD dataset specifically evaluates the robustness of change detection algorithms against pseudo-change caused by interference, we follow strict annotation principles during label generation.
Only genuine land-cover changes are annotated as positive change regions. Areas affected by cloud occlusion, cloud shadows, illumination variation, seasonal chromatic differences, and temporary objects (e.g., vehicles, construction materials, movable facilities) are explicitly annotated as non-change regions, even though they exhibit significant visual differences between the two temporal images.This annotation strategy forces change detection models to distinguish between true semantic changes and domain-induced visual perturbations, which is the core objective of CICD.
Data Split Protocol: The dataset consists of 294 bi-temporal image pairs with a spatial resolution of 1024 × 1024 pixels. To ensure fair evaluation and reproducibility, we adopt a fixed split protocol:
- 1.
Training set: 204 pairs (70%)
- 2.
Validation set: 30 pairs (10%)
- 3.
Test set: 60 pairs (20%)
All splits are scene-independent, meaning that image pairs from the same geographic region are not shared across different subsets, preventing spatial leakage.
Interference Type Distribution: Each image pair is manually inspected and categorized according to the dominant interference factors. The interference sources in CICD mainly include: thin cloud occlusion, cloud shadow, illumination variation, temporary objects and mixed interference conditions.The statistical distribution of these interference types is reported in
Table 1, providing a clearer understanding of the dataset complexity.
Unlike standard benchmarks that focus on clean temporal pairs, CICD is intentionally constructed to include diverse interference conditions collected from multiple remote sensing datasets. This diversity introduces significant domain discrepancies between temporal images, making CICD particularly suitable for evaluating methods designed to handle domain-induced pseudo changes.
3.2. Architecture Overview
As discussed in
Section 2, pseudo-change in bi-temporal remote sensing images is mainly caused by domain-induced visual perturbations rather than genuine land-cover variations. Therefore, the core challenge of interference-robust change detection is not merely how to enhance cross-temporal feature interaction, but how to suppress the distribution discrepancy between features extracted from the two temporal images before interaction.
Based on this observation, the proposed DAAINet follows a three-stage design philosophy:
Domain Alignment: reduce the feature distribution gap between bi-temporal images caused by clouds, shadows, and illumination variations;
Semantic Reasoning: perform reliable cross-temporal interaction after domain-induced variations are suppressed;
Boundary Refinement: recover fine spatial details that may be lost during high-level reasoning.
The overall pipeline of DAAINet follows a principle that differs from conventional change detection frameworks. Rather than immediately establishing interaction between bi-temporal features, DAAINet first enforces domain-level alignment to ensure that features extracted from different temporal phases lie in a consistent representation space. Only after this alignment step are cross-temporal interaction and boundary refinement performed. This sequential design is essential for suppressing domain-induced pseudo changes under complex interference conditions.
The overall pipeline of DAAINet is illustrated in
Figure 3. Given a pair of bi-temporal images
and
, they are first passed through a shared backbone to obtain feature maps
and
. These features are then aligned via a domain adversarial learning branch guided by clustering-based domain discovery. After alignment, the features are projected into a compact graph space, where the Graph Feature Interaction Module (GFIM) performs semantic reasoning across temporal domains. Finally, the Boundary Feature Module (BFM) integrates shallow texture cues with high-level semantics to produce the final change map.
A key question in applying domain adversarial learning to change detection is how to define domain labels. Unlike conventional domain adaptation problems where domain identity is explicitly known (e.g., different sensors or datasets), the domain discrepancy in CICD is implicitly caused by interference factors such as clouds and shadows. Therefore, we employ unsupervised clustering on the feature space to automatically discover latent domain groups that correspond to different interference patterns. The clustering results serve as pseudo domain labels to guide the adversarial learning branch. This design has two important implications: (1) It enables domain alignment without requiring manual domain annotation; (2) It ensures that the adversarial learning specifically targets interference-induced distribution shifts rather than semantic content.
After domain alignment, the feature distributions of the two temporal images become more consistent. At this stage, cross-temporal interaction becomes more reliable because the model is less likely to confuse domain-induced variations with real changes. Instead of performing interaction at the pixel level, which is sensitive to noise and computationally expensive, we project the aligned features into a graph space via soft semantic clustering. The GFIM operates on these graph nodes to capture non-local semantic dependencies across temporal domains efficiently. High-level graph reasoning may weaken spatial boundary details. Therefore, the BFM is introduced to fuse shallow spatial features with graph-enhanced semantic representations, ensuring accurate boundary localization in the final change map.
Feature Extractor: Maps the input data into a specific feature space, enabling the change detector to distinguish between classes in the source domain data while preventing the domain classifier from identifying the data’s origin.
Change Detector: Performs change detection on the input data, aiming to accurately identify changed regions.
Domain Classifier: Classifies the feature-space data, attempting to determine which domain (source or target) the data originates from.
The feature extractor and change detector together form a feedforward neural network. A domain classifier is then appended after the feature extractor, connected via a Gradient Reversal Layer (GRL). During training, for bi-temporal remote sensing images, the network continuously minimizes the change detector’s loss based on the change detection labels of the input data. Simultaneously, guided by the domain labels assigned by the clustering module, the network minimizes the domain classifier’s loss.
The training objective is two-fold: (1) Ensure the change detector is sufficiently robust to accurately identify changed regions. (2) Encourage the feature representations of bi-temporal remote sensing images to be as similar as possible in the feature space, thereby confusing the domain classifier.
The feature extractor
F and change detector
C together form a feedforward neural network. A domain classifier
D is then appended after the feature extractor, connected via a GRL. During training, for bi-temporal remote sensing images
, the network optimizes:
where
denotes change detection labels and
d represents domain labels. The training objective encourages domain-invariant feature representations:
The model implements an adversarial mechanism through a gradient reversal in the backward pass:
where
controls the adversarial strength. This induces a feature space
where:
while maintaining discriminative power for change detection.
3.3. Graph Feature Interaction Module
The structure of the GFIM module is shown in the
Figure 4. The bidirectional arrows in GFIM indicate mutual feature interaction between graph representations of the two temporal images. This interaction is implemented through feature concatenation, rather than through any explicit loss function.
Although the three outputs of the MLP resemble the Q, K, V structure in self-attention, GFIM does not operate at the pixel level. Instead, the spatial dimension is first compressed into a small number of semantic graph nodes via soft clustering. Therefore, the query matrix no longer represents pixel-wise queries but semantic node queries. This compression significantly reduces computational complexity and makes the interaction more robust to interference such as clouds and shadows, which primarily affect local pixels but not high-level semantic clusters.
GFIM does not align features at the pixel level. Instead, after domain alignment, the two feature maps are projected into a shared semantic graph space via clustering. Nodes in this space represent high-level semantic regions rather than local pixels. Since clouds, shadows, and temporary objects mainly affect local appearance, their influence is significantly reduced in this graph space. As a result, GFIM aligns the two temporal representations at the semantic level, enabling reliable comparison even under severe interference.
The computational complexity of conventional pixel-level GCN is . In contrast, GFIM first compresses the feature map into K semantic nodes () via clustering. Therefore, the complexity is reduced to . In our implementation, , while HW = 65,536 for a feature map, leading to over 1000× reduction in pairwise computation.
3.4. Boundary Feature Module
The structure of the BFM module is shown in the
Figure 5. In the BFM, the feature map at resolution
is concatenated twice before fusion. This design is not redundant but intentionally acts as a form of multi-scale residual reinforcement.
Residual errors mainly appear around object boundaries due to fine-grained spatial ambiguity introduced by interference. Graph reasoning in GFIM operates on highly abstracted semantic nodes, which may weaken fine-grained spatial textures. By repeatedly injecting the original shallow feature map, BFM explicitly strengthens low-level spatial cues and stabilizes boundary localization during upsampling.
3.5. Domain Clustering
Since the original dataset lacks domain labels, it remains uncertain whether all data samples belong to the same domain or exhibit feature distribution shifts due to interference factors. To address this, we preliminarily assign temporary domain labels to all remote sensing images, enabling subsequent domain alignment through the gradient reversal layer in the domain classification network.
After extracting feature maps from all remote sensing images, we perform
k-means clustering on the complete set of feature representations. We empirically tested
. We observed that larger
k values tend to fragment semantic regions and weaken domain discrimination. As shown in
Table 2, the best performance is achieved at
, indicating that the dominant distribution discrepancy mainly lies between interference-affected and normal regions. This parameter can be adjusted when more interference factors lead to increased domain diversity. Each image is then assigned a domain label based on its nearest cluster center. These annotated images subsequently proceed to the change detection and domain classification modules.
3.6. Graph Projection and Reprojection
This section describes the procedure for projecting bi-temporal images into a graph representation. For clarity, we illustrate the feature projection for temporal instance
. Following the strategy in [
58], two matrices
and
are learned, where
denotes the learnable projection matrix in the MLP, and
represents the affinity matrix between semantic graph nodes and
denotes the predefined number of graph nodes. Each row vector
in
W functions as the anchor of the
k-th vertex. The soft assignment
between feature vector
and anchor
is then computed as:
In this formulation, denotes the k-th row of matrix M, whose values are restricted to the range through a sigmoid activation. The numerator measures the affinity between the feature vector and its corresponding anchor, whereas the denominator normalizes this assignment over all graph nodes.
Next, the vertex feature matrix
is formed by aggregating the associated pixel-level features. For each vertex
k, its representation
is computed as the weighted mean of the residuals between feature vectors
and the anchor
. The resulting vector
is then normalized to yield a unit vector
:
Finally, the graph adjacency matrix
A is computed as:
To transform the refined graph features back into the original spatial domain, we make use of the assignment relationships established during the graph projection stage:
where
and
are the interacted feature maps.
3.7. Bitemporal Graph Semantic Interaction
This module implements the semantic reasoning stage after domain alignment. The proposed GFIM module takes as input the graph embeddings and , derived from the feature mappings of temporal and , respectively. The model facilitates inter-graph semantic interaction by guiding bidirectional message passing from to and vice versa.
We use different multilayer perceptrons (MLPs) to transform into: the query graph , the key graph , and the value graph ; and transform into: the query graph , the key graph , and the value graph .
Then, we unify
and
as:
The similarity matrices
and
are calculated as:
where
is the softmax function. After that, information between
and
is exchanged by:
Following the inter-graph interaction, intra-graph reasoning is performed using
and
as inputs, yielding refined graph representations:
where
is the ReLU activation function and
,
are learnable parameters of the graph convolutional layer.
3.8. Domain Classifier
This module corresponds to the domain alignment stage. The purpose of domain clustering is not to perform semantic grouping, but to provide reliable pseudo-domain labels for adversarial alignment. Since interference factors (clouds, shadows, transient objects) introduce distinct feature distribution patterns, clustering enables the network to explicitly model these distribution differences and guide the domain adversarial branch to reduce them. This step ensures that subsequent cross-temporal interaction operates on domain-consistent features rather than misaligned representations.
The features produced by the backbone extractor are forwarded simultaneously to the change detection module and the domain classification module. As illustrated in
Figure 3, the domain classifier is composed of a Gradient Reversal Layer (GRL), two fully connected layers, a ReLU activation, and a log-softmax output layer. The domain labels of all feature maps in a batch are obtained through domain clustering. This module converts the single input feature maps into probability vectors
, representing the likelihood of belonging to each domain, and then calculates the loss with the domain label. During training, the domain classification loss is first computed and optimized via gradient descent. When backpropagating to the feature extractor, the domain classifier parameters remain fixed, and the negative gradient (inverse gradient) of the domain classification loss is used to update the feature extractor parameters. This adversarial update strategy intentionally maximizes domain classification errors, thereby forcing the feature extractor to learn domain-invariant feature representations.
3.9. Prediction Head and Loss Function
In the final step of producing the binary change map, a unified prediction head is applied. First, three decoded aggregated maps at different scales are upsampled to match the input image size using bilinear interpolation. These maps are then processed through two
convolutional layers followed by batch normalization to generate a differential prediction. After multiscale differential computation, aggregation, and reorganization, the model is trained by minimizing both the loss of domain adaptation
and the loss of cross-entropy
between the predictions and the ground truth.
Here, we set
in all experiments, where
denotes the one-hot domain label of the
i-th sample for the
c-th domain category, and
represents the predicted probability that the
i-th sample belongs to domain
c produced by the domain classifier. The total training loss is given by:
where
balances the contribution between the domain adaptation loss
and the change detection loss
.
5. Discussion
The consistent top-tier performance across all benchmark and noisy datasets suggests that our approach offers more reliable change detection for practical remote sensing applications where data quality varies significantly. These numerical results provide clear evidence that the DA structure contributes meaningfully to CD tasks.
The visualization results for each model on different datasets are presented in
Figure 6,
Figure 7 and
Figure 8. These figures showcase the accuracy of different models in detecting changing areas. DAAINet generates significantly fewer false alarms than most competing algorithms on both datasets, and exhibits excellent capability in suppressing pseudo-changes caused by shadows and chromatic aberrations. While most existing algorithms struggle to accurately detect changes in irregularly shaped buildings of varying sizes, our DAAINet maintains clear identification of building boundaries.
Despite the effectiveness of the proposed framework under interference-rich scenarios, several limitations remain.
First, the proposed domain alignment mechanism relies on clustering-based pseudo-domain discovery. Although this strategy provides useful guidance for adversarial learning, the clustering quality may vary across datasets and interference types. In cases where domain discrepancy is subtle, such as clean benchmarks with minimal interference, the benefit of explicit domain alignment becomes less significant. Second, the CICD dataset, although specifically constructed to evaluate interference-induced pseudo changes, is relatively limited in scale and collected from heterogeneous sources. The diversity of sensors, resolutions, and acquisition conditions introduces domain variety, but also makes standardized evaluation more challenging.
Future work will focus on expanding CICD into a larger, standardized benchmark with richer annotations and metadata. In addition, integrating the proposed domain alignment perspective with transformer-based architectures may further improve generalization ability under complex interference conditions.