1. Introduction
Remote sensing-based change detection technology, leveraging multi-temporal observation data, can effectively monitor dynamic surface evolution processes and provide reliable support for scientific decision-making in ecological monitoring and resource management. Surface changes such as land use transitions, urban expansion, and forest degradation may profoundly impact ecosystems, climate regulation, and human activities [
1]. Therefore, the accurate detection of these changes is crucial for environmental monitoring, resource management, and disaster warning. Change detection has been widely applied in practical problems such as land use monitoring [
2], urban expansion [
3], post disaster assessment [
4], and environmental change analysis [
5]. It can help decision-makers obtain key change information in a timely manner and take corresponding measures, playing a key role in environmental protection, resource management, and disaster response. For instance, following natural disasters such as earthquakes or landslides, remote sensing-based change detection technology can rapidly identify affected areas, providing critical decision-making support for emergency response and resource allocation. In land use monitoring, it enables the tracking of farmland abandonment, illegal land occupation, and cultivated land changes, thereby facilitating land resource management and policy implementation. In recent years, remote sensing change detection methods have progressively evolved from traditional manual interpretation to deep learning-based automated approaches [
6]. These advanced methods demonstrate powerful feature extraction and representation capabilities, significantly improving detection efficiency and accuracy while reducing human errors. This technological shift has made large-scale, intelligent change analysis feasible.
Change detection technology has shifted from manual interpretation to intelligent methods driven by machine learning and deep learning. Traditional approaches—such as visual interpretation [
7], threshold segmentation [
8], PCA [
9], and CVA [
6,
10]—relied heavily on spectral features and statistical models. Although partially automated, these methods were highly sensitive to radiometric inconsistencies, illumination variations, and registration errors [
11,
12], and lacked robustness in complex environments. To address these issues, machine learning algorithms were introduced. For instance, SVM [
13] enhances class separability through kernel mapping in high-dimensional space, while RF [
14] improves generalization by combining multiple decision trees. Celik [
15] proposed an unsupervised method based on PCA and K-means for urban change monitoring, and Chen et al. [
16] designed an AdaBoost-based multi-classifier system integrating SVM, decision trees, and neural networks to improve land cover classification. However, these methods still depend on handcrafted features such as texture and spectral signatures, limiting adaptability in complex scenarios and resulting in poor generalization across datasets [
17].
The rise of deep learning has enabled end-to-end learning in change detection, overcoming the limitations of handcrafted features. Daudt et al. [
18] proposed FC-EF, a Siamese fully convolutional network using shared weights for temporal feature alignment. Subsequent works introduced attention mechanisms to enhance focus on change regions, such as STANet [
19], which fuses multi-temporal features via spatial–temporal attention. However, CNN-based models are limited in capturing global context due to their local receptive fields. To mitigate this, SNUNet [
20] and DSAMNet [
21] introduced dense connections and spatial attention to improve localization and reduce complexity. Advanced CNN variants like LGPNet [
22] and USSFCNet [
23] integrated multi-scale feature extraction and spectral–spatial attention for improved accuracy in high-resolution imagery.
Inspired by breakthroughs in natural language processing (NLP), sequence modeling architectures such as Transformer [
24] and Mamba [
25] have been introduced into remote sensing change detection. Transformer-based methods demonstrate strong capability in capturing global dependencies via self-attention. For instance, ChangeFormer [
26] employs multi-scale Transformer encoders to improve cross-scene adaptability, BIT [
27] models semantic token context for efficient detection, and ICIFNet [
28] enhances feature fusion through cascaded cross-attention. TransUNetCD [
29] combines CNN and Transformer backbones with a differential enhancement module for better feature representation. To reduce the high computational cost of Transformers while preserving global modeling ability, Mamba leverages state-space models (SSMs) to achieve linear-time sequence modeling. In remote sensing, ChangeMamba [
30] integrates Mamba into spatiotemporal feature extraction, demonstrating superior performance over CNN and Transformer counterparts. CD-Lamba [
31] further enhances spatial consistency with a CT-LASS module, and RSMamba [
32] introduces a multi-path activation mechanism to improve scale adaptability, offering useful insights for change detection.
Despite the remarkable progress achieved by deep learning models in change detection tasks, existing algorithms still exhibit notable limitations, which can be summarized as follows:
Inadequate modeling of motion information across multi-temporal images hampers the accurate capture of temporal variations. This issue is particularly pronounced in scenarios involving small-scale changes or complex backgrounds, often resulting in false positives and missed detections.
Most current approaches employ a single attention mechanism for feature enhancement, lacking comprehensive multi-domain modeling. This shortcoming limits the extraction of discriminative change features in complex scenes.
It is noteworthy that optical flow field modeling can effectively compensate for the temporal motion features often overlooked by traditional methods in change detection tasks. To more accurately capture dynamic changes between images, researchers have progressively explored more efficient optical flow estimation approaches. Early optical flow methods such as the Horn Schenck method relied on assuming brightness consistency of the image to calculate motion but had limited accuracy and robustness. With the development of deep learning, FlowNet proposed by Fischer et al. [
33] has become a groundbreaking deep learning model for directly estimating optical flow from image sequences. The LiteFlowNet2 proposed by Qin et al. [
34] enhances the model’s ability to capture multi-scale features and provides better optical flow estimation performance by introducing a feature pyramid network and a bidirectional flow estimation strategy based on LiteFlowNet [
35]. On the other hand, spatial domain methods have limitations in capturing cross-temporal variation patterns, and frequency domain analysis provides an important supplement for this. AFFormer [
36] and Fcanet [
37] have confirmed that frequency domain features can effectively enhance image representation capabilities. Ma et al. [
38] proposed DDLNet, which employs a tailored frequency domain enhancement module to extract frequency components from bi-temporal images via discrete cosine transform (DCT) to emphasize significant changes. A spatial domain recovery module SRM is designed to fuse spatiotemporal features and reconstruct spatial details of change representations.
Based on the above insights, this study proposes a novel change detection network framework. Building upon the Siamese architecture, an optical flow branch is integrated to leverage motion information for enhancing the identification of change regions, thereby improving the network’s responsiveness to dynamic features. Our investigation reveals that spatial attention mechanisms are effective in focusing on prominent local changes within the image, whereas frequency-domain attention excels at capturing fine-grained variations and periodic patterns. To exploit the complementary strengths of these two domains, we design a dual-domain attention mechanism that combines the spatial attention’s ability to highlight salient local changes with the frequency attention’s capacity to enhance subtle and periodic features, achieving more comprehensive feature modeling. By further incorporating multi-level feature fusion, the introduced approach exhibits enhanced precision and resilience in managing background noise and identifying minor changes, thereby significantly enhancing change detection performance. The core contributions of this study are outlined as follows:
Building upon the Siamese network architecture, we introduce an optical flow branch module to explicitly model pixel-level motion across dual-phase images. This module guides the network in identifying genuine changes caused by the movement of real-world objects, thereby enhancing its sensitivity to dynamic change regions and improving robustness in complex scenes.
We further design a bi-domain attention mechanism that integrates spatial and frequency attention modules to model change features from the perspectives of local structures and frequency distributions, respectively. This design effectively enhances the model’s sensitivity to subtle and boundary-level changes.
The proposed method, OFNet, demonstrates superior performance across multiple publicly available remote sensing datasets, consistently outperforming existing state-of-the-art change detection approaches. Moreover, it maintains a low parameter count and computational cost, indicating strong potential for real-world deployment. Visualization analyses further validate the model’s ability to perceive complex change regions, and ablation studies are conducted to evaluate the individual contributions of key modules to overall performance.
The rest of this article is structured as follows:
Section 2, Materials and Methods, provides a detailed explanation of the various components of the proposed method and model.
Section 3, Experiments, introduces the dataset used in the experiment and the specific experimental setup, and compares the performance of this method with existing methods. At the same time, ablation studies and model validation are conducted.
Section 4, Conclusions, summarizes the research work of this article.
3. Results
3.1. Dataset and Evaluation Metrics
3.1.1. Datasets
The proposed model is evaluated on two publicly available remote sensing change detection datasets, LEVIR-CD [
19] and WHU-CD [
40]:
WHU-CD is a benchmark dataset focused on change analysis in remote sensing imagery, primarily used for building change detection. It contains a pair of high-resolution aerial images with a resolution of 32,507 × 15,354 pixels, covering diverse environments and architectural types with abundant change information. To facilitate model training and evaluation, the original images were cropped into 256 × 256 pixel patches and randomly partitioned into a training subset (6096 images), a validation subset (762 images), and a testing subset (762 images). This dataset provides significant practical utility for building change identification and classification, serving as a key benchmark in remote sensing change detection research.
The LEVIR-CD dataset was also developed for high-resolution remote sensing change detection tasks and includes 637 pairs of bi-temporal images with rich structural detail. These images capture various types of changes across both urban and natural environments and are particularly suited for detecting changes in typical targets such as buildings and roads. To ensure generalizability across different scenes, the images were divided into non-overlapping 256 × 256 pixel patches and randomly distributed into a training group (7120 images), a tuning group (1024 images), and a testing group (2048 images). LEVIR-CD provides high-quality and challenging samples, and its diverse scene coverage makes it a key resource in remote sensing change detection research.
3.1.2. Evaluation Metrics
To thoroughly assess the effectiveness of the proposed method, this research employed precision (P), recall (R), F1 score, and Intersection over Union (IoU) as primary metrics. These indicators can respectively measure the accuracy of the model in predicting positive samples, the recognition ability of the target area, the comprehensive performance, and the degree of spatial matching, ensuring a comprehensive evaluation of the change detection task.
Among them, precision (P) represents the proportion of samples that are truly positive among all predicted positive classes, which measures how much of the model’s prediction results are accurate. Its formula is as follows:
Recall (R) indicates the percentage of true positive samples correctly identified, reflecting the model’s capability to detect the target region. Its formula is as follows:
The F1 score is the weighted harmonic mean of accuracy and recall, which can effectively reflect the overall classification ability of the model when the accuracy and recall are balanced. If the F1 score is high, it indicates that the model has good recall ability while ensuring high accuracy. Its formula is as follows:
In addition, to assess the spatial overlap between the model’s predicted change region and the actual change area, Intersection over Union (IoU) was examined and computed. IoU quantifies the ratio of the overlapping region to the combined area of the prediction and ground truth. Values closer to 1 indicate a higher agreement between the prediction and true annotation, demonstrating stronger localization performance of the model. Its formula is as follows:
Here, TP (true positive) denotes the number of pixels that are correctly predicted as changed, i.e., the pixels that are actually changed and are correctly identified as changed by the model. FP (false positive) refers to pixels that are incorrectly predicted as changed (but are actually unchanged). FN (false negative) indicates the number of changed pixels that are mistakenly predicted as unchanged. These definitions provide a clear and accurate basis for evaluating the classification accuracy (precision), detection capability (recall), and spatial overlap (IoU) in the change detection model.
3.1.3. Implementation Details
Parameter settings: The experiments were conducted using the OpenCD [
41] framework on a desktop equipped with a single NVIDIA TITAN V GPUs (12GB) (NVIDIA, Santa Clara, CA, USA) and running Ubuntu 20.04 with CUDA 10.1. During training, the AdamW optimizer was adopted. The AdamW optimizer was configured with a learning rate of 0.001, momentum parameters
, and a weight decay coefficient of 0.05. In the experiments, the batch size for the WHU-CD and LEVIR-CD datasets was set to 8, with a total of 40,000 iterations.
Data enhancement: In this study, we uniformly applied the same data augmentation operations to all datasets, including rotation, cropping, flipping, photometric transformation, etc., to enhance the model’s generalizability and resilience. Rotation transformation enhances the adaptability of the model to directional changes, cropping operation helps the model focus on local areas and learn richer spatial features, random flipping maintains stable performance of the model in different perspectives, and photometric transformation improves the robustness of the model to different lighting conditions by adjusting brightness, contrast, saturation, and hue. These enhancement strategies can effectively reduce the model’s dependence on specific data distributions and improve its detection ability in complex scenarios.
3.2. Comparison and Analysis
To comprehensively evaluate the effectiveness and efficiency of our proposed OFNet in dual temporal image change detection tasks, we selected ten classic and cutting-edge change detection models for comparative experiments, covering different architecture designs and feature fusion strategies.
- 1.
Traditional fully convolutional baseline models: (1) FC-EF [
18]: Splicing the dual phase image channels and inputting them into a single encoder has a simple structure but limited feature interaction capability. (2) FC-Siam-Di [
18] and FC-Siam-Conc [
18]: variants based on twin networks that generate change maps through differential features and cascading fusion strategies, representing typical paradigms of early twin network design.
- 2.
Lightweight and efficient design: (1) SNUNet [
26]: Introduces dense skip connections to enhance multi-scale feature fusion capability and significantly improve the accuracy of changing boundaries. (2) IFNet [
42]: Deep supervised image fusion network, after feature extraction by dual stream CNN, fuses multi-level image difference features through attention module, and directly introduces loss function in the middle layer to enhance boundary integrity and internal compactness.
- 3.
Transformer-Based Model: (1) BIT [
21]: Compresses dual temporal images into semantic tokens and modeling global spatiotemporal context through Transformer encoder. (2) ChangeFormer [
20]: A twin network with a pure Transformer architecture, using a layered Transformer encoder to capture multi-scale spatiotemporal features, and a lightweight MLP decoder to directly generate change maps.
- 4.
Feature decoupling and spatiotemporal modeling: (1) ChangeStar(FarSeg) [
43]: Based on a single-phase supervised framework, pseudo dual phase labels are generated using unpaired images. The semantic segmentation model, such as FarSeg, is extended into a change detector through the ChangeMixin module, and temporal symmetry loss is introduced to alleviate overfitting. (2) STNet [
44]: An explicit spatiotemporal feature fusion network is designed, incorporating a cross-temporal gating mechanism (TFF) to filter out irrelevant changes, and a cross-scale attention mechanism SFF to fuse multi-level features to restore details. (3) DDLNet [
38]: A dual domain learning network extracts frequency domain features through discrete cosine transform (DCT) to enhance the changing regions, and combines spatial recovery module (SRM) to reconstruct spatial details, achieving frequency domain spatial collaborative optimization.
3.2.1. Quantitative Results
In this study, the performance of OFNet was comprehensively validated in change detection tasks on multiple datasets. By comparing various advanced methods on the LEVIR-CD and WHU-CD datasets, OFNet demonstrated excellent change detection capabilities.
On the LEVIR-CD dataset, OFNet achieved an F1 score of 90.73 and an IoU of 83.03, outperforming other methods significantly, As shown in the
Table 1. For example, the FC-EF method has an F1 score of 83.4 and an IoU score of 71.53, while OFNet shows significant improvements in both F1 and IoU, two important metrics. In addition, the advantage of OFNet in recall rate also gives it higher sensitivity in change detection, enabling it to effectively identify subtle changes. These improvements can be attributed to the design of OFNet. Specifically, the optical flow branch (OFB) enables the model to capture temporal motion, which enhances its ability to identify real changes in dynamic scenes. Compared to other high-performing models such as DDLNet (F1: 90.60, IoU: 82.49), which utilizes frequency-domain features for global modeling, STNet (F1: 90.52, IoU: 82.09), which introduces cross-scale attention to enhance feature interactions, and ChangeFormer (F1: 90.40, IoU: 82.48), which utilizes transformer-based global context modeling, OFNet (F1: 90.73, IoU: 83.03) integrates motion-sensitive cues with Bi-Domain Attention (BDA) to adaptively fuse spatial and frequency features. This synergy allows OFNet to maintain high precision and recall, especially on the LEVIR-CD dataset, where fine-grained and small-scale changes are common, resulting in more accurate and complete change localization.
On the WHU-CD dataset, OFNet consistently demonstrates strong performance, achieving an F1 score of 90.63, an IoU of 82.86, and a recall of 88.88, indicating its effectiveness in building change detection. As shown in
Table 2. Among several strong-performing models on the WHU-CD dataset, ChangeStar (FarSeg) (F1: 90.23, IoU: 81.77) introduces semantic-guided strategies to enhance change region prediction, STNet (F1: 87.46, IoU: 77.72) leverages cross-scale attention to capture hierarchical feature interactions, and DDLNet (F1: 90.56, IoU: 82.75) employs frequency-domain modeling to enhance global perception. In comparison, OFNet (F1: 90.63, IoU: 82.86) combines temporal motion modeling through the optical flow branch (OFB) with the Bi-Domain Attention (BDA) mechanism to adaptively fuse spatial and frequency features. This synergy enables OFNet to better capture structural changes in complex urban scenes while maintaining high recall and precision, resulting in more complete and accurate change localization.
Overall, unlike models such as DDLNet that focus only on spatial and frequency domain features, OFNet adopts a more comprehensive strategy by explicitly modeling motion through an optical flow branch and enhancing feature representation via a dual-domain attention mechanism. This integration of motion cues with spatial-frequency information enables OFNet to better capture subtle changes, making it especially effective in complex change detection scenarios.
In addition to performance testing, this study also evaluated the computational complexity of OFNet and compared it with existing methods in terms of parameters and floating point operations (FLOPs). As shown in
Table 3, OFNet contains only 12.17 million parameters and 11.27 GFLOPs, which is significantly lower than most existing models. For example, ChangeFormer and IFNet have 41.02 M and 50.44 M parameters respectively, along with much higher FLOPs of 202.87 G and 82.26 G. Although DDLNet and SNUNet have comparable parameter sizes (12.67 M and 12.03 M, respectively), their FLOPs (7.35 G and 54.88 G) either sacrifice model representation capability or introduce computational redundancy. STNet, while lightweight in FLOPs (9.61 G), still has a higher parameter count (14.6 M) and underperforms compared to OFNet in detection accuracy. Conversely, OFNet attains an ideal trade-off between detection precision and computational performance, ranking second in both parameter size and FLOPs while maintaining state-of-the-art performance. These results demonstrate that OFNet maintains high detection accuracy while significantly reducing computational overhead, rendering it better suited for implementation in practical settings with constrained computational capacity such as edge devices or onboard processing systems in remote sensing platforms. This balance between performance and efficiency highlights the practical value and adaptability of OFNet in resource-constrained applications.
To verify whether the frequency domain components selected by the frequency domain attention mechanism are optimal, we conducted experiments on different frequency domain components n on the LEVIR-CD dataset. The experimental results are shown in
Table 4. From the experimental results, both too small and too large n can affect the performance of the model. When
n = 4 or
n = 8, the precision is low, resulting in a relatively low F1 score and IoU, indicating that low-frequency information is not sufficient to provide complete feature expression. However, when
n = 32 or
n = 64, due to their large frequency domain components, high-frequency information may introduce too many irrelevant details, affecting the ability to distinguish change regions. Overall,
n = 16 achieved good performance in all indicators, indicating that the frequency domain component can balance high and low frequency information well, proving the excellent feature expression ability of the frequency domain attention mechanism under the correct selection of frequency domain components.
We conducted a systematic set of comparative experiments on the weight parameter of the auxiliary loss function to thoroughly investigate its impact on overall model performance under different settings. As shown in
Table 5, the model achieved the best results across multiple key evaluation metrics when the auxiliary loss weight was set to 0.4, demonstrating superior generalization ability and stability. This finding clearly indicates that appropriately incorporating auxiliary supervision during the training process can effectively enhance feature representation and thereby improve the performance of the primary task. In contrast, when the weight is set too low, the auxiliary task provides insufficient guidance, yielding limited benefits, whereas an excessively high weight may cause the auxiliary signal to dominate the learning process and detract from the primary objective. Overall, a weight of 0.4 strikes a well-balanced synergy between the main and auxiliary tasks in this context, leading to a significant improvement in training efficiency and model performance.
3.2.2. Qualitative Results
In order to further validate the advantages of the proposed OFNet in change detection tasks, this study compared its visual detection performance with other methods on the LEVIR-CD and WHU-CD datasets. The OFNet proposed in this paper (i) was compared with representative change detection methods, including (d) FC-EF, (e) FC-Siam-Diff, (f) IFNet, (g) SNUNet, and (h) STNet, on the test sets of the LEVIR-CD and WHU datasets. (a) and (b), T1 and T2, represent dual phase input images, while (c) represents the true label values. For better visualization, different pixel colors are used: white represents true positives, black represents true negatives, red represents false positives, and green represents false negatives. Through visual analysis, it is possible to intuitively observe the performance of different methods in terms of the integrity of change areas, extraction of edge details, and suppression of false positives. Among them, the first two sample groups in the figure originate from the LEVIR-CD dataset, while the remaining three groups come from the WHU-CD dataset.
Figure 6 shows the visualization experiment on the LEVIR-CD dataset. In this visualization experiment, this study compared the detection performance of different methods in scenarios with discrete small changes, large-area array changes, and regular linear changes. The results indicate that the model of this study performs better under different variation patterns. FC-EF and FC Siam Diff have more false positives, while IFNet and SNUNet still have small target missed detections. However, the model in this study can more accurately capture dispersed changes and improve detection integrity. In large-scale array changes, IFNet and STNet edge detection are discontinuous or target adhesion, while the model in this study maintains target integrity and accurately detects changes in building clusters. In the linear variation of rules, SNUNet and STNet have significant edge errors, while IFNet has detection omissions. However, the boundary processing of the model in this study is more refined, effectively reducing false positives and false negatives. Overall, the model in this study performs better in detecting integrity, boundary handling, and noise suppression.
The visualization experiments on the WHU-CD dataset show that the model of this study performs better in three scenarios: complex building changes, small independent changes, and dense building group changes, as shown in
Figure 7. In complex building changes, FC-EF and FC Siam Diff have more false detections, while IFNet and SNUNet have edge discontinuity problems. However, the model in this study can more accurately maintain the building outline and reduce edge false detections. In small independent changes, most methods are prone to missed or false detections. SNUNet and STNet perform well but still have shortcomings, while the model in this study can capture changes more comprehensively and improve detection accuracy. In the changes of dense building clusters, FC-EF and FC Siam Diff have serious false positives, while IFNet and STNet structures have improved to some extent but still have missing contours. However, the model in this study can more accurately identify the changing areas, reduce boundary noise, and make the detection more complete. Overall, the model in this study has improved detection integrity, reduced false positives and false negatives, and demonstrated stronger robustness in complex scenarios.
3.3. Ablation Studies
In order to verify the effectiveness of key components of OFNet, this study conducted ablation experiments on the LEVIR-CD validation set to analyze the role of optical flow branch OFB and dual domain attention BDA in model performance. Specifically, OFB (w/o OFB) and BDA (w/o BDA) were removed for comparative experiments, and the complete model Full (Ours) was used as the benchmark. The experimental results are shown in
Table 6.
From the experimental results, it can be seen that after removing OFB, the F1 score decreased to 90.31 and the IoU decreased to 82.33, indicating that the optical flow branch plays a key role in capturing motion information and helps improve the perception ability of changing areas. Meanwhile, although the precision only slightly decreased to 90.26, the recall rate reached the highest value at 90.36, indicating that after removing OFB, the model is more inclined to predict more areas of change, but due to an increase in false positives, the overall IoU decreases. After removing BDA, the F1 score decreased to 90.53 and the IoU decreased to 82.70, indicating that the dual domain attention mechanism played an enhancing role in the fusion of spatial and frequency domain features. Although the precision of the model reached 90.88, the recall rate decreased to 90.18, indicating that BDA mainly contributes to reducing false positives and helps the model extract edge details of change areas more accurately.
To further demonstrate the contribution of the optical flow branch (OFB), we visualize the input change mask, optical flow map, and the bi-temporal remote sensing images as shown in
Figure 8. As seen in the visualization, while the optical flow maps do not always perfectly align with the binary change masks, they capture rich motion-related cues such as building construction, shadow shifts, and even lighting condition variations. These motion-sensitive cues are essential in guiding the model to distinguish actual changes from visually similar but static regions. The OFB uses this dense motion information to provide complementary features to the Siamese backbone, enabling the network to focus on regions with dynamic context or high uncertainty.
The ablation study in
Figure 9 visually compares the full OFNet model with its variants in which either the BDA or OFB module is removed, using representative samples from the LEVIR-CD dataset. The full model (column (d)) achieves the most accurate results, showing the fewest false positives (red) and false negatives (green), thereby maintaining high precision and recall. When the BDA module is removed (column (e)), the number of false positives increases, especially along object boundaries and in fine-grained change areas, indicating that BDA plays a crucial role in enhancing edge localization and reducing noise. Although false negatives remain moderate in this variant, the increased false positives suggest reduced precision. Removing the OFB module (column (f)) leads to a significant increase in false negatives, particularly in regions with dynamic changes, highlighting the OFB module’s importance in capturing motion information and detecting subtle temporal differences. The decline in recall underscores the OFB module’s role in improving sensitivity to actual changes. Overall, the two modules offer complementary benefits: BDA improves precision through better edge and noise handling, whereas OFB enhances recall by modeling temporal motion. Their joint integration in OFNet is thus essential for achieving robust and balanced change detection performance.
Although both the optical flow branch (OFB) and the Siamese encoder extract bi-temporal differences, they capture complementary information. The Siamese structure mainly learns high-level semantic changes between T1 and T2, while the OFB focuses on modeling pixel-level motion displacement through optical flow. During joint training, the network implicitly distinguishes motion-induced local variations from semantic-level temporal changes by fusing the outputs of both branches. This enables the model to attend to true changes while reducing sensitivity to irrelevant motion. The ablation results in
Table 6 further demonstrate that the inclusion of OFB improves IoU by 0.70 points compared to the version without it, confirming the independent contribution of the motion-guided branch.
These results enable a step-by-step interpretation of each component’s contribution. Compared to the full model, removing OFB leads to a 0.70-point drop in IoU, and removing BDA results in a 0.33-point decrease, indicating that OFB has a stronger effect on motion-related change localization, while BDA further refines the predictions by enhancing discriminative features. Although we do not present an explicit baseline-only model, the existing ablation setting implicitly reflects a progressive enhancement: starting from a base with one component removed, adding the next component yields a quantifiable improvement in performance. This validates that the proposed modules contribute incrementally and synergistically to the overall detection accuracy.
In the complete model, the F1 score reached 90.73, and the IoU was the highest at 83.03, which was improved compared to the ablation version. This indicates that the combination of optical flow branch and dual domain attention can effectively improve the detection ability of change areas, while balancing accuracy and recall. Especially the precision of the complete model is 91.96, significantly higher than other versions, indicating that this method can reduce false positives and improve the reliability of the change area.
To verify the impact of introducing auxiliary loss (Aux Loss) on model performance, we compared the experimental results without and with auxiliary loss on the LEVIR-CD and WHU-CD datasets, as shown in
Table 4 and
Table 5.
Table 7 shows that across all evaluation metrics (F1 score, precision, recall, and IoU), the model with auxiliary loss is significantly better than the model without auxiliary loss, indicating that auxiliary loss can effectively improve the model’s change detection ability.
Specifically, on the LEVIR-CD dataset, the introduction of auxiliary loss improved F1 score from 88.85 to 90.73 and IoU from 79.93 to 83.03, indicating that the model is more accurate in detecting changing regions. With the WHU-CD dataset, the auxiliary loss also significantly improved, with the F1 score increasing from 89.63 to 90.63 and IoU increasing from 81.2 to 82.86, further verifying the generalization ability of this strategy on different datasets. Overall, the auxiliary loss function constrains the intermediate representation of fused features, guiding them towards directions that are more conducive to change detection optimization, thus boosting the model’s capability to detect change areas. The experimental results show that the strategy improves all indicators on both datasets, especially the increase in IoU, reflecting a more accurate prediction of the model in the changing area, and effectively reducing false positives and false negatives. This result validates the effectiveness of the auxiliary loss optimization strategy proposed in this study, indicating that the strategy can enhance the overall detection performance of the model and improve its generalization ability on different datasets.
5. Conclusions
This research introduces OFNet, a innovative remote sensing image change detection model that integrates optical flow modeling with a dual-domain attention mechanism. By incorporating an optical flow branch, OFNet effectively captures motion information between bi-temporal images, thereby enhancing the model’s sensitivity to change regions. This branch enables the model to recognize subtle pixel-level displacements and better perceive spatiotemporal dynamics, which are crucial for accurate change detection. The dual-domain attention mechanism further strengthens OFNet by enabling it to attend to complementary information in both spatial and frequency domains. Spatially, it facilitates the extraction of localized structural changes, while in the frequency domain, it captures spectral variations that may be overlooked by traditional convolutional operations. This combined strategy significantly improves the model’s robustness in scenarios with complex backgrounds or weak signals. Experimental results on benchmark remote sensing change detection datasets demonstrate that OFNet achieves an F1 score of 90.73 and IoU of 83.03 on LEVIR-CD, and an F1 score of 90.63 and IoU of 82.86 on WHU-CD, consistently surpassing state-of-the-art approaches in precision and robustness. Compared with previous works, which either rely solely on spatial cues or disregard temporal motion, our model shows clear advantages by leveraging both temporal motion cues and frequency-based distinctions. Despite these promising results, several limitations of the current work should be acknowledged. First, the computation of optical flow introduces additional computational overhead, which may hinder real-time deployment in resource-constrained environments. Second, while the dual-domain attention improves performance, it increases the model’s complexity, which could affect scalability to larger or more diverse datasets. Moreover, the model’s performance under severe illumination variations or seasonal changes—common challenges in remote sensing—was not explicitly examined and may require further robustness testing. Future research could explore lightweight variants of OFNet to reduce inference costs, potentially by integrating more efficient optical flow estimation or attention mechanisms. Additionally, extending OFNet to multi-temporal or multi-modal (e.g., SAR-optical fusion) data settings would be valuable for practical applications. Another potential avenue is exploring self-supervised or semi-supervised learning techniques to lessen dependence on extensive labeled datasets, which are typically costly to acquire in remote sensing contexts. In summary, OFNet offers a compelling and effective solution to the change detection problem, but there remain important opportunities to improve its efficiency, scalability, and adaptability in future work.