OFNet: Integrating Deep Optical Flow and Bi-Domain Attention for Enhanced Change Detection

Zhang, Liwen; Zou, Quan; Li, Guoqing; Yu, Wenyang; Yang, Yong; Zhang, Heng

doi:10.3390/rs17172949

Open AccessArticle

OFNet: Integrating Deep Optical Flow and Bi-Domain Attention for Enhanced Change Detection

by

Liwen Zhang

^1,†

,

Quan Zou

^1,†,

Guoqing Li

²

,

Wenyang Yu

²,

Yong Yang

³ and

Heng Zhang

^1,*

¹

College of Computer and Information Science & College of Software, Southwest University, Chongqing 400715, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

³

Technology Innovation Center of Geohazards Automatic Monitoring, Ministry of Natural Resources, Chongqing Institute of Geology and Mineral Resources, Chongqing 401120, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(17), 2949; https://doi.org/10.3390/rs17172949

Submission received: 25 June 2025 / Revised: 5 August 2025 / Accepted: 22 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Advances in Remote Sensing Image Target Detection and Recognition)

Download

Browse Figures

Versions Notes

Abstract

Change detection technology holds significant importance in disciplines such as urban planning, land utilization tracking, and hazard evaluation, as it can efficiently and accurately reveal dynamic regional change processes, providing crucial support for scientific decision-making and refined management. Although deep learning methods based on computer vision have achieved remarkable progress in change detection, they still face challenges including reducing dynamic background interference, capturing subtle changes, and effectively fusing multi-temporal data features. To address these issues, this paper proposes a novel change detection model called OFNet. Building upon existing Siamese network architectures, we introduce an optical flow branch module that supplements pixel-level dynamic information. By incorporating motion features to guide the network’s attention to potential change regions, we enhance the model’s ability to characterize and discriminate genuine changes in cross-temporal remote sensing images. Additionally, we innovatively propose a dual-domain attention mechanism that simultaneously models discriminative features in both spatial and frequency domains for change detection tasks. The spatial attention focuses on capturing edge and structural changes, while the frequency-domain attention strengthens responses to key frequency components. The synergistic fusion of these two attention mechanisms effectively improves the model’s sensitivity to detailed changes and enhances the overall robustness of detection. Experimental results demonstrate that OFNet achieves an IoU of 83.03 on the LEVIR-CD dataset and 82.86 on the WHU-CD dataset, outperforming current mainstream approaches and validating its superior detection performance and generalization capability. This presents a novel technical method for environmental observation and urban transformation analysis tasks.

Keywords:

change detection; optical flow guidance; bi-domain attention mechanism

1. Introduction

Remote sensing-based change detection technology, leveraging multi-temporal observation data, can effectively monitor dynamic surface evolution processes and provide reliable support for scientific decision-making in ecological monitoring and resource management. Surface changes such as land use transitions, urban expansion, and forest degradation may profoundly impact ecosystems, climate regulation, and human activities [1]. Therefore, the accurate detection of these changes is crucial for environmental monitoring, resource management, and disaster warning. Change detection has been widely applied in practical problems such as land use monitoring [2], urban expansion [3], post disaster assessment [4], and environmental change analysis [5]. It can help decision-makers obtain key change information in a timely manner and take corresponding measures, playing a key role in environmental protection, resource management, and disaster response. For instance, following natural disasters such as earthquakes or landslides, remote sensing-based change detection technology can rapidly identify affected areas, providing critical decision-making support for emergency response and resource allocation. In land use monitoring, it enables the tracking of farmland abandonment, illegal land occupation, and cultivated land changes, thereby facilitating land resource management and policy implementation. In recent years, remote sensing change detection methods have progressively evolved from traditional manual interpretation to deep learning-based automated approaches [6]. These advanced methods demonstrate powerful feature extraction and representation capabilities, significantly improving detection efficiency and accuracy while reducing human errors. This technological shift has made large-scale, intelligent change analysis feasible.

Change detection technology has shifted from manual interpretation to intelligent methods driven by machine learning and deep learning. Traditional approaches—such as visual interpretation [7], threshold segmentation [8], PCA [9], and CVA [6,10]—relied heavily on spectral features and statistical models. Although partially automated, these methods were highly sensitive to radiometric inconsistencies, illumination variations, and registration errors [11,12], and lacked robustness in complex environments. To address these issues, machine learning algorithms were introduced. For instance, SVM [13] enhances class separability through kernel mapping in high-dimensional space, while RF [14] improves generalization by combining multiple decision trees. Celik [15] proposed an unsupervised method based on PCA and K-means for urban change monitoring, and Chen et al. [16] designed an AdaBoost-based multi-classifier system integrating SVM, decision trees, and neural networks to improve land cover classification. However, these methods still depend on handcrafted features such as texture and spectral signatures, limiting adaptability in complex scenarios and resulting in poor generalization across datasets [17].

The rise of deep learning has enabled end-to-end learning in change detection, overcoming the limitations of handcrafted features. Daudt et al. [18] proposed FC-EF, a Siamese fully convolutional network using shared weights for temporal feature alignment. Subsequent works introduced attention mechanisms to enhance focus on change regions, such as STANet [19], which fuses multi-temporal features via spatial–temporal attention. However, CNN-based models are limited in capturing global context due to their local receptive fields. To mitigate this, SNUNet [20] and DSAMNet [21] introduced dense connections and spatial attention to improve localization and reduce complexity. Advanced CNN variants like LGPNet [22] and USSFCNet [23] integrated multi-scale feature extraction and spectral–spatial attention for improved accuracy in high-resolution imagery.

Inspired by breakthroughs in natural language processing (NLP), sequence modeling architectures such as Transformer [24] and Mamba [25] have been introduced into remote sensing change detection. Transformer-based methods demonstrate strong capability in capturing global dependencies via self-attention. For instance, ChangeFormer [26] employs multi-scale Transformer encoders to improve cross-scene adaptability, BIT [27] models semantic token context for efficient detection, and ICIFNet [28] enhances feature fusion through cascaded cross-attention. TransUNetCD [29] combines CNN and Transformer backbones with a differential enhancement module for better feature representation. To reduce the high computational cost of Transformers while preserving global modeling ability, Mamba leverages state-space models (SSMs) to achieve linear-time sequence modeling. In remote sensing, ChangeMamba [30] integrates Mamba into spatiotemporal feature extraction, demonstrating superior performance over CNN and Transformer counterparts. CD-Lamba [31] further enhances spatial consistency with a CT-LASS module, and RSMamba [32] introduces a multi-path activation mechanism to improve scale adaptability, offering useful insights for change detection.

Despite the remarkable progress achieved by deep learning models in change detection tasks, existing algorithms still exhibit notable limitations, which can be summarized as follows:

Inadequate modeling of motion information across multi-temporal images hampers the accurate capture of temporal variations. This issue is particularly pronounced in scenarios involving small-scale changes or complex backgrounds, often resulting in false positives and missed detections.
Most current approaches employ a single attention mechanism for feature enhancement, lacking comprehensive multi-domain modeling. This shortcoming limits the extraction of discriminative change features in complex scenes.

It is noteworthy that optical flow field modeling can effectively compensate for the temporal motion features often overlooked by traditional methods in change detection tasks. To more accurately capture dynamic changes between images, researchers have progressively explored more efficient optical flow estimation approaches. Early optical flow methods such as the Horn Schenck method relied on assuming brightness consistency of the image to calculate motion but had limited accuracy and robustness. With the development of deep learning, FlowNet proposed by Fischer et al. [33] has become a groundbreaking deep learning model for directly estimating optical flow from image sequences. The LiteFlowNet2 proposed by Qin et al. [34] enhances the model’s ability to capture multi-scale features and provides better optical flow estimation performance by introducing a feature pyramid network and a bidirectional flow estimation strategy based on LiteFlowNet [35]. On the other hand, spatial domain methods have limitations in capturing cross-temporal variation patterns, and frequency domain analysis provides an important supplement for this. AFFormer [36] and Fcanet [37] have confirmed that frequency domain features can effectively enhance image representation capabilities. Ma et al. [38] proposed DDLNet, which employs a tailored frequency domain enhancement module to extract frequency components from bi-temporal images via discrete cosine transform (DCT) to emphasize significant changes. A spatial domain recovery module SRM is designed to fuse spatiotemporal features and reconstruct spatial details of change representations.

Based on the above insights, this study proposes a novel change detection network framework. Building upon the Siamese architecture, an optical flow branch is integrated to leverage motion information for enhancing the identification of change regions, thereby improving the network’s responsiveness to dynamic features. Our investigation reveals that spatial attention mechanisms are effective in focusing on prominent local changes within the image, whereas frequency-domain attention excels at capturing fine-grained variations and periodic patterns. To exploit the complementary strengths of these two domains, we design a dual-domain attention mechanism that combines the spatial attention’s ability to highlight salient local changes with the frequency attention’s capacity to enhance subtle and periodic features, achieving more comprehensive feature modeling. By further incorporating multi-level feature fusion, the introduced approach exhibits enhanced precision and resilience in managing background noise and identifying minor changes, thereby significantly enhancing change detection performance. The core contributions of this study are outlined as follows:

Building upon the Siamese network architecture, we introduce an optical flow branch module to explicitly model pixel-level motion across dual-phase images. This module guides the network in identifying genuine changes caused by the movement of real-world objects, thereby enhancing its sensitivity to dynamic change regions and improving robustness in complex scenes.
We further design a bi-domain attention mechanism that integrates spatial and frequency attention modules to model change features from the perspectives of local structures and frequency distributions, respectively. This design effectively enhances the model’s sensitivity to subtle and boundary-level changes.
The proposed method, OFNet, demonstrates superior performance across multiple publicly available remote sensing datasets, consistently outperforming existing state-of-the-art change detection approaches. Moreover, it maintains a low parameter count and computational cost, indicating strong potential for real-world deployment. Visualization analyses further validate the model’s ability to perceive complex change regions, and ablation studies are conducted to evaluate the individual contributions of key modules to overall performance.

The rest of this article is structured as follows: Section 2, Materials and Methods, provides a detailed explanation of the various components of the proposed method and model. Section 3, Experiments, introduces the dataset used in the experiment and the specific experimental setup, and compares the performance of this method with existing methods. At the same time, ablation studies and model validation are conducted. Section 4, Conclusions, summarizes the research work of this article.

2. Proposed Method

This section begins with an in-depth overview of the proposed model’s architecture. Section 2.1 presents the model’s overall framework, while Section 2.2 details its optical flow branch. Section 2.3 explains the model’s dual-domain attention mechanism, and Section 2.4 elaborates on the model’s loss function.

2.1. Overview

The model consists of three key components: a deep optical flow network, a Siamese network, and a feature pyramid decoder, as shown in Figure 1. First, the images

x_{1}

and

x_{2}

(of size

H \times W \times 3

) are fed in parallel into two independent networks: a lightweight optical flow network, LiteFlowNet2, for estimating the optical flow map, and a weight-sharing Siamese network for extracting image features. The optical flow map provides motion information between images, while the features extracted by the Siamese network capture temporal differences. In the Siamese network module, OFNet employs both spatial attention and frequency-domain attention mechanisms to extract features from

x_{1}

and

x_{2}

. In the feature fusion stage, a straightforward temporal attention module is incorporated to strengthen the temporal dependencies between dual-phase features, and channel shuffling is applied to break the isolation between channels, further enhancing channel interaction. These features, along with the optical flow map processed by the motion attention mechanism, are fed into two different progressive downsampling paths. These downsampling paths adopt a shared-weight strategy to ensure consistency in feature extraction. Finally, after three stages of downsampling, the extracted features are fused in the last stage to enhance the model’s ability to comprehensively represent motion information. This multi-scale fusion allows the model to integrate spatial and motion information effectively, thereby improving its ability to detect subtle changes. The model ultimately utilizes two Feature Pyramid Networks (FPNs) as the neck and decoder, refining spatial features through multi-scale feature fusion and upsampling. This approach enables the precise reconstruction of change representations, particularly excelling in complex backgrounds and scenes with minimal detail changes.

2.2. Optical Flow Branch

Within the overall network architecture, the optical flow branch (OFB) serves as a key component, primarily responsible for extracting temporal information and improving the model’s responsiveness to temporal variations. As a representation of the motion of surface points in image sequences, optical flow provides critical dynamic features for change detection. Existing studies have shown that optical flow effectively captures the spatiotemporal motion of objects and has demonstrated significant utility in tasks such as object detection, particularly in scenarios involving dynamic changes and moving targets. Inspired by this, this article introduces optical flow branching (OFB), which effectively integrates the motion information between image pairs with image features by capturing and utilizing it. The deep optical flow network LiteFlowNet2 is utilized to learn spatiotemporal feature extraction capabilities on open-source datasets. The trained deep optical flow network takes the image at time

T_{1}

and the image at time

T_{2}

as input to extract the general spatiotemporal motion features. The visualization of the optical flow map is used to represent the change characteristics.

The network architecture diagram of LiteFlowNet2 is shown in Figure 2, which is divided into a feature encoder NetC and an optical flow decoder NetE.

The input consists of dual-temporal images

x_{1} \in R^{H \times W \times 3}

and

x_{2} \in R^{H \times W \times 3}

, resulting in the estimated optical flow map

F \in R^{H \times W \times 2}

. This optical flow map contains the motion information of each pixel in the image, where the two channels represent the horizontal and vertical motion of each pixel. The generated optical flow map is further processed within the model:

F = L i t e F l o w N e t 2 (x_{1}, x_{2})

(1)

We design a Motion Attention Module to focus on the motion information in the optical flow map, guiding the model in recognizing change regions as shown in Figure 3. First, a

3 \times 3

convolution kernel is applied to process the optical flow map. The convolution output is subsequently scaled through the Sigmoid activation function. Additionally, a learnable parameter

α

is introduced to control the weighting intensity of the convolution result, thereby adjusting the influence of optical flow information on feature fusion. This enables the model to adaptively regulate the impact of optical flow information based on image variations. The final result is obtained as follows.

F^{'} = α \cdot σ ({Conv}_{3 \times 3} (F))

(2)

The entire process enhances the sensitivity of the model to changing regions by introducing optical flow information, further enhancing the accuracy of change detection.

2.3. Bi-Domain Attention Mechanism

In change detection tasks, the optical flow branch provides essential motion information between bi-temporal images. However, the detailed differences and structural variations present within static images are equally critical. To achieve a more comprehensive representation of change features, the introduction of a bi-domain attention mechanism emerges as an effective strategy to enhance the model’s perceptual capability. Recently, attention mechanisms have achieved notable advancements in deep learning by emulating the human visual system’s selective focus capabilities. However, commonly used single attention mechanisms often only focus on a certain dimension, making it difficult to comprehensively capture multi domain feature interactions in complex scenes. The limitations of their single perspective have prompted researchers to explore cross domain collaborative composite attention paradigms. Based on this, this article proposes the bi-domain attention mechanism (BDA), which combines spatial and frequency domain information. The spatial domain can reflect the edge structure and local details of an image, while the frequency domain helps extract periodic and global variation features. The two have complementarity in expressing variation features. Therefore, this article achieves cross-domain feature collaborative optimization by separately weighting spatial information and frequency domain features, thereby improving the sensitivity of the model to key regions and changing features as shown in Figure 4.

The spatial attention module is designed to weight the spatial information in the input feature map, helping the model focus on important regions in the image. Specifically, the input feature map

x \in R^{B \times C \times H \times W}

is first processed using average pooling and max pooling to compute spatial features at each position. The two pooled feature maps are concatenated along the channel dimension and passed through a convolutional layer to generate the spatial attention map, which is then normalized using the Sigmoid activation function. Finally, the attention map is applied to the input feature map x, resulting in a weighted feature representation:

x^{'} = x \cdot σ (C o n v_{7 \times 7} (c a t (A v g P o o l (x), M a x P o o l (x))))

(3)

The frequency-domain channel attention module strengthens the model’s capacity to perceive frequency-domain features by introducing DCT basis functions. First, the DCT basis functions are generated and adapted to the input image size

H \times W

:

D C T_{i} [h, w] = cos (\frac{(2 h + 1) u π}{2 H}) \cdot cos (\frac{(2 w + 1) v π}{2 w})

(4)

where

h \in {0, 1, \dots, H - 1}

,

w \in {0, 1, \dots, W - 1}

,

u \in {0, 1, \dots, n - 1}

, and

v \in {0, 1, \dots, n - 1}

, where n represents the selected number of frequency components, and H and W denote the height and width of the input image, respectively. Next, the generated DCT basis functions

{DCT}_{i}

are used for frequency-domain feature extraction. The input feature map

x \in R^{C \times H \times W}

is element-wise multiplied with each DCT basis function to obtain the frequency-domain features

X_{i}

:

X_{i} = x \cdot D C T_{i}

(5)

{DCT}_{i}

represents the i-th DCT basis function, and

X_{i} \in R^{C^{'} \times H \times W}

denotes the frequency-domain features obtained through the DCT transformation, where

C^{'} = C / n

is the number of channels per frequency component. All frequency components

[X_{0}, X_{1}, \dots, X_{n - 1}]

are merged along the channel axis to construct a new frequency-domain feature map

X \in R^{C \times H \times W}

. Then, a global average pooling process is performed on each frequency component to derive a holistic representation of the frequency-domain characteristics. Next, a dense layer is utilized to calculate the channel attention weights, which are subsequently applied to the input feature map:

x^{'} = x \cdot σ (f c (A v g P o o l (X)))

(6)

Following the spatial- and frequency-domain attention mechanisms, the output feature maps from both branches are fused along the channel axis. A

1 \times 1

convolution is then applied to fuse the features, enabling effective integration of the spatial- and frequency-domain information.

After applying dual-domain attention to the two temporal input images, the extracted features are fused as illustrated in Figure 5. First, a

1 \times 1

convolution is applied separately to both input features to expand the channel dimension from C to

4 C

. Then, multiple stacked ShuffleNetV2 Blocks are employed to facilitate feature interaction, enhancing feature representation capability. Subsequently, another

1 \times 1

convolution reduces the channel dimension back from

4 C

to C, further strengthening individual feature representation. Since the two input images exhibit temporal correlations and essentially form a special type of time-series data, a basic temporal attention module is incorporated to strengthen the temporal coherence between the two feature maps. Following the application of the temporal attention mechanism, the two temporal features are merged along the channel axis, then a channel shuffling operation is performed to enhance inter-channel information exchange. Finally, a

1 \times 1

convolution compresses the concatenated channel dimension from

2 C

back to C, obtaining the fused features. Additionally, to further enhance the representation capability of the fused features for change regions, a fully convolutional network (FCN [39]) head is introduced for auxiliary supervision. This head generates a change probability map and computes the corresponding loss, guiding the fused features toward a more discriminative representation for change detection, thereby improving the overall detection performance.

2.4. Loss Function

To tackle the class imbalance between positive and negative samples in change detection tasks, this study incorporates an Online Hard Example Mining (OHEM) strategy into the loss function design to enhance the model’s focus on challenging change regions (i.e., positive samples). Given that positive samples typically constitute a small fraction of the dataset while background regions dominate, OHEM dynamically selects pixels with high prediction errors during training for targeted optimization. This approach improves the model’s discriminative ability, particularly in regions with blurred boundaries or subtle changes. In this study, the OHEM threshold is set to 0.7, and a cross-entropy loss is applied with emphasis on hard-to-classify pixels to enable more precise change region detection. The loss function is formulated as follows:

L_{O H E M} = - \frac{1}{| S |} \sum_{i \in S} (y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i}))

(7)

S represents the set of pixels sampled by OHEM, which includes hard samples where the pixel loss exceeds 0.7.

| S |

denotes the total number of pixels sampled by OHEM.

y_{i}

represents the ground truth label of the i-th pixel, while

{\hat{y}}_{i}

denotes the predicted probability for the i-th pixel. To further enhance the learning capability of the model, an auxiliary head loss is introduced into the network, providing additional supervision signals for intermediate layer features. The predictions from the auxiliary head are used only for loss computation, with a weight set to 0.4. The complete loss function is given by

L_{t o t a l} = L_{O H E M} + λ L_{a u x}

(8)

3. Results

3.1. Dataset and Evaluation Metrics

3.1.1. Datasets

The proposed model is evaluated on two publicly available remote sensing change detection datasets, LEVIR-CD [19] and WHU-CD [40]:

WHU-CD is a benchmark dataset focused on change analysis in remote sensing imagery, primarily used for building change detection. It contains a pair of high-resolution aerial images with a resolution of 32,507 × 15,354 pixels, covering diverse environments and architectural types with abundant change information. To facilitate model training and evaluation, the original images were cropped into 256 × 256 pixel patches and randomly partitioned into a training subset (6096 images), a validation subset (762 images), and a testing subset (762 images). This dataset provides significant practical utility for building change identification and classification, serving as a key benchmark in remote sensing change detection research.
The LEVIR-CD dataset was also developed for high-resolution remote sensing change detection tasks and includes 637 pairs of bi-temporal images with rich structural detail. These images capture various types of changes across both urban and natural environments and are particularly suited for detecting changes in typical targets such as buildings and roads. To ensure generalizability across different scenes, the images were divided into non-overlapping 256 × 256 pixel patches and randomly distributed into a training group (7120 images), a tuning group (1024 images), and a testing group (2048 images). LEVIR-CD provides high-quality and challenging samples, and its diverse scene coverage makes it a key resource in remote sensing change detection research.

3.1.2. Evaluation Metrics

To thoroughly assess the effectiveness of the proposed method, this research employed precision (P), recall (R), F1 score, and Intersection over Union (IoU) as primary metrics. These indicators can respectively measure the accuracy of the model in predicting positive samples, the recognition ability of the target area, the comprehensive performance, and the degree of spatial matching, ensuring a comprehensive evaluation of the change detection task.

Among them, precision (P) represents the proportion of samples that are truly positive among all predicted positive classes, which measures how much of the model’s prediction results are accurate. Its formula is as follows:

P = \frac{T P}{T P + F P}

(9)

Recall (R) indicates the percentage of true positive samples correctly identified, reflecting the model’s capability to detect the target region. Its formula is as follows:

R = \frac{T P}{T P + F N}

(10)

The F1 score is the weighted harmonic mean of accuracy and recall, which can effectively reflect the overall classification ability of the model when the accuracy and recall are balanced. If the F1 score is high, it indicates that the model has good recall ability while ensuring high accuracy. Its formula is as follows:

F 1 = \frac{2 \times P \times R}{P + R}

(11)

In addition, to assess the spatial overlap between the model’s predicted change region and the actual change area, Intersection over Union (IoU) was examined and computed. IoU quantifies the ratio of the overlapping region to the combined area of the prediction and ground truth. Values closer to 1 indicate a higher agreement between the prediction and true annotation, demonstrating stronger localization performance of the model. Its formula is as follows:

I o U = \frac{T P}{F N + F P + T P}

(12)

Here, TP (true positive) denotes the number of pixels that are correctly predicted as changed, i.e., the pixels that are actually changed and are correctly identified as changed by the model. FP (false positive) refers to pixels that are incorrectly predicted as changed (but are actually unchanged). FN (false negative) indicates the number of changed pixels that are mistakenly predicted as unchanged. These definitions provide a clear and accurate basis for evaluating the classification accuracy (precision), detection capability (recall), and spatial overlap (IoU) in the change detection model.

3.1.3. Implementation Details

Parameter settings: The experiments were conducted using the OpenCD [41] framework on a desktop equipped with a single NVIDIA TITAN V GPUs (12GB) (NVIDIA, Santa Clara, CA, USA) and running Ubuntu 20.04 with CUDA 10.1. During training, the AdamW optimizer was adopted. The AdamW optimizer was configured with a learning rate of 0.001, momentum parameters $β = (0.9, 0.999)$ , and a weight decay coefficient of 0.05. In the experiments, the batch size for the WHU-CD and LEVIR-CD datasets was set to 8, with a total of 40,000 iterations.
Data enhancement: In this study, we uniformly applied the same data augmentation operations to all datasets, including rotation, cropping, flipping, photometric transformation, etc., to enhance the model’s generalizability and resilience. Rotation transformation enhances the adaptability of the model to directional changes, cropping operation helps the model focus on local areas and learn richer spatial features, random flipping maintains stable performance of the model in different perspectives, and photometric transformation improves the robustness of the model to different lighting conditions by adjusting brightness, contrast, saturation, and hue. These enhancement strategies can effectively reduce the model’s dependence on specific data distributions and improve its detection ability in complex scenarios.

3.2. Comparison and Analysis

To comprehensively evaluate the effectiveness and efficiency of our proposed OFNet in dual temporal image change detection tasks, we selected ten classic and cutting-edge change detection models for comparative experiments, covering different architecture designs and feature fusion strategies.

1.: Traditional fully convolutional baseline models: (1) FC-EF [18]: Splicing the dual phase image channels and inputting them into a single encoder has a simple structure but limited feature interaction capability. (2) FC-Siam-Di [18] and FC-Siam-Conc [18]: variants based on twin networks that generate change maps through differential features and cascading fusion strategies, representing typical paradigms of early twin network design.
2.: Lightweight and efficient design: (1) SNUNet [26]: Introduces dense skip connections to enhance multi-scale feature fusion capability and significantly improve the accuracy of changing boundaries. (2) IFNet [42]: Deep supervised image fusion network, after feature extraction by dual stream CNN, fuses multi-level image difference features through attention module, and directly introduces loss function in the middle layer to enhance boundary integrity and internal compactness.
3.: Transformer-Based Model: (1) BIT [21]: Compresses dual temporal images into semantic tokens and modeling global spatiotemporal context through Transformer encoder. (2) ChangeFormer [20]: A twin network with a pure Transformer architecture, using a layered Transformer encoder to capture multi-scale spatiotemporal features, and a lightweight MLP decoder to directly generate change maps.
4.: Feature decoupling and spatiotemporal modeling: (1) ChangeStar(FarSeg) [43]: Based on a single-phase supervised framework, pseudo dual phase labels are generated using unpaired images. The semantic segmentation model, such as FarSeg, is extended into a change detector through the ChangeMixin module, and temporal symmetry loss is introduced to alleviate overfitting. (2) STNet [44]: An explicit spatiotemporal feature fusion network is designed, incorporating a cross-temporal gating mechanism (TFF) to filter out irrelevant changes, and a cross-scale attention mechanism SFF to fuse multi-level features to restore details. (3) DDLNet [38]: A dual domain learning network extracts frequency domain features through discrete cosine transform (DCT) to enhance the changing regions, and combines spatial recovery module (SRM) to reconstruct spatial details, achieving frequency domain spatial collaborative optimization.

3.2.1. Quantitative Results

In this study, the performance of OFNet was comprehensively validated in change detection tasks on multiple datasets. By comparing various advanced methods on the LEVIR-CD and WHU-CD datasets, OFNet demonstrated excellent change detection capabilities.

On the LEVIR-CD dataset, OFNet achieved an F1 score of 90.73 and an IoU of 83.03, outperforming other methods significantly, As shown in the Table 1. For example, the FC-EF method has an F1 score of 83.4 and an IoU score of 71.53, while OFNet shows significant improvements in both F1 and IoU, two important metrics. In addition, the advantage of OFNet in recall rate also gives it higher sensitivity in change detection, enabling it to effectively identify subtle changes. These improvements can be attributed to the design of OFNet. Specifically, the optical flow branch (OFB) enables the model to capture temporal motion, which enhances its ability to identify real changes in dynamic scenes. Compared to other high-performing models such as DDLNet (F1: 90.60, IoU: 82.49), which utilizes frequency-domain features for global modeling, STNet (F1: 90.52, IoU: 82.09), which introduces cross-scale attention to enhance feature interactions, and ChangeFormer (F1: 90.40, IoU: 82.48), which utilizes transformer-based global context modeling, OFNet (F1: 90.73, IoU: 83.03) integrates motion-sensitive cues with Bi-Domain Attention (BDA) to adaptively fuse spatial and frequency features. This synergy allows OFNet to maintain high precision and recall, especially on the LEVIR-CD dataset, where fine-grained and small-scale changes are common, resulting in more accurate and complete change localization.

On the WHU-CD dataset, OFNet consistently demonstrates strong performance, achieving an F1 score of 90.63, an IoU of 82.86, and a recall of 88.88, indicating its effectiveness in building change detection. As shown in Table 2. Among several strong-performing models on the WHU-CD dataset, ChangeStar (FarSeg) (F1: 90.23, IoU: 81.77) introduces semantic-guided strategies to enhance change region prediction, STNet (F1: 87.46, IoU: 77.72) leverages cross-scale attention to capture hierarchical feature interactions, and DDLNet (F1: 90.56, IoU: 82.75) employs frequency-domain modeling to enhance global perception. In comparison, OFNet (F1: 90.63, IoU: 82.86) combines temporal motion modeling through the optical flow branch (OFB) with the Bi-Domain Attention (BDA) mechanism to adaptively fuse spatial and frequency features. This synergy enables OFNet to better capture structural changes in complex urban scenes while maintaining high recall and precision, resulting in more complete and accurate change localization.

Overall, unlike models such as DDLNet that focus only on spatial and frequency domain features, OFNet adopts a more comprehensive strategy by explicitly modeling motion through an optical flow branch and enhancing feature representation via a dual-domain attention mechanism. This integration of motion cues with spatial-frequency information enables OFNet to better capture subtle changes, making it especially effective in complex change detection scenarios.

In addition to performance testing, this study also evaluated the computational complexity of OFNet and compared it with existing methods in terms of parameters and floating point operations (FLOPs). As shown in Table 3, OFNet contains only 12.17 million parameters and 11.27 GFLOPs, which is significantly lower than most existing models. For example, ChangeFormer and IFNet have 41.02 M and 50.44 M parameters respectively, along with much higher FLOPs of 202.87 G and 82.26 G. Although DDLNet and SNUNet have comparable parameter sizes (12.67 M and 12.03 M, respectively), their FLOPs (7.35 G and 54.88 G) either sacrifice model representation capability or introduce computational redundancy. STNet, while lightweight in FLOPs (9.61 G), still has a higher parameter count (14.6 M) and underperforms compared to OFNet in detection accuracy. Conversely, OFNet attains an ideal trade-off between detection precision and computational performance, ranking second in both parameter size and FLOPs while maintaining state-of-the-art performance. These results demonstrate that OFNet maintains high detection accuracy while significantly reducing computational overhead, rendering it better suited for implementation in practical settings with constrained computational capacity such as edge devices or onboard processing systems in remote sensing platforms. This balance between performance and efficiency highlights the practical value and adaptability of OFNet in resource-constrained applications.

To verify whether the frequency domain components selected by the frequency domain attention mechanism are optimal, we conducted experiments on different frequency domain components n on the LEVIR-CD dataset. The experimental results are shown in Table 4. From the experimental results, both too small and too large n can affect the performance of the model. When n = 4 or n = 8, the precision is low, resulting in a relatively low F1 score and IoU, indicating that low-frequency information is not sufficient to provide complete feature expression. However, when n = 32 or n = 64, due to their large frequency domain components, high-frequency information may introduce too many irrelevant details, affecting the ability to distinguish change regions. Overall, n = 16 achieved good performance in all indicators, indicating that the frequency domain component can balance high and low frequency information well, proving the excellent feature expression ability of the frequency domain attention mechanism under the correct selection of frequency domain components.

We conducted a systematic set of comparative experiments on the weight parameter of the auxiliary loss function to thoroughly investigate its impact on overall model performance under different settings. As shown in Table 5, the model achieved the best results across multiple key evaluation metrics when the auxiliary loss weight was set to 0.4, demonstrating superior generalization ability and stability. This finding clearly indicates that appropriately incorporating auxiliary supervision during the training process can effectively enhance feature representation and thereby improve the performance of the primary task. In contrast, when the weight is set too low, the auxiliary task provides insufficient guidance, yielding limited benefits, whereas an excessively high weight may cause the auxiliary signal to dominate the learning process and detract from the primary objective. Overall, a weight of 0.4 strikes a well-balanced synergy between the main and auxiliary tasks in this context, leading to a significant improvement in training efficiency and model performance.

3.2.2. Qualitative Results

In order to further validate the advantages of the proposed OFNet in change detection tasks, this study compared its visual detection performance with other methods on the LEVIR-CD and WHU-CD datasets. The OFNet proposed in this paper (i) was compared with representative change detection methods, including (d) FC-EF, (e) FC-Siam-Diff, (f) IFNet, (g) SNUNet, and (h) STNet, on the test sets of the LEVIR-CD and WHU datasets. (a) and (b), T1 and T2, represent dual phase input images, while (c) represents the true label values. For better visualization, different pixel colors are used: white represents true positives, black represents true negatives, red represents false positives, and green represents false negatives. Through visual analysis, it is possible to intuitively observe the performance of different methods in terms of the integrity of change areas, extraction of edge details, and suppression of false positives. Among them, the first two sample groups in the figure originate from the LEVIR-CD dataset, while the remaining three groups come from the WHU-CD dataset.

Figure 6 shows the visualization experiment on the LEVIR-CD dataset. In this visualization experiment, this study compared the detection performance of different methods in scenarios with discrete small changes, large-area array changes, and regular linear changes. The results indicate that the model of this study performs better under different variation patterns. FC-EF and FC Siam Diff have more false positives, while IFNet and SNUNet still have small target missed detections. However, the model in this study can more accurately capture dispersed changes and improve detection integrity. In large-scale array changes, IFNet and STNet edge detection are discontinuous or target adhesion, while the model in this study maintains target integrity and accurately detects changes in building clusters. In the linear variation of rules, SNUNet and STNet have significant edge errors, while IFNet has detection omissions. However, the boundary processing of the model in this study is more refined, effectively reducing false positives and false negatives. Overall, the model in this study performs better in detecting integrity, boundary handling, and noise suppression.

The visualization experiments on the WHU-CD dataset show that the model of this study performs better in three scenarios: complex building changes, small independent changes, and dense building group changes, as shown in Figure 7. In complex building changes, FC-EF and FC Siam Diff have more false detections, while IFNet and SNUNet have edge discontinuity problems. However, the model in this study can more accurately maintain the building outline and reduce edge false detections. In small independent changes, most methods are prone to missed or false detections. SNUNet and STNet perform well but still have shortcomings, while the model in this study can capture changes more comprehensively and improve detection accuracy. In the changes of dense building clusters, FC-EF and FC Siam Diff have serious false positives, while IFNet and STNet structures have improved to some extent but still have missing contours. However, the model in this study can more accurately identify the changing areas, reduce boundary noise, and make the detection more complete. Overall, the model in this study has improved detection integrity, reduced false positives and false negatives, and demonstrated stronger robustness in complex scenarios.

3.3. Ablation Studies

In order to verify the effectiveness of key components of OFNet, this study conducted ablation experiments on the LEVIR-CD validation set to analyze the role of optical flow branch OFB and dual domain attention BDA in model performance. Specifically, OFB (w/o OFB) and BDA (w/o BDA) were removed for comparative experiments, and the complete model Full (Ours) was used as the benchmark. The experimental results are shown in Table 6.

From the experimental results, it can be seen that after removing OFB, the F1 score decreased to 90.31 and the IoU decreased to 82.33, indicating that the optical flow branch plays a key role in capturing motion information and helps improve the perception ability of changing areas. Meanwhile, although the precision only slightly decreased to 90.26, the recall rate reached the highest value at 90.36, indicating that after removing OFB, the model is more inclined to predict more areas of change, but due to an increase in false positives, the overall IoU decreases. After removing BDA, the F1 score decreased to 90.53 and the IoU decreased to 82.70, indicating that the dual domain attention mechanism played an enhancing role in the fusion of spatial and frequency domain features. Although the precision of the model reached 90.88, the recall rate decreased to 90.18, indicating that BDA mainly contributes to reducing false positives and helps the model extract edge details of change areas more accurately.

To further demonstrate the contribution of the optical flow branch (OFB), we visualize the input change mask, optical flow map, and the bi-temporal remote sensing images as shown in Figure 8. As seen in the visualization, while the optical flow maps do not always perfectly align with the binary change masks, they capture rich motion-related cues such as building construction, shadow shifts, and even lighting condition variations. These motion-sensitive cues are essential in guiding the model to distinguish actual changes from visually similar but static regions. The OFB uses this dense motion information to provide complementary features to the Siamese backbone, enabling the network to focus on regions with dynamic context or high uncertainty.

The ablation study in Figure 9 visually compares the full OFNet model with its variants in which either the BDA or OFB module is removed, using representative samples from the LEVIR-CD dataset. The full model (column (d)) achieves the most accurate results, showing the fewest false positives (red) and false negatives (green), thereby maintaining high precision and recall. When the BDA module is removed (column (e)), the number of false positives increases, especially along object boundaries and in fine-grained change areas, indicating that BDA plays a crucial role in enhancing edge localization and reducing noise. Although false negatives remain moderate in this variant, the increased false positives suggest reduced precision. Removing the OFB module (column (f)) leads to a significant increase in false negatives, particularly in regions with dynamic changes, highlighting the OFB module’s importance in capturing motion information and detecting subtle temporal differences. The decline in recall underscores the OFB module’s role in improving sensitivity to actual changes. Overall, the two modules offer complementary benefits: BDA improves precision through better edge and noise handling, whereas OFB enhances recall by modeling temporal motion. Their joint integration in OFNet is thus essential for achieving robust and balanced change detection performance.

Although both the optical flow branch (OFB) and the Siamese encoder extract bi-temporal differences, they capture complementary information. The Siamese structure mainly learns high-level semantic changes between T1 and T2, while the OFB focuses on modeling pixel-level motion displacement through optical flow. During joint training, the network implicitly distinguishes motion-induced local variations from semantic-level temporal changes by fusing the outputs of both branches. This enables the model to attend to true changes while reducing sensitivity to irrelevant motion. The ablation results in Table 6 further demonstrate that the inclusion of OFB improves IoU by 0.70 points compared to the version without it, confirming the independent contribution of the motion-guided branch.

These results enable a step-by-step interpretation of each component’s contribution. Compared to the full model, removing OFB leads to a 0.70-point drop in IoU, and removing BDA results in a 0.33-point decrease, indicating that OFB has a stronger effect on motion-related change localization, while BDA further refines the predictions by enhancing discriminative features. Although we do not present an explicit baseline-only model, the existing ablation setting implicitly reflects a progressive enhancement: starting from a base with one component removed, adding the next component yields a quantifiable improvement in performance. This validates that the proposed modules contribute incrementally and synergistically to the overall detection accuracy.

In the complete model, the F1 score reached 90.73, and the IoU was the highest at 83.03, which was improved compared to the ablation version. This indicates that the combination of optical flow branch and dual domain attention can effectively improve the detection ability of change areas, while balancing accuracy and recall. Especially the precision of the complete model is 91.96, significantly higher than other versions, indicating that this method can reduce false positives and improve the reliability of the change area.

To verify the impact of introducing auxiliary loss (Aux Loss) on model performance, we compared the experimental results without and with auxiliary loss on the LEVIR-CD and WHU-CD datasets, as shown in Table 4 and Table 5. Table 7 shows that across all evaluation metrics (F1 score, precision, recall, and IoU), the model with auxiliary loss is significantly better than the model without auxiliary loss, indicating that auxiliary loss can effectively improve the model’s change detection ability.

Specifically, on the LEVIR-CD dataset, the introduction of auxiliary loss improved F1 score from 88.85 to 90.73 and IoU from 79.93 to 83.03, indicating that the model is more accurate in detecting changing regions. With the WHU-CD dataset, the auxiliary loss also significantly improved, with the F1 score increasing from 89.63 to 90.63 and IoU increasing from 81.2 to 82.86, further verifying the generalization ability of this strategy on different datasets. Overall, the auxiliary loss function constrains the intermediate representation of fused features, guiding them towards directions that are more conducive to change detection optimization, thus boosting the model’s capability to detect change areas. The experimental results show that the strategy improves all indicators on both datasets, especially the increase in IoU, reflecting a more accurate prediction of the model in the changing area, and effectively reducing false positives and false negatives. This result validates the effectiveness of the auxiliary loss optimization strategy proposed in this study, indicating that the strategy can enhance the overall detection performance of the model and improve its generalization ability on different datasets.

4. Discussion

This study focuses on two core innovations—dynamic feature representation assisted by optical flow information and a dual-domain attention mechanism—further exploring the performance improvement of remote sensing image change detection technology. By introducing optical flow fields to capture temporal motion features and combining spatial–frequency dual-domain attention to enhance the focus on key regions, the proposed method significantly outperforms traditional methods in detecting changes in complex scenarios, such as building alterations. However, given the diverse requirements in practical applications, future research still needs further optimization, which can be explored in the following directions:

Multi-modal Data Collaborative Optimization: The current method primarily relies on optical imagery. Future research could integrate multi-source remote sensing data, such as SAR and LiDAR, to enhance the model’s robustness to interference factors like shadows and clouds through cross-modal feature alignment. For instance, combining the SAR penetration characteristics with optical flow’s motion representation can create a more comprehensive change detection framework, while lightweight fusion strategies should be designed to reduce computational costs.
Few-shot Adaptive Learning: The current model relies on large-scale labeled data. Future work could explore meta-learning or semi-supervised strategies to build task-adaptive meta-networks (such as the MAML framework), leveraging historical data to construct a prior knowledge base for rapid generalization to new scenes. Additionally, physical-guided data augmentation can be utilized, combining geographic spatial priors (such as terrain slope or hydrological models) to generate physically constrained synthetic change samples, enhancing the model’s adaptability to unseen areas.
Explainability and Deep Integration with Decision Support: Existing black-box models fail to meet the causal inference needs in practical applications. Future research could explore the construction of causal change graphs, combining spatiotemporal graph networks (ST-GNNs) to analyze the causal chain of change events. Moreover, using Bayesian neural networks or Monte Carlo Dropout techniques, uncertainty quantification can be performed by outputting confidence intervals for change detection results, thereby supporting decision-making and risk control.

5. Conclusions

This research introduces OFNet, a innovative remote sensing image change detection model that integrates optical flow modeling with a dual-domain attention mechanism. By incorporating an optical flow branch, OFNet effectively captures motion information between bi-temporal images, thereby enhancing the model’s sensitivity to change regions. This branch enables the model to recognize subtle pixel-level displacements and better perceive spatiotemporal dynamics, which are crucial for accurate change detection. The dual-domain attention mechanism further strengthens OFNet by enabling it to attend to complementary information in both spatial and frequency domains. Spatially, it facilitates the extraction of localized structural changes, while in the frequency domain, it captures spectral variations that may be overlooked by traditional convolutional operations. This combined strategy significantly improves the model’s robustness in scenarios with complex backgrounds or weak signals. Experimental results on benchmark remote sensing change detection datasets demonstrate that OFNet achieves an F1 score of 90.73 and IoU of 83.03 on LEVIR-CD, and an F1 score of 90.63 and IoU of 82.86 on WHU-CD, consistently surpassing state-of-the-art approaches in precision and robustness. Compared with previous works, which either rely solely on spatial cues or disregard temporal motion, our model shows clear advantages by leveraging both temporal motion cues and frequency-based distinctions. Despite these promising results, several limitations of the current work should be acknowledged. First, the computation of optical flow introduces additional computational overhead, which may hinder real-time deployment in resource-constrained environments. Second, while the dual-domain attention improves performance, it increases the model’s complexity, which could affect scalability to larger or more diverse datasets. Moreover, the model’s performance under severe illumination variations or seasonal changes—common challenges in remote sensing—was not explicitly examined and may require further robustness testing. Future research could explore lightweight variants of OFNet to reduce inference costs, potentially by integrating more efficient optical flow estimation or attention mechanisms. Additionally, extending OFNet to multi-temporal or multi-modal (e.g., SAR-optical fusion) data settings would be valuable for practical applications. Another potential avenue is exploring self-supervised or semi-supervised learning techniques to lessen dependence on extensive labeled datasets, which are typically costly to acquire in remote sensing contexts. In summary, OFNet offers a compelling and effective solution to the change detection problem, but there remain important opportunities to improve its efficiency, scalability, and adaptability in future work.

Author Contributions

Conceptualization, L.Z.; Methodology, L.Z.; Data curation, L.Z. and Y.Y.; Writing—original draft, L.Z.; Writing—review and editing, Q.Z., G.L., W.Y. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the horizontal research project “AI-Based Video Monitoring for Rock Surface Change Detection and Deformation Analysis” in collaboration with the Chongqing Institute of Geology and Mineral Resources, under project number M2024066.

Data Availability Statement

This article uses a public dataset, the download link is LEVIR-CD: https://github.com/justchenhao/LEVIR (accessed on 21 August 2025), WHU-CD: https://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 21 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, Z.; Mei, G. Deep learning for geological hazards analysis: Data, models, applications, and opportunities. Earth-Sci. Rev. 2021, 223, 103858. [Google Scholar] [CrossRef]
Hansen, M.C.; Potapov, P.V.; Moore, R.; Hancher, M.; Turubanova, S.A.; Tyukavina, A.; Thau, D.; Stehman, S.V.; Goetz, S.J.; Loveland, T.R.; et al. High-resolution global maps of 21st-century forest cover change. Science 2013, 342, 850–853. [Google Scholar] [CrossRef]
Srivastava, S.; Ahmed, T. An Approach to Monitor Urban Growth through Deep Learning based Change Detection Technique using Sentinel-2 Satellite Images. In Proceedings of the IEEE 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 15–17 March 2023; pp. 832–838. [Google Scholar]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Owe, M.; Neale, C. Remote Sensing for Environmental Monitoring and Change Detection; IAHS Press: Oxford, UK, 2007. [Google Scholar]
Bovolo, F.; Bruzzone, L. A theoretical framework for unsupervised change detection based on change vector analysis in the polar domain. IEEE Trans. Geosci. Remote Sens. 2006, 45, 218–236. [Google Scholar] [CrossRef]
Singh, A. Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Rosin, P.L. Thresholding for change detection. Comput. Vis. Image Underst. 2002, 86, 79–95. [Google Scholar] [CrossRef]
Byrne, G.; Crapper, P.; Mayo, K. Monitoring land-cover change by principal component analysis of multitemporal Landsat data. Remote Sens. Environ. 1980, 10, 175–184. [Google Scholar] [CrossRef]
Malila, W.A. Change vector analysis: An approach for detecting forest changes with Landsat. In Proceedings of the Laboratory for Applications of Remote Sensing (LARS) Symposia, Lafayette, IN, USA, 3–6 June 1980; p. 385. [Google Scholar]
Coppin, P.; Lambin, E.; Jonckheere, I.; Muys, B. Digital change detection methods in natural ecosystem monitoring: A review. In Analysis of Multi-temporal Remote Sensing Images; World Scientific Publishing: Singapore, 2002; pp. 3–36. [Google Scholar]
Lu, D.; Mausel, P.; Brondizio, E.; Moran, E. Change detection techniques. Int. J. Remote Sens. 2004, 25, 2365–2401. [Google Scholar] [CrossRef]
Mountrakis, G.; Im, J.; Ogole, C. Support vector machines in remote sensing: A review. ISPRS J. Photogramm. Remote Sens. 2011, 66, 247–259. [Google Scholar] [CrossRef]
Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Chen, Y.; Dou, P.; Yang, X. Improving land use/cover classification with a multiple classifier system using AdaBoost integration technique. Remote Sens. 2017, 9, 1055. [Google Scholar] [CrossRef]
Nemmour, H.; Chibani, Y. Multiple support vector machines for land cover change detection: An application for mapping urban extensions. ISPRS J. Photogramm. Remote Sens. 2006, 61, 125–133. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Liu, M.; Shi, Q. DSAMNet: A deeply supervised attention metric based network for change detection of high-resolution images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 6159–6162. [Google Scholar]
Liu, T.; Gong, M.; Lu, D.; Zhang, Q.; Zheng, H.; Jiang, F.; Zhang, M. Building Change Detection for VHR Remote Sensing Images via Local–Global Pyramid Network and Cross-Task Transfer Learning Strategy. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Lei, T.; Geng, X.; Ning, H.; Lv, Z.; Gong, M.; Jin, Y.; Nandi, A.K. Ultralightweight Spatial–Spectral Feature Cooperation Network for Change Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. arXiv 2022, arXiv:2201.01293. [Google Scholar] [CrossRef]
Hao Chen, Z.Q.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatiotemporal State Space Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Wu, Z.; Ma, X.; Lian, R.; Zheng, K.; Ma, M.; Zhang, W.; Song, S. CD-Lamba: Boosting Remote Sensing Change Detection via a Cross-Temporal Locally Adaptive State Space Model. arXiv 2025, arXiv:2501.15455. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification with State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 3407111. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
Hui, T.W.; Tang, X.; Loy, C.C. A lightweight optical flow CNN—Revisiting data fidelity and regularization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2555–2569. [Google Scholar] [CrossRef]
Hui, T.W.; Tang, X.; Loy, C.C. LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Dong, B.; Wang, P.; Wang, F. Head-free lightweight semantic segmentation with linear transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 516–524. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 783–792. [Google Scholar]
Ma, X.; Yang, J.; Che, R.; Zhang, H.; Zhang, W. Ddlnet: Boosting remote sensing change detection with dual-domain learning. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Li, K.; Jiang, J.; Codegoni, A.; Han, C.; Deng, Y.; Chen, K.; Zheng, Z.; Chen, H.; Zou, Z.; Shi, Z.; et al. Open-CD: A Comprehensive Toolbox for Change Detection. arXiv 2024, arXiv:2407.15317. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Zheng, Z.; Ma, A.; Zhang, L.; Zhong, Y. Change is everywhere: Single-temporal supervised object change detection in remote sensing imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15193–15202. [Google Scholar]
Ma, X.; Yang, J.; Hong, T.; Ma, M.; Zhao, Z.; Feng, T.; Zhang, W. STNet: Spatial and Temporal feature fusion network for change detection in remote sensing images. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, QLD, Australia, 10–14 July 2023; pp. 2195–2200. [Google Scholar]

Figure 1. The proposed OFNet architecture comprises three components: feature extraction (Siamese network with an optical flow branch), spatial–frequency domain fusion, and a decoder. MA extracts features from the optical flow map, while BDA extracts spatial and frequency features from the original images.

Figure 2. The LiteFlowNet2 network diagram, composed of multi-scale feature encoder NetC and optical flow decoder NetE, is adapted from the original paper [34].

Figure 3. Motion Attention Module.

Figure 4. Bi-domain attention mechanism.

Figure 5. Feature fusion module.

Figure 6. Visualization results of different methods on the LEVIR-CD dataset. (a,b) The bi-temporal remote sensing images at T1 and T2; (c) the ground truth change map; (d–i) the predicted change maps generated by FC-EF, FC-Siam-Diff, IFNet, SNUNet, STANet, and the proposed OFNet, respectively.

Figure 7. Visualization results of different methods on the WHU-CD dataset. (a,b) The bi-temporal remote sensing images at T1 and T2; (c) the ground truth change map; (d–i) the predicted change maps generated by FC-EF, FC-Siam-Diff, IFNet, SNUNet, STANet, and the proposed OFNet, respectively.

Figure 8. Visualization of the optical flow map between bi-temporal images. (a) The change label map, (b) the optical flow map, (c) the image before the change, and (d) the image after the change. The optical flow map in (b) highlights dynamic regions through motion vectors.

Figure 9. Visualization of ablation study results on the LEVIR-CD dataset. (a,b) bi-temporal input images; (c) ground truth; (d) OFNet (full model); (e) OFNet without BDA module; (f) OFNet without OFB module. White: True Positive; Black: True Negative; Red: False Positive; Green: False Negative.

Table 1. Experimental results of OFNet on LEVIR-CD [19] datasets.

Method	LEVIR-CD
Method	F1	Pre.	Rec.	IoU
FC-EF	83.4	86.91	80.17	71.53
FC-Siam-Di	86.31	89.53	83.31	75.92
FC-Siam-Conc	83.69	91.99	76.77	71.96
IFNet	88.13	91.78	82.93	78.77
BIT	89.31	89.24	89.37	80.68
SNUNet	88.16	89.18	87.17	78.83
ChangeStar (FarSeg)	89.3	89.88	88.72	80.66
ChangeFormer	90.4	92.05	88.8	82.48
STNet	90.52	92.06	89.03	82.09
DDLNet	90.6	91.72	89.41	82.49
OFNet (Ours)	90.73	91.96	89.52	83.03