FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images

Chen, Bochao; Wang, Yapeng; Yang, Xu; Yuan, Xiaochen; Im, Sio Kei

doi:10.3390/rs17050824

Open AccessArticle

FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images

by

Bochao Chen

¹

,

Yapeng Wang

^1,*

,

Xu Yang

¹

,

Xiaochen Yuan

¹

and

Sio Kei Im

²

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China

²

Macao Polytechnic University, Macao 999078, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(5), 824; https://doi.org/10.3390/rs17050824

Submission received: 6 December 2024 / Revised: 9 February 2025 / Accepted: 25 February 2025 / Published: 26 February 2025

(This article belongs to the Special Issue Advancements in Deep Learning for Object Detection and Segmentation in Remote Sensing Imagery)

Download

Browse Figures

Versions Notes

Abstract

Change detection is an important technique that identifies areas of change by comparing images of the same location taken at different times, and it is widely used in urban expansion monitoring, resource exploration, land use detection, and post-disaster monitoring. However, existing change detection methods often struggle with balancing the extraction of fine-grained spatial details and effective semantic information integration, particularly for high-resolution remote sensing imagery. This paper proposes a high-resolution remote sensing image change detection model called FFLKCDNet (First Fusion Large-Kernel Change Detection Network) to solve this issue. FFLKCDNet features a Bi-temporal Feature Fusion Module (BFFM) to fuse remote sensing features from different temporal scales, and an improved ResNet network (RAResNet) that combines large-kernel convolution and multi-attention mechanisms to enhance feature extraction. The model also includes a Contextual Dual-Land-Cover Attention Fusion Module (CD-LKAFM) to integrate multi-scale information during the feature recovery stage, improving the resolution of details and the integration of semantic information. Experimental results showed that FFLKCDNet outperformed existing methods on datasets such as GVLM, SYSU, and LEVIR, achieving superior performance in metrics such as Kappa coefficient, mIoU, MPA, and F1 score. The model achieves high-precision change detection for remote sensing images through multi-scale feature fusion, noise suppression, and fine-grained information capture. These advancements pave the way for more precise and reliable applications in urban planning, environmental monitoring, and disaster management.

Keywords:

remote sensing change detection; high-resolution images; FFLKCDNet; large-kernel convolution; multi-scale feature fusion; RAResNet; attention mechanism; city; disaster risk reduction; semantic information

1. Introduction

Change detection is a critical technique in remote sensing image analysis and used to identify areas of change by comparing two images of the same location acquired at different times [1]. This technique is extensively applied across various domains, including urban expansion monitoring, resource exploration, land use assessment, and post-disaster evaluation. However, high-resolution remote sensing images present challenges due to their large size, where the proportion of altered areas within the entire image is typically very small [2]. Consequently, manual comparison of such images becomes both time-consuming and labor-intensive. Over the past few decades, numerous methods have been developed to improve the efficiency and accuracy of change detection.

A primary challenge in change detection lies in effectively modeling the temporal correlation between dual-time images. Variations in atmospheric scattering conditions and complex light-scattering mechanisms introduce significant nonlinearity into the change detection process, complicating accurate analysis. Consequently, a task-driven, learning-based approach is essential to address this complexity. With advancements in geographic object-oriented image analysis (GEOBIA) methods [3], pixels with similar characteristics are grouped to form geographic objects that retain specific geometric shapes, sizes, textures, and other features. These geographic objects are then used as the basic units for change detection, leveraging their attribute characteristics [4]. This approach allows for a more realistic and objective representation of regional changes, particularly in high-resolution images, resulting in more accurate and contextually relevant detection outcomes.

High-resolution remote sensing satellite images provide distinct advantages for change detection tasks, due to their fine spatial details and ability to capture even subtle changes in the landscape. The increased resolution allows for the identification of smaller and finer changes that might be overlooked in lower-resolution imagery, such as slight shifts in land cover, small-scale urbanization, or localized environmental changes. This heightened sensitivity is particularly valuable in applications such as monitoring urban sprawl, tracking deforestation, or assessing disaster impacts, where detecting even minute alterations can significantly impact decision-making. Furthermore, high-resolution images improve the overall accuracy of the detection process by providing more precise spatial references, enabling a clearer distinction between unchanged and changed areas. As a result, these images contribute to more reliable and contextually relevant detection outcomes, which is crucial for tasks requiring a high level of detail and accuracy in mapping landscape changes.

In addition, high-resolution remote sensing images allow for better integration with other data sources, such as geographic information systems (GIS) and land use databases, facilitating more comprehensive change detection analysis. The detailed spatial information embedded in these images enhances the ability to conduct multi-scale analysis, which can better capture variations in both global and local features across the study area. This integration of detailed imagery with additional datasets not only improves the detection of spatial patterns but also enriches the interpretation of the underlying processes driving those changes.

Deep learning methods utilize labeled training data to learn which areas have changed over time. With the rapid advancement in graphics processing units (GPUs), deep learning techniques have become increasingly applicable across a range of fields, including remote sensing change detection. In this domain, standard end-to-end two-dimensional convolutional neural networks (2D-CNNs) are widely employed to effectively extract distinguishing features from higher levels. These models can also incorporate a hybrid affinity matrix combined with sub-pixel representations, enhancing their generalization capabilities. Notable progress has been made using CNN-based models in remote sensing change detection, such as DMINet [5], which supports both change detection and land cover mapping by integrating land cover information to facilitate change prediction. Additionally, a change detection approach based on deep Siamese convolutional networks, known as P2V [6], inputs dual images into the same network to generate distinct feature maps, leveraging feature vector separation to detect changes between pixel pairs in the images. This approach enables accurate detection of changes by analyzing spatially aligned feature sets.

Land cover change reflects not only transformations in geographical environments but also specific shifts in land use patterns and surface cover. Global features primarily describe the overall distribution and trends in land cover change, such as the spatial distribution and proportions of different land cover types, including urban, agricultural, and forested areas. These features are critical for understanding the broader context and overarching trends in land cover change. Local features, by contrast, focus on finer details, capturing subtle alterations in surface cover. These features provide valuable insights into the microscopic mechanisms and specific processes driving land cover change. Therefore, to comprehensively understand the nature and dynamics of land cover change, it is essential to integrate both global and local features in detection tasks. To address this need, this paper proposes the FFLKCDNet network. FFLKCDNet introduces the BFFM, which leverages multiple convolutional units to fuse change features from dual-temporal remote sensing images, combining information from different scale receptive fields and dimensions. Following BFFM processing, a backbone network extracts multi-dimensional change features. The backbone network in FFLKCDNet employs multi-attention to aggregate features, and incorporates RAResNet, enhanced by Reparameterized Large-Kernel Convolution (ReLK), to capture high-dimensional semantic change information over a larger receptive field. For the feature recovery phase, we designed the CD-LKAFM, which merges change features across dimensions, integrating the backbone network’s high-dimensional semantic information with spatial information and ensuring that multi-scale features are considered in feature recovery. The primary contributions of this paper are summarized as follows:

BFFM: A novel module for fusing dual-temporal change features from varying scales and dimensions;
RAResNet: An improved ResNet50, incorporating multi-attention and ReLK, which aggregates change information from remote sensing images over a large receptive field;
CD-LKAFM: A cross-dimensional module in the feature recovery phase that further integrates global and local change features, effectively merging semantic and spatial information.

Extensive experiments demonstrated that FFLKCDNet outperformed existing methods, including DASMNet and USSFCNet. The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 details the architecture of FFLKCDNet; Section 4 presents the experimental setup and results; and Section 5 concludes with a summary of our contributions.

2. Related Work

CNN-based methods have the advantage of automatically learning feature representations from data, capturing complex patterns and spatial dependencies between pixels, which leads to improved performance over traditional methods. With the rise of CNNs, researchers began exploring their applications in remote sensing change detection. Daudt et al. [7] were among the first to introduce fully convolutional networks (FCNs) into remote sensing change detection, proposing models such as FC-Siam-Conc, FC-EF, and FC-Siam-Diff. These models extract image features through convolutional layers and utilize a twin network structure to process dual-temporal images, marking a milestone in CNN-based change detection for remote sensing.

To further enhance CNN performance in remote sensing change detection, researchers have introduced various optimizations. With the development of deeper CNN architectures like ResNet [8] and DenseNet [9], these models have been applied to change detection, as deeper CNNs can capture richer image features by stacking more convolutional layers, leading to higher detection accuracy. For instance, Zhang et al. [10] proposed the DSIFN model, which applies a CNN to multi-source remote sensing image change detection by incorporating deep supervision mechanisms, achieving notable results. Fang et al. [11] further developed the SNUNET model, which employs a densely connected network to capture deep spatiotemporal relationships, thereby improving change detection accuracy.

The success of attention mechanisms in natural language processing has inspired their application in computer vision, including remote sensing change detection. Integrating attention mechanisms with CNNs enables models to focus on crucial areas within an image, thereby improving detection accuracy. Liu et al. [12] introduced the DSAMNet model, which incorporates spatial and channel attention modules, significantly enhancing remote sensing change detection performance. Shi et al. [13] developed the CDNet model, combining attention mechanisms and deep supervision to improve feature representation through a multi-layer attention mechanism, achieving meaningful advancements. Additionally, Huang et al. [14] employed CNNs to filter noise in multi-scale features by designing specific remote sensing feature fusion methods, further enhancing change detection accuracy for remote sensing images.

The local receptive field inherent in convolutional neural networks (CNNs) limits their capacity to capture global contextual information, thereby affecting their accuracy in Land Cover Change Detection (LCCD) tasks. In contrast, the Transformer model [15], which is built upon a self-attention mechanism, demonstrates remarkable efficacy in processing sequential data. The enhanced long-range modeling capabilities of Vision Transformers (ViT) and Swin Transformers enable them to effectively address the limitations encountered by CNNs. Consequently, Transformers have found extensive applications within the field of computer vision.

In remote sensing image processing, numerous studies have begun to leverage the Transformer architecture for Land Cover Change Detection (LCCD) tasks. Bandara et al. proposed ChangeFormer [16], a Siamese network exclusively utilizing Transformers, featuring a Siamese Transformer encoder and a multi-layer perceptron (MLP) decoder, thereby eliminating the need for a CNN-based feature extractor. Chen et al. [17] introduced BIT as a pioneering approach that transforms multi-temporal images into semantic tokens, effectively modeling spatiotemporal relationships within a token-based framework to enhance change detection. Zhang et al. [18] developed SwinUnet for change detection tasks, constructing a model architecture based on Swin Transformers. Liu et al. [19] proposed MSCANet, which synergistically combines the strengths of CNNs and Transformers to achieve efficient and effective agricultural land change detection. Feng et al. [20] introduced ICIFNet, utilizing Transformers to model both the intra-scale and inter-scale relationships of diverse temporal features. Additionally, models such as ACABFNet [21] and DMINet [5] employ Transformers to enhance the modeling of spatiotemporal relationships among multi-temporal features, effectively capturing complex dependencies between pixels and the associated spatiotemporal changes. DARNet [22] presented the concept of dense attention and a refined network structure, using Transformers to optimize the feature extraction stage and reduce noise interference, achieving commendable results. Similarly, the PT-Former [23] introduced in recent studies leverages a position-time aware Transformer, which effectively models the spatial and temporal dependencies in multi-temporal images, significantly improving change detection performance in complex scenarios. Furthermore, SMBCNet [24], a model that integrates hierarchical Transformer encoders with a cross-scale enhancement module, is designed to capture a broader range of features, improving detection accuracy across multiple spatial and temporal scales.

In the field of remote sensing change detection tasks, CNN-based models have demonstrated excellent performance, especially in capturing local details and spatial features in images. CNNs can efficiently extract small changes in images through local convolution operations, which is of great significance for detail changes in remote sensing images. Although the Transformer model has advantages in global feature modeling, CNNs are still a very effective choice in change detection tasks, due to a CNN’s ability to extract key local features at lower computational costs and strong training adaptability. Especially when facing high-resolution remote sensing images, the computational efficiency of CNNs makes them more feasible for practical applications. Although some scholars have begun to explore hybrid models that combine CNNs and Transformers, CNN-based models, with their mature technology and efficient feature extraction capabilities, can still provide excellent performance in resource-limited situations. Therefore, the application of CNNs in remote sensing change detection still has strong competitiveness, being especially suitable for scenes with sufficient data and attention to local change details.

In the Land Cover Change Detection (LCCD) task, obtaining a universal feature representation of changing targets against a complex background is essential for accurate detection. Consequently, the algorithm must possess the capability to capture large-scale land cover features, necessitating a model with an extensive receptive field. Recent studies have explored the use of large kernel convolution (LKC) to achieve this goal. Large kernel convolution (LKC) extends the size of traditional convolution kernels, enabling the model to capture a broader receptive field and more effectively extract features from images.

To enhance LKC performance, Liu et al. introduced ConvNeXt [25], which is entirely based on a standard CNN architecture, without incorporating Transformers, yet demonstrates competitive accuracy, scalability, and robustness. The authors experimentally validated that varying the convolution kernel size can lead to performance improvements. RepLKNet [26] employs parameterizable large depthwise separable convolutions to create a CNN architecture featuring a substantial 31 × 31 kernel, providing insights on effectively designing high-performance CNNs with large kernels. Experimental results have indicated that LKC offers a larger effective receptive field and enhanced shape bias compared to smaller kernels.

Additionally, Ding et al. proposed UniRepLKNet [27], a versatile perceptual LKC model that exhibits remarkable performance across various tasks, including audio, video, point cloud, time series, and image recognition. UniRepLKNet integrates Squeeze-and-Excitation (SE) blocks to increase the depth and employs extended reparameterization blocks for convolutional layer optimization, further demonstrating the robust performance of large-kernel convolution. These advancements underscore the significant potential of LKC in the domain of remote sensing image processing.

3. Model Overview

In high-resolution remote sensing images, shallow features such as geometric, texture, and color characteristics are often prominent, which can impede the algorithm’s ability to capture deeper feature information, particularly data spatial relationships and semantic context. This limitation complicates the accurate localization of target areas. Furthermore, remote sensing images may contain pseudo-change areas, arising from a lack of distinctive features that clearly differentiate change regions. This phenomenon introduces noise and elevates the risk of misclassification by the model.

To capture more spatial information from remote sensing images, enhance feature dependencies, and integrate global and local features for a comprehensive understanding of land cover changes, this paper introduces FFLKCDNet. The overall architecture of FFLKCDNet is illustrated in Figure 1. In the feature extraction stage, the model employs the BFFM to fuse remote sensing images from different time points, producing a feature map

X_{1}

. Subsequently, RAResNet extracts feature maps that encapsulate dual-time remote sensing information, resulting in feature maps

X_{2}

,

X_{3}

, and

X_{4}

. Atrous Spatial Pyramid Pooling (ASPP) [28] is then utilized to aggregate spatial information dependencies from the output of

X_{4}

, yielding

X_{5}

. The ASPP module expands the receptive field of the convolution kernel by using atrous convolution with multiple different dilation rates, thereby capturing a larger range of spatial contextual information without increasing computational complexity. In addition, ASPP also integrates global contextual features extracted through global average pooling, and forms a multi-scale comprehensive feature map by concatenating or fusing dilated convolution feature maps and global pooling feature maps with different dilation rates. This approach enhances the model’s ability to perceive objects of different scales, which helps improve performance in complex scenes, especially when dealing with tasks that require capturing multi-level spatial information. In the feature recovery stage, the Contextual Dual-Land-Cover Attention Fusion Module (CD-LKAFM) sequentially fuses features from high to low dimensions to capture contextual semantic information, while filtering out noise. Finally, a classifier processes these features to generate the model’s prediction results.

3.1. Bi-Temporal Feature Fusion Module

The structure of the BFFM is illustrated in Figure 2. This architecture comprises two input channels, designated as

I n p u t 1

and

I n p u t 2

, each with dimensions [3, H, W], where the channel count is 3 and the spatial dimensions are H and W, respectively. Initially, each input undergoes processing through a 3 × 3 convolutional layer, accompanied by batch normalization and a ReLU activation function. This results in an output feature map of size [32, H, W], effectively increasing the channel count to 32, while maintaining the spatial dimensions.

The two convolved feature maps are subsequently concatenated along the channel axis, resulting in a combined feature map with dimensions [64, H, W]. To extract fine-grained information at different scales, three depthwise separable convolution operations are performed, each using a 3 × 3 convolutional kernel, with dilation rates of 1, 3, and 5, respectively, and batch normalization and ReLU activation functions are applied simultaneously. This multi-scale approach enables the module to capture detailed spatial representations of land cover change characteristics. The output of these convolutions is concatenated again, preserving the size of the feature map at [64, H, W] and enhancing the richness of the feature representation. Following this, the concatenated feature map is processed through another 3 × 3 convolutional layer, which again employs batch normalization and ReLU activation, resulting in a feature map of size [64, H, W]. This feature map is then directed to a feature adjustment module (

y m

), where it undergoes a 1 × 1 convolution, along with batch normalization and ReLU activation, ensuring that the channel count and spatial dimensions remain [64, H, W]. Subsequently, the processed feature map from the ym module is fed into the next module,

y m^{'}

, where similar steps of 1 × 1 convolution, batch normalization, and ReLU activation are applied to further refine the feature representation. The output size after processing in

y m^{'}

remains [64, H, W].

Upon completion of these steps, the output feature map from

y m^{'}

is fused with the initial convolution feature map through element-wise addition (Add), facilitating feature integration. The resulting fused feature map is then processed through another 3 × 3 convolution layer, followed by batch normalization and ReLU activation. This stage includes downsampling, which reduces the spatial dimensions of the output feature map by half, leading to a final output size of [64, H/2, W/2]. This final feature map, denoted

X 1

, encapsulates the results of multiscale feature extraction, feature adjustment, feature fusion, and downsampling, thereby providing a rich and compact representation for subsequent layers of the network. The dual-temporal remote sensing feature map

X 1

is computed as follows:

X_{1} = ρ (bn (f_{2}^{3 \times 3} (y m^{'} \oplus y m))),

(1)

3.2. ReLK-Attention ResNet

In FFLKCDNet, RAResNet is used as the backbone network for extracting features from remote sensing images. As an effective backbone for semantic segmentation, ResNet solves the gradient vanishing problem in deep networks, enabling the model to learn effectively through deeper layers, without sacrificing training accuracy. In the feature extraction process, the addition of the attention module allows for the acquisition of more discriminative features that not only consider the information of a single-pixel but also integrate the global context in the spatiotemporal dimension, thereby enhancing the richness and accuracy of the feature representation. The use of large-kernel convolution helps extract broader and more global feature information, thereby expanding the receptive field of the model. This enhancement helps the model to fully understand the overall features and contextual information of the input image, resulting in more accurate reasoning and prediction results. RAResNet consists of three RAResBlocks, each of which contains a multi-attention downsampling model and two improved bottlenecks. The structure of RAResBlock is shown in Figure 3.

Figure 3 shows a schematic diagram of the improvements made to the residual block in RAResBlock. In this design, the downsampling component of ResNet50 is replaced by a multi-attention downsampling module (MADM). In MADM, convolution is first applied to downsample the input features to produce m, and then multi-attention is applied to generate the output. This attention mechanism effectively captures long-range and rich spatio-temporal correlations, resulting in features that are robust to illumination changes and registration errors. As a result, the network is better able to generate similar feature representations for similar objects at different times and locations, thereby mitigating the impact of registration errors and improving the accuracy of change detection. In addition, a residual structure is adopted to combine m with the output of the multi-attention, which helps introduce the attention mechanism within the residual framework. This integration allows self-attention to work in various subregions, allowing the model to capture spatio-temporal dependencies at different scales, so that the algorithm can better adapt to object scale changes, thereby improving the detection ability for objects of different sizes.

After the convolutional layer and projection operation, the generated Q and K tensors are used to calculate the attention weights, during which the inner product of Q and K is used to obtain the initial attention matrix, and the attention distribution is generated after

s o f t m a x

processing. The purpose of this step is to identify the correlation between input features through the self-attention mechanism, enhance the model’s ability to capture long-distance dependencies, and thus better understand global features. Next, this attention distribution feature is multiplied by the input feature again and convolved to obtain a new intermediate feature

m^{'}

, which is the final attention-weighted result. This represents the feature map after attention adjustment. By re-weighting and adjusting the original features,

m^{'}

enhances the diversity and discrimination of the features, further improving the model’s ability to focus on details and important information. The final

m^{'}

is fused with m (through addition) and convolved to form the output feature

F 1

of MADM. This fusion operation enables the final output feature to retain the information of the original features after attention weighting and integrating important features, providing a richer feature input for subsequent modules. The calculation formula of the feature map

F 1

is as follows:

F_{1} = ρ (bn (f_{1}^{3 \times 3} (m^{'} \oplus m)))

(2)

The first bottleneck module (MBottleneck) receives

F 1

as input, first performs channel compression through a 1 × 1 convolution layer, and undergoes batch normalization and ReLU activation. The advantage of this operation is that it reduces the number of channels and the computational complexity, while maintaining the integrity of the main information.

This intermediate feature retains the main feature information after preliminary compression, providing an effective input for the subsequent Reparam Large-Kernel Convolution (ReLK) and Squeeze-and-Excitation (SE) modules. In the ReLK module, the reparameterization strategy for large convolution kernels is used to enhance the spatial perception ability of the features. This operation enables the model to capture a wider range of spatial information and context through large convolution kernels, significantly improving the performance of the model when processing large-scale information. Next, the SE module dynamically weights the channel features through global average pooling and nonlinear activation to enhance important features (which can be called

F 1^{'}

). The advantage of the SE module is that by assigning different weights to each channel, the model can pay more attention to channel information with higher discrimination, thereby improving the selectivity of feature expression and the discrimination ability of the model. After ReLK and SE processing,

F 1^{'}

is further sent to the 3 × 3 convolution, and after batch normalization and ReLU activation, the final output feature

F 2

is generated, which is the output of the first bottleneck module. The 3 × 3 convolution further refines the feature extraction, enhances the ability to capture local information, and makes the final output feature more detailed.

The second bottleneck module performs similar operations on

F 2

. First, it generates intermediate features through 1 × 1 convolution layers, batch normalization, and activation functions, and then further enhances them through ReLK and SE modules. Finally, it obtains the final output

X n + 1

through 3 × 3 convolution. This repeated bottleneck structure design enables the model to gradually enhance the expressiveness of features at different levels by stacking multiple layers. After stacking features layer by layer, it provides a more comprehensive feature representation capability for the final output, further improving the model’s discrimination ability and adaptability. In this process, m and

m^{'}

are intermediate features generated by the attention mechanism in the MADM module, representing the initial attention-weighted result and the attention-enhanced feature after fusion with the original feature, respectively;

F 1^{'}

is the intermediate feature after 1 × 1 convolution and nonlinear activation processing in the bottleneck module (MBottleneck), which is further compressed and filtered after entering the ReLK and SE modules, to ensure a balance between feature effectiveness and computational efficiency. Overall, this design structure not only improves the model’s feature extraction capability through the combination of multiple technologies but also enhances its ability to understand information at different scales and global contexts. The calculation formula of the feature map

F 2

is as follows:

F_{2} = F_{1} \oplus ρ (bn (f_{1}^{3 \times 3} (F_{1}^{'})))

(3)

To further enhance the extraction of change information from a larger receptive field, we improved the bottleneck structure originally present in ResNet. Specifically, we replaced the 3 × 3 convolution in the residual block with a heavily parameterized large-kernel convolution. By employing reparameterization techniques, we can effectively utilize very large convolution kernels without a substantial increase in the number of parameters, thereby minimizing the computational costs associated with training and inference.

The advantages of heavily parameterized large-kernel convolution in remote sensing image feature extraction are primarily evident in the integration of multi-scale spatial features, the application of a multi-branch topology, and optimizations for model light weight and resource efficiency. This approach facilitates the effective extraction of advanced semantic information and significantly enhances the feature recognition capabilities. These benefits allow for more accurate and efficient processing of remote sensing images, enabling the capture of intricate details and contextual information, while maintaining high performance under resource constraints. Consequently, this provides a robust tool for the interpretation and analysis of remote sensing imagery.

In the Modified Bottleneck, the ultra-large-kernel convolution is decomposed into six depthwise separable convolutions, each utilizing different kernel sizes and dilation rates. For further details, refer to Figure 4.

The receptive field of convolution is influenced by both the convolution kernel size and the dilation rate. In the Modified Bottleneck, the convolution kernel sizes of the six depthwise separable convolutions are set to 9 × 9, 5 × 5, 5 × 5, 3 × 3, 3 × 3, and 3 × 3. By adjusting the dilation rates, we can effectively achieve an equivalent replacement of the ultra-large-kernel convolutions with these varying kernel sizes. This strategy enhances the model’s ability to capture features across different scales, while maintaining computational efficiency.

3.3. Cross-Dimensional Large-Kernel Attention Fusion Module

Following the BFFM, RAResNet, and ASPP, five feature maps

X_{N}

of varying dimensions are obtained, where N =1, 2, 3, 4, 5. To effectively integrate high-dimensional semantic information with low-dimensional spatial data and enhance the positioning and shape representation of targets, this paper introduces the CD-LKAFM. The structure of the CD-LKAFM is illustrated in Figure 5. The core feature of the CD-LKAFM structure lies in the combination of feature fusion and an attention mechanism. The following sections will detail the processes and functionalities of the various modules within this structure:

CD-LKAFM is designed to enhance feature representation through a multi-branch architecture and attention mechanisms. Typically, the input

x 1

is the feature extracted by RAResNet, while

x 2

represents the output of the previous CD-LKAFM, where

x 2

is twice the size of

x 1

. To accommodate this size difference, the smaller input feature map

x 1

is first enlarged through upsampling to produce

x 1^{'}

, ensuring it matches the dimensions of

x 2

. This upsampling typically employs interpolation methods to maintain the smoothness and continuity of the feature map, facilitating effective integration with higher-resolution features.

Next,

x 1^{'}

is added to

x 2

to enrich the representation by fusing features of different scales, thereby aiding in the capture of multi-level information. The resulting feature map is then processed through a sequence of convolution, batch normalization, and ReLU (ConvBR), which serves to extract and enhance the effective information present in the input features. The ConvBR layer is critical for extracting local features and boosting the network’s nonlinear capabilities. Specifically, the convolution extracts local patterns, batch normalization accelerates training and mitigates the gradient vanishing issue, and the ReLU activation function enhances the network’s nonlinear representation.

Following this, the feature map t is transformed into

s 1

through ConvBR and Reparameterized Large-Kernel Convolution (RepLK). The application of RepLK is intended to capture a broader range of feature patterns, enabling the network to better understand complex spatial relationships.

Simultaneously,

s 1

is divided into two branches for processing: the maximum value branch and the average value branch. These branches are designed to extract distinct types of important features, with the maximum value branch focusing on local maximum responses and the average value branch capturing overall average responses, thereby providing a multi-faceted understanding of the features. The combined output from these branches is generated as

s 2

through a convolution, followed by a sigmoid activation function (ConvS). The sigmoid function produces attention weights, allowing the network to dynamically adjust its focus on different features.

Next,

s 2

is multiplied element-wise with the original

s 1

to produce

s 3

, effectively introducing the attention mechanism. This weighted approach enhances significant features, while suppressing less important ones, improving the overall feature representation and enabling the network to concentrate on critical areas.

Concurrently, the feature map t is processed through global average pooling (GAP) to yield

c 1

. This operation captures global feature information, assisting the network in understanding the broader context. Subsequently,

c 1

is transformed into

c 2

through the ConvBR layer and the ConvS layer, further enhancing the feature expression capabilities of the model.

The feature map

c 2

is multiplied by t to generate

c 3

, achieving the fusion of global information with the original features and enhancing the feature diversity and robustness. Subsequently,

s 3

undergoes processing through a ConvBR layer and is added to

c 3

to establish a residual connection among the features. This design helps mitigate the gradient vanishing problem and facilitates the effective transfer of features throughout the network. Finally, the output y is generated through another ConvBR layer, which integrates multi-level features and improves the overall expressiveness and robustness of the model. The fusion of high-dimensional semantic information with low-dimensional spatial information significantly enhances the model’s performance in target positioning and recognition tasks.

4. Experimental Setup

4.1. Dataset Introduction

GVLM-CD [29] is the first large-scale open-source dataset for very-high-resolution (VHR) landslide mapping, comprising 17 pairs of bi-temporal images with a resolution of 0.59 m, sourced from Google Earth. The images were cropped to a size of 256 × 256 pixels without overlap, resulting in a total of 7327 pairs. GVLM-CD is suitable for developing and evaluating machine learning and deep learning models for change detection and related applications. Figure 6 shows some examples from the GVLM-CD dataset.

SYSU-CD [30] consists of 20,000 pairs of 0.5 m high-resolution aerial images, capturing seven years of land cover and land use changes in the Hong Kong region. Compared to other datasets, SYSU-CD provides complementary examples related to the evolution of high-rise buildings and ports. However, due to the diverse range of change types included, SYSU-CD also presents more distracting information, making the detection of change targets more challenging. Figure 7 shows some examples from the SYSU-CD dataset.

LEVIR-CD [31] comprises 637 very high-resolution (VHR, 0.5 m/pixel) image patch pairs from Google Earth, each sized at 1024 × 1024 pixels. These bi-temporal images, spanning 5 to 14 years, exhibit significant land-use cha nges, particularly in construction growth. The dataset includes various types of buildings, such as villas, high-rise apartments, small garages, and large warehouses. LEVIR-CD is an excellent resource for evaluating change detection metrics in deep learning models. Figure 8 shows some examples of the LEVIR-CD dataset.

4.2. Experimental Setting and Metrics

All experiments in this chapter were conducted on an NVIDIA V100 GPU, utilizing the Adam optimizer along with the polynomial learning rate adjustment strategy. The polynomial calculation formula is as follows:

l r = b a s e_l r \times {(1 - \frac{e p o c h}{n u m_e p o c h})}^{p o w e r},

(4)

where Ir represents the new learning rate,

b a s e_{l} r

represents the initial learning rate, epoch represents the current iteration number,

n u m_{e} p o c h

represents the maximum iteration number, and power represents the constant that controls the decay rate. In this paper,

b a s e_{l} r

=

5 \times 10^{- 5}

,

n u m_{e} p o c h

= 200, and

p o w e r

= 0.9. In addition, the batch size was set to 8. Cross-entropy was used for loss calculation; in the PaddlePaddle framework, paddle.nn.CrossEntropyLoss integrates softmax and cross-entropy loss calculations internally, allowing for direct computation using logits and labels, without manually applying softmax. Thus, the cross-entropy calculation can be simplified to

L = \frac{1}{N} \sum_{n = 1}^{N} - y_{n} log (p_{n}),

(5)

where

p_{n}

is the probability distribution of sample n predicted by the model, and

y_{n}

is the true label of sample n.

Generally, the model converged within 200 batches. To ensure the effectiveness of the training process, we monitored the changes in the loss function during training and observed that the loss value stabilized within this iteration count, with the model performance reaching the expected convergence. The training process used a gradient descent algorithm. The parameters were updated continuously by using back propagation. All optimal parameters of training models were saved.

The division of the dataset was as follows: for the GVLM dataset, the data division was roughly 80:10:10, which meant that the validation set and test set each had 733 samples, while the training set contained 5861 samples. The data division for the sysu dataset was roughly 60:20:20, meaning that the validation and test sets each had 4000 samples, while the training set contained 12,000 samples. The data division for the Levi CD dataset was roughly 70:20:10, with 256 samples in the validation set, 512 samples in the test set, and 1780 samples in the training set. The choice of this segmentation was to provide sufficient data for training, while maintaining a reasonable proportion of data for validation and testing.

4.3. Ablation Experiments and Result Analysis

To evaluate the contribution of each module in FFLKCDNet to the overall performance of the model, we conducted an ablation experiment using the GVLM dataset. By systematically removing key modules from FFLKCDNet and comparing the results under consistent experimental conditions, we could effectively quantify the impact of each module. The following ablation experiments were performed, and the results are presented in Table 1.

The BFFM module is primarily designed to fuse remote-sensing image features across different time frames and to suppress noise in the early stages. In the experiment, we trained and tested the model after removing the BFFM module. The results indicated a significant drop in accuracy across all datasets, particularly in scenes with high background noise. This underscores the crucial role of BFFM in effectively fusing temporal information and initially mitigating noise.

We replaced RAResNet in the model with the standard ResNet50 to assess the impact of the large-kernel convolution and multi-attention mechanism incorporated in RAResNet. The experimental results revealed a decrease in kappa, mIoU, and F1 scores when using the standard ResNet50. This indicates that large-kernel convolution effectively extracts a broader range of features within a larger receptive field, enhancing the model’s capacity to capture long-range dependencies and large-scale changes, which is crucial for addressing significant transformations. Additionally, the multi-attention mechanism plays a vital role in capturing intricate details and spatiotemporal relationships within remote sensing images.

CD-LKAFM is utilized to fuse semantic and spatial information of varying dimensions during the feature recovery stage. Upon removing this module, we observed a decline in the model’s ability to perform fine-grained feature extraction, particularly reflected in the mIoU and F1 Score metrics. This highlights the critical role of CD-LKAFM in mitigating noise and effectively integrating information across different dimensions.

When analyzing the removal of different module combinations, the exclusion of BFFM and RAResNet (leaving only CD-LKAFM) resulted in a decrease in Kappa to 0.7942, MIoU to 0.8032, F1 to 0.8964, and MPA to 0.9237. The primary function of the BFFM module is to fuse features from different time frames of remote sensing images. Particularly for time-series data, this aids in capturing temporal variations and effectively suppresses noise in the early stages. RAResNet, on the other hand, utilizes large-kernel convolutions and a multi-attention mechanism to extract features from the images, enhancing the model’s ability to capture complex details and long-range dependencies. The combination of these two modules facilitates both temporal feature fusion and enhanced detail extraction. BFFM provides stable feature fusion across time scales, supporting RAResNet in extracting finer details at the spatial scale. The multi-attention mechanism of RAResNet helps the model focus on more crucial features across different time frames, improving the quality of the temporal information fused by BFFM. Therefore, the integration of BFFM and RAResNet complementarily strengthens the extraction of spatiotemporal features and the retention of details. Removing BFFM results in a loss of the model’s capacity to handle temporal features, impairing the feature extraction capabilities of RAResNet, which leads to a decline in overall performance. Similarly, removing RAResNet weakens the model’s feature extraction capacity, which in turn diminishes the temporal feature fusion effect of BFFM, negatively impacting the model’s performance. After the removal of both BFFM and RAResNet, the model’s ability to extract features and fuse temporal information is significantly weakened. The CD-LKAFM module, when used alone, cannot compensate for these deficiencies, leading to a notable performance degradation.

Upon removal of RAResNet and CD-LKAFM (retaining only BFFM), the Kappa value was 0.7964, MIoU was 0.7855, F1 was 0.8856, and MPA was 0.9211. The deep features extracted by RAResNet provide CD-LKAFM with rich semantic information, while CD-LKAFM further enhances the representation of these features through multi-scale fusion, particularly in detail recovery and semantic integration. The combination of these two modules enables better handling of complex details and multi-scale variations in remote sensing images. When RAResNet is removed, CD-LKAFM cannot effectively leverage deep features for detail recovery, leading to insufficient detail expression. Conversely, while RAResNet captures complex image features, removing CD-LKAFM results in a lack of detail recovery and multi-scale information fusion, thus impairing the overall image quality and semantic representation. This suggests that although BFFM performs temporal feature fusion and noise suppression, the removal of RAResNet and CD-LKAFM significantly reduces the model’s feature extraction and detail recovery capabilities, resulting in a weakened overall performance.

Finally, when BFFM and CD-LKAFM were removed (retaining only RAResNet), the Kappa value was 0.8026, MIoU was 0.8049, F1 was 0.8987, and MPA was 0.9393. The primary function of BFFM is to fuse features across different temporal scales, while CD-LKAFM performs multi-scale information fusion during the feature recovery phase, enhancing the model’s ability to restore details. BFFM aids in capturing dynamic temporal information by fusing time-series data, whereas CD-LKAFM focuses on multi-scale fusion, strengthening the integration of image details and semantic information. BFFM provides CD-LKAFM with fused temporal features, which CD-LKAFM further integrates at multiple scales and semantic levels. The combination of these two modules enhances the spatial-temporal feature representation of the image, particularly for fine-grained feature recovery and multi-scale information fusion, exhibiting a synergistic effect. After the removal of BFFM, CD-LKAFM’s multi-scale information fusion loses the support of temporal information, limiting its ability to recover details. Conversely, removing CD-LKAFM causes the model to lose the ability to address fine-grained feature and spatial information fusion, leading to insufficient detail expression in the temporal features fused by BFFM, thereby impairing the overall performance. Although RAResNet provides strong feature extraction capabilities, the absence of BFFM’s temporal feature fusion and CD-LKAFM’s multi-scale information integration reduces the model’s pixel-level accuracy compared to the complete model.

In summary, the synergistic interaction of BFFM, RAResNet, and CD-LKAFM is crucial for the performance of the model. BFFM effectively integrates temporal information and suppresses noise; RAResNet enhances the feature extraction capabilities, particularly in capturing complex details and long-range dependencies; while CD-LKAFM plays a critical role in feature recovery and multi-scale information fusion. The removal of any module, especially the combined modules, results in a significant degradation in model performance, further underscoring the complementary and synergistic roles of these three modules in ensuring the success of the model.

The overall results of the ablation experiments demonstrate that each module contributes significantly to the model’s performance. Notably, the multi-attention mechanism and large-kernel convolution in RAResNet, along with the multi-dimensional fusion capabilities of CD-LKAFM, markedly enhanced the model’s ability to detect changing targets in remote sensing images.

Additionally, to demonstrate the consistency and robustness of the proposed FFLKCDNet across different datasets, we further conducted ablation experiments on the SYSU-CD and LEVIR-CD datasets. The results, presented in Table 2 and Table 3, further support the conclusions drawn from the GVLM dataset.

4.4. Comparative Experiment and Result Analysis

4.4.1. Comparisons on GVLM

To comprehensively evaluate the performance of the proposed FFLKCDNet, we conducted multiple experiments comparing it with BIT-CD [32], FC-Siam-diff [7], ChangeFormer [16], ResUNet [33], DSAMNet [12], MSCANet [19], DSIFNet [34], DTCDSCNet [35], ICIFNet [20], SNUNet [11], and USSFCNet [36]. Table 4 presents a detailed comparison of key evaluation metrics across the different models on the GVLM dataset, including the Kappa coefficient, mean intersection over union (mIoU), mean pixel accuracy (MPA), and F1 score. FFLKCDNet demonstrated a strong performance across all metrics, underscoring its effective change detection capabilities.

FFLKCDNet achieved an exceptional balance between computational efficiency and performance, showcasing its advantages across multiple dimensions. Compared to the other models with high computational complexity, FFLKCDNet had a significantly lower GFLOPs of 56.28, in contrast to methods such as BIT-CD (206.03 GFLOPs) and MSCANet (164.82 GFLOPs). While these high-complexity models exhibited strong performance, their computational demands are substantial, leading to a heavy reliance on computational resources. In contrast, FFLKCDNet leverages a thoughtful design and optimization to maintain efficient computation, while still achieving near-optimal performance, demonstrating its superior computational efficiency.

One of the key reasons for FFLKCDNet’s ability to maintain low computational complexity while delivering exceptional performance lies in its use of large-kernel convolutions. Although large-kernel convolutions are, theoretically, computationally more demanding, their advantage lies in processing broader regions in a single step, effectively merging more information. This method of information fusion enables FFLKCDNet to perform complex feature extraction with fewer layers, avoiding the redundant stacking of layers typical in traditional small-kernel convolution networks, thereby reducing additional computational overhead. In contrast, high-complexity models such as BIT-CD and MSCANet, while utilizing more layers and finer feature extraction, result in a significant increase in computational load, with their redundant layers and operations similarly contributing to a higher resource consumption.

Additionally, FFLKCDNet had a parameter count of 68.47 M, which, while slightly higher than some models with fewer parameters, such as FC-Siam-diff (40.19 M), remains a relatively reasonable choice when considering its optimization of both accuracy and computational efficiency. The use of large-kernel convolutions allows FFLKCDNet to achieve efficient feature extraction and robust model performance, with a parameter count that is well-suited for a wide range of complex scenarios. When compared to the other methods, FFLKCDNet not only excels in computational efficiency but also ensures that the parameter count contributes to its resilience and efficiency in various applications.

In summary, FFLKCDNet strikes an ideal balance between computational efficiency, parameter count, and performance through the strategic use of large-kernel convolutions and an optimized network structure. Compared to the other high-complexity models, FFLKCDNet demonstrated a lower computational overhead, higher resource utilization efficiency, and maintained a competitive edge in terms of accuracy.

For instance, FFLKCDNet achieved a Kappa coefficient of 0.8538, compared to 0.7921 for FC-Siam-diff and 0.8133 for BIT-CD. The Kappa coefficient measures the agreement between predicted and actual classifications, while accounting for chance agreement. A higher Kappa value indicates improved consistency and precision in change detection, highlighting FFLKCDNet’s ability to minimize misclassifications, even in complex regions with diverse land cover types.

Moreover, the mean intersection over union (mIoU) reached 0.8708. This metric is essential, as it quantifies the overlap between predicted change areas and the ground truth, with higher values indicating better segmentation quality. This significant improvement suggests that FFLKCDNet is more effective at capturing subtle differences between the two temporal images, successfully distinguishing changes from static areas.

FFLKCDNet also excelled in MPA, achieving 0.985 and surpassing ChangeFormer’s 0.9809. This indicates that the proposed model correctly identified a larger proportion of pixels, leading to more accurate overall change maps. The F1 Score of 0.927, significantly higher than USSFCNet’s 0.9054, further highlights the model’s balance between precision and recall, validating FFLKCDNet’s robustness in accurately identifying change areas, while minimizing false positives and negatives. The comparative analysis clearly demonstrates that FFLKCDNet outperformed the other models, attributable to its unique architecture that integrates multi-scale feature fusion and large-kernel attention mechanisms. This design enables the model to capture both global contextual information and fine-grained local details, providing a comprehensive understanding of changes within images.

Figure 9 visually compares the change detection results generated by FFLKCDNet with those of the other models. It presents a side-by-side view of predicted change maps, alongside ground truth labels for reference. As shown, FFLKCDNet produced sharper, more precise boundaries around change areas, particularly in regions where changes were subtle or dispersed across complex landscapes. For example, in areas marked by significant vegetation loss, models like ResUNet and ChangeFormer exhibited less precise boundaries, often merging static and change regions, which led to misclassification. Specifically, ResUNet tended to blur the transition between non-change and change regions, especially when changes were subtle, such as in the case of minor vegetation loss. ChangeFormer, on the other hand, struggled with distinguishing between closely related areas, leading to misclassifications where the model incorrectly identified areas as changed when they had remained static. In contrast, FFLKCDNet distinctly delineated these boundaries, maintaining the integrity of non-change areas, while accurately identifying regions of change.

In other cases, such as terrain alterations caused by floods or landslides, models like DSAMNet and USSFNet performed less effectively, due to their reliance on shallow feature extraction, which does not fully capture complex landscape variations. These models often failed to recognize smaller changes, such as slight shifts in terrain structure that are crucial for accurate change detection. Conversely, FFLKCDNet, with its use of large-kernel convolutions, excelled in these scenarios by considering a broader context, enhancing its ability to detect subtle but significant changes in the terrain. It captures even the smallest variations, making it more suitable for detecting subtle changes in challenging environments. The models FC-Siam-diff and BIT-CD, while effective in detecting larger changes, also suffered from an inability to detect fine-grained changes. BIT-CD, for instance, sometimes merged areas that should have been separated, particularly in urban landscapes, where small changes in building structures or roads are critical. FC-Siam-diff, although useful for capturing large-scale differences, did not perform as well when subtle spatial changes were distributed over larger areas, resulting in a reduced precision.

FFLKCDNet, on the other hand, outperformed these models across the various scenarios by consistently producing more accurate and sharper boundaries, especially in complex, real-world applications where changes are subtle and require the model to distinguish between small variations in time or space. This performance boost is largely due to the integration of the BFFM and RAResNet backbone, which enhances the model’s capability to focus on critical change regions and better understand complex temporal dynamics.

The visual results further reinforce the quantitative findings, demonstrating that FFLKCDNet achieved a more accurate and reliable segmentation of change areas, with fewer false positives and better-defined change boundaries. By effectively utilizing CD-LKAFM, the model balanced the extraction of high-level semantic features with the preservation of low-level spatial details, leading to a superior performance on high-resolution remote sensing images. In conclusion, the experiments on the GVLM dataset demonstrated that FFLKCDNet delivered state-of-the-art performance, surpassing existing models in terms of accuracy, consistency, and robustness. The success of FFLKCDNet underscores the advantages of its architectural innovations, including the integration of multi-scale temporal feature fusion and the application of large-kernel convolutions for enhanced change detection. These findings affirm FFLKCDNet’s potential as a powerful tool for remote sensing applications, particularly in scenarios where accurate and detailed change analysis is critical.

4.4.2. Comparisons on SYSU

The SYSU-CD dataset serves as a robust testing ground for change detection models, comprising 20,000 pairs of high-resolution (0.5 m) aerial images that capture over seven years of land cover and land use changes in the Hong Kong region. This dataset is notable for its diversity, documenting changes such as the growth of high-rise buildings, expansion of port areas, and other urban developments. However, it also presents significant challenges, due to the inclusion of varied and distracting information, which complicates the task of distinguishing true change targets from background noise. Table 5 presents a comparison of various models on the SYSU dataset, utilizing key metrics such as the Kappa coefficient, mIoU, MPA, and F1 Score. FFLKCDNet demonstrated exceptional performance, particularly in its ability to maintain high accuracy despite the complex and cluttered nature of the dataset. This highlights its robustness in effectively distinguishing change targets amidst diverse and distracting information.

For example, FFLKCDNet achieved a Kappa coefficient of 0.6842, significantly improving over USSFCNet’s 0.6435 and ResUNet’s 0.6521. This higher Kappa value reflects FFLKCDNet’s robustness in accurately identifying true changes, while minimizing parameterizableIoU) of 0.738 for FFLKCDNet sets a new benchmark on the SYSU dataset, surpassing DSAMNet’s previous best of 0.7312. This achievement highlights FFLKCDNet’s superior feature extraction and fusion strategies, which can effectively segment complex, multi-scale changes. Additionally, FFLKCDNet delivered a MPA of 0.8944, outpacing BIT-CD’s 0.8849, demonstrating its capacity to accurately classify pixels in densely built-up areas with overlapping features. The F1 Score of 0.8422 further emphasizes the model’s balance between precision and recall, effectively addressing both under-detection and over-detection issues.

These results demonstrate that FFLKCDNet’s integration of multi-attention mechanisms and reparameterizable large-kernel convolutions enables it to more effectively capture intricate details in high-resolution imagery. This capability is particularly advantageous for detecting gradual and subtle changes in urban areas, such as new constructions or expansions of existing infrastructures, which might be overlooked by other models. The model’s ability to maintain high performance in complex environments underscores its potential for practical applications in urban change detection.

Figure 10 illustrates a set of visual examples comparing the change maps generated by FFLKCDNet and other models. In highly urbanized regions with dense building structures, FFLKCDNet provided clearer and more accurate delineations of change areas. The comparison reveals that models like ChangeFormer and DSAMNet often produced fragmented or blurry change maps, struggling to accurately segment complex urban features. This limitation arose from their difficulties in capturing multi-scale dependencies and effectively fusing global and local features. In contrast, FFLKCDNet leveraged the CD-LKAFM to integrate contextual information across varying spatial scales, resulting in more cohesive and precise change maps. This highlights FFLKCDNet’s superiority in navigating the complexities of urban environments.

For example, in Figure 10, FFLKCDNet effectively identified both the height and spatial spread of new high-rise buildings, a challenging task given the overlapping shadows and occlusions typical in urban landscapes. Other models, particularly ChangeFormer, struggled in these areas, often misclassifying unchanged regions as new developments, which resulted in false positives.

FFLKCDNet’s enhanced segmentation accuracy can be attributed to its use of large-kernel convolutions, which provide a broader receptive field to capture long-range dependencies. This capability allows the model to better distinguish between true changes and artifacts caused by occlusions. In conclusion, the experiments on the SYSU dataset further validated the effectiveness of FFLKCDNet. Its outstanding performance in diverse and challenging urban environments demonstrates its ability to generalize across various change detection scenarios. By leveraging innovations like cross-dimensional feature fusion and large-kernel attention, FFLKCDNet consistently surpassed existing models, establishing a new benchmark for change detection accuracy in high-resolution remote sensing.

4.4.3. Comparisons on LEVIR

The LEVIR-CD dataset serves as a significant benchmark for change detection, comprising 637 pairs of very-high-resolution (VHR, 0.5 m/pixel) image patches, each sized 1024 × 1024 pixels. These bi-temporal images cover a period of 5 to 14 years, effectively capturing substantial land-use changes, particularly in urban and suburban areas characterized by new constructions such as villas, tall apartments, small garages, and large warehouses. This dataset is particularly suited for assessing models’ capabilities in detecting structural changes over extended periods. Table 6 details the comparative performance of various models on the LEVIR dataset, evaluated using key metrics such as Kappa coefficient, mIoU, MPA, and F1 Score. As highlighted in the table, FFLKCDNet achieved exceptional results, setting new records across all evaluation metrics, further affirming its effectiveness in accurately detecting changes in high-resolution imagery.

FFLKCDNet achieved a Kappa coefficient of 0.9009, significantly improving upon the next best score of 0.8498 recorded by ChangeFormer. This high Kappa value indicates the model’s excellent ability to accurately classify both changed and unchanged areas, minimizing confusion, even amid complex construction patterns over extended periods. In terms of mIoU, FFLKCDNet recorded 0.9092, slightly surpassing ResUNet’s score of 0.908. This metric is particularly critical for datasets like LEVIR, where new constructions may occupy only a small fraction of the image. This superior mIoU underscores FFLKCDNet’s capability to accurately segment small, intricate structures, ensuring that even subtle changes are detected.

The MPA of 0.9921 achieved by FFLKCDNet further exemplifies its robust performance, exceeding USSFCNet’s score of 0.9878. This reflects the model’s effectiveness in maintaining high classification accuracy across the entire image, effectively capturing both large structures and minor features. Additionally, the F1 Score of 0.9505 emphasizes FFLKCDNet’s reliability in detecting and classifying new constructions, while minimizing false positives and negatives, reinforcing its overall effectiveness in change detection tasks.

These results affirm that FFLKCDNet’s architectural enhancements, including parameterizable large-kernel convolutions and multi-attention mechanisms, significantly boosted its performance compared to the other models. This was particularly evident on the LEVIR dataset, where accurately detecting various building structures and land-use changes necessitates a model that excels in both local feature extraction and global context integration. FFLKCDNet’s ability to effectively combine these elements allows it to capture subtle changes, while maintaining high accuracy across different types of structural transformations.

Figure 11 illustrates visual comparisons of the change detection outputs generated by FFLKCDNet and the other models. As seen, FFLKCDNet consistently delivered sharper and more accurate segmentation results, particularly in areas with complex structural changes. For instance, in images depicting the construction of new residential blocks, FFLKCDNet precisely identified the new buildings, marking their boundaries with high fidelity. Other models, in contrast, could produce less distinct boundaries or misclassifications, underscoring FFLKCDNet’s superior capability to handle intricate urban transformations effectively. This ability highlights its potential for applications in urban planning and monitoring, where accurate change detection is critical. One example in Figure 11 highlights an area where new housing units have been added over the years. FFLKCDNet’s prediction accurately delineated these new structures, maintaining clear boundaries between the new and existing buildings. In contrast, ChangeFormer’s output tended to merge parts of the new construction with the old, leading to over-segmentation, while USSFCNet sometimes under-segmented, failing to detect smaller buildings. The improved precision in FFLKCDNet’s results can be attributed to its BFFM, which effectively merges temporal information, enhancing the detection of nuanced changes. This capability underscores FFLKCDNet’s robustness in complex urban environments, making it a valuable tool for monitoring land-use changes over time.

In another instance, the dataset captures the expansion of an industrial warehouse, where FFLKCDNet proved to be particularly adept at detecting changes. Its CD-LKAFM enables the model to seamlessly integrate high-level semantic information with detailed spatial features, ensuring that even minor changes are accurately identified. In contrast, other models, such as BIT-CD, often struggled in these scenarios, missing subtle expansion areas or misclassifying unchanged regions as new. Furthermore, FFLKCDNet’s use of parameterizable large-kernel convolutions enhances its ability to capture long-range dependencies, making it more effective at distinguishing between genuine changes and static noise. This capability significantly improves the model’s performance in complex environments, demonstrating its potential for reliable change detection in high-resolution remote sensing applications.

Overall, the experiments on the LEVIR dataset further substantiated FFLKCDNet’s effectiveness in addressing the challenges of change detection in complex, high-resolution scenarios. Its superior performance across all metrics, coupled with its ability to produce accurate and clear segmentation maps, underscores its potential as a reliable tool for applications such as urban development monitoring, construction mapping, and infrastructure assessment. By leveraging innovative architectural components like the BFFM and the CD-LKAFM, FFLKCDNet achieves a delicate balance between local feature sensitivity and global context awareness, setting a new benchmark for high-resolution remote sensing change detection.

It is worth noting that, compared to the other two datasets, the proposed model demonstrated lower Kappa and MIoU scores on the SYSU-CD dataset, although it still outperformed the other models.

This performance discrepancy on the SYSU-CD dataset may have stemmed from several factors. First, the scenes and features in the SYSU-CD dataset significantly differ from those in the other two datasets (GVLM-CD and LEVIR-CD), which presented greater challenges for the model when processing this dataset. Specifically, factors such as background complexity, lighting changes, and object occlusion may limit the model’s cross-domain adaptation ability. Despite the support of the BFFM and CD-LKAFM, these modules may not have performed as expected when handling complex visual features. Secondly, the lack of significant temporal variation between images in the SYSU-CD dataset weakened the effectiveness of the BFFM, impacting the overall performance. Additionally, the data quality of the SYSU-CD dataset may not be as high as that of the other datasets, with issues such as inconsistent annotations or blurred object boundaries affecting detection accuracy, further leading to lower Kappa and MIoU scores. Finally, the convolutional and attention mechanisms may have failed to fully exploit their potential when facing the specific complexities of the SYSU-CD dataset. The model struggled to make accurate predictions when encountering varying scales, occlusions, and environmental changes, resulting in a relatively poor performance on this dataset.

However, considering that we found that the metrics of all models on this dataset were not as good as for the other two datasets, although the metrics of the model proposed in this paper were not as high, the model proposed in this paper was indeed more advantageous compared to the other models under the same experimental conditions. We will further investigate the improvement of relevant indicators for specific datasets in our future work.

5. Conclusions

For the RSCD task, this paper proposed FFLKCDNet, which comprises BFFM, RAResNet, and CD-LKAFM modules. BFFM initially fuses change features from dual-temporal remote sensing images at different scales, while filtering out noise, thereby providing relevant information for further processing. RAResNet utilizes multi-attention mechanisms and reparameterized large-kernel convolutions to extract high-dimensional semantic change information over a larger receptive field. In the feature recovery stage, CD-LKAFM integrates multi-dimensional semantic information with spatial details, ensuring that features from various scales are effectively combined. This approach enhances the extraction of change features and strengthens local correlations, leading to improved prediction results and overall reliability in change detection tasks. Experimental comparisons with state-of-the-art models on the GVLM, LEVIR, and SYSU datasets demonstrated that FFLKCDNet outperformed the existing methods across different scenarios. Future work will focus on further exploring the integration of local and global information, and enhancing feature dependencies in remote-sensing images to achieve higher accuracy in change detection.

Author Contributions

Conceptualization, B.C.; methodology, B.C. and Y.W.; software, Y.W.; validation, X.Y. (Xu Yang) and X.Y. (Xiaochen Yuan); formal analysis, Y.W.; investigation, X.Y. (Xu Yang); resources, X.Y. (Xu Yang); data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, B.C.; visualization, B.C.; supervision, X.Y. (Xiaochen Yuan); project administration, S.K.I.; funding acquisition, S.K.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Macao Polytechnic University under grant No. RP/ESCA-08/2021.

Data Availability Statement

Publicly available datasets were used in this study. The GVLM dataset can be found at (https://github.com/zxk688/GVLM, accessed on 5 January 2024). The SYSU dataset can be found at (https://github.com/liumency/SYSU-CD, accessed on 5 May 2024). The LEVIR dataset can be found at (https://github.com/justchenhao/LEVIR, accessed on 5 June 2024). The code presented in this study is openly available at: (https://github.com/FFLKCDNet/FFLKCDNet).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Carlotto, M.J. Detection and analysis of change in remotely sensed imagery with application to wide area surveillance. IEEE Trans. Image Process. 1997, 6, 189–202. [Google Scholar] [CrossRef]
Treitz, P.; Rogan, J. Remote sensing for mapping and monitoring land-cover and land-use change—An introduction. Prog. Plan. 2004, 61, 269–279. [Google Scholar] [CrossRef]
Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.; Feitosa, R.Q.; Van der Meer, F.; Van der Werff, H.; Van Coillie, F.; et al. Geographic object-based image analysis–towards a new paradigm. ISPRS J. Photogramm. Remote Sens. 2014, 87, 180–191. [Google Scholar] [CrossRef] [PubMed]
Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Lin, M.; Yang, G.; Zhang, H. Transition is a process: Pair-to-video change detection networks for very high resolution remote sensing images. IEEE Trans. Image Process. 2022, 32, 57–71. [Google Scholar] [CrossRef] [PubMed]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4063–4067. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Zhu, S.; Song, Y.; Zhang, Y.; Zhang, Y. ECFNet: A Siamese network with fewer FPs and fewer FNs for change detection of remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Liu, M.; Shi, Q. DSAMNet: A deeply supervised attention metric based network for change detection of high-resolution images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6159–6162. [Google Scholar]
Jin, W.D.; Xu, J.; Han, Q.; Zhang, Y.; Cheng, M.M. CDNet: Complementary depth network for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 3376–3390. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Yuan, X.; Lam, C.T.; Huang, G. F3Net: Feature Filtering Fusing Network for Change Detection of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10621–10635. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Sun, Y.; Dai, D.; Zhang, Q.; Wang, Y.; Xu, S.; Lian, C. MSCA-Net: Multi-scale contextual attention network for skin lesion segmentation. Pattern Recognit. 2023, 139, 109524. [Google Scholar] [CrossRef]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Song, L.; Xia, M.; Weng, L.; Lin, H.; Qian, M.; Chen, B. Axial cross attention meets CNN: Bibranch fusion network for change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 21–32. [Google Scholar] [CrossRef]
Cheng, D.; Liao, R.; Fidler, S.; Urtasun, R. Darnet: Deep active ray network for building segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7431–7439. [Google Scholar]
Liu, Y.; Wang, K.; Li, M.; Huang, Y.; Yang, G. A Position-Temporal Awareness Transformer for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Feng, J.; Yang, X.; Gu, Z.; Zeng, M.; Zheng, W. SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation. Remote Sens. 2023, 15, 3566. [Google Scholar] [CrossRef]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. Inceptionnext: When inception meets convnext. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5672–5683. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5513–5524. [Google Scholar]
Libiao, J.; Wenchao, Z.; Changyu, L.; Zheng, W. Semantic segmentation based on DeeplabV3+ with multiple fusions of low-level features. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1957–1963. [Google Scholar]
Zhang, X.; Yu, W.; Pun, M.O.; Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Cui, F.; Jiang, J. Shuffle-CDNet: A lightweight network for change detection of bitemporal remote-sensing images. Remote Sens. 2022, 14, 3548. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Lei, T.; Geng, X.; Ning, H.; Lv, Z.; Gong, M.; Jin, Y.; Nandi, A.K. Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]

Figure 1. First Fusion Large-Kernel Change Detection Network (FFLKCDNet) overall structure diagram. First, the BFFM facilitates the fusion process by effectively aligning and combining the temporally disparate images. Second, RAResNet is employed to extract feature maps that encapsulate dual-time remote sensing information, enhancing the representation of temporal dynamics. Finally, the CD-LKAFM performs a dimensionality-reduction-based fusion of features, progressing from high-dimensional to low-dimensional spaces, thereby capturing contextual semantic information and enabling robust feature recovery. The red and green lines represent downsampling and upsampling operations, respectively.

Figure 2. Structure of the BFFM. dwc represents depthwise separable convolution. This structure is excellent at multi-scale feature extraction and has efficient feature expression ability and low computational complexity.

Figure 3. Structure of RAResBlock. ReLK represents Reparams Large-Kernel Convolution.

S E

represents the Squeeze-and-Excitation Module. This module not only integrates a multimodal attention mechanism and channel attention mechanism but also enhances multi-input feature fusion with an enhancement mechanism, achieving efficient feature extraction and information flow processing through innovative structural design.

Figure 3. Structure of RAResBlock. ReLK represents Reparams Large-Kernel Convolution.

S E

represents the Squeeze-and-Excitation Module. This module not only integrates a multimodal attention mechanism and channel attention mechanism but also enhances multi-input feature fusion with an enhancement mechanism, achieving efficient feature extraction and information flow processing through innovative structural design.

Figure 4. Structure of Repparms. Meat color represents single-layer convolution points, green represents double-layer convolution points, yellow represents three-layer convolution points, and red represents six layer convolution points. ^ddwc^k×^k Depthwise separable convolution, with kernel size and hole rate of k × k, d. RepLK represents Reparameterized Large-Kernel Convolution.

Figure 5. Structure of CD-LKAF. This module integrates the design concepts of multi-scale feature extraction, context enhancement, and dynamic weight adjustment, aiming to achieve efficient processing and fusion of complex input features.

Figure 6. Example images from the GVLM-CD dataset. A and B represent the same location at different times. Labels represent the changed areas, where white represents the changed area and black represents the unchanged area.

Figure 7. Example images from the SYSU-CD dataset. A and B represent the same location at different times. Labels represent the changed areas, where white represents the changed area and black represents the unchanged area.

Figure 8. Example images from the LEVIR-CD dataset. A and B represent the same location at different times. Labels represent the changed areas, where white represents the changed area and black represents the unchanged area.

Figure 9. The prediction results of selected models on GVLM. In the prediction results, white represents a true positive, black is a true negative, red indicates a false negative, and green stands as a false positive. In short, the lower the proportion of red and green, the better the predictive performance of the model.

Figure 10. The prediction results of selected models on SYSU. In the prediction results, white represents a true positive, black is a true negative, red indicates a false negative, and green stands as a false positive. In short, the lower the proportion of red and green, the better the predictive performance of the model.

Figure 11. The prediction results of selected models on LEVIR. In the prediction results, white represents a true positive, black is a true negative, red indicates a false negative, and green stands as a false positive. In short, the lower the proportion of red and green, the better the predictive performance of model.

Table 1. The ablation experiment results on the GVLM dataset (with the best results in bold).

BFFM	RAResNet	CD-LKAF	Kappa	MIoU	MPA	F1
×	✓	✓	0.8469	0.8628	0.9764	0.914
✓	×	✓	0.8334	0.8439	0.9673	0.9067
✓	✓	×	0.8438	0.8574	0.9706	0.9127
×	×	✓	0.7942	0.8032	0.937	0.8964
×	✓	×	0.8026	0.8049	0.9393	0.8987
✓	×	×	0.7964	0.7855	0.9211	0.8856
✓	✓	✓	0.8538	0.8708	0.985	0.927

Table 2. The ablation experiment results on the SYSU dataset (with the best results in bold).

BFFM	RAResNet	CD-LKAF	Kappa	MIoU	MPA	F1
×	✓	✓	0.6541	0.7129	0.8778	0.812
✓	×	✓	0.6097	0.6653	0.8545	0.7746
✓	✓	×	0.6439	0.6971	0.8758	0.7967
×	×	✓	0.5864	0.6416	0.8377	0.7532
×	✓	×	0.5973	0.6548	0.8464	0.7597
✓	×	×	0.5433	0.6299	0.8238	0.7479
✓	✓	✓	0.6842	0.738	0.8944	0.8422

Table 3. The ablation experiment results on the LEVIR dataset (with the best results in bold).

BFFM	RAResNet	CD-LKAF	Kappa	MIoU	MPA	F1
×	✓	✓	0.8671	0.8745	0.9769	0.9277
✓	×	✓	0.805	0.8162	0.9522	0.8869
✓	✓	×	0.8246	0.8411	0.9597	0.8935
×	×	✓	0.7758	0.7955	0.9279	0.8436
×	✓	×	0.7825	0.8064	0.9431	0.8546
✓	×	×	0.7547	0.7623	0.9173	0.8011
✓	✓	✓	0.9009	0.9092	0.9921	0.9505

Table 4. The experimental results for each group based on the GVLM dataset (bold indicates the best result).

Methods	Kappa	MIoU	MPA	F1	GFlops	Parameter (M)
BIT-CD [32]	0.8133	0.8399	0.9806	0.9059	206.03	63.87
FC-Siam-diff [7]	0.7921	0.8286	0.9791	0.8989	97.64	40.19
ChangeFormer [16]	0.816	0.8428	0.9809	0.9087	202.79	61.03
MSCANet [19]	0.7711	0.8089	0.9739	0.8859	164.82	55.17
DSIFNet [34]	0.7539	0.7967	0.9692	0.8749	58.37	44.8
DTCDSCNet [35]	0.7801	0.818	0.9766	0.8897	182.67	56.36
ICIFNet [20]	0.8289	0.8522	0.9835	0.9094	138.58	49.87
SNUNet [11]	0.8118	0.8394	0.9815	0.9047	75.85	50.69
ResUNet [33]	0.8159	0.8439	0.9073	0.9099	62.29	49.72
DSAMNet [12]	0.7947	0.8261	0.9782	0.8982	145.32	52.86
USSFCNet [36]	0.8061	0.8351	0.979	0.9054	51.81	48.39
FFLKCDNet (Ours)	0.8538	0.8708	0.985	0.927	56.28	68.47

Table 5. The experimental results for each group based on the SYSU dataset (bold indicates the best result).

Methods	Kappa	MIoU	MPA	F1
BIT-CD	0.6468	0.7132	0.8849	0.8238
FC-Siam-diff	0.6028	0.6845	0.8836	0.797
ChangeFormer	0.6177	0.6925	0.8738	0.8097
MSCANet	0.5962	0.6669	0.8679	0.7854
DSIFNet	0.6598	0.7219	0.8889	0.831
DTCDSCNet	0.5921	0.6822	0.8721	0.7962
ICIFNet	0.6651	0.7252	0.8892	0.8317
SNUNet	0.6391	0.7086	0.8819	0.8187
ResUNet	0.6521	0.716	0.8819	0.8274
DSAMNet	0.6779	0.7312	0.8833	0.8383
USSFCNet	0.6435	0.7089	0.8828	0.8228
FFLKCDNet (Ours)	0.6842	0.738	0.8944	0.8422

Table 6. The experimental results for each group based on the LEVIR-CD dataset (bold indicates the best result).

Methods	Kappa	MIoU	MPA	F1
BIT-CD	0.8935	0.9033	0.9899	0.945
FC-Siam-diff	0.8772	0.8886	0.991	0.9388
ChangeFormer	0.8498	0.8673	0.9877	0.929
MSCANet	0.8024	0.8319	0.9816	0.901
DSIFNet	0.769	0.7581	0.9782	0.8287
DTCDSCNet	0.8259	0.8499	0.9842	0.9129
ICIFNet	0.885	0.8563	0.982	0.9295
SNUNet	0.8438	0.8628	0.9875	0.922
ResUNet	0.8959	0.908	0.9904	0.9517
DSAMNet	0.8501	0.8702	0.9859	0.9274
USSFCNet	0.8493	0.8688	0.9878	0.9282
FFLKCDNet (Ours)	0.9009	0.9092	0.9921	0.9505

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Wang, Y.; Yang, X.; Yuan, X.; Im, S.K. FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 824. https://doi.org/10.3390/rs17050824

AMA Style

Chen B, Wang Y, Yang X, Yuan X, Im SK. FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images. Remote Sensing. 2025; 17(5):824. https://doi.org/10.3390/rs17050824

Chicago/Turabian Style

Chen, Bochao, Yapeng Wang, Xu Yang, Xiaochen Yuan, and Sio Kei Im. 2025. "FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images" Remote Sensing 17, no. 5: 824. https://doi.org/10.3390/rs17050824

APA Style

Chen, B., Wang, Y., Yang, X., Yuan, X., & Im, S. K. (2025). FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images. Remote Sensing, 17(5), 824. https://doi.org/10.3390/rs17050824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FFLKCDNet: First Fusion Large-Kernel Change Detection Network for High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

3. Model Overview

3.1. Bi-Temporal Feature Fusion Module

3.2. ReLK-Attention ResNet

3.3. Cross-Dimensional Large-Kernel Attention Fusion Module

4. Experimental Setup

4.1. Dataset Introduction

4.2. Experimental Setting and Metrics

4.3. Ablation Experiments and Result Analysis

4.4. Comparative Experiment and Result Analysis

4.4.1. Comparisons on GVLM

4.4.2. Comparisons on SYSU

4.4.3. Comparisons on LEVIR

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI