Transformer Self-Attention Change Detection Network with Frozen Parameters

Cheng, Peiyang; Xia, Min; Wang, Dehao; Lin, Haifeng; Zhao, Zikai

doi:10.3390/app15063349

Open AccessArticle

Transformer Self-Attention Change Detection Network with Frozen Parameters

by

Peiyang Cheng

¹,

Min Xia

^1,*

,

Dehao Wang

¹,

Haifeng Lin

²

and

Zikai Zhao

^1,3

¹

Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

³

Department of Computer Science, University of Reading, Whiteknights, Reading RG6 6DH, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3349; https://doi.org/10.3390/app15063349

Submission received: 24 January 2025 / Revised: 11 March 2025 / Accepted: 14 March 2025 / Published: 19 March 2025

(This article belongs to the Special Issue Big Data Analysis and Management Based on Deep Learning: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The purpose of change detection is to recognize changed areas from a pair of two remote sensing images. However, since change areas often include multiple terrain features, this demands enhanced feature extraction capability from the model. This paper proposes a frozen-parameter Transformer self-attention change detection network (ZAQNet). The network integrates four innovative modules: a GIAU (Generalized Image Attention Unit) which can effectively fuse the features of two remote sensing images and accurately focus on changing areas; a GSAU (Global Spatial Attention Unit) which performs self attention processing in the image spatial dimension to enhance the model’s ability to capture global change information; a GSCU (Global Semantic Context Unit) which performs self-attention operations in the channel dimension to enhance the model’s attention to feature maps containing changing information; and a PRU (Patch Refinement Unit) which extracts and refines spatial position information from the underlying feature map, optimizing the restoration effect of the feature map. The experiments on the BTRS-CD and LEVIR-CD datasets show that ZAQNet performs excellently in change detection tasks. Among them, the change detection index F1 and IOU are better than the comparison model. These results fully demonstrate the superiority, robustness, and generalization ability of ZAQNet in change detection tasks and provide an efficient and reliable solution for remote sensing image analysis.

Keywords:

change detection; remote sensing; transformer; deep learning

1. Introduction

Change detection identifies differences between two images captured at different times at the same location; it labels the pixels in the changed areas as 0 and the pixels in the unchanged areas as 1. With the ongoing advancement of remote sensing technology, change detection in remote sensing images has become a key focus in the field [1]. In addition, change detection in remote sensing images plays a significant role in various fields such as urban development planning [2], agricultural surveys [3,4], land management [5], and more.

Weismiller [6] introduced the difference method in 1977 to distinguish differences in remote sensing images, marking the beginning of traditional change detection methods. Since then, numerous scientific advancements have been made in this area, including methods such as change vector analysis [7], post-classification refinement [8], and principal component analysis [9]. These methods are capable of effectively distinguishing differences in remote sensing images within a short period of time, significantly improving the efficiency. However, as sensors continue to advance, remote sensing images are achieving higher resolutions, resulting in greater diversity in the appearance, color, and brightness of different land features. Therefore, the detection accuracy of traditional change detection methods is often unsatisfactory. In this context, it is highly meaningful to develop a change detection method that can quickly and accurately adapt to high-resolution remote sensing images.

AlexNet, proposed by Krizhevsky et al. [10], achieved the championship in the ImageNet image classification competition, demonstrating the powerful feature extraction capability of deep learning-based image processing methods at that time. Moreover, with the continuous advancement of hardware technology, deep learning-based image processing methods quickly realized end-to-end training, showing significant advantages in terms of efficiency and speed. Subsequently, deep learning-based image processing methods experienced rapid development and were also applied in the field of change detection in remote sensing images. Gong et al. [11] proposed a novel deep learning-based method for image change detection; it demonstrated the effectiveness of convolutional neural networks (CNNs) in change detection tasks. In 2018, Rodrigo et al. [12] introduced three convolutional neural network (CNN) architectures specifically designed for change detection tasks. Additionally, they proposed a novel Siamese network structure with shared weights for processing two remote sensing images, achieving the best scores on two datasets. The Siamese network structure subsequently became one of the mainstream methods in the field of change detection. These convolution-based methods exhibit good adaptability in handling various sizes of images and achieve good accuracy. However, the limited receptive field of convolutional networks restricts their ability to capture broader features, which is a key drawback that convolutional neural networks tend to overlook.

Due to the excellent capability of Transformers [13,14] in extracting global semantic features in the field of natural language processing, Dosovitskiy et al. [15] applied Transformers to the domain of image processing in 2020, providing new insights for image processing. Subsequently, many networks utilizing Transformers emerged in the field of image processing. However, while Transformers provide the ability to extract global semantic features, they also increase the number parameters. Liu et al. [16] incorporated a sliding window mechanism into Transformer-based networks, which retained the advantages of Transformers while significantly reducing the number of parameters. Additionally, some studies have combined Transformers with convolutional operations, using Transformers to overcome the limitations of convolutional neural networks in extracting global semantic information while leveraging convolutional networks to reduce the large parameter count associated with Transformers alone. These approaches have shown promise. Zhang et al. [17] designed a pure Transformer network called SwinSUNet with a Siamese U-shaped structure to solve the change detection problem. Peng et al. [18] proposed the Conformer network, which combines Transformer and CNN architectures. They introduced coupling units to fuse the features extracted by both models. Chen et al. [19] proposed the Bitemporal Image Transformer (BIT), which expresses bitemporal images as a small number of semantic tokens. The Transformer encoder and decoder are used to effectively model the context in the spatio-temporal domain, and BIT is integrated into a change detection framework based on deep feature differencing. Zhan et al. [20] proposed an attention-guided multi-scale fusion network for bi-temporal change detection for change detection. Li et al. [21] proposed an end-to-end encoding–decoding hybrid Transformer model for CD known as TransUNetCD. This model solves the problem of redundant information generated when extracting low-level features under the UNet framework and improves the effectiveness of the network and the accuracy of extracting changing features by weighting each pixel and selectively aggregating features.

The aforementioned work can process images more effectively and increase the model’s accuracy when compared to pure convolution and pure Transformers. However, there are still some issues with them:

The extensive usage of Transformers in the encoding stage overlooks the high computational cost of Transformers. Additionally, Transformer modules with a high number of parameters are susceptible to overfitting, particularly on small- to medium-sized datasets.
Due to the computational cost of the encoding stage, these models often use relatively simple fusion methods in the decoding stage. However, such fusion approaches may lead to inadequate information integration, negatively impacting the restoration of feature maps during decoding.

ResNet-34 serves as the backbone network to extract features from two original remote sensing images. The data obtained from each layer are then fed into the Feature Fusion Unit (FFU) for fusion. Simultaneously, at the bottom-most layer of the network, we sum the outputs of the two ResNet-34 branches and feed the result into a Patch Refinement Unit (PRU) module. The output of the PRU module is then passed to the upsampling stage. Finally, in the upsampling stage, we perform summation between the output of the PRU module and the outputs of each level of the FFU. This summation is then followed by upsampling operations until the feature maps are restored to their original size.

In this study, we employed parameter freezing techniques to optimize network performance. Specifically, we first trained the network normally until it achieved satisfactory performance metrics (such as IOU and F1 scores). Once the network met the expected performance, we froze its parameters and constructed a new network with the same architecture. At this point, the frozen parameters from each layer (such as the outputs of ResNet-34 and FFU units) were integrated with the untrained parts of the new network, forming a new network for further training and optimization. This strategy ensures that the pre-trained weights are utilized to their fullest potential in the frozen network, and the improvement in network performance is closely related to the selection of frozen parameters. The best performance of ZAQNet is typically achieved when optimal pre-trained weights are obtained, so we have added detailed explanations in the relevant sections.

The following points can be used to sum up this paper’s contribution:

A parameter-frozen Transformer self-attentional change detection network is proposed. The information contained in remote sensing photos can be fully utilized by the network to perform the change detection task.
We propose three modules: the GIAU, GSAU, and GSCU. The GIAU effectively integrates features from two remote sensing images. The GSAU performs self-attention processing on the input features in the spatial dimension of the image. The GSCU performs self-attention processing on the input features in the channel dimension.
ZAQNet achieved the best performance over the other change detection algorithms on two different datasets, demonstrating its superiority.

2. Methodology

2.1. Overall Framework

Indeed, CNN and Transformer methods are two prominent methods in the field of image processing [22]. CNNs excel in extracting local features, while Transformers are known for their ability to capture global features. Combining these two approaches can be a promising choice, because it leverages the strengths of both methods. Additionally, we introduce the method of parameter freezing, where we freeze the parameters of certain modules in the pre-trained network and combine them with the untrained network to form a new network. This approach resembles a residual-like structure, where the untrained network essentially learns to fit the difference between the frozen parameters and the optimal solution. Based on the aforementioned points, we propose the frozen-parameter Transformer self-attention change detection network. This network combines the powerful feature extraction capabilities of Transformers and the parameter-freezing technique to improve change detection performance.

Figure 1 provides a detailed illustration of the overall framework of the ZAQNet. This paper aims to propose a Transformer self-attention change detection network, ZAQNet, based on freezing parameters, which is used to detect change regions more accurately. In addition, this paper also introduces four innovative modules: the GIAU (Global Information Aggregation Unit), GSAU (Global Spatial Attention Unit), GSCU (Global Channel Attention Unit), and PRU (Pooling Refinement Unit). As shown in Figure 2, Figure 3, Figure 4 and Figure 5, these modules optimize the performance of the model from the aspects of feature fusion, spatial dimension attention, channel dimension attention, and spatial position refinement. The GIAU accurately locates the change area by fusing the features of dual-temporal remote sensing images. The GSAU implements a self-attention mechanism in the spatial dimension to enhance the model’s ability to capture global change information; the GSCU implements self-attention operation in the channel dimension to strengthen the model’s attention to the change feature map. The PRU further improves the recovery effect of the feature map by extracting and refining the spatial location information of the underlying feature map. In the downsampling stage, we employ two pairs of ResNets with shared weights to extract features from the two remote sensing images. Notably, one of the ResNet pairs undergoes pre-training, and its parameters are subsequently frozen. Afterwards, the feature maps obtained from the two pairs of ResNets are summed together and fed into the Feature Fusion Unit (FFU) for further processing. The specific operations are illustrated in the pink region of Figure 1. The features extracted by the ResNet pairs with frozen parameters are fed into the FFU with frozen parameters for processing, while the features extracted by the ResNet pairs involved in training are passed through the FFU involved in training. Finally, the processed feature maps are summed and fed into the upsampling stage to assist the model in generating the mask image. The specific operations are depicted in the blue region of Figure 1. Additionally, within the FFU, we introduce three auxiliary units—the GIAU, GSAU, and GCAU—to process the extracted features at different levels. The internal structure of the FFU is depicted in the flaxen region of Figure 1. Furthermore, in Table 1, the specific details of the parameters in the model are shown.

2.2. Global Information Aggregation Unit (GIAU)

The unique aspect of change detection tasks lies in the handling of two remote sensing pictures taken at various times and the generation of a mask image that identifies the changed and unchanged regions. This implies that the two sets of feature information extracted by a pair of ResNets with shared weights need to be aggregated into one set of feature information that is strongly correlated with the changed regions. Furthermore, due to the difference in capture time, the two images often exhibit variations beyond the identified changed regions, such as shooting angle deviations, shadow effects, cloud cover, and seasonal vegetation changes. To extract the true changed regions from these noise sources, simple fusion approaches may lead to a loss of information integrity or even compromise the extracted feature information. We propose a novel Feature Fusion Unit (FFU) to perform the fusion and extraction of two sets of feature information. The FFU consists of three independent units and receives the feature sets extracted during the downsampling stage. To facilitate comprehensive interaction between these sets, we designed the Global Information Aggregation Unit (GIAU), which selectively extracts the information most relevant to the changed regions.

The GIAU is designed to handle variations in input remote sensing images, such as angular deviations, shadows, and seasonal changes. The core mechanism involves first subtracting features from the two images and applying a convolutional layer to compute the differences. These differences are used to calculate weights, which are then multiplied with the feature maps to enhance relevant change information. The operations in the GIAU, particularly the convolutional layer, enable the model to focus on the most significant changes, effectively addressing variations in environmental conditions. Furthermore, the selective extraction mechanism is enhanced by aggregating features from the two images, which helps to focus on genuine changes rather than noise or irrelevant variations. Figure 2 shows a schematic diagram of the GIAU, where the inputs of the GIAU are set to

f_{1} \in R^{B \times C \times H \times W}

and

f_{2} \in R^{B \times C \times H \times W}

. First, we subtract

f_{1}

and

f_{2}

and then send the result to a

3 \times 3

convolutional layer that keep the number of channels. Then, we use the output of the

3 \times 3

convolutional layer as the weight to multiply the two input information and add the inputs

f_{1}

and

f_{2}

. Afterwards, the obtained two pieces of information are sent to the

3 \times 3

convolutional layer, in which the number of channels is halved in the channel dimension. Finally, the output of the convolutional layer with the number of channels is halved, and the first two pieces of input information are added and sent to the

1 \times 1

convolutional layer without changing the number of channels to obtain the final output

f_{o u t} \in R^{B \times C \times H \times W}

. The specific arithmetic expression is as follows:

f_{3} = {Conv}_{3 \times 3} (|f_{1} - f_{2}|),

(1)

f_{4} = Concat ((f_{1} + f_{1} \times f_{3}), (f_{2} + f_{2} \times f_{3})),

(2)

f_{out} = {Conv}_{bel} (f_{1} + f_{1} + {Conv}_{3 \times 3} (f_{4})),

(3)

where

|\cdot|

represents the absolute value operation, and

Concat (\cdot, \cdot)

represents the concatenation along the channel dimension.

2.3. Global Spatial Attention Unit (GSAU)

Self-attention possesses powerful capabilities for global information extraction by computing the relationships between each pixel and the other pixels in an image. However, this computation approach can lead to significant memory consumption. Furthermore, as mentioned earlier, the feature maps often contain a significant amount of noise and interference for various reasons. If a straightforward and coarse application of self-attention is used to attend to global information, it can lead to efficiency issues. Therefore, we propose the Global Spatial Attention Unit (GSAU), which retains the global information extraction capabilities of self-attention while reducing memory consumption.

Figure 3 shows a schematic diagram of the GSAU, where the input of the GSAU is set to

f \in R^{B \times C \times H \times W}

. First, f is sent to the pooling layer that is globally averaged in the horizontal and vertical directions to obtain

f_{1} \in R^{B \times C \times H \times 1}

and

f_{2} \in R^{B \times C \times 1 \times W}

; then, we perform dimension extrusion on

f_{1}

and

f_{2}

to obtain

f_{1} \in R^{B \times C \times H}

and

f_{2} \in R^{B \times C \times W}

. Then, we send

f_{1}

and

f_{2}

to two multi-head self-attention modules to obtain

f_{3} \in R^{B \times C \times H}

and

f_{4} \in R^{B \times C \times W}

, and we perform dimension expansion on

f_{3}

and

f_{4}

to obtain

f_{3} \in R^{B \times C \times H \times 1}

and

f_{4} \in R^{B \times C \times 1 \times W}

. Then, we use the broadcast mechanism to sum

f_{3}

and

f_{4}

and multiply them by f. Finally, the product is convolved with

1 \times 1

and then added to itself to obtain the final output

f_{o u t} \in R^{B \times C \times H \times W}

. The specific arithmetic expression is as follows:

f_{1} = MHSA (HSqueeze (AvgPool 2 d (f))),

(4)

f_{2} = MHSA (VSqueeze (AvgPool 2 d (f))),

(5)

f_{c} = (HUnsqueeze (f_{1}) + V Unsqueeze (f_{2})) \times f,

(6)

f_{o u t} = {Conv}_{1 \times 1} (f_{c}) + f_{c} .

(7)

In this context,

MHSA (\cdot)

represents the multi-head self-attention module,

HSqueeze (\cdot)

denotes the dimension compression along the vertical direction,

VSqueeze (\cdot)

denotes the dimension compression along the horizontal direction,

HUnqueeze (\cdot)

represents the dimension expansion along the vertical direction, and

VUnqueeze (\cdot)

represents the dimension expansion along the horizontal direction.

2.4. Global Channel Attention Unit (GCAU)

The GCAU utilizes self-attention to achieve global information extraction on feature maps. However, it cannot simultaneously attend to the information between different feature maps. From Table 1, it is evident that the channel numbers of the feature maps propagated to the FFU range from 64 to 512. To capture the information between different feature maps at various levels, we propose the GCAU. Similar to the GSAU, we introduce self-attention in the GCAU to explore the information between all the feature maps. Subsequently, this information is used as weights to assign to the original information, thereby enhancing the model’s focus on feature maps containing change-related information.

A schematic diagram of the GCAU is shown in Figure 4, and the input of the GCAU is set to

f \in R^{B \times C \times H \times W}

. First, we send f to the global average pooling to obtain

f_{1} \in R^{B \times C \times 1 \times 1}

; then, we perform dimension compression on

f_{1}

and transpose it to obtain

f_{2} \in R^{B \times 1 \times C}

; then, we send

f_{2}

to the output obtained in the multi-head self-attention module as the weight, and we also use f for multiplication. Finally, the product is convolved with

1 \times 1

and then added to itself to obtain the final output

f_{o u t} \in R^{B \times C \times H \times W}

. The specific arithmetic expression is as follows:

f_{c} = MHSA (Permute (Squeeze (AvgPool 2 d (f)))) \times f,

(8)

f_{out} = {Conv}_{bel} (f_{c}) + f_{c},

(9)

where

Squeeze (\cdot)

represents the dimension compression operation,

Permute (\cdot)

represents the transpose operation on the tensor, and

MHSA (\cdot)

represents the multi-head sub-attention module.

2.5. Pooling Refinement Unit (PRU)

As the depth of the ResNet increases, the receptive field of the model also expands. In the feature maps extracted by the lower-level networks, a single pixel often represents the result of convolutions over a spatial region in the original image. Therefore, these feature maps contain rich spatial position information. To extract and refine this spatial position information, we propose the Pooling Refinement Unit (PRU). The PPM [23] module applies different convolutions on the original feature map to obtain multiple feature maps of different sizes. These feature maps are then concatenated together. This approach utilizes the mutual guidance of context to enhance the model’s prediction accuracy. However, it tends to overlook pixel-level local details. Building on its advantages, we utilized four pooling layers to extract the contextual information from the feature maps at the lowest layer of the network. At this stage, the feature maps themselves are at the pixel level, thereby mitigating the drawback of overlooking local details. Furthermore, we strengthened the guidance of context by concatenating the output of the pooling layer with a smaller pooling kernel and the original input along the channel dimension. This concatenation effectively guides the pooling layer with a larger pooling kernel, enabling it to capture richer contextual information.

A schematic diagram of the PRU is shown in Figure 5, and the input of the PRU is set to

f \in R^{B \times C \times H \times W}

; we sum

f_{1}

and

f_{2}

to obtain f. First, we send f to the pooling layer with a core of 1; then, the result is concatenated with f and sent to the pooling layer with a core of 2 and a convolution of

1 \times 1

. After that, the result is concatenated with f and sent to the pooling layer with a core of 4 and a convolution of

1 \times 1

. Similarly, the result is concatenated with f and sent to a pooling layer with a core of 6 and a convolution of

1 \times 1

. Finally, the results of the four convolutional layers are concatenated in the channel dimension and added to the original input to obtain the final output

f_{o u t} \in R^{B \times C \times H \times W}

. The specific arithmetic expression is as follows:

f_{1} = {Conv}_{bll} (AvgPool 2 d_{1} (f))

(10)

f_{2} = {Conv}_{| k |} (AvgPool 2 d_{2} (Concat (f, f_{1}))),

(11)

f_{3} = {Conv}_{| k |} (AvgPool 2 d_{4} (Concat (f, f_{2}))),

(12)

f_{4} = {Conv}_{| k |} (AvgPool 2 d_{6} (Concat (f, f_{3}))),

(13)

f_{o u t} = f + Concat (f_{1}, f_{2}, f_{3}, f_{4}),

(14)

where

Concat (\cdot)

indicates splicing in the channel dimension;

Concat 2 d_{a} (\cdot)

indicates the average pooling with the pooling core as a.

3. Datasets

To comprehensively validate the effectiveness and generalization of ZAQNet proposed in this paper, we conducted experiments on two different remote sensing change detection datasets, namely, BTRS-CD [20] and LEVIR-CD [24]. In addition, the BTRS-CD dataset consists of images with a size of 512 × 512 pixels, while the LEVIR-CD dataset consists of images with a size of 256 × 256 pixels. The use of datasets with different sizes allows for better evaluation of the model’s generalization performance.

3.1. BTRS-CD Dataset

The BTRS-CD dataset is derived from the central and western regions of China from 2010 to 2019. The dataset consists of a total of 3520 pairs of 512 × 512-pixel remote sensing images. Among them, 2850 pairs are used for training, while 570 pairs are used for prediction. These images contain information about 659 different land cover classes. Additionally, each image covers an area of approximately 3 km². The sample image is shown in Figure 6.

In Figure 6, the (a) column displays sample images where the building areas have changed to non-building areas. The (b) column displays sample images where non-building areas have changed to building areas. The (c) column displays sample images where agricultural areas have changed to roads. The (d) column displays sample images with elongated changed areas. The (e) column displays sample images with no change but cloud cover. The (f) column displays sample images with irregular changed areas. These sample images illustrate the diverse range of change areas and interference noise present in the BTRS-CD dataset.

3.2. LEVIR-CD Dataset

The LEVIR-CD dataset is derived from Google Earth imagery of Texas, USA, spanning from 2002 to 2018. A total of 637 pairs of 1024 × 1024 resolution remote sensing images make up the dataset. With a time span of up to 14 years, the dataset contains a wide range of change areas, including residential areas, apartments, vegetation areas, garages, vacant land, highways, and other locations. In addition, the dataset also takes into account seasonal and lighting variations, which are crucial for testing the effectiveness of the algorithms. Due to GPU limitations, we have cropped the original images of 1024 × 1024 into 4528 pairs of 256 × 256 remote sensing image pairs. Among them, 3808 pairs are used for training, and 720 pairs are used for prediction. A sample image is shown in Figure 7.

In Figure 7, the samples in column (a) represent regular change areas, column (b) shows small and densely distributed change areas, column (c) displays scattered change areas with significant spectral variations. Columns (d–f) show samples with a lot of interference information but no actual change areas. It is important to note that the labeled change areas in the LEVIR-CD dataset do not include the change areas in columns (d–f). Therefore, the change areas in those columns can be considered as interference information. This can to some extent enhance the model’s generalization performance.

4. Experiment

4.1. Introduction to Experimental Indicators

In this study, we employed several commonly used metrics, including the PR, RC, IoU, and F1 score, to assess the performance of the model. Their mathematical formulas are as follows:

P R = \frac{T P}{T P + F P},

(15)

R C = \frac{T P}{T P + F N},

(16)

I O U = \frac{T P}{T P + F P + F N} .

(17)

Here, TP stands for true positive, FN for false negative, FP for false positive, and TN for true negative. The Poly strategy can dynamically adjust the learning rate, and this paper adopts this strategy for learning rate scheduling. The Poly strategy’s specific mathematical representation is as follows:

l r = l r_{b a s e} \times {(1 - \frac{e}{e_{m}})}^{p} .

(18)

In the above formula,

l r

represents the current learning rate,

l r_{b a s e}

is the base learning rate, e is the current iteration,

e_{m}

is the maximum number of iterations, and p is the decay exponent.

4.2. Experimental Parameter Settings

We conducted all experiments using the PyTorch deep learning framework on a single RTX 3080Ti graphics card. The appropriate learning rate is crucial for the training process. Using a larger learning rate at the beginning of the training process can accelerate the convergence of the model. However, if the learning rate remains too large, the model may fail to converge. In such cases, using a smaller learning rate would be more appropriate. In all the experiments conducted in this paper,

l r_{b a s e}

was set to 0.0015,

e_{m}

was set to 200, and p was set to 0.95. Adam is the optimization algorithm employed in this work. We used BCEWithLogitsLoss as the loss function.

Additionally, we set the batch size to 8 for training on the 512 × 512 pixel BTRS-CD dataset and to 16 for training on the 256 × 256 pixel dataset.

We performed ablation experiments on the two datasets to further validate our network. The specific experimental results are presented in Table 2.

The experimental results in the first five rows represent the performance of progressively adding modules on top of the backbone network. The last row represents the results obtained by introducing parameter freezing in the network. The ablation experiments on the two datasets demonstrate that the proposed method in our network can improve its performance, leading to the following conclusions:

The PRU module extracts contextual semantic information from the bottom-level feature maps to guide the recovery of the mask. On the two datasets, the PRU improved the IOU metric by 0.84% and 1.12%, respectively, and the F1 metric by 0.97% and 0.68%, respectively. The GIAU is used to aggregate the two sets of feature information obtained from each level and extract the information strongly related to the change region. The GIAU improved the IOU metric by 0.97% and 1.40%, respectively, and the F1 metric by 0.74% and 1.17%, respectively. The GSAU provides the model with global attention capability, enhancing the model’s global focus on change information at the image level. The GSAU improved the IOU metric by 0.81% and 0.50%, respectively, and the F1 metric by 0.75% and 0.28%, respectively. The GCAU provides the model with global attention capability in the channel dimension, allowing the model to focus more on feature maps that contain change information. The GCAU improved the IOU metric by 0.55% and 0.51%, respectively, and the F1 metric by 0.37% and 0.21%, respectively. Net_F is the introduced parameter-freezing method in the complete network, which provides a structure similar to residual networks to re-fit the difference between predicted values and ground truth. This method improved the IOU metric by 0.30% and 0.38%, respectively, and the F1 metric by 0.21% and 0.11%, respectively.

4.3. Comparative Experiment

Comparative experiments can further demonstrate the performance of the model. Additionally, conducting experiments on different datasets can verify the robustness of the model. We conducted comparative experiments on two datasets, and the specific values of all parameters involved in the experiments are presented in Section 4.2. To make the experimental setup more compelling, we selected CNN-based and Transformer-based change detection methods as baselines. The descriptions of these networks are as follows:

The three networks of FC-EF, FC-Siam-Conc, and FC-Siam-Diff come from the same article. The author first proposed the Siamese structure of shared weights. TCDNet [25], SNUNet [26], MFGAN [27], STANet [28], DASNet [29], TFI-GR [30], and ChangeNet [31] are CNN-based change detection methods proposed in recent years. BIT [19] and ChangeFormer [32] are change detection methods proposed in recent years, which combine the CNN and Transformer.

4.3.1. Comparative Experiments on the BTRS-CD Dataset

The details of the comparison tests on the BTRS-CD dataset are provided in Table 3. In the data presented in Table 3, it is evident that ZAQNet achieved superior accuracy on the BTRS_CD dataset, surpassing other networks by at least 1.63% and 1.75% in terms of the IOU and F1 scores, respectively. To further illustrate the performance of the models, we visualized the results of each network on the BTRS_CD dataset. The specific visualizations are shown in Figure 8.

In Figure 8, the first set of comparison images shows larger and regular changed regions. The second set of comparison images includes challenging details in the changed regions. The third set of comparison images depicts densely distributed and small changed regions. As can be seen by looking at the visualizations in these three sets of comparison photographs, our network performed the best, efficiently collecting the details of the changing regions.

4.3.2. Comparative Experiments on the LEVIR-CD Dataset

Additionally, we conducted comparative experiments on the LEVIR-CD dataset, and the specific experimental results are presented in Table 4. In Table 4, our network achieved the best accuracy, specifically outperforming the other networks by at least 0.58% in the IOU and 0.36% in the F1 score. This indicates that our network exhibits good generalization performance. In addition, we also visualized the performance of all the networks on the LEVIR-CD dataset. The specific visualization results are shown below.

The LEVIR-CD dataset contains densely populated clusters of small buildings. On the one hand, the model may have difficulty in distinguishing the boundaries of change regions, resulting in two separate change regions being detected as a single unified change region. On the other hand, the dense arrangement of buildings can result in missed detections and false detections. In Figure 9, we illustrate three different scenarios of building clusters. Our network achieved the best results in these three scenarios, successfully avoiding the two aforementioned issues.

4.3.3. Comprehensive Efficiency Analysis of the Models

This paper aims to achieve high-precision detection while reducing computational complexity. Therefore, we conducted a comprehensive analysis and comparison of networks on the LEVIR-CD dataset, with evaluation metrics including the floating point operations (FLOPs), number of parameters (Params), inference time, and F1 score. We randomly selected 1000 images of 256 × 256 pixels from the validation set for inference operations and averaged all results to evaluate the model’s inference time, as detailed in Table 5. ZAQNet had lower FLOPs and Params results compared to the average levels of the other models, yet it ultimately achieved the highest F1 score. This indicates that ZAQNet significantly reduces computational burden while improving performance, offering a more efficient solution for practical applications. Overall, ZAQNet achieves excellent detection results with lower computational costs, making it more suitable for hardware deployment, especially in resource-constrained environments. ZAQNet demonstrates its potential in high-performance change detection. We will add explanatory text in relevant sections.

5. Discussion

This study proposes a Transformer self-attention change detection network (ZAQNet) with frozen parameters, aiming to improve the accuracy and efficiency of change detection in remote sensing images. The model combines the advantages of convolutional neural networks (CNNs) and Transformers, extracting both local and global features while reducing the computational burden of standard Transformer architectures. Additionally, the introduction of frozen parameters stabilizes feature extraction, reduces overfitting, and enhances the generalization ability of the model.

Experimental results show that ZAQNet outperformed existing advanced change detection methods on the BTRS-CD and LEVIR-CD datasets. The model achieved superior performance in multiple metrics, including precision, recall, IoU, and F1 score, particularly excelling in detecting fine-grained and spatially complex changes. The GIAU (Global Information Aggregation Unit) ensures effective fusion of dual-temporal features, the GSAU (Global Spatial Attention Unit) enhances spatial awareness, and the GSCU (Global Channel Attention Unit) improves channel-wise feature discrimination. Additionally, the PRU (Pooling Refinement Unit) further optimizes spatial features, leading to more precise change localization.

Despite these advancements, some limitations remain. First, relying on frozen parameters may limit the model’s adaptability to new datasets with significantly different distributions. Future research could explore adaptive parameter freezing mechanisms to enhance model transferability. Second, while ZAQNet achieved excellent performance, its real-time inference speed and computational efficiency still require optimization. Lightweight Transformer modules or knowledge distillation techniques could be investigated to improve deployment efficiency. Finally, the current model primarily focuses on optical remote sensing images. Future work could explore multi-modal fusion strategies, integrating SAR or LiDAR data to improve robustness under varying environmental conditions such as cloud cover or seasonal changes.

6. Conclusions

This paper proposes ZAQNet, a Transformer self-attention change detection network with frozen parameters, designed to improve the accuracy of change detection in remote sensing images. By leveraging the advantages of CNNs and Transformers, ZAQNet effectively extracts both local and global features while reducing computational costs. The network integrates three key self-attention modules—the GIAU, GSAU, and GSCU—alongside the PRU module, ensuring precise feature extraction and refinement. Extensive experiments on the BTRS-CD and LEVIR-CD datasets validate the superiority of ZAQNet, achieving state-of-the-art performance in multiple evaluation metrics. These results highlight the effectiveness of parameter freezing and self-attention mechanisms in remote sensing change detection tasks. Looking ahead, further research will focus on improving model adaptability, optimizing computational efficiency, and extending the framework to multi-modal data fusion. By addressing these challenges, ZAQNet is expected to play a greater role in real-world applications such as environmental monitoring, urban planning, and disaster assessment.

Author Contributions

Conceptualization, P.C. and M.X.; methodology, P.C. and D.W.; software, P.C. and D.W.; validation, Z.Z. and H.L.; formal analysis, D.W.; investigation, P.C. and M.X.; resources, M.X.; data curation, D.W.; writing—original draft preparation, P.C. and D.W.; writing—review and editing, H.L.; visualization, M.X.; supervision, M.X.; project administration, M.X.; and funding acquisition, M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of PR China (42075130).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and the code of this study are available from https://github.com/RenHongjin6/MYNet accessed on 1 March 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Gu, G.; Xia, M.; Weng, L.; Hu, K. Bitemporal Attention Sharing Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10368–10379. [Google Scholar] [CrossRef]
He, C.; Wei, A.; Shi, P.; Zhang, Q.; Zhao, Y. Detecting land-use/land-cover change in rural–urban fringe areas using extended change-vector analysis. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 572–585. [Google Scholar] [CrossRef]
Dong, L.; Shan, J. A comprehensive review of earthquake-induced building damage detection with remote sensing techniques. ISPRS J. Photogramm. Remote Sens. 2013, 84, 85–99. [Google Scholar] [CrossRef]
Fichera, C.R.; Modica, G.; Pollino, M. Land Cover classification and change-detection analysis using multi-temporal remote sensed imagery and landscape metrics. Eur. J. Remote Sens. 2012, 45, 1–18. [Google Scholar] [CrossRef]
Eisavi, V.; Homayouni, S.; Karami, J. Integration of remotely sensed spatial and spectral information for change detection using FAHP. J. Fac. For. Istanb. Univ. 2016, 66, 524–538. [Google Scholar] [CrossRef]
Weismiller, R.; Kristof, S.; Scholz, D.; Anuta, P.; Momin, S. Change detection in coastal zone environments. Photogramm. Eng. Remote Sens. 1977, 43, 1533–1539. [Google Scholar]
Ferraris, V.; Dobigeon, N.; Wei, Q.; Chabert, M. Detecting changes between optical images of different spatial and spectral resolutions: A fusion-based approach. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1566–1578. [Google Scholar] [CrossRef]
Sofina, N.; Ehlers, M. Building change detection using high resolution remotely sensed data and GIS. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 3430–3438. [Google Scholar] [CrossRef]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Gong, M.; Zhao, J.; Liu, J.; Miao, Q.; Jiao, L. Change detection in synthetic aperture radar images based on deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 125–138. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Zhu, T.; Zhao, Z.; Xia, M.; Huang, J.; Weng, L.; Hu, K.; Lin, H.; Zhao, W. FTA-Net: Frequency-Temporal-Aware Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3448–3460. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Zhan, Z.; Ren, H.; Xia, M.; Lin, H.; Wang, X.; Li, X. AMFNet: Attention-Guided Multi-Scale Fusion Network for Bi-Temporal Change Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1765. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. MDANet: A High-Resolution City Change Detection Network Based on Difference and Attention Mechanisms under Multi-Scale Feature Fusion. Remote Sens. 2024, 16, 1387. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Qian, J.; Xia, M.; Zhang, Y.; Liu, J.; Xu, Y. Tcdnet: Trilateral change detection network for google earth image. Remote Sens. 2020, 12, 2669. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Xu, L.; Zeng, X.; Li, W.; Zheng, B. MFGAN: Multi-modal feature-fusion for CT metal artifact reduction using GANs. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–17. [Google Scholar] [CrossRef]
Bi, H.B.; Lu, D.; Zhu, H.H.; Yang, L.N.; Guan, H.P. STA-Net: Spatial-temporal attention network for video salient object detection. Appl. Intell. 2021, 51, 3450–3459. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Li, Z.; Tang, C.; Wang, L.; Zomaya, A.Y. Remote sensing change detection via temporal feature interaction and guided refinement. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628711. [Google Scholar] [CrossRef]
Varghese, A.; Gubbi, J.; Ramaswamy, A.; Balamuralidhar, P. ChangeNet: A deep learning architecture for visual change detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 129–145. [Google Scholar]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]

Figure 1. Overall network structure diagram.

Figure 2. The structure of Global Information Aggregation Unit.

Figure 3. The structure of Global Spatial Attention Unit.

Figure 4. The structure of Global Channel Attention Unit.

Figure 5. The structure of Pooling Refinement Unit.

Figure 6. Some examples from the BTRS-CD dataset. (a–f) are six examples. Dual-temporal remote sensing images are shown in pairs in each column. Image1 and Image2 are the real image data, and Label is a representation of the associated label map.

Figure 7. Some examples from the LEVIR-CD dataset. (a–f) are six examples. Dual-temporal remote sensing images are shown in pairs in each column. Image1 and Image2 are the real image data, and Label is a representation of the associated label map.

Figure 8. Visualization of each network on the BTRS-CD dataset. Image1 and Image2 are two remote sensing images taken at different times; Label is the real label map; (a–j) represent ZAQNet, FC_CONC, FC_EF, FC_DIFF, TCDNet, ChangeNet, MFGAN, TFI_GR, BIT, ChangeFormer, respectively.

Figure 9. Visualization of each network on the LEVIR-CD dataset. Image1 and Image2 are two remote sensing images taken at different times; Label is the real label map; (a–j) represent ZAQNet, FC_CONC, FC_EF, FC_DIFF, TCDNet, ChangeNet, MFGAN, TFI_GR, BIT, ChangeFormer, respectively.

Table 1. Display of parameter details.

	BTRS-CD		LEVIR
	ResNet-34	MHSA	ResNet-34	MHSA
Stage1	64 × 256 × 256	Head = 32, n = 1	64 × 128 × 128	Head = 32, n = 1
Stage2	64 × 128 × 128	Head = 16, n = 1	64 × 64 × 64	Head = 16, n = 1
Stage3	128 × 64 × 64	Head = 8, n = 1	128 × 32 × 32	Head = 8, n = 1
Stage4	256 × 32 × 32	Head = 4, n = 1	256 × 16 × 16	Head = 4, n = 1
Stage5	512 × 16 × 16	Head = 2, n = 1	512 × 8 × 8	Head = 2, n = 1

Table 2. Results of ablation experiments on two datasets.

	BTRS-CD		LEVIR-CD
	IOU	F1	IOU	F1
backbone	69.65	81.41	80.03	88.91
+PRU	70.49	82.38	81.15	89.59
+PRU + GIAU	71.42	83.12	82.55	90.76
+PRU + GIAU + GSAU	72.23	83.87	83.05	91.04
+PRU + GIAU + GSAU + GCAU	72.78	84.24	83.56	91.25
Net_F	73.08	84.45	83.91	91.36

Table 3. Comparative experiments on the BTRS-CD dataset. The best and second best data are shown in bold.

Method	PR (%)	RC (%)	IOU (%)	F1 (%)
FC-EF	74.24	45.96	39.64	56.78
FC-Siam-Conc	78.15	47.34	41.92	59.01
FC-Siam-Diff	84.86	44.87	41.54	58.70
TCDNet	88.34	74.09	67.53	80.46
SNUNet	88.59	75.14	68.56	81.34
STANet	88.53	75.27	69.25	81.82
DASNet	88.67	75.55	69.52	82.05
TFI-GR	88.81	75.73	69.76	82.29
ChangeNet	88.63	77.15	70.19	82.46
MFGAN	87.95	78.11	71.07	82.70
BIT	87.99	75.52	69.03	80.23
ChangeFormer	88.72	78.26	71.45	82.53
ZAQNet (our)	88.82	80.49	73.08	84.45

Table 4. Comparative experiments on the LEVIR -CD dataset. The best and second best data are shown in bold.

Method	PR (%)	RC (%)	IOU (%)	F1 (%)
FC-EF	87.78	81.09	73.24	84.42
FC-Siam-Conc	87.59	84.48	76.03	86.52
FC-Siam-Diff	89.91	83.10	76.89	86.44
TCDNet	88.86	87.99	79.25	88.42
SNUNet	89.80	87.73	78.60	88.75
MFGAN	90.07	88.84	81.08	90.04
STANet	89.26	87.10	79.20	88.13
DASNet	90.84	86.09	78.67	88.79
TFI-GR	90.89	91.11	83.49	91.00
ChangeNet	89.44	89.49	80.32	89.49
BIT	91.70	89.44	82.74	90.55
ChangeFormer	90.49	89.02	81.41	89.75
ZAQNet (our)	92.03	90.67	84.07	91.36

Table 5. Comparative experiments of multiple efficiency indicators of the models.

Method	Flops (G)	Param (M)	Inference (ms/Picture)	F1 (%)
FC-EF	1.19	1.35	2.29	83.42
FC-Siam-Diff	2.33	1.35	9.82	86.31
FC-Siam-Conc	2.33	1.55	10.41	86.37
TCDNet	19.01	8.76	20.83	87.19
SNUNet	43.94	12.03	12.51	87.28
MFGAN	67.88	45.99	48.02	87.34
STANet	18.03	16.94	13.16	87.51
DASNet	107.69	57.36	19.27	88.48
TFI-GR	14.25	33.23	18.37	88.50
ChangeNet	10.33	20.49	15.88	88.65
BIT	25.92	11.99	14.03	89.32
ChangeFormer	23.10	51.47	18.38	89.57
ZAQNet (our)	10.55	9.09	14.52	91.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, P.; Xia, M.; Wang, D.; Lin, H.; Zhao, Z. Transformer Self-Attention Change Detection Network with Frozen Parameters. Appl. Sci. 2025, 15, 3349. https://doi.org/10.3390/app15063349

AMA Style

Cheng P, Xia M, Wang D, Lin H, Zhao Z. Transformer Self-Attention Change Detection Network with Frozen Parameters. Applied Sciences. 2025; 15(6):3349. https://doi.org/10.3390/app15063349

Chicago/Turabian Style

Cheng, Peiyang, Min Xia, Dehao Wang, Haifeng Lin, and Zikai Zhao. 2025. "Transformer Self-Attention Change Detection Network with Frozen Parameters" Applied Sciences 15, no. 6: 3349. https://doi.org/10.3390/app15063349

APA Style

Cheng, P., Xia, M., Wang, D., Lin, H., & Zhao, Z. (2025). Transformer Self-Attention Change Detection Network with Frozen Parameters. Applied Sciences, 15(6), 3349. https://doi.org/10.3390/app15063349

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer Self-Attention Change Detection Network with Frozen Parameters

Abstract

1. Introduction

2. Methodology

2.1. Overall Framework

2.2. Global Information Aggregation Unit (GIAU)

2.3. Global Spatial Attention Unit (GSAU)

2.4. Global Channel Attention Unit (GCAU)

2.5. Pooling Refinement Unit (PRU)

3. Datasets

3.1. BTRS-CD Dataset

3.2. LEVIR-CD Dataset

4. Experiment

4.1. Introduction to Experimental Indicators

4.2. Experimental Parameter Settings

4.3. Comparative Experiment

4.3.1. Comparative Experiments on the BTRS-CD Dataset

4.3.2. Comparative Experiments on the LEVIR-CD Dataset

4.3.3. Comprehensive Efficiency Analysis of the Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI