A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection

: With the development of deep learning techniques in the ﬁeld of remote sensing change detection, many change detection algorithms based on convolutional neural networks (CNNs) and nonlocal self-attention (NLSA) mechanisms have been widely used and have obtained good detection accuracy. However, these methods mainly extract semantic features on images from different periods without taking into account the temporal dependence between these features. This will lead to more “pseudo-change” in complex scenes. In this paper, we propose a network architecture named UVACD for bitemporal image change detection. The network combines a CNNs extraction backbone for extracting high-level semantic information with a visual transformer. Here, visual transformer constructs change intensity tokens to complete the temporal information interaction and suppress irrelevant information weights to help extract more distinguishable change features. Our network is validated and tested on both the LEVIR-CD and WHU datasets. For the LEVIR-CD dataset, we achieve an intersection over union (IoU) of 0.8398 and an F1 score of 0.9130. For the WHU dataset, we achieve an IoU of 0.8664 and an F1 score of 0.9284. The experimental results show that the proposed method outperforms some previous state of the art change detection methods.


Introduction
Change detection is a process of identifying differences by observing the states of an object or phenomenon at different times [1]. It is one of the main problems in Earth observation and has been studied extensively in recent years. With the continuous development of Earth observation technology, a large amount of remote sensing data with hyperspectral space-time resolutions are now available, bringing new requirements for change detection and promoting the development of change detection technology. The application fields of change detection are urban expansion [2], building change detection [3,4], water environment change detection [5,6], forest detection [7], debris flow and landslide detection [8].
Most traditional change detection methods can be divided into two steps: change unit analysis, and change identification [9]. Change unit analysis usually divides the image into pixel-level units, and object-level units, and then constructs useful features based on these units. Different forms of analysis units share similar feature extraction techniques, common spectral features, and spatial features. Change identification uses human-made or learned rules to compare feature representations of analysis units to determine change categories. A common and simple method is to calculate a feature difference map and then use thresholding to segment the change region [10]. The direction and magnitude of the change vector can be analyzed to determine the type of change based on change vector analysis (CVA) [11]. Alternatively, hand-designed rules are used to construct decision trees [12], and support vector machines are used [13] to identify the type of change. In differences in algebraic operations to detect used changes. Another study [38] extracted the most representative features at different semantic levels by cascading bitemporal features at different scales, and then performed channel attention on these features. Ref. [20] makes the network focus on useful information on bitemporal images by adding attention mechanisms to the Siamese networks separately.
Although successful, these methods mainly focus on extracting semantic features on images at different time periods, but do not consider the temporal dependence between these features, which will lead to more pseudo-change in complex scenes. In this work, considering its superior capacity for modeling global dependencies, we use a transformer to integrate spatial and temporal information for change detection, generating discriminative spatiotemporal features for feature fusion. Considering the superior capability of transformers in terms of modeling global dependencies, we use a transformer to integrate spatial and temporal information for change detection and generate discriminative spatiotemporal features for feature fusion. More specifically, we propose a new spatiotemporal module based on visual transformer for change information fusion. The new architecture contains three key components: an encoder, a fusion module, and a variance information extraction module. The encoder receives the original image in two time periods and generates independent feature maps. This fusion module uses the visual transformer structure to enhance independent feature maps and achieves the fusion of spatiotemporal information. The differential information extraction module is based on 3D convolution (conv3d), which is different from general algebraic operations such as summation and difference, and can more flexibly generate change feature maps and be directly used for classification.
In summary, this work has three contributions. 1. We propose a new 3D convolutional difference information extraction module specifically for the extraction of bitemporal variation feature maps. It can easily and efficiently aggregate bitemporal features, helping us to generate difference feature maps more flexibly while helping us to focus more on the feature encoding or feature enhancement part.
2. We propose a visual transformer-based spatiotemporal feature enhancement strategy in the dual-temporal feature information fusion and processing approach. Temporal information modeling is achieved by first executing the transformer separately in the spatial dimension, and then a classification token of aggregated temporal features is constructed in the temporal dimension; in other words, this approach can be understood as a global average pooling operation involving fused temporal information. The method fully considers the long-range dependencies between feature maps and unites feature information in the temporal dimension, and the experimental results show that this is important for temporal feature enhancement.
3. For the combined CNN + Transformer approach, we construct an additional training loss for the transformer part to strengthen the influence of the transformer module on the network, and experimentally validate the feasibility of this approach, which provides a new solution for the CNN + Transformer to build a change detection network.
The rest of this paper is as follows. In the second section, the overall structure of the change detection network in this paper is proposed, in which the dual-time phase feature fusion enhancement module and the feature difference module are described in detail. In the third section, the experimental results and analysis are provided. In Section 4, we discuss some of the details and parameters of the experimental results. In Section 5, the conclusions of this paper are derived.

Materials and Methods
In this section, we introduce the architecture of the proposed method in detail. The overall structure of the change detection network for learning spatiotemporal features is presented in Figure 1. For clarity, we first introduce this basic change detection network with three main modules: a feature extraction backbone in Section 2. enhancement module to the basic network to help build a long-range modeling approach for spatiotemporal features, considering the long-range dependence of self-attention and making the network performance more robust.

Materials and Methods
In this section, we introduce the architecture of the proposed method in detail. The overall structure of the change detection network for learning spatiotemporal features is presented in Figure 1. For clarity, we first introduce this basic change detection network with three main modules: a feature extraction backbone in Section 2.1.1, a difference information extraction module in Section 2.1.2, and a decoder for segmentation in Section 2.1.3 and Section 2.1.4. Subsequently, In Section 2.2, we add a transformer-based temporal feature enhancement module to the basic network to help build a long-range modeling approach for spatiotemporal features, considering the long-range dependence of self-attention and making the network performance more robust. The transformer locates key location information by computing global adaptive weighting for the input, which makes the computational effort of the model positively correlated with the size of the input image. ViVit [32] divides the image into blocks of nonoverlapping size and computes block-to-block correlations, and although it greatly reduces the computational effort, it loses its ability to model the relationships between pixels with a block modeling capability. However, convolutional networks perform feature extraction by sliding windows, which are less computationally intensive than a transformer. Therefore, to compensate for the computation of block-to-block correlation analysis, we use ResNet [15] as the backbone to build a deep feature extraction framework, and to guarantee the model capacity, we use ResNet50 to achieve a better feature representation. More specifically, there are no changes in the original ResNet other than the removal of the last stage and the fully connected layers. The input of the backbone is a pair of images: a pre-period image 1 ∈ ℝ 3× × , and a post-period image 2 ∈ ℝ 3× × that locate the change state information. Afterward, by passing them to the backbone, the pre-and post-period images 1 and 2 are mapped to feature maps 1 ∈ ℝ × × and 2 ∈ ℝ × × , respectively. Here, s represents the size of the block. Then, for the change detection task, we constructed the change detection network backbone of this paper by using the Siamese structure, which is shown in Figure 2 below. The depth residual convolution uses the same color to indicate that they have the same network structure and weights, which will form a Siamese backbone, and then the output of different input images passing through this Siamese backbone will become different. Thus, we use different colors to draw the output.   The transformer locates key location information by computing global adaptive weighting for the input, which makes the computational effort of the model positively correlated with the size of the input image. ViVit [32] divides the image into blocks of nonoverlapping size and computes block-to-block correlations, and although it greatly reduces the computational effort, it loses its ability to model the relationships between pixels with a block modeling capability. However, convolutional networks perform feature extraction by sliding windows, which are less computationally intensive than a transformer. Therefore, to compensate for the computation of block-to-block correlation analysis, we use ResNet [15] as the backbone to build a deep feature extraction framework, and to guarantee the model capacity, we use ResNet50 to achieve a better feature representation. More specifically, there are no changes in the original ResNet other than the removal of the last stage and the fully connected layers. The input of the backbone is a pair of images: a pre-period image z1 ∈ R 3×H×W , and a post-period image z2 ∈ R 3×H×W that locate the change state information. Afterward, by passing them to the backbone, the pre-and post-period images z1 and z2 are mapped to feature maps f z1 ∈ R C× H s × W s and f z2 ∈ R C× H s × W s , respectively. Here, s represents the size of the block. Then, for the change detection task, we constructed the change detection network backbone of this paper by using the Siamese structure, which is shown in Figure 2 below. The depth residual convolution uses the same color to indicate that they have the same network structure and weights, which will form a Siamese backbone, and then the output of different input images passing through this Siamese backbone will become different. Thus, we use different colors to draw the output.

Difference Information Extraction
The difference feature maps are extracted as a simple cascade of bitemporal feature maps extracted by the Siamese backbone and then operated by the 3D convolution module [39]. Unlike traditional 2D convolution, the convolution kernel of 3D convolution adds a temporal dimension to process the sequential image input, so the input feature map needs to be stacked in the temporal dimension. The processing process is shown in Equation (1). In this module, the channel width C, height H, and weight W of the bitemporal feature maps f z1 and f z2 are kept constant. The receptive fields on the multiscale temporal feature maps are fused by convolving the conv3d voids with dilation rates of 1, 4, and 8, and a convolution is performed for feature aggregation. Here, to fuse the receiver fields at different scales, the 3D convolution module uses the idea of atrous spatial pyramid pooling (ASPP) [39], which we name ASPP3d. Dilation = 1 represents the normal convolution, while the outer 4 and 8 are added to aggregate the difference features at multiple scales Remote Sens. 2022, 14, 2228 5 of 17 and to increase the model capacity. In addition, we visualize the absolute value of the feature map and the difference feature map extracted using ASPP3d. Figure 3 shows that the extracted features are more robust and have less noise.
where g : R H×W×C×2 → R H×W×C represents the difference feature extraction operation and σ represents the aggregation operation of g with different dilation parameters. C is the number of channels after convolution, where C = C.

Difference Information Extraction
The difference feature maps are extracted as a simple cascade of bitemporal feature maps extracted by the Siamese backbone and then operated by the 3D convolution module [39]. Unlike traditional 2D convolution, the convolution kernel of 3D convolution adds a temporal dimension to process the sequential image input, so the input feature map needs to be stacked in the temporal dimension. The processing process is shown in Equation (1). In this module, the channel width C, height H, and weight W of the bitemporal feature maps 1 and 2 are kept constant. The receptive fields on the multiscale temporal feature maps are fused by convolving the conv3d voids with dilation rates of 1, 4, and 8, and a convolution is performed for feature aggregation. Here, to fuse the receiver fields at different scales, the 3D convolution module uses the idea of atrous spatial pyramid pooling (ASPP) [39], which we name ASPP3d. Dilation = 1 represents the normal convolution, while the outer 4 and 8 are added to aggregate the difference features at multiple scales and to increase the model capacity. In addition, we visualize the absolute value of the feature map and the difference feature map extracted using ASPP3d. Figure  3 shows that the extracted features are more robust and have less noise.
where : ℝ × × ×2 → ℝ × × ′ represents the difference feature extraction operation and represents the aggregation operation of with different dilation parameters. ′ is the number of channels after convolution, where ′ = .

Classification Head
Since the size resolutions of the feature maps extracted based on the ResNet backbone are smaller than the resolution of the original image, to achieve a semantic segmentation process that guarantees the size of the original image, the difference feature maps need to be upsampled and then classified by convolution. Since a simple upsampling operation produces a tessellation lattice phenomenon, an additional convolution operation is added to the sampled difference feature map to stabilize the post-sampling performance; this is followed by a categorical convolution output change probability for two classification steps. The first channel represents the background probability value with a label value of 0, and the second channel is the change foreground probability value to be extracted with a label of 1. The structure is shown in Figure 4.

Classification Head
Since the size resolutions of the feature maps extracted based on the ResNet backbone are smaller than the resolution of the original image, to achieve a semantic segmentation process that guarantees the size of the original image, the difference feature maps need to be upsampled and then classified by convolution. Since a simple upsampling operation produces a tessellation lattice phenomenon, an additional convolution operation is added to the sampled difference feature map to stabilize the post-sampling performance; this is followed by a categorical convolution output change probability for two classification steps.
The first channel represents the background probability value with a label value of 0, and the second channel is the change foreground probability value to be extracted with a label of 1. The structure is shown in Figure 4.
Since the size resolutions of the feature maps extracted based on the ResNet backbone are smaller than the resolution of the original image, to achieve a semantic segmentation process that guarantees the size of the original image, the difference feature maps need to be upsampled and then classified by convolution. Since a simple upsampling operation produces a tessellation lattice phenomenon, an additional convolution operation is added to the sampled difference feature map to stabilize the post-sampling performance; this is followed by a categorical convolution output change probability for two classification steps. The first channel represents the background probability value with a label value of 0, and the second channel is the change foreground probability value to be extracted with a label of 1. The structure is shown in Figure 4.

Construction of Extra Predictions
Although the ASPP3d difference feature extraction module proposed in this paper can simply aggregate the bitemporal feature relations, it fails to consider the multiscale resolution relations of multilevel feature maps. Therefore, in this paper, the Unet [40] structure, which consists of a symmetric encoder-decoder network with skip connections to enhance detail retention, is adopted to build an extra classification. The overall process is shown in Figure 5. Here, signal "C" represents the cascading and convolution of the

Construction of Extra Predictions
Although the ASPP3d difference feature extraction module proposed in this paper can simply aggregate the bitemporal feature relations, it fails to consider the multiscale resolution relations of multilevel feature maps. Therefore, in this paper, the Unet [40] structure, which consists of a symmetric encoder-decoder network with skip connections to enhance detail retention, is adopted to build an extra classification. The overall process is shown in Figure 5. Here, signal "C" represents the cascading and convolution of the difference feature maps. The red and yellow parts at the bottom of the figure represent the feature maps generated by the bitemporal images after the Siamese backbone. The light blue part is their difference feature map, and the transformer is the bitemporal feature enhancement module for disparity feature extraction, which is introduced in Section 2.2.

Loss Function
As the change detection process used in this paper is treated as a semantic segmentation task, we use softmax cross-entropy loss in the training phase for implementation. The loss formula is as follows:

. Loss Function
As the change detection process used in this paper is treated as a semantic segmentation task, we use softmax cross-entropy loss in the training phase for implementation. The loss formula is as follows: l(x, y) = L = (l 1 , · · · , l N ) τ (2) where y is the true value of a point (usually 0 or 1), x is the predicted value of a point, N is the batch size, and w is the weight given to each batch. Due to the construction of a multilevel resolution prediction task, this network has a total of two losses to be calculated. The final loss_total = sum (loss1, loss2). The smaller the subscript of the loss is, the closer it is to the final output of the network and the smaller its resolution. In the calculation of the loss, the binary label map must be resampled to this resolution.

Bitemporal Feature Enhancement
To enhance the model so that it can handle global contextual information while specifying the temporal and spatial relationships between pixel pairs to generate differentiated spatiotemporal features, we propose a bitemporal feature fusion transformer module. It mainly consists of the following two structures: tokenization in Section 2.2.1 and the spatiotemporal transformer in Section 2.2.2, as shown in Figure 6.  Figure 6. Bitemporal feature enhancement module, where c1 and c2 are the change intensity tokens of the pre-and post-feature maps, respectively.

Tokenization
Vit divides an image into a series of sequences as input, scans each element of the sequence, and learns their dependencies. Although this feature makes it essentially good at capturing global information in sequence data, it cannot implicitly learn the position information of the obtained sequence, so it needs to perform position encoding on the sequence to retain the position information (position embedding, PE). The formula can be expressed as follows. To learn the relationships between temporal feature maps, a temporal classification token is used here, as shown in the formula below: where : ℝ × × ×2 → ℝ (( * ⋅ * )+1)× ′×2 is the sequence that divides the input feature map into size patches and encodes the original feature width C as ′, and change intensity tokens used to fill the time dimension are generated in this process. ∈ ℝ ( * ⋅ * +1)×2 is a position-encoding parameter that can be calculated. Note that since the height H and width W of the original image will become 1/s times since the Siamese backbone from the previous section is used, image chunking is achieved, so we set p to 1.

Tokenization
Vit divides an image into a series of sequences as input, scans each element of the sequence, and learns their dependencies. Although this feature makes it essentially good at capturing global information in sequence data, it cannot implicitly learn the position information of the obtained sequence, so it needs to perform position encoding on the sequence to retain the position information (position embedding, PE). The formula can be expressed as follows. To learn the relationships between temporal feature maps, a temporal classification token is used here, as shown in the formula below: is the sequence that divides the input feature map into p size patches and encodes the original feature width C as C , and change intensity tokens used to fill the time dimension are generated in this process. pos ∈ R ( H s * p · W s * p +1)×2 is a position-encoding parameter that can be calculated. Note that since the height H and width W of the original image will become 1/s times since the Siamese backbone from the previous section is used, image chunking is achieved, so we set p to 1.

Spatiotemporal Transformer
This part of the structure contains both spatial transformer and temporal transformer parts with a similar multiheaded attention mechanism. Both parts are composed of transformer encodings of L layers, with each layer containing a multiheaded self-attention mechanism (MSA) with layer normalization (LN) and a multilayer perceptron (MLP) structure, which are denoted as follows: T l+1 = MLP LN y l + y l The MLP structure consists of two linear layers, as well as the GELU activation function, and the dimensionality C\prime of the sequence is kept constant throughout the process. In addition, the model uses a separated multiheaded attention mechanism, utilizing different heads to compute spatial attention and temporal attention separately. Its attention is defined as follows: where Q = XW q , K = XW k , V = XW v , X, Q, K, V ∈ R (N/r 2 )×C , and the sequence length is denoted N, which is divided into two types, N s and N t , representing the sequence lengths in the space and time dimensions, respectively. The calculation formulas are as follows.
The core idea of this structure is to construct the spaces Q s , K s , V s ∈ R N s ×C and K t , V t ∈ R n t ×d representing the query, key and value information of the respective dimensions. Then, multiheaded attention is used to compute spatial features Y s = Attention (Q s , K s , V s ), and the next class token dimension is taken to compute temporal features Y t = Attention (Q t , K t , V t ). Finally, the temporal and spatial features are multiplied together, and the residuals are concatenated.

Experiment
From the overall structure of the network, we named the network UVACD. This is because it refers to the form of UNet, and the change detection task is realized in the ASPP3d differential information extraction module proposed by the visual transformer. We validate UVACD on two publicly available building change detection datasets: the LEVIR-CD [36] dataset and the WHU [41] dataset. The experimental results show that the proposed UVACD network outperforms recently proposed change detection methods. In this section, we start by introducing the experimental dataset. Then, we describe the details of our implementation and present the utilized evaluation metrics. Finally, we list the comparison with some other methods. [36] is an open dataset containing 637 ultrahigh-resolution (0.5 m-resolution) Google Earth image pairs with 1024 × 1024 pixels. Images of 20 different locations in several cities in Texas were collected from 2002 to 2018, and the image pairs ranged from 5 to 14 years. The introduction of changes due to seasonal and light variations in the dataset has helped to develop effective methods for mitigating the effects of unrelated changes on actual changes. Architecture-related changes include building growth (changes from soil/grassland/hardened ground or areas under construction/new building areas) and building decay (building areas/nonbuilding areas such as soil/grassland/hardened ground). The dataset covers various types of buildings, such as villas, high-rise apartments, small garages and large warehouses. The dataset contains a total of 31,333 individual building changes, with an average of approximately 50 building changes per image pair and an average size of approximately 987 pixels per change area. Note that most of the changes are due to building growth. The author of LEVIR-CD provided a standard training/validation/test split, which assigns 70% of the samples for training, 10% for validation, and 20% for testing. Regarding the GPU memory capacity limitation, we follow the standard split proposed by reference [31]. We cut the images into small patches of size 256 × 256 with no overlap. Therefore, we obtain 7120/1024/2048 pairs of patches for training/validation/testing. WHU [41] is a building change detection dataset consisting of two-period aerial images, each with a resolution of 0.3 m. This dataset covers an area where a 6.3-magnitude earthquake has occurred in February 2011 that was rebuilt in the following years. This dataset consists of aerial images obtained in April 2012 that contain 12,796 buildings in 20.5 km 2 (16,077 buildings in the same area in the 2016 dataset). A standard split does not exist for this dataset. Different researchers use different data splitting approaches to validate their models. For comparison, we use the splitting approach that was used in reference [21]. We crop the images into small patches of size 256 × 256 with no overlap. Note that we adopt the method of fewer training set samples, so we split them into three parts (4491/498/2700) for training/validation/testing according to the range of test set vectors given by the original dataset, where the validation set obtained represents 10% of the training set. See Table 1 and Figure 7 for more details. WHU [41] is a building change detection dataset consisting of two-period aerial images, each with a resolution of 0.3 m. This dataset covers an area where a 6.3-magnitude earthquake has occurred in February 2011 that was rebuilt in the following years. This dataset consists of aerial images obtained in April 2012 that contain 12,796 buildings in 20.5 km 2 (16,077 buildings in the same area in the 2016 dataset). A standard split does not exist for this dataset. Different researchers use different data splitting approaches to validate their models. For comparison, we use the splitting approach that was used in reference [21]. We crop the images into small patches of size 256 × 256 with no overlap. Note that we adopt the method of fewer training set samples, so we split them into three parts (4491/498/2700) for training/validation/testing according to the range of test set vectors given by the original dataset, where the validation set obtained represents 10% of the training set. See Table 1 and Figure 7 for more details.

Training Details
Our model is based on PyTorch and trained on one ubuntu20.04 operating system using four NVIDIA Tesla V100 GPUs; the training strategy uses distributed data parallel (DDP). During the training process, we implemented a loading data process where we

Training Details
Our model is based on PyTorch and trained on one ubuntu20.04 operating system using four NVIDIA Tesla V100 GPUs; the training strategy uses distributed data parallel (DDP). During the training process, we implemented a loading data process where we normalized the image data to between 0 and 1 and transformed their distribution to a standard normal distribution. Then, data enhancement, random probability value of 0.5 data enhancement by random inversion, random resize, random cropping, Gaussian noise, and random color change were performed. We used cross-entropy loss and the AdamW optimizer with the parameters set to a weight decay of 0.01. The initial learning rate was set to 0.005, and the initial four epochs were dropped to 1 × 10 −6 using linear warmup to the initial learning rate, followed by cosine annealing. The number of epochs was 200, and the batch size was 32.

Evaluation Metrics
To compare the performance of our model with the performances of other methods, we report their F1 and intersection over union (IoU) scores with regard to the change class as the primary quantitative indices. Additionally, we report the precision and recall values for the change category and the overall accuracy (OA) performance of the change detection task. The IoU and F1 values range from 0 to 1, and the higher each value is, the better the performance. The IoU and F1 scores are calculated as follows, where TP denotes true positives, FP denotes false positives, and FN denotes false negatives.
The precision is calculated as: The recall is calculated as: recall = TP TP + FN (14) The OA is calculated as:

Comparison with Other Methods
In this section, we compare the change detection performance of our UVACD approach with the performances of some existing deep learning change detection methods. The methods used for comparison are as follows: • FC-EF [37]: This method concatenates original bitemporal images and processes them through ConvNet to detect changes. • FC-Siam-D [37]: This is a feature-level difference method that extracts the multilevel features of bitemporal images from a Siamese ConvNet and uses feature differences in algebraic operations to detect changes. • FC-Siam-Conc [37]: This is a feature-level concatenation method that extracts the multilevel features of bitemporal images from a Siamese ConvNet, and feature concatenation in the channel dimension is used to detect changes. • DTCDSCN [19]: This is an attention-based, feature-level method that utilizes a dual attention module (DAM) to exploit the interdependencies between the channels and spatial positions of ConvNet features to detect changes. • STANet [36]: This is another attention-based, feature-level network for CD that integrates a spatial-temporal attention mechanism to detect changes.
• IFNet [14]: This is a multiscale feature-level method that applies channel attention and spatial attention to the concatenated bitemporal features at each level of the decoder. A supervised loss is computed at each level of the decoder. We use multi-loss training strategies inspired by this technique. • SNUNet [38]: This is a multiscale feature-level concatenation method in which a densely connected (NestedUNet) Siamese network is used for change detection. • BIT [31]: This is a transformer-based, feature-level method that uses a transformer decoder network to enhance the context information of ConvNet features via semantic tokens; this is followed by feature differencing to obtain the change map. • ChangeFormer [30]: This is a pure transformer feature-level method that uses a transformer encoder-decoder network to obtain the change map directly.

Experimental Results Obtained on the WHU Dataset
In this section, we present the results of the comparison of UVACD with some other change detection algorithms on the WHU dataset. The comparison is mainly based on the work of [21], who tested many change methods with their splitting approach. Here, we also crop the image to a nonoverlapping 256 × 256 size, although we use less training data to train the network on the WHU dataset. We present the comparative results in Table 2. The table shows that UAVCD has shown excellent performance on this dataset. Our methods achieved F1, IOU, recall, and precision values of 1.8%, 1.74%, 0.57%, and 0.88%, respectively. We also plotted some of the inference results from the test set, as shown in Figure 8. In the first three columns, we present the bitemporal images (A, B) and ground truths (GTs). The fourth column shows the change masks for UVACD. To show the test results more visually, the missing false negative (FN) parts are shown in green, the false positive parts are shown in red, and the other parts that overlap the label image are color-matched. Overall, UVACD is maintained on the contour shape of the image with fewer FPs. From the results of the first image, the change of the building is not simply an exclusive OR (XOR) relationship, but also includes a building to another building change relationship, which our algorithm can still accurately identify. However, from the third row of the image, the performance of our small target detection method is weakened.

Experimental Results Obtained on the LEVIR-CD Dataset
In this section, we compare UVACD with some other change detection algorithms on the LEVIR-CD dataset. The comparison is mainly based on the work of [30], who tested many change detection methods with their splitting approach. Moreover, we maintain a consistent way of dividing the data. We present the comparative results in Table 3. The table shows that UVACD has greater performance on this dataset. In comparison to the second-ranked metrics, the F1 and IOU values of our approach increase by 0.9% and 1.8%, respectively, but the recall and precision remain optimal.
the results of the first image, the change of the building is not s (XOR) relationship, but also includes a building to another buildi which our algorithm can still accurately identify. However, from age, the performance of our small target detection method is wea  Ref. [30] published their code and the results of a test image for their comparison experiment. We also validated their test images by inference, as shown in Figure 9. Here, we focus on the red boxes in A (pre-period image) and B (post-period image), which have changed with the disappearance of the building and the unlabeled change type. Our UVACD and Bit-CD method are still accurate in detecting the types of missing changes, but the methods used to compare them all fail to detect the types of irrelevant changes. As highlighted in red (false detection) and green (missed detection), our UVACD method maintains better robustness against building additions and reductions, and even suspected building additions, than the other change detection methods. Both of these quantitative and qualitative comparisons show the superiority of our proposed method for conducting change detection with bitemporal images.

Effects of Transformers on the Network Structure
To verify the impact of the bitemporal feature enhancement module built on the transformer for the overall network structure in this paper, we designed the following sets of ablation experiments.

•
Base_Single: Only the ASPP3d convolution fusion module proposed in this paper is used (without extra classification). The results of these four experiments on the WHU dataset and the LEVIR-CD dataset are presented in Table 4. The base network, with or without extra classification, does not significantly improve the following five metrics and is not optimal in terms of recall. When bitemporal feature enhancement is used, the prefix for UVA, the recall increases significantly, and the maximum increases in recall on LEVIR-CD and WHU are 2.25% and 1.16%, respectively. When we use the extra classification, the F1 and IoU metrics of Uva_Muti increase by 0.66% and 1.1% compared to those of Uva_Single on the LEVIR-CD dataset, but the metrics increase by only 0.09% and 0.67% on the WHU dataset. This shows that the addition of extra classification tasks has a facilitating effect on the construction of the bitemporal feature enhancement strategy. However, the facilitation of this strategy is much better on the LEVIR-CD dataset than on the WHU dataset, which may be because the change types on the LEVIR-CD dataset are more complex than those on the WHU dataset, and the test metrics of WHU are better than those of LEVIR-CD, which makes the

Effects of Transformers on the Network Structure
To verify the impact of the bitemporal feature enhancement module built on the transformer for the overall network structure in this paper, we designed the following sets of ablation experiments. The results of these four experiments on the WHU dataset and the LEVIR-CD dataset are presented in Table 4. The base network, with or without extra classification, does not significantly improve the following five metrics and is not optimal in terms of recall. When bitemporal feature enhancement is used, the prefix for UVA, the recall increases significantly, and the maximum increases in recall on LEVIR-CD and WHU are 2.25% and 1.16%, respectively. When we use the extra classification, the F1 and IoU metrics of Uva_Muti increase by 0.66% and 1.1% compared to those of Uva_Single on the LEVIR-CD dataset, but the metrics increase by only 0.09% and 0.67% on the WHU dataset. This shows that the addition of extra classification tasks has a facilitating effect on the construction of the bitemporal feature enhancement strategy. However, the facilitation of this strategy is much better on the LEVIR-CD dataset than on the WHU dataset, which may be because the change types on the LEVIR-CD dataset are more complex than those on the WHU dataset, and the test metrics of WHU are better than those of LEVIR-CD, which makes the ability of the visual transformer to cope with more complex change types more fully exploited. We further plot the inference for the test sets of these four experiments on these two datasets in Figure 10. The first three lines in the figure represent the results on the WHU dataset, and the last three lines represent the results on the LEVIR dataset. In the first three columns, we present the bitemporal images A (pre-period image), B (post-period image) and GTs. The last four columns show the results of the four ablation experiments constructed in this paper. The UVA prefix indicates that we are using the transformer bitemporal feature enhancement strategy. Here, green represents FNs, red represents FPs, white represents TPs, and black represents TNs. From the image on the first line, we see that the ability of the network to detect small target changes has improved slightly with the addition of a transformer, but there are still omissions. In the six-line image that follows, building the transformer shows some similarity in error detection, as shown by the red part of the image, if there is no additional classification task. Here, we consider that without additional classification constraints, the overall performance of the network will be biased toward the performance of the convolutional neural network. Therefore, it is necessary to build additional classification constraints on the transformer here.

Visualization of Change Intensity Tokens
Since the Siamese backbone is used for extracting the feature maps of the pre-and post-temporal phases separately, it is important to consider these two feature maps for transformer modeling. Here, we fuse temporal information by constructing a change in- Base_Muti UVA_Muti UVA_Single Figure 10. Visualization for the transformer on the network structure.

Visualization of Change Intensity Tokens
Since the Siamese backbone is used for extracting the feature maps of the pre-and post-temporal phases separately, it is important to consider these two feature maps for transformer modeling. Here, we fuse temporal information by constructing a change intensity token for each of the two feature maps, which can be computed by the transformer in the spatial dimension; then, the two calculated change intensity tokens are further computed interactively by using the transformer. To further display the feature extraction ability of the proposed transformer bitemporal feature enhancement, a gradient-weighted class activation map (G-CAM) is adopted to evaluate the proposed method. The G-CAM method displays the important areas in the image predicted by the model by generating a rough attention map from the last layer of the neural network. Red denotes higher attention values, and blue denotes lower values, as shown in Figure 11.

Conclusions
Current deep learning-based change detection methods mainly extract semantic features on images from different time periods without considering the temporal correlation between these features. This will lead to more "pseudo-change" in complex scenes. To address this problem, we propose a network architecture for bitemporal image change detection named UVACD. The network combines a CNNs extraction backbone for extracting high-level semantic information with a visual transformer. Here, visual transformer constructs change intensity tokens to complete the temporal information interaction and suppress irrelevant information weights to help extract more distinguishable change features. The experimental results show that the proposed method is effective and outperforms some previous state of the art change detection methods. The results also show that constructing extra classification tasks for the output of the transformer can improve the performance of the network. However, our method still lacks the ability to detect changes in small targets, and there is still room for improvement. Our future work is dedicated to further modeling the ability to detect changes in small targets.  From the second column to the third column, we can see that the transformer has successfully located the building information on the high-level semantic feature map extracted by the CNN. However, it is redder, which indicates that it still maintains a higher weight for nonbuilding. By introducing change intensity tokens, the weight values of nonbuilding areas are obviously suppressed, resulting in a light green color. The last column of the change feature map shows that it is clearly positioned in the change area, within which the color is biased toward red.

Conclusions
Current deep learning-based change detection methods mainly extract semantic features on images from different time periods without considering the temporal correlation between these features. This will lead to more "pseudo-change" in complex scenes. To address this problem, we propose a network architecture for bitemporal image change detection named UVACD. The network combines a CNNs extraction backbone for extracting high-level semantic information with a visual transformer. Here, visual transformer constructs change intensity tokens to complete the temporal information interaction and suppress irrelevant information weights to help extract more distinguishable change features. The experimental results show that the proposed method is effective and outperforms some previous state of the art change detection methods. The results also show that constructing extra classification tasks for the output of the transformer can improve the performance of the network. However, our method still lacks the ability to detect changes in small targets, and there is still room for improvement. Our future work is dedicated to further modeling the ability to detect changes in small targets.

Conflicts of Interest:
The authors declare no conflict of interest.