Building Change Detection Based on an Edge-Guided Convolutional Neural Network Combined with a Transformer

: Change detection extracts change areas in bitemporal remote sensing images, and plays an important role in urban construction and coordination. However, due to image offsets and brightness differences in bitemporal remote sensing images, traditional change detection algorithms often have reduced applicability and accuracy. The development of deep learning-based algorithms has improved their applicability and accuracy; however, existing models use either convolutions or transformers in the feature encoding stage. During feature extraction, local ﬁne features and global features in images cannot always be obtained simultaneously. To address these issues, we propose a novel end-to-end change detection network (EGCTNet) with a fusion encoder (FE) that combines convolutional neural network (CNN) and transformer features. An intermediate decoder (IMD) eliminates global noise introduced during the encoding stage. We noted that ground objects have clearer semantic information and improved edge features. Therefore, we propose an edge detection branch (EDB) that uses object edges to guide mask features. We conducted extensive experiments on the LEVIR-CD and WHU-CD datasets, and EGCTNet exhibits good performance in detecting small and large building objects. On the LEVIR-CD dataset, we obtain F1 and IoU scores of 0.9008 and 0.8295. On the WHU-CD dataset, we obtain F1 and IoU scores of 0.9070 and 0.8298. Experimental results show that our model outperforms several previous change detection methods.


Introduction
In the field of remote sensing, change detection is an important research topic. The purpose of change detection is to identify changed regions in remote sensing images obtained during different periods [1,2]. Change detection plays an important role in land use [3], urban management [4], and disaster assessment [5,6]. However, the common issues of image offsets [7] and brightness differences in remote sensing images obtained in different periods increases the difficulty of change detection research.
Traditional change detection algorithms can be divided into two main categories: pixel-based methods and object-based methods. Pixel-based methods obtain change maps by directly comparing the pixels in two images [8], such as the image difference method [9], image ratio method [10], regression analysis [11], change vector analysis [12,13], and principal component analysis [14]. However, since pixel-based methods focus on pixels and tend to ignore contextual relationships between pixels, change maps contain considerable noise. Therefore, some scholars have proposed object-based methods [15,16]. These researchers used the spectral and spatial adjacent similarity to divide an image into regions and context information such as the texture, shape, and spatial relationships [17] to judge the changed region. However, object-based methods often fail to make good judgements about false detections caused by image shifts and brightness differences [18]. With advances in our understanding of image features, some scholars have proposed change detection methods based on post-classification objects [19,20]. However, this strategy is considerably affected by the object segmentation algorithm, and the error information introduced during segmentation affecting the accuracy of the change detection algorithm, thereby transmitting errors to the change graph.
With the development of deep learning algorithms, an increasing number of scholars have applied deep learning algorithms in the field of remote sensing and have achieved good results in object detection, semantic segmentation, and change detection of remote sensing objects. Deep learning algorithms blur the concept of pixels and objects in traditional change detection algorithms [21]. After the image passes through a neural network unit, the image is converted into high-dimensional abstract feature information, including the spatial information of the context, and is restored to its original size through upsampling, resulting in pixel-level classification predictions. Because deep learning algorithms are robust and more generalizable than traditional algorithms, an increasing number of scholars have transferred their research on change detection algorithms to deep learningbased algorithms.
Existing change detection networks can be divided into the following categories: Convolution-based change detection networks: The powerful capability of convolutions to understand images has increased research on CNN-based change detection networks. Ref. [22] adopted a spatial self-attention module in a Siamese network that calculates attention weights in images with different sizes and scales. Ref. [23] enhanced correlations between feature pairs by aggregating low-level and global feature information using a pyramid-based attention mechanism. Ref. [24] introduced a dual-attention module that focused on the dependencies between image channels and spatial locations to improve the discriminative ability of features. However, these studies focused on image channels and spatial dependencies, and ignored correlations between features at different scales. Therefore, ref. [25] proposed a differentially enhanced dense attention mechanism that simultaneously focuses on spatial context information and relationships between high-and low-dimensional features, thereby improving the network results. Ref. [26] adopted UNet as the network backbone and designed a cross-layer block (CLB) to combine multiscale features and multilevel context information. Based on UNet, ref. [27] used skip connections to connect encoder and decoder features and supplement feature information lost during the encoding stage in the output pixel prediction map. Ref. [28] introduced a multiscale fusion module and a multiscale supervision module to obtain more complete building boundaries. Ref. [29] proposed an edge-guided recurrent convolutional neural network (EGRCNN) that used edge information to guide building change detection. Although networks with CNNs as their backbone have achieved good results, their receptive fields are usually small, and features are often lost when handling large objects; thus, global features are captured worse by CNN-based models than by transformer-based networks.
Transformer-based change detection networks: Recent research on the transformer has led to their use in various vision tasks; however, there has been little research on its use in change detection. Ref. [30] successfully designed a layered encoder based on a transformer architecture. A change map is obtained at each scale through a differential module; these change maps are then decoded by a multilayer perceptron (MLP), which effectively utilizes the multiscale information of the global features.
Convolution and transformer-based change detection networks: Although there have been several studies on the combination of transformers and CNNs, few scholars have investigated these models in change detection tasks in remote sensing images. Ref. [31] used a CNN as the backbone to extract image multidimensional features, refined the original features with a transformer decoder, and achieved good results on multiple datasets. Although many networks for change detection have been proposed, these methods have several disadvantages: (1) Existing networks often use only convolutions or transformers during the encoding stage and thus do not take advantage of the translation invariance and local correlation of convolutional neural networks or the broad global vision of the transformer. Therefore, the detection effect of certain objects is often unsatisfactory.
(2) Existing change detection networks generally ignore edge features, and the boundaries of ground objects are not always ideal in the final change result. (3) Most existing methods obtain change maps by directly fusing the encoder features, which may be disturbed by background noise.
In remote sensing images, target objects contain both obvious semantic information and clear edge features [32]. Therefore, the use of edge information to constrain semantic features is an important research topic. In a building segmentation study, ref.
[33] adopted a holistic nested edge detection (HED) module to extract edge features, and the edges and segmentation masks were combined in a boundary enhancement (BE) module; thus, the output edge information was the optimized segmentation mask, thereby optimizing the IoU score.
Inspired by these works, we noted that in the change detection task, the changed objects are a subset of all the objects of this type in the two phase images. Therefore, we added an edge detection branch to our change detection network and designed an edge fusion (EF) module to combine and supervise edge features. Moreover, on the output change mask, we fused the edge information to strengthen the boundaries of the mask.
The contributions of this paper can be summarized as follows: (1) We explore a novel idea, namely, fusing CNN and transformer features, and apply this concept to change detection in remote sensing images. (2) To obtain a more complete ground object mask, we add an edge detection branch to our change detection algorithm and use the edge information to constrain the output change mask. (3) To suppress the background noise introduced during the encoding stage, we add an intermediate decoder before running the change decoder. The intermediate decoder performs upsampling on the feature information obtained during the encoding stage, partially restores the object information, and fuses the mask and edge features. Our code can be found at https://github.com/chen11221/EGCTNet_pytorch (accessed on 1 September 2022).

Methods
In the first part of this section, we introduce the overall architecture of our end-to-end change detection network. The second part of this section discusses the fusion encoder; the third part details the intermediate decoder; the fourth part describes the change decoder; the fifth part presents our proposed edge detection branch; the sixth part presents our loss function.

Network Architecture
The proposed network architecture is shown in Figure 1 and can be divided into two main parts: the change detection branch and the edge detection branch. The change detection branch has three main components: a fusion encoder, an intermediate decoder, and a change decoder. The fusion encoder extracts multiscale features, and the intermediate decoder partially restores the encoded information and fuses the edge information. The change decoder includes a differential module and a semantic multiscale feature fusion module. The difference module identities changed features in the two sets of features. The semantic multiscale feature fusion modules combine the change features obtained at different scales to generate the final change map. The edge detection branch shares the fusion encoder with the change detection branch. Since the edge features are low-level features, these features disappear in the deep network; thus, we use the feature layer after the first convolution and the first and second layer features of the encoder as the edge feature layer. The edge detection branch also contains edge self-attention (ESA) and an edge decoder. The edge decoder includes edge fusion and edge multiscale feature fusion. The EF module fuses two sets of edge features obtained at different scales, and the edge multiscale feature fusion module combines the fused edges with different scales to generate the final edge output. The EF module fuses two sets of edge features obtained at different scales, and the edge multiscale feature fusion module combines the fused edges with different scales to generate the final edge output.

Figure 1.
Overall architecture of EGCTNet. The red box is the edge detection branch, FE represents the fusion encoder, ESA represents edge self-attention, IMD represents the intermediate decoder, the blue circle is the edge decoder, EF represents edge fusion, EMFF represents edge multiscale feature fusion, the green circle is the change decoder, DM represents the differential module, and SMFF represents semantic multiscale feature fusion.

Fusion Encoder
The fusion encoder is shown in Figure 2, which consists of a ResNet encoder [34] and a transformer encoder [30]. The ResNet encoder extracts image features at multiple scales by stacking Residual block, where the n in each downsampling Residual block is 64, 128, 256, or 512, and the number of downsampled Residual block in each layer is 3, 4, 6, and 3. In the first residual block in each residual structure, the stride is taken as 2 to halve the height and width of the feature matrix. The Residual block structure is shown in Figure 3. The convolution branch introduces translation invariance to EGCTNet, and the correlations between local pixels in the image are obtained by sharing the convolution kernel. Overall architecture of EGCTNet. The red box is the edge detection branch, FE represents the fusion encoder, ESA represents edge self-attention, IMD represents the intermediate decoder, the blue circle is the edge decoder, EF represents edge fusion, EMFF represents edge multiscale feature fusion, the green circle is the change decoder, DM represents the differential module, and SMFF represents semantic multiscale feature fusion.

Fusion Encoder
The fusion encoder is shown in Figure 2, which consists of a ResNet encoder [34] and a transformer encoder [30]. The ResNet encoder extracts image features at multiple scales by stacking Residual block, where the n in each downsampling Residual block is 64, 128, 256, or 512, and the number of downsampled Residual block in each layer is 3, 4, 6, and 3. In the first residual block in each residual structure, the stride is taken as 2 to halve the height and width of the feature matrix. The Residual block structure is shown in Figure 3. The convolution branch introduces translation invariance to EGCTNet, and the correlations between local pixels in the image are obtained by sharing the convolution kernel.  The transformer branch encoder consists of four sets of downsampling layers and transformer blocks. The first downsampling layer uses a 3 × 3 Conv2D layer with a kernel size (K) of 7, a stride (S) of 4, and a padding (P) of 3, and the remaining downsampling layers use 3 × 3 Conv2D layers with K = 3, S = 2, and P = 1. The transformer blocks consist of a self-attention module and an MLP layer. Due to the large number of high-resolution  The transformer branch encoder consists of four sets of downsampling layers and transformer blocks. The first downsampling layer uses a 3 × 3 Conv2D layer with a kernel size (K) of 7, a stride (S) of 4, and a padding (P) of 3, and the remaining downsampling layers use 3 × 3 Conv2D layers with K = 3, S = 2, and P = 1. The transformer blocks consist of a self-attention module and an MLP layer. Due to the large number of high-resolution The transformer branch encoder consists of four sets of downsampling layers and transformer blocks. The first downsampling layer uses a 3 × 3 Conv2D layer with a kernel size (K) of 7, a stride (S) of 4, and a padding (P) of 3, and the remaining downsampling layers use 3 × 3 Conv2D layers with K = 3, S = 2, and P = 1. The transformer blocks consist of a self-attention module and an MLP layer. Due to the large number of high-resolution image pixels, traditional self-attention modules use a large number of parameters to calculate the image sequence. Sequence reduction [35] can substantially reduce the number of parameters. Sequence reduction can be formulated as follows: where S represents the original sequence, Reshape represents the shaping operation, R is the reduction rate, H, W, and C are the height, width, and number of channels in the image, respectively, and Linear is the fully connected layer, which reduces the number of channels in S from C · R to C. This process generates a new sequence: (HW/R, C ). The generated feature sequence is passed through 2 MLP blocks and a convolutional layer to generate image features that contain global information. The MLP block encodes the feature sequence. In contrast to existing MLP structures, after the first fully connected layer, we convert the sequence to a matrix, perform a convolution, and then convert the matrix back to a sequence. The MLP block is structured as shown in Figure 4. image pixels, traditional self-attention modules use a large number of parameters to calculate the image sequence. Sequence reduction [35] can substantially reduce the number of parameters. Sequence reduction can be formulated as follows: where S represents the original sequence, Reshape represents the shaping operation, R is the reduction rate, H, W, and C are the height, width, and number of channels in the image, respectively, and Linear is the fully connected layer, which reduces the number of channels in S from C · R to C. This process generates a new sequence: ( ⁄ , ). The generated feature sequence is passed through 2 MLP blocks and a convolutional layer to generate image features that contain global information. The MLP block encodes the feature sequence. In contrast to existing MLP structures, after the first fully connected layer, we convert the sequence to a matrix, perform a convolution, and then convert the matrix back to a sequence. The MLP block is structured as shown in Figure 4. To combine the image features obtained by the CNN and transformer branches, we use a simple feature aggregator (FA). This feature aggregator is implemented by a 1 × 1 Conv2D layer with K = 1, S = 1, and P = 0, as follows: where Cat represents the tensor connection, represents the image features obtained by the CNN branch, and represents the image features obtained by the transformer branch.
To enhance multiscale features, we use the atrous spatial pyramid pooling (ASPP) [36] module in the last layer of the encoder. The ASPP module uses Dilated Convolution To combine the image features obtained by the CNN and transformer branches, we use a simple feature aggregator (FA). This feature aggregator is implemented by a 1 × 1 Conv2D layer with K = 1, S = 1, and P = 0, as follows: where Cat represents the tensor connection, F c represents the image features obtained by the CNN branch, and F t represents the image features obtained by the transformer branch. To enhance multiscale features, we use the atrous spatial pyramid pooling (ASPP) [36] module in the last layer of the encoder. The ASPP module uses Dilated Convolution with different sampling rates, superimposes the different channels using cat, and restores the channels with a 1 × 1 convolution.

Intermediate Decoder
The intermediate decoder substantially reduces the background noise contained in the features of the ground objects by partially restoring the ground object information. The IMD module is shown in Figure 5. The edge features are obtained from the edge self-attention module, which is introduced in detail in Section 2.5. First, we combine the edge features and mask features and pass the combined features to the deep layer. To aggregate multiscale information when each layer's edge features are fused with the mask features, we fuse the previous layer's feature information during each upsampling step. The upsampling operation uses a bilinear interpolation algorithm and skip connections to prevent gradient disappearance. The Edge Embedding Module (EEM) embeds edge information into mask feature matrices of different scales. The EEM outputs the final mask features and passes the mask features to the change decode. The EEM has two main goals.
(1) We downsample the edge features to fit the multiscale fused mask feature structure at each layer; then, we add the features and convolve the output. The details are as follows: where Down is the downsampling operation, F e is an edge feature, and F s is a mask feature containing multiscale information. (2) To prevent gradient disappearance, we retain the direct fusion of the mask and edge features output by the last layer, as follows: where Upsamle is a bilinear interpolation algorithm, F e is an edge feature, and F s is a mask feature. with different sampling rates, superimposes the different channels using cat, and restores the channels with a 1 × 1 convolution.

Intermediate Decoder
The intermediate decoder substantially reduces the background noise contained in the features of the ground objects by partially restoring the ground object information. The IMD module is shown in Figure 5. The edge features are obtained from the edge selfattention module, which is introduced in detail in Section 2.5. First, we combine the edge features and mask features and pass the combined features to the deep layer. To aggregate multiscale information when each layer's edge features are fused with the mask features, we fuse the previous layer's feature information during each upsampling step. The upsampling operation uses a bilinear interpolation algorithm and skip connections to prevent gradient disappearance. The Edge Embedding Module (EEM) embeds edge information into mask feature matrices of different scales. The EEM outputs the final mask features and passes the mask features to the change decode. The EEM has two main goals.
(1) We downsample the edge features to fit the multiscale fused mask feature structure at each layer; then, we add the features and convolve the output. The details are as follows: where Down is the downsampling operation, is an edge feature, and is a mask feature containing multiscale information.
(2) To prevent gradient disappearance, we retain the direct fusion of the mask and edge features output by the last layer, as follows: where Upsamle is a bilinear interpolation algorithm, is an edge feature, and is a mask feature.

Change Decoder
The change decoder generates the final change graph, and its structure is shown in Figure 6. The difference module identifies changed features in feature pairs acquired at different scales, and its specific structure is as follows: where F i pre and F i post represent the i-th layer feature maps before and after the change, respectively. Cat is a tensor connection that finds the positional relationship between the features in the feature maps before and after the change at different scales to learn the changed features at different scales. ReLU is the activation function, and BN represents the normalization operation. Considering the offset issues in the image, we did not use the difference between F i pre and F i post as the change feature.

Change Decoder
The change decoder generates the final change graph, and its structure is shown in Figure 6. The difference module identifies changed features in feature pairs acquired at different scales, and its specific structure is as follows: where and represent the i-th layer feature maps before and after the change, respectively. Cat is a tensor connection that finds the positional relationship between the features in the feature maps before and after the change at different scales to learn the changed features at different scales. ReLU is the activation function, and BN represents the normalization operation. Considering the offset issues in the image, we did not use the difference between and as the change feature.  Furthermore, to prevent the deep change result from disappearing during gradient descent, the change result of the upper layer is added each time the change feature is calculated. The specific formula is as follows: where F 4 di f f is the variation feature obtained by the deepest EEM in the intermediate decoder, and F 5 di f f is the variation feature obtained by the skip-connected EEM in the intermediate decoder.
Next, the variation features obtained by the differential module are uniformly restored to (H/2, W/2) through upsampling. The semantic multiscale feature fusion modules combines 5 groups of changed features, and the fusion process is formulated as follows: where F 2 di f f , F 3 di f f , and F 4 di f f are the upsampling results of each layer, the upsampling algorithm is bilinear interpolation, Conv2D is a 1 × 1 convolution, and Cat represents the tensor connection.
The fused changed features are upsampled using a transposed convolution to restore the features to their original size and finally passed through a classifier that determines whether each pixel in the matrix has changed.

Edge Detection Branch
The edge detection branch is used to assist the mask edge of the object in the change detection branch, and its structure is shown in Figure 7. Although the shallow structure contains rich edge information, it also contains complex background noise. Edge selfattention can help shallow features eliminate background noise. Edge self-attention uses deep mask features as a guide to remove noise in shallow features and focus more on ground features. The specific implementation is as follows: where F sem represents a feature of the ground object mask, F edge represents an edge feature, and Upsample is a bilinear interpolation algorithm that restores the mask feature to the size of the edge feature layer. Sigmoid is the activation function, which is used to obtain the global attention layer containing the ground object mask information. Since the edges are components of the mask features, the dot product between the global attention layer of the feature mask and the edge feature layer can remove background noise in the edge feature layer. In addition, we take the edge features obtained in the shallowest edge self-attention layer as the edge features in the IMD module. EF is an edge fusion module that fuses the edge features obtained in images at different scales before and after the change. The specific implementation is as follows: where F i pre represents an edge feature in the i-th layer before the change, and F i post represents an edge feature in the i-th layer after the change.
The semantic multiscale feature fusion modules in the edge multiscale feature fusion module and change detection branch have similar structures. We upsample the fused edge features to size (H/2, W/2) and obtain the final edge features by learning the correlations between the edge feature positions at different scales. After the edge feature is restored to size (H, W) through bilinear interpolation, the edge classifier is used to generate the final edge output.
x FOR PEER REVIEW 10 of 19 Figure 7. The details of the edge detection branch. The red box is the edge multiscale feature fusion modules.

Loss Function
Our loss function includes the losses of the edge branch and change detection branch and is formulated as follows: where is the loss that supervises the edge branch, and is the loss that supervises the change detection branch. λ is a regularization parameter that balances the two loss

Loss Function
Our loss function includes the losses of the edge branch and change detection branch and is formulated as follows: where L cbce is the loss that supervises the edge branch, and L ce is the loss that supervises the change detection branch. λ is a regularization parameter that balances the two loss functions. In the experiments performed in this paper, the edge and variation loss weights are balanced when λ = 10.
In the edge task, the edge includes only the outline of a single-pixel object, resulting in a serious imbalance in the numbers of edge and non-edge pixels. Therefore, classbalanced cross-entropy (CBCE) is often used as the loss function in edge tasks and is defined as follows: where β is the percentage of non-edge pixels in the dataset, which is used to balance the uneven distribution of edge and non-edge pixels in the dataset. P y j = 1 is the probability of an edge pixel, and P y j = 0 is the probability of a non-edge pixel.
The cross-entropy loss function is commonly used in classification tasks and formulated as follows: where P(i) represents the true value of the i-th pixel, and Q(i) represents the probability that the predicted value of the i-th pixel is a changed state.

Experiments
To verify the effectiveness of EGCTNet, in this section, we explain the datasets, training details, and evaluation indicators used in the experiments, compare existing change detection methods developed in recent years with the proposed network, and analyze the comparison results.

Datasets and Preprocessing
We used two publicly available CD datasets, namely, LEVIR-CD [22] and WHU-CD [37]. Several dataset examples are shown in Figure 8.
LEVIR-CD was acquired by Google Earth between 2002 and 2018, with an image resolution of 0.5 m. This dataset contains 637 pairs of high-resolution remote sensing images of size 1024 × 1024. The LEVIR-CD dataset is used for building change detection, focusing on small and dense buildings with different types of changes.
The authors provided a standard training set, test set, and validation set division for the LEVIR-CD dataset. This experiment follows the original data division and reduces the image size to 256 × 256 according to recent research.
The WHU-CD dataset consists of two sets of aerial data that were acquired in 2012 and 2016, with an image resolution of 0.3 m. This dataset used for building change detection, focusing on large and sparse buildings.
Since the authors did not divide the WHU dataset, we divide the dataset into 7680 nonoverlapping 256 × 256 images and randomly divided the images into training, test, and validation sets at a ratio of 7:2:1.

Implementation Details
We implemented our model in PyTorch and used an NVIDIA RTX 3090 GPU for training. To control for variables, we use the same data augmentation in all methods and use the same hyperparameters as much as possible. During training, we perform data augmentation on the dataset, including random flips, random rescaling (0.8-1.2), random cropping, Gaussian blurs, and random color dithering. The network was randomly initialized at the beginning of training and trained using the AdamW optimizer with beta values of (0.9, 0.999) and a weight decay of 0.01. The learning rate was initially set to 0.0001 and decayed linearly according to the epoch; the batch size was set to 16, and the number of epochs was set to 200.

Datasets and Preprocessing
We used two publicly available CD datasets, namely, LEVIR-CD [22] and [37]. Several dataset examples are shown in Figure 8.

Evaluation Metrics
In this experiment, we used the F1 and IoU scores of the variation categories as the main quantitative metrics. We also compared the overall accuracy (OA), precision, and recall of the variation categories. These metrics are calculated as follows: where TP denotes a true positive, TN denotes a true negative, FP denotes a false positive, and FN denotes a false negative.
FC-Siam-Di [27]: A change detection method with a CNN structure that encodes multilevel features in bitemporal images and uses feature difference concatenation to obtain change information.
FC-Siam-Conc [27]: A change detection method with a CNN structure that encodes multilevel features in bitemporal images and uses feature fusion concatenation to obtain change information.
NestUnet [38]: A change detection method with a CNN structure that encodes multilevel features in biphasic images and uses feature difference connections to obtain variation information.
STANet [22]: A spatiotemporal attention-based method that calculates the weights of different pixels in space and time to obtain feature information to better distinguish changes.
DTDSCSCN [24]: A channel spatial attention-based approach that uses a dual-attention module (DAM) to improve the model's ability to discriminate features. We refer to other literature and omit the semantic segmentation decoder to ensure fairness.
SNUNet [39]: A channel attention-based method that applies an ensemble channel attention module (ECAM) to refine the most representative features at different semantic levels for change classification.
ChangeFormer [30]: A transformer-based method that obtains multiscale change information through a transformer encoder and an MLP decoder.
BIT [31]: A method based on convolutions and a transformer that refines convolution features through a transformer decoder and uses feature differences to obtain change information.

Results on the LEVIR-CD Dataset
On the LEVIR-CD building dataset, we implement all comparison methods except NestUnet using the hyperparameter settings described in the original literature. Since the original author of NestUnet did not conduct experiments on LEVIR-CD, to ensure fairness, the hyperparameter settings of NestUnet are consistent with the hyperparameter settings of our method. The experimental results are shown in Table 1. Our method achieves the best results on most metrics, with an F1 score of 0.9008 and an IoU score of 0.8295. Our network may benefit from the feature advantages introduced by CNN and transformer fusion coding. Moreover, the addition of the edge detection branch may have improved the IoU score.
As shown in Figure 9, it is worth noting that NestUnet shows excellent performance in the detection of small buildings, which may be because network models with convolutions as their backbone exhibit better local feature extraction. However, the network model with a transformer as its backbone performs better on large building detection. The third set of comparison images in Figure 9 proves our conclusion. EGCTNet combines the advantages of convolutions and transformers. EGCTNet performs well in both small building and large building detection. We use red rectangles to highlight some of the buildings in Figure 9, demonstrating that EGCTNet performs better on the edges of most buildings. However, some buildings are occluded by shadows, which leads to some false detections with EGCTNet. advantages of convolutions and transformers. EGCTNet performs well in both small building and large building detection. We use red rectangles to highlight some of the buildings in Figure 9, demonstrating that EGCTNet performs better on the edges of most buildings. However, some buildings are occluded by shadows, which leads to some false detections with EGCTNet.

Results on the WHU Building Dataset
On the WHU-CD building dataset, to control the hyperparameters, we trained all models except STANet with the same hyperparameters, with a learning rate of 0.0001, a

Results on the WHU Building Dataset
On the WHU-CD building dataset, to control the hyperparameters, we trained all models except STANet with the same hyperparameters, with a learning rate of 0.0001, a batch size of 16, and 200 epochs. Furthermore, we use AdamW as the optimizer. For STANet, we initially set the batch size to 16; however, the video memory overflowed on our device during training, so we set the batch size of STANet to 4. The final result is shown in the Table 2. Our method achieves the highest scores on all metrics, with an F1 score of 0.9070 and an IoU score of 0.8298. Since the WHU-CD dataset mainly includes large buildings, these results indicate that our method exhibits better performance than existing methods for change detection of large buildings. Our method also achieves the highest IoU score, showing that our method performs better on building boundaries.
The Figure 10 intuitively shows the performance differences among the different models. NestUnet exhibits good performance in small building change detection, as illustrated by the first and second groups of images in the Figure 10. This result proves that models with CNNs as their backbone network have more significant effects on small buildings due to local correlations in CNNs. However, the fourth set of images shows an issue associated with network models with CNN backbones, which demonstrate poor performance in large building change detection tasks. The performance of BIT, ChangeFormer and EGCTNet in the fourth group of images proves the importance of the global information introduced by the transformer for large building detection. However, using only the transformer as the model backbone results in a loss of features for small buildings, as illustrated by the second set of images. After the edge detection branch is introduced in EGCTNet, the proposed network shows better performance on building boundaries, as illustrated in the Figure 10.

Ablation Studies
To verify the effectiveness of the fusion encoder, intermediate decoder, and edge detection branch proposed in this paper, we randomly selected 1910 pairs of images in the WHU-CD dataset as the ablation experimental dataset and randomly divided the images in training, test and validation sets. Three sets of comparison networks were designed using a ResNet encoder and change decoder as the basic network structure. The fusion encoder was used in the first group of comparative networks (CTNet), and the second group of comparative networks (CTINet) added an intermediate decoder. In this experiment, we removed edge features with the intermediate decoder. The third group of comparative networks added an edge detection branch, which is the model proposed in this paper. The results are shown in the Table 3.
Former and EGCTNet in the fourth group of images proves the importance of the glob information introduced by the transformer for large building detection. However, usi only the transformer as the model backbone results in a loss of features for small bui ings, as illustrated by the second set of images. After the edge detection branch is intr duced in EGCTNet, the proposed network shows better performance on building boun aries, as illustrated in the Figure 10.

Ablation Studies
To verify the effectiveness of the fusion encoder, intermediate decoder, and edge d tection branch proposed in this paper, we randomly selected 1910 pairs of images in t WHU-CD dataset as the ablation experimental dataset and randomly divided the imag in training, test and validation sets. Three sets of comparison networks were design using a ResNet encoder and change decoder as the basic network structure. The fusi encoder was used in the first group of comparative networks (CTNet), and the seco group of comparative networks (CTINet) added an intermediate decoder. In this expe ment, we removed edge features with the intermediate decoder. The third group  The second row in the Table 3 shows the result of using the fusion encoder. The F1 score increased from an initial value of 0.8874 to 0.8934, and the IoU score increased from an initial value of 0.7976 to 0.8074. These increases may be because of the global information introduced by the transformer encoder. This global information increases the number of extracted features, improving the overall result. The fusion encoder improves the recall score the most, increasing the recall from 0.8644 to 0.8821, which indicates that more changed pixels are detected in the result. However, the precision is reduced from an initial value of 0.9117 to 0.9050, which may be because the transformer encoder introduces global noise when adding global information, thereby reducing the precision.
The third row in the Table 3 shows the result of adding the intermediate decoder. The fourth row in the Table 3 shows the result of adding the edge branch. Except for the precision, this model achieves the best results in the other indicators, with an IoU of 0.8152 and an F1 score of 0.8982, proving that the edge information introduced by the edge detection branch improves the detection result.
To more intuitively represent the performance difference, we show the results of each comparison network in the Figure 11.
The Figure 11 proves our conclusion. After the transformer coding structure is added to the model, the large building detection performance improves. However, this structure also introduces noise. After the features pass through the intermediate decoder, the excess noise is eliminated. The results of the second and fifth groups in the Figure 11 show this conclusion. Moreover, the method proposed in this paper shows better results at the boundaries than the other methods.

Discussion
We provide a completely new approach to change detection. It combines convolution and Transformer for combined encoding, and develops multiple customized modules. Our results are better than some other models (see Figures 9 and 10 and Tables 1 and 2).
Each module brings different degrees of performance improvement to EGCTNet; but the increase of modules also means the increase of the amount of calculation and parameters. The calculation amount and parameter amount of each module are shown in Table  4.

Discussion
We provide a completely new approach to change detection. It combines convolution and Transformer for combined encoding, and develops multiple customized modules. Our results are better than some other models (see Figures 9 and 10 and Tables 1 and 2).
Each module brings different degrees of performance improvement to EGCTNet; but the increase of modules also means the increase of the amount of calculation and parameters. The calculation amount and parameter amount of each module are shown in Table 4. It can be found that most of the parameters and computations of EGCTNet are due to the use of fused encoders. Compared with the performance improvement brought by the intermediate decoder and edge detection branch, the increased amount of parameters and computation is acceptable. Since the fusion encoder and the encoder in Base only differ in the Transformer structure, the Transformer structure in FE may be the main reason for the increase in the amount of parameters and computation. Although FE brings a larger amount of computation and parameters, the Transformer structure provides a global view for EGCTNet, showing comparatively large advantages in both quantitative and qualitative results (see Figure 11 and Table 3). The fusion encoder combines global features and local features. In the future, we will continue to improve FE, such as using the window idea of the Swim Transformer, to ensure its performance while reducing parameters and computation as much as possible.

Conclusions
In this paper, we propose a novel end-to-end change detection network named EGCT-Net. After the transformer coding structure is added to the model, EGCTNet shows good performance in detecting large objects. The addition of the intermediate decoder eliminates the global noise introduced by the transformer. The addition of the edge detection branch improves the EGCTNet performance on the edges of objects. The experimental results demonstrate that EGCTNet shows good performance on both small, dense buildings and large, sparse buildings. Thus, our method exhibits good performance for building change detection.
Although the model proposed in this paper achieved good results on multiple datasets, it has some limitations. Since the network introduces both transformer and CNN modules, although we perform sequence reduction on the feature sequence in the transformer module to reduce the computational load, the proposed model still has a larger computational load than existing networks. Multiple custom modules increase the complexity of the model, and at the same time achieve better performance. In future work, we will optimize the calculations in the model to reduce computational costs while ensuring model performance.

Conflicts of Interest:
The authors declare no conflict of interest.