1. Introduction
Change detection (CD) captures the spatial changes between two multitemporal satellite images due to manmade or natural phenomena [
1]. Pixels in the same region but acquired at different times are usually classified as changed or unchanged by comparing the coregistered images [
2]. The development of the CD algorithm is mainly divided into two aspects. First, many researchers analyze the imaging mechanism and feature design. Second, the resolution of satellite data is constantly improving, from a spatial resolution of 79 m (Landsat-1 satellite) to today’s massive submeter satellite data, which provides experimental scenarios and further development directions for research on change detection algorithms.
Traditional CD algorithms achieve better detection results on low-resolution data by manually and elaborately designing features and adjusting super parameters. At the early stage of development, researchers are directly comparing and identifying changes in Landsat images by image differentiation, imaging and regression analysis [
3,
4,
5]. Based on the method of feature space mapping, principal component analysis (PCA) [
6] is utilized to compress features to obtain the main features, and then changed regions are obtained by the change vector analysis (CVA) [
7] algorithm. However, the above methods have high data requirements and require the same sensor and data with consistent radiation characteristics as much as possible [
8]. With an increase in the number of satellites, the use of classification-based CD methods has increased [
9,
10,
11]. This method classifies first and then detects changes, avoiding the problem of false changes caused by inconsistent radiation. Therefore, the change problem is transformed into how to classify the surface features with high quality.
With the development of satellite imaging and the launch of numerous land satellites, high-resolution images have become increasingly accessible. In China, business departments use images with resolutions from 0.5 to 2 m for national land cover detection and change monitoring. In the SpaceNet 7 challenge, the high-resolution data of SPOT are used for competition, which has higher requirements for extracting change target technology. The traditional CD algorithm based on manually designed features and many super parameters has poor performance because it is difficult to model semantics on high-resolution images [
12,
13,
14]. Thus, this algorithm is gradually replaced by a deep neural network that is based on numerous data and samples.
Presently, algorithms based on convolutional neural networks (CNNs) have shown excellent performance in a variety of change detection tasks. These networks extract multilayer pyramid features via encoders and fuse change features via decoders [
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26]. Based on the encoding and decoding structure, numerous researchers have made different innovative works according to the characteristics of different change detection tasks. Chen et al. designed the encoder by residual connection and pretrained it using a large change detection dataset [
22]. Daudt et al. decoupled the dual-phase features and introduced the twin encoder for independent feature extraction [
15]. Some scholars have investigated the fusion method of dual characteristics after the twin network has extracted features and proposed more efficient data fusion methods that integrate prior knowledge of change detection [
15,
19,
21,
22,
27,
28]. In addition, multiscale information exchange channels can be added to the decoder, and change decoding can be better handled by adding depth supervision [
21]. In view of the lack of change samples, we proposed using segmentation information to guide encoder learning [
29,
30] and set specific data enhancement schemes for change detection tasks to greatly improve the utilization efficiency of sample information [
31].
However, the algorithm encounters some problems based on CNNs. (1) The receptive field is limited. CNNs expand the receptive field via downsampling, which easily causes information loss. (2) It is difficult to exchange information across scales. The multiscale features of CNNs are usually sampled layer by layer [
32] or directly concatenate [
33]. The information across scales and channels is combined, which makes it difficult to obtain information and improve the accuracy. The final performance on the change detection task is incomplete detection of large targets and easy loss of small targets.
Recently, the appearance of a transformer [
34] provided a new feature extraction method. Using a self-attention mechanism, each pixel can obtain global information. With the application of a transformer to image segmentation [
35,
36,
37,
38] and detection [
39] tasks, the accuracy has been greatly improved. In the field of change detection, BIT [
40] uses a convolutional neural network to extract features and then constructs the encoder-decoder of a transformer to obtain the features at different times. Although a transformer is utilized, due to the inappropriate method of the transformer and CNN hybrid, its accuracy is inferior to that of a CNN. In subsequent research, researchers used the transformer to fuse twin network features under the overall architecture of a CNN network [
21,
23] and achieved better accuracy than a pure CNN. DMATNnet [
41] proposed a fuse the fine and coarse features with dual-feature mixed attention base on transformer. It can not only extract more specific regions of interest, but also overcome the misjudgment caused by oversampling, and synchronize feature extraction and target information integration. SwinSUNet [
42] Unet builds encoders and decoders based on SWIN [
36] to achieve leading results across multiple data sets. Transformers can not only provide global feature modeling to obtain remote semantic information but also build information exchange channels between two features [
23]. However, the original MSA disregards the biased induction characteristics of CNNs and the computational complexity of MSA when modeling. Network computing is slow, and convergence is difficult when the dataset is small. Although the transformer can provide a higher accuracy, there is a large demand for data, and the original multihead self-attention (MSA) [
43] involves extensive computation.
The development of the current change detection task is inseparable from the support of many open datasets. The public dataset provides an open benchmark for algorithms so that different algorithms can be compared on the same basis. WUH-CD [
44] and LEVIR-CD [
45] provide extremely high-resolution building changes, and SECOND [
46] provides multicategory change labels. However, in the actual change detection scene, there are numerous large targets, which have regions with weakened changes and regions with obvious changes. Usually, people need to infer weak regions by regions with obvious changes during detection. However, this scenario does not exist in the current public dataset. Therefore, this paper constructs a change detection dataset with a large map, which can complement the lack of change scenarios in the current public dataset.
In this paper, to solve the problem that it is difficult to completely detect large targets and lose small targets in the current CNN, a Change MSA module is designed on the intrascale by using the global modeling capability of a transformer. For the problem that the original MSA involves extensive computation, a block MSA named S-MSA is utilized. To solve the problem of change data fusion, the original MSA spatial dimension construction tokens are replaced by the channel construction tokens, thus realizing the direct global feature exchange of channel dimensions. To solve the problem of the large data demand of the transformer, MSA and a feature fusion module (FFM) are combined to enhance the local feature modeling of the transformer by using the biased induction feature of convolution and to greatly reduce the computational requirements. Through the multiscale characteristics of blocks, the module can efficiently model the characteristics of different scales while maintaining low computational complexity. The characteristics of the transformer and the twin characteristics extracted by the encoder are merged layer by layer under the CNN decoder so that the network can simultaneously extract large and small targets while maintaining the high precision of the original CNN and can use the regions with obvious changes to enhance the characteristics of the regions with weak changes.
The contributions of this work are summarized as follows:
We propose a hybrid transformer–CNN change detection network named TChange. Under the condition of maintaining a low computational cost, the network can globally and efficiently model the features within the scale and provide a direct information exchange channel for features across scales.
A novel MSA module named Change MSA is proposed to acquire global feature and pairwise feature information within scales. In addition, a new feature fusion method, which conducts MSA companies in the channel dimension rather than the spatial dimension by channel crossing, is proposed for change detection tasks. The offset induction of CNNs is used to enhance the local modeling ability of MSA.
An interscale transformer module (ISTM) is proposed to build a multiscale feature exchange channel.
A new remote sensing change detection dataset named TZ-CD is constructed by taking into account changed regions with various areas, which compensates for the lack of scenarios in the current change detection public dataset.
2. Materials and Methods
2.1. Overview
TChange uses a transformer and CNN to build a model that can provide high-precision pixel-level change area detection. The overall structure of TChange is shown in
Figure 1. First, the network extracts multilevel pyramid features via multiple feature extractors. Second, to solve the problem of multiscale feature fusion, the common approach is to use a sampled CNN to fuse layer by layer so that shallow features can obtain the information of deep features. However, the flow of information is one-way, some small targets in the deep features easily lose information, and deep features have the problem of a local receptive field. Therefore, this paper proposes an interscale information fusion module based on a transformer. While preserving the layer-by-layer fusion decoding method, the information of multiscale features is exchanged by the long-distance modeling capability of the transformer. Last, because it is easy to lose high-frequency features in the transformer and the decoding module is more complex, to enhance the learning ability of the model, in-depth supervision is utilized to learn the edges of various sizes to ensure that the high-precision edges are obtained while the changing target is detected.
2.2. Encoder
TChange is aimed at high-resolution remote sensing images. The structure of TChange includes a CNN–transformer hybrid decoder, so the computation is heavy. Therefore, to balance efficiency and accuracy, the lightweight and efficient b1is selected as the feature extractor to extract the multilayer pyramid features. The input of TChange is remote sensing image data in different time phases: . Multiscale features are aligned by twin encoders that share weight, and two groups of multiscale features and , whose spatial scales are , are output. TChange uses a universal encoder, which can directly use the ImageNet pretrained model to obtain higher accuracy and faster training speed.
2.3. ChangMSA
In high-resolution remote sensing images, there are many large change targets, and there are weak change features in each large change target. When people interpret visually, they gradually index the areas with weak change according to the areas with obvious change characteristics. Previously, some scholars applied the UNet network structure to fuse features of different scales layer by layer to identify changes in weak areas by using remote feature information. The rise of transformers provides a more efficient way to obtain remote semantic information without downsampling. Therefore, this paper proposes an MSA-based approach to discover changes and obtain remote semantic information.
To solve the problem of long-distance semantic acquisition and feature pair change discovery, this paper designs an MSA module for change detection tasks based on MSA, named Change MSA, which consists of three parts: (1) C-MSA for interchannel awareness, (2) S-MSA for spatial long-distance semantic acquisition, and (3) FFM for convolutional biased induction to enhance local feature learning.
Before introducing our specific implementation, first, we describe the basic paradigm of MSA in image segmentation. The input is
, which is reshaped to tokens
, where the dimension of each token is
C. The tokens are linearly projected into
,
, and
. Second, the input is sent to multiple heads for self-attention calculation, and each single head is calculated as follows:
is the transposition of K, and is the normalization factor to avoid large values after the dot-product operation. If it is a multihead self-attention network, the number of heads is N, and the dimensions of , and are divided into N parts and then output to the channel dimension . The computational cost of the original MSA is .
The original MSA is intended to obtain long-distance semantic information. However, the change detection task requires not only spatial long-distance information but also the information between two change features. Therefore, the Change MSA proposed in this paper includes two parts, channel awareness and spatial awareness, which are detailed in the following section.
2.3.1. C-MSA
Change detection uses twin encoders to obtain multilevel pyramid features. In most papers, the fusion method of features is concate and the change area is simply and roughly obtained by the learning ability of deep learning. From the perspective of prior knowledge, this paper designs a C-MSA based on the original MSA network.
As shown in
Figure 2, the input data are
. We suggest that different channels represent the detection results of the same detector. With twin encoders sharing weight, the same level is the result of the same detector. Therefore, all channels are crossed to form feature
.
Two channels are combined into a group to build
tokens:
. Note that the subscripts at this time represent different token indices. Since C-MSA uses the paired spatial regions as a token, each token is reshaped from
to
, and its feature depth is
. Compared with the token of the original MSA, as shown in
Figure 3, the original MSA uses the spatial feature as the token, and there are HW tokens (
Figure 3a). This paper uses the dual feature channel as the token, so there are
tokens (
Figure 3c).
The dimension of each token is 2 HW, and then tokens are linearly projected into
,
,
.
Next, we cut
,
, and
into
pieces along the dimension direction and input N heads, with
. The dimension of each head is
. Thus, the calculation of the jth head is expressed as follows:
is the transposition of , is a normalization factor to avoid large values after the dot-product operation, and , and is a function for encoding the channel position. The position encoding of the original MSA is used to encode spatial information, while the C-MSA flattens the channel into a token, so its position encoding is utilized to encode channel position information. Therefore, the final output of C-MSA is .
Compared with the computational complexity of the original MSA, the computational complexity of C-MSA is . It can be seen that the calculation amount of C-MSA linearly increases with the spatial resolution, while the original MSA exponentially increases when carrying out large image detection of remote sensing images.
2.3.2. S-MSA
To ensure the edge texture of the target and realize the detection of small targets, spatial MSA (S-MSA) is used in the 1/4 downsample features. The original MSA has the problem of high computational complexity. Therefore, this paper uses SWIN [
36] for reference and adopts the sliding window method. The original MSA is based on a single pixel, while S-MSA is based on the nonoverlapping window of
, which reduces the number of tokens from
to
. Tokens are shown in
Figure 3b, which aggregates the original yellow area into a token pixel by pixel. The subsequent S-MSA is calculated in the same way as the original MSA.
The S-MSA computational cost is . S-MSA also linearly increases the amount of computation with the spatial resolution. Thus, TChang can use MSA to obtain long-distance spatial information at a higher resolution.
2.3.3. FFM
The feature fusion module (FFM) is aimed at using offset induction of the convolutional network to obtain local features, which is applied to strengthen the global feature modeling of MSA. The process is described as follows: first, restore the output feature of MSA to the spatial dimension X. Second, input a
convolution to learn inductive biases, and then input the feature to the Layernrom and GELU activation layers. Last, obtain the final output via a
convolution layer.
2.3.4. Change MSA Summary
The structure of Change MSA is shown in
Figure 4. The paired feature pairs
are extracted using four groups of encoders. The features p and q in the channel dimension are interleaved to obtain
.
In the above formula, LN represents the layer norm, and denotes the position encodings, which are composed of convolution . First, is input via layer normalization, C-MSA is input for channel feature coding, and then the output features and are added to obtain . X is input into the FFM to strengthen local features, S-MAS is input, and the final output is obtained via the FFM module. Its spatial scale is . It should be noted that in order to balance precision and computation, TChange adds Change MSA at the beginning of 1/4 scale.
2.4. The Inter-Scale Transformer Module
In change detection research, the transmission and fusion of feature information at different scales greatly improve the detection accuracy. This paper proposes a scale feature exchange mechanism based on a transformer, which is aimed at directly exchanging information about features at different scales using the characteristics of the transformer. The design of this module has two characteristics: (1) Building a characteristic communication channel between two scales and (2) less computation.
Different from the original MSA, the proposed ISTM input is a multiscale feature. To exchange information on different scale features, we flatten the feature maps on different scales and then simultaneously concatenate them. Through the self-attention mechanism of the transformer, each spatial feature can obtain global feature information. To further optimize the calculation amount, we divide different scale features into blocks with the same spatial dimension, as shown in
Figure 5. The same block area with different scales corresponds to the same area in the original data, so that the calculation amount and spatial dimension can be linearly related while realizing channel exchange.
The specific implementation is described as follows:
The input of this module is a multiscale feature map,
, which first divides the features of different scales into N regions. The same regions of different scales have different sizes, but they are features of different levels of the same region of the input data. The features are flattened and then spliced to obtain the input
.
where
is a function that expands the spatial dimension of
and
is used to concatenate all flattened features into one feature.
In Change MSA, the space dimension and channel dimension are fully transmitted by self-attention, so the ISTM is mainly aimed at providing an information exchange channel between two features of different sizes. To reduce the calculation, features have been processed in blocks. The input token represents features of different scales in the same area at the pixel level. We use the original MSA to exchange features; it is expressed as follows:
denotes the layer normalization and is reshaped to the original spatial dimension to obtain . N regions are reassembled to obtain the output of the ISTM as .
2.5. The CNN Decoder
The decoder consists of two parts: the fusion and segmentation block (SEB), which provides fusion transformer and CNN features, and then fusion decoding layer by layer via the SEB.
Fusion has 3 inputs, as shown in
Figure 6a; the input of the CNN is
, and the input of the transformer is
. Compared with common feature fusion operations, such as
, this paper adds a priori knowledge of changes to the change module while maintaining no loss of features so that a faster training speed and higher accuracy can be obtained. The CNN input is described as follows:
indicates that the features are concatenated in the channel dimension. When
n = 1, there is no input from the transformer, so the
results will be directly input into the SEB module. The overall fusion module is expressed as
represents the convolution operation with a convolution kernel of 1 * 1. Features from the transformer and CNN are fused and complemented by fusion and input into the SEB.
Segmentation block:
is set as the input to the upper sampling layer, and the spatial dimension is expanded from
to
.
is fused with the features of the next scale and then combined with a set of
operations. After
, we use se attention [
47] for spatial dimension feature fusion, which can better fuse the CNN and transformer features. The output
funded by the SEB module is obtained by a group of
. In special cases, when
n = 5, because it is in the highest dimension, there is no fusion with the upper layer features, so the
module is canceled, and the upsampling results are directly input to
.
2.6. Output and Deep Edge Supervision
To obtain from the decoder, in the TChange structure, we design two output modules: the segmentation head and edge head.
Segmentation Head: The input is the result
of the last layer decoder. After convolving by Conv 3 * 3, the input is upsampled and activated via sigmoid to obtain the 0–1 probability map of the changed region, which is described as follows:
is a function for sampling twice the linear interpolation.
Edge Head: Because the decoder is heavy and the transformer is prone to losing high-frequency information, we adopt the method of deep supervision for each layer. Different from the commonly employed depth supervision, we replace pixel-level supervision with edge supervision to solve the problem that the transformer is prone to losing high-frequency information. Unlike the segmentation head, the edge head needs to actively guide high-frequency information. Therefore, we simultaneously perform average pooling and max pooling operations on
. Average pooling can be considered a balanced filter to obtain the low-frequency information of the feature map, while max pooling obtains the high-frequency information. The low-frequency and high-frequency information are decoupled via pooling and then via a 3 * 3 convolution layer. The 0–1 probability map output is obtained using sigmoid.
Note that the edge head only decouples high-frequency and low-frequency information to make it easier for the edge to learn.
In view of the small proportion of foreground categories in the prediction results, the Dice function is added on the basis of the commonly employed binary cross-entropy to optimize the foreground objectives. The weight of the two loss functions is 0.5:0.5.
5. Conclusions
In this paper, we propose a hybrid transformer–CNN network (TChange) for the high-resolution change detection task, which can efficiently extract multiscale features by using lightweight efficient b1 in the encoder part. For the problem that large targets are prone to voids, Change MSA is proposed to acquire long-range information in both the channel dimension and spatial dimension. To address the problem of easily missed detection for both large targets and small targets, we propose IMST based on the transformer to construct direct communication channels between two scales. In the CNN decoder, the features from the transformer are simultaneously input, and the multiscale features extracted by the encoder are fused to obtain the final change region. For the problem that the transformer tends to lose high-frequency features, the use of deep edge supervision is proposed to replace the commonly employed depth supervision.
To verify the effectiveness of the proposed algorithm, a large-scale change building dataset, TZ-CD, is constructed. The dataset contains both very large targets and weak feature targets, which are lacking in the public dataset, so that the performance of the algorithm can be more comprehensively measured. TChange achieves state-of-the-art results in both the two open-source datasets and TZ-CD.
Although TChange has improved its accuracy compared with the previous network and performed well in the public dataset, the current change detection algorithm, as exposed by the TZ-CD dataset, still cannot achieve complete extraction when the task is more complex. Moreover, the dataset used in this paper has basically the same distribution between the training set and the test set, and the experimental environment is more ideal. It can be expected that when domain differences appear, the accuracy differences between two different models will be further widened, and more problems will be exposed. Therefore, in subsequent work, we will test the performance of the algorithm for complex tasks and domain differences and then identify and solve any problems.