Cascaded U-Net with Training Wheel Attention Module for Change Detection in Satellite Images

: Change detection is an important application of remote sensing image interpretation, which identiﬁes changed areas of interest from a pair of bi-temporal remote sensing images. Various deep-learning-based approaches have demonstrated promising results and most of these models used an encoder–decoder shape such as U-Net for segmentation of changed areas. In order to obtain more reﬁned features, this paper introduces a change detection model with cascaded U-Net. The proposed network architecture contains four cascaded U-Nets with ConvNeXT blocks. With a patch embedding layer, the cascaded structure can improve detection results with acceptable computational overhead. To facilitate the training of the cascaded N-Nets, we proposed a novel attention mechanism called the Training whEel Attention Module (TEAM). During the training phase, TEAM aggregates outputs from different stages of cascaded structures and shifts attention from outputs from shallow stages to outputs from deeper stages. The experimental results show that our cascaded U-Net architecture with TEAM achieves state-of-the-art performance in two change detection datasets without extra training data.


Introduction
Satellite images have a wide range of applications in many areas, including dynamic disaster monitoring [1,2], environmental pollution monitoring [3][4][5][6], and urban planning [7][8][9]. Among them, change detection is an active research field that detects the changes in land cover from remote sensing image sequences (usually image pairs). Visual interpretation of changes highly relies on expert knowledge, which is expensive and timeconsuming. Automatic change detection is a computer vision task, which takes sequential images of the same location as inputs and produces a binary map that marks the changes. In this paper, we focus on deep-learning-based techniques for binary change detection in which classification is made at the pixel level.
Traditional remote sensing image change detection methods are mostly designed with handcrafted features and focus on the identification of strong, abrupt changes [10]. The pixel-based traditional methods image algebra [11] and transformation [12] treat each pixel independently. Methods such as the Markov random field [13], conditional random fields [14,15], simple neighboring window [16], and hypergraph models [17] were used in change detection to utilize spatial-contextual information. These traditional methods are limited in their application and easily affected by the influences of changing seasons and atmospheric conditions.
With the increased availability of data and computational resources, the deep learning method is becoming more advantageous [18]. Deep learning neural networks, especially convolutional neural networks, have demonstrated their great capability in change detection applications. As described in [19], most change detection neural networks can The rest of this paper is organized as follows. In Section 2, the overall structure of our proposed model DUNE-CD is introduced along with the proposed TEAM module and the used loss function. In Section 3, the change detection datasets used in our experiments, evaluation metrics, and the comparison methods used to evaluate our model are described. In Section 4, we show our experimental results on two datasets: LEVIR-CD and CCD. Section 5 is the discussion of the effectiveness of the cascaded stages, the improvements introduced by TEAM and other modules, and the computational costs. Lastly, the conclusion is drawn in Section 6.

Materials and Methods
Our proposed model is introduced in this section. First, the overall structure of the proposed cascaded network is described and the structure of our model is compared with a common U-Net. Then, the internal details of the modified U-Net are presented and the mechanism of TEAM is introduced. Finally, the loss function used to train our model is described.

Overall Structure of Proposed Neural Network
The overall structure of the proposed change detection model DUNE-CD is shown in Figure 1. DUNE-CD is composed of a patch embedding layer, several cascaded U-Net stages, a TEAM module to aggregate outputs of different stages, and a patch restore layer to produce the final change map prediction.
(2) A novel attention module was proposed to facilitate the training process of cascaded U-Nets which increases accuracy of the model without extra cost at inference time. (3) Extensive experiments on two change detection datasets validated the effectiveness and efficiency of the proposed method.
The rest of this paper is organized as follows. In Section 2, the overall structure of our proposed model DUNE-CD is introduced along with the proposed TEAM module and the used loss function. In Section 3, the change detection datasets used in our experiments, evaluation metrics, and the comparison methods used to evaluate our model are described. In Section 4, we show our experimental results on two datasets: LEVIR-CD and CCD. Section 5 is the discussion of the effectiveness of the cascaded stages, the improvements introduced by TEAM and other modules, and the computational costs. Lastly, the conclusion is drawn in Section 6.

Materials and Methods
Our proposed model is introduced in this section. First, the overall structure of the proposed cascaded network is described and the structure of our model is compared with a common U-Net. Then, the internal details of the modified U-Net are presented and the mechanism of TEAM is introduced. Finally, the loss function used to train our model is described.

Overall Structure of Proposed Neural Network
The overall structure of the proposed change detection model DUNE-CD is shown in Figure 1. DUNE-CD is composed of a patch embedding layer, several cascaded U-Net stages, a TEAM module to aggregate outputs of different stages, and a patch restore layer to produce the final change map prediction.  The bi-temporal input images are concatenated into one tensor with 6 channels. This introduces asymmetry between the pair of input images (the prediction of the model will be affected by the order of before and after images), but this effect can be reduced by swapping before and after images randomly during training. In the rest of this subsection, how DUNE-CD produces the change map prediction from the fused inputs is introduced.
Patch Embedding: First, the fused input image tensor is fed into the patch embedding layer. Small regions of the input tensor are grouped into single features by the patch embedding layer, which can increase the receptive field of the model. In our model, the size of these small regions is 4 × 4 with 6 channels (2 concatenated RGB images). Every 4 × 4 non-overlapping patches with 6 channels are embedded into one feature with 96 channels. Then, these embedded features are fed into the following cascaded U-Net structure. A larger patch size can reduce the size of the input feature of the subsequent cascading structure and increase the receptive field even more; however, the predicted change map of models with a larger patch size will contain more artifacts and the cascaded structures make the model less dependent on the patch embedding layer to increase the receptive field.
Cascaded U-Net Structure: The embedded features from the patch embedding layer are then fed into the following cascaded U-Net structure. As shown in Figure 1, the cascaded U-Net structure is composed of four U-Nets. All the U-Net stages share the same structure. Each U-Net stage is composed of encoders (represented as blue rectangles in Figure 1) and decoders (represented as green rectangles). The difference between encoders and decoders is in the vertical path: in encoders, the vertical path goes upward which indicates the features are upsampled, and in decoders, the vertical path goes downwards which indicates downsampling.
The commonly used U-Net can be viewed as a single pair of encoder and decoder, compared with four pairs in our proposed network. The U-Net structure is modified to cascade different stages; in a regular U-Net the decoder has only one output. Different levels of the intermediate features of U-Nets are passed to cascade the model. To receive those multiple inputs, a 1 × 1 convolutional layer was added (shown as a purple arrow in the subgraph of Figure 1) to combine the input and internal features of encoders and decoders. Note that to align with the traditional depiction of a U-Net, the sub-graph is upside down compared with the four stages above the sub-graph.
Except for the multiple inputs and outputs, another small modification in the internals of U-Net is introduced. The channel sizes are changed when upsampling or downsampling compared with the original U-Net which modifies channel sizes in convolution layers. As channel sizes are changed only in the downsampling layer of ConvNeXT, we follow a similar approach and change channel sizes only in downsampling and upsampling layers. This modification makes the model slightly easier to implement because channels are changed in the upsampling layers and downsampling layers, and the input shape and output shape of ConvNeXT layers are kept the same. Thus, the input can be directly added to the output as a residual connection. Two subsequent ConvNeXT layers are used in each level (the yellow arrow in the subgraph of Figure 1), which are similar to two consecutive 3 × 3 convolutional layers often used in regular U-Nets.
The complex skip connections rules used by many previous works between each stage are not necessary with the help of built-in residual connections inside the ConvNeXT blocks and our proposed attention mechanism. These U-Net stages, which can receive multiple inputs and outputs, are cascaded together directly. The patch embedding layer produces embedding with a channel size of 96 which is used as the input of the cascaded structures.
TEAM Module: The outputs of each stage are fed into our proposed Training whEel Attention Module (TEAM), which can be viewed as a simple weighted sum. The novelty of TEAM is that these attention weights are shifted during the training process. To make the behavior of the TEAM module more interpretable, scalar values are used as attention weights for each input of TEAM. These attention weights are updated both by gradient descent and our attention shift algorithm. At the beginning of the training process, this module provides shortcuts from the shallow stages of the cascaded structure. As training proceeds, these shortcuts are gradually removed. In another word, TEAM helps the network learn from easy to hard, which fits perfectly with the cascaded structures. The details of the attention shift algorithm are introduced in the next section.
Patch restoring: Finally, the aggregated output of TEAM is fed into the patch restore layer. This layer restores those features into the patches and the restored patches form the final prediction of the change map. This layer restores every feature with 96 channels to 4 × 4 patches with 2 channels (for change and non-change classes). Then, the final output can be used in training and inference.

Details of Proposed Neural Network
In this subsection, the details of our proposed DUNE-CD are introduced from top to bottom. First, the internals of encoders and decoders are introduced, which are the building blocks of DUNE-CD. Then, the ConvNeXT layer used in our model is described. The upsampling and downsampling layers in the encoders and decoders are described. Next, the details of the patch embedding layer and patch restoring layer are given.
Encoder/Decoder: Each U-Net stage is composed of one pair of encoder and decoder. These encoders and decoders share the same structure and are composed of ConvBlocks. The difference between encoders and decoders is the path on which the data flow. In encoders, the features are downsampled and flow from large spatial resolution (low level) to low spatial resolution (high level) and channel sizes are doubled. In decoders, the features are continuously upsampled and channel sizes are halved at every layer.
The ConvBlocks in encoders and decoders can receive one input or two inputs with the same spatial resolution and produce one output feature as shown in Figure  ConvBlock: The ConvBlocks of encoders and decoders share the same structure, which is composed of a 1 × 1 convolution layer and dual ConvNeXT blocks. The 1 × 1 convolutions are used to reduce the channel size by 1/2, which are used when receiving two inputs that are of the same shape to fuse the two inputs into one. However, these blocks do not always get two inputs so the 1 × 1 convolution is removed in the top of the encoders and the bottom of the decoders, as in sub-graph of Figure 1. The ConvBlock output features the same shape as its only input or the two inputs.
ConvNeXT block: ConvNeXT blocks are composed of a 2D depth-wise group convolution with a kernel size of 3 × 3 followed by a LayerNorm and two pointwise convolutional layers with a GELU [51] activation function in between as illustrated in Figure 2. We replace the original 7 × 7 convolutional kernels of ConvNeXT with 3 × 3 to reduce computation. Two subsequent ConvNeXT blocks are used in ConvBlock, which is similar to the original U-Net.
Upsampling layers and Downsampling layers: Downsampling layers are composed of a LayerNorm [52] and a 2D convolutional layer. Formally, where the LN represents a layer-norm and Conv is a 2D convolution with input channels C in and output channels C out . The stride and kernel size are both s, which is set to 2 in downsampling layers. Upsampling layers are a combination of a LayerNorm and a transposed convolutional layer as: LN(ConvTranspose(C in , C out , stride = s, kernel = s)), where ConvTranspose is a transposed convolution with s set to 2 which can restore the shape of the feature changed by the upsampling layer. Upsampling layers and Downsampling layers: Downsampling layers of a LayerNorm [52] and a 2D convolutional layer. Formally, where the represents a layer-norm and is a 2D convolution wi nels and output channels . The stride and kernel size are both , 2 in downsampling layers.
Upsampling layers are a combination of a LayerNorm and a transp tional layer as: where is a transposed convolution with s set to 2 which c shape of the feature changed by the upsampling layer.
The convolutional layers in the downsampling layer double the chann the upsampling layer, the channel size is halved by the transposed convolu Patch embedding and patch restoring layer: The patch embedding lay as the downsampling layer but the convolution layer is with a kernel size o of 4, as in Equation (1), with equal to 4. The patch restoring layer is th upsampled layers, but the stride and kernel size is 4, as in Equation (2), with The patch embedding layers produce features with a channel size of 96 an size of the patch restore layer's output is 2 to correspond to the change an classes.

Training whEels Attention Module (TEAM)
In this subsection, the training wheels attention module (TEAM) is The convolutional layers in the downsampling layer double the channel size and in the upsampling layer, the channel size is halved by the transposed convolution layer.
Patch embedding and patch restoring layer: The patch embedding layer is the same as the downsampling layer but the convolution layer is with a kernel size of 4 and stride of 4, as in Equation (1), with s equal to 4. The patch restoring layer is the same as the upsampled layers, but the stride and kernel size is 4, as in Equation (2), with s equal to 4. The patch embedding layers produce features with a channel size of 96 and the channel size of the patch restore layer's output is 2 to correspond to the change and non-change classes.

Training whEels Attention Module (TEAM)
In this subsection, the training wheels attention module (TEAM) is proposed. According to [53], the attention process can be formulated as Attention = f (g(x), x), where g(x) is used to generate attention which is a region to be attended and f (g(x), x) is a process that combines x with attention g(x). However, similar to [21], g(x) of our attention module is a learned weight and will not change with x but is guided with prior knowledge during the training process. Besides back-prop, these weights are changed by our weight shifting policy.
The policy is based on the prior knowledge that at the beginning of training outputs from shallow stages are easier to learn, but as training proceeds, the model's attention should be shifted to deeper stages to make more use of the cascaded structure. This is caused by the cascading nature of our network, and each U-Net's output is dependent on the input from the previous U-Net.
To be more precise, g(x) of the proposed trained shallow stage attention module is a function of trained iterations that causes the attention to shift during the training process. Perhaps this kind of training-time attention can be counted as a new paradigm of attention mechanisms. TEAM can be viewed as a simple weighted sum as follows: where m is the number of stages, attr i is i-th weight, and x i is the output of the i-th stage. The prior knowledge to be added to our model is that the desired distribution of these weights should hold this order: attr i > attr i−1 for all i > 1, which ensures that deep stages are paid more attention. The simplest implantation to reduce attention on the shallow part is achieved by adding an L2 norm of weights as a regularization term of loss with a different coefficient which is large on the weights of the shallow part. This approach does not work well because these weights are trainable parameters and lower weights of the shallow parts by regularization will also reduce the weights of deeper parts of the networks and the distribution of weights will not change as desired. The attention should be shifted to the deeper stages of the network. To obtain this desired behavior, we directly modify these weights on each forward pass during the training phase.
These weights are updated with the following algorithm (Algorithm 1) described in pseudocode at every forward pass at the training phase:

end
The weights are reduced in all except the last one which is the weight corresponding to the deepest output and the reduced weights are added to the last weight. α i is the weight shift coefficient. α 0 is set to 1 and these coefficients are linearly decreased to 0. The coefficients are multiplied with ξ (the initial learning rate) when the learning rate is reduced, which makes the ξ larger comparatively; therefore, the effect of weight adjustment by TEAM will take over and the short-cuts from the shallow U-Net to the outputs will be gradually removed. The impact of the proposed algorithm can be tuned by the hyperparameter λ.
Each attr i is a scalar value and in this way the distribution of attr i on different stages can be easily observed. These attr i can be replaced with vectors with a size equal to the number of channels of each output stage and the update rules can work the same.
The TEAM only shifts attention during training, which means that TEAM does not introduce any computational cost at the inference phase but the impacts on the cascaded structure remain. With carefully chosen λ, which we set to 0.05 on both datasets, TEAM can improve the accuracy of the model. The λ is an important hyper-parameter and will impact the training result. A large λ will reduce TEAM to just use the output of the last stage and a small λ will reduce TEAM to a simple weighted sum. Therefore, with a good selection of λ, TEAM can harness the power of both weighted sum and the deep cascaded structure.

Loss Function
The negative log likelihood loss is used in our framework, which is commonly used in classification tasks as a change detection problem is classifying pixels into change and non-change classes. The loss function is defined as: where σ(·) is a softmax function and B is the batch size, W is the width of the change map, and H is the height; the loss is a mean product of the ground-truth vector y GT and the output vector y o .

Experiments
This section includes details of our experiments. First, the datasets and evaluation metrics are described. Then, several current state-of-the-art (SOTA) comparison methods are introduced. Last, the detailed experimental settings including our hyper-parameter settings are introduced.

Datasets and Evaluation Metrics
The experiments were conducted on two different change detection datasets, namely, LEVIR-CD [28] and CCD [54], to evaluate our model. These datasets are commonly used in recent change detection literature and are publicly available.
The LEVIR-CD dataset contains 637 pairs of remote sensing bitemporal images from Google Earth, and the ground sample distance (GSD) of these images is 0.5 m/pixel with corresponding binary-building-related change labels. The default split (training/validation/test) provided by the dataset is used, and the images are cropped in the validation set and test set into small 256 × 256 non-overlapping patches. Images in the training set are cropped into the same size by overlapping with a stride of 128. After cropping, the training set contains 21,805 pairs of image patches, the validation set contains 1024 pairs, and the test set contains 2048 pairs.
The CCD dataset contains 11 pairs of remote sensing images with GSD in the range from 0.03 m/pixel to 1 m/pixel of which 4 pairs are of size 1900 × 1000 and 7 pairs are of size 4725 × 2200. The original datasets already contain cropped images of size 256 × 256 and are divided into training/validation/test subsets. We use the original split provided in the dataset to train, validate, and test our model, which contains 10,000 non-overlapping image pairs with binary change maps in the training set, 3000 samples in the testing set, and 3000 in the validation set.

Evaluation Metrics
In all our experiments, the following metrics were used to evaluate the experimental results for comparison: where TP, TN, FP, and FN stand for the number of true positives, true negatives, false positives, and false negatives. Since the F 1 -score provides a single score that considers both precision and recall, the F 1 -score is used as the main evaluation index. In addition, the number of multiply-accumulate (MAC) operations and the average inference time to process one image pair are used to evaluate different methods.

Comparison Methods
To compare our proposed deep learning change detection model, state-of-the-art deep learning change detection models are chosen to compare with the proposed approach. We choose both types of change detection networks for comparison methods: CDNet [20], UNet++ MSOF [22], RDPNet [21], DeepLab V3 [55], and DeepLab V3+ [55] are all singlestring networks. Double-stream change detection methods are also included: FC-Siam-Diff, FC-Siam-Conc [23], IFNet [24], DSAMNet [33], and BIT [26]. These models are described as follows: FC-Siam-Diff and FC-Siam-Conc are two fully convolutional Siamese networks. These networks have two encoders with shared weights to extract features from bi-temporal inputs. The difference between the two networks is how the outputs from the two branches are fused. The kernel size of the convolutional layers of the two networks is 3 × 3.
CDNet is composed of contraction and expansion blocks. The contraction blocks consist of a convolution, batch normalization, ReLU, and max pooling, and the expansion blocks consist of un-pooling, convolution, batch normalization, and ReLU. Contraction and expansion layers are composed of 7 × 7 convolutional layers with 64 fixed channels.
DSAMNet is an attention metric-based network with a feature extractor and change decision module. ResNet-18 is used as the feature extractor. The change decision generates bi-temporal features and the distance between these features is used to compute the change map. The model was trained with two loss functions: batch contrastive loss (BCL loss) and dice loss.
IFNet extract features via a shared deep feature extraction network, which is VGG16. The extracted features are then fed into the deeply supervised difference discrimination network (DDN) to produce change prediction, in which multi-level deep features of raw images are fused with image difference features by attention modules.
DeepLab V3 is a semantic segmentation architecture. At the end of the encoder, it uses an atrous spatial pyramid pooling operation to handle the problem of segmenting objects at multiple scales. ResNet-50 is used as the backbone in our experiments.
DeepLab V3+ is improved upon DeepLabv3 by adding a decoder module to further refine the segmentation results. In our experiments, the ResNet-50 is used as the backbone.
UNet++ MSOF used nested and dense skip connections to fuse features from different levels with a multiple side-outputs fusion (MSOF) strategy to produce highly accurate results. The original configuration of the network in [22] is used in our experiments.
BIT tokenizes the bitemporal image inputs and feeds these tokens into the transformer encoder and decoder to produce the change map which is the absolute value of the difference between the outputted features. The original setting of BIT [26] is used in which ResNet-18 is used as the backbone.
RDP-Net concatenates the bitemporal image at the beginning of the network. The concatenated input is divided into patches and fed into several ConvMixer layers. Then, the output patches are merged into output features. For fair comparison, we did not include the efficient training strategy and edge loss in our experiments.

Experimental Details
The proposed model is implemented with the PyTorch framework. All methods in our paper are trained and evaluated on a single NVIDIA GeForce RTX 3060 GPU. Without loss of generality, the Adam [56] optimizer is used with an initial learning rate of 4 × 10 −4 for training the CCD dataset and 0.002 for LEVIR-CD without weight decay. For training CCD, we initially used 200 epochs for training, which are commonly used as in [21,26]. Through experiments we found that with a longer training routine the networks achieved higher performance. Thus, we choose 280 epochs as a compromise between training time and model performance. Due to the overlapping cropping of the LEVIR-CD training set (which produces more data samples), we use 80 epochs. The learning rate is multiplied by 0.1 after 60 epochs when training LEVIR-CD and after 210 epochs for training CCD. Simple data augmentation is used, including random rotation of 0 • , 90 • , 180 • , or 270 • with a random flip. Due to memory constraints, the batch size is set to 8 and the size of input images is 256 × 256 for both datasets. For a fair comparison, all the models are trained with the same batch size.

Experimental Results
In this section, the experimental results on two change detection datasets, LEVIR-CD and CCD, are shown in detail. The results of our model are compared with state-of-the-art change detection models to verify the effectiveness of our proposed model.
The results show that our model obtains an F1-score of 90.52% on the LEVIR-CD dataset which is higher than all comparison methods. The precision of BIT is 92.66%, which is higher than our proposed model. The recall of DSAMNet on LEVIR-CD is 96.41%, which is higher than all other methods including our proposed method; this is due to DSAMNet being prone to predict pixels as change which drops the precision of DSAMNet to 70.61%. In change detection, the F1-score and IoU are better metrics to compare different models because these metrics encounter both precision and recall. In the F1-score and IoU, our proposed model outperformed all comparison methods. The F1-score of our model is 90.52%, which is higher than the F1-score of the second-best model, BIT, which is 90.28%. In Figures 3-6, the chosen results demonstrate some of the predicted change maps from DUNE-CD and compare them with FC-Siam-Diff and FC-Siam-Conc, CDNet, DSAMNet, IFNet, DeepLab V3, DeepLab V3+, UNet++ MSOF, BIT, and RDPNet for quality analysis of our algorithm. The red areas on these change maps represent false positives and the blue areas are false negatives. It shows that change maps predicted by DUNE-CD have fewer false positives and false negatives and our method can detect both small buildings (Figures 3n and 6n) and large buildings (Figure 4n) well. Most mislabeling of pixels occurs around the edges of buildings, especially at corners, this is partially due to the noise in the training labels. There are more false negatives in results from BIT than DUNE-CD (Figures 3l,n, 4l,n, 5l,n and 6l,n), which explains why BIT has a higher precision score, but the F1-score is lower than our proposed method.

Experimental Results on CCD
In Table 2, the F1-score of BIT is higher than RDP-Net on LEVIR-CD but lower on CCD. Our model achieves the highest F1-score on both datasets. The improvement on IoU, which is 1.53% higher than RDPNet, is more obvious than the F1-score.
ity analysis of our algorithm. The red areas on these change maps represent false positives and the blue areas are false negatives. It shows that change maps predicted by DUNE-CD have fewer false positives and false negatives and our method can detect both small buildings (Figures 3n and 6n) and large buildings (Figure 4n) well. Most mislabeling of pixels occurs around the edges of buildings, especially at corners, this is partially due to the noise in the training labels. There are more false negatives in results from BIT than DUNE-CD (Figures 3l,n, 4l,n, 5l,n and 6l,n), which explains why BIT has a higher precision score, but the F1-score is lower than our proposed method.   DSAMNet, IFNet, DeepLab V3, DeepLab V3+, UNet++ MSOF, BIT, and RDPNet for quality analysis of our algorithm. The red areas on these change maps represent false positives and the blue areas are false negatives. It shows that change maps predicted by DUNE-CD have fewer false positives and false negatives and our method can detect both small buildings (Figures 3n and 6n) and large buildings (Figure 4n) well. Most mislabeling of pixels occurs around the edges of buildings, especially at corners, this is partially due to the noise in the training labels. There are more false negatives in results from BIT than DUNE-CD (Figures 3l,n, 4l,n, 5l,n and 6l,n), which explains why BIT has a higher precision score, but the F1-score is lower than our proposed method.

Experimental Results on CCD
In Table 2, the F1-score of BIT is higher than RDP-Net on LEVIR-CD but lower on CCD. Our model achieves the highest F1-score on both datasets. The improvement on IoU, which is 1.53% higher than RDPNet, is more obvious than the F1-score.   To qualitatively analyze the result on CCD, some of the predicted results are shown in Figures 7-10. The comparison shows that BIT and RDPNet can produce good detection results on CCD datasets, but our proposed model can further improve the result. Larger changed areas can be captured by almost all models (Figures 7 and 8). The differences are mainly on edges (Figures 7 and 8), thin roads (Figure 9), and tiny vehicles ( Figure 10). Detection results of the proposed method (Figures 7n, 8n, 9n and 10n To qualitatively analyze the result on CCD, some of the predicted results are shown in Figures 7-10. The comparison shows that BIT and RDPNet can produce good detection results on CCD datasets, but our proposed model can further improve the result. Larger changed areas can be captured by almost all models (Figures 7 and 8). The differences are mainly on edges (Figures 7 and 8), thin roads (Figure 9), and tiny vehicles ( Figure 10). Detection results of the proposed method (Figures 7n, 8n, 9n and 10n) contain less false positives and false negatives.  (h) (i) (j) (k) (l) (m) (n)

Discussion
Though the experiments show that our DUNE-CD can outperform comparison methods, some questions are to be answered about the cascaded structure and effectiveness of TEAM. In each subsection of this section, these questions are discussed, respectively: Q1. The DUNE-CD with four U-Net stages is tested for the comparison with the SOTA method. How will the number of cascaded stages affect the performance of DUNE-CD? What is the optimal number of stages?
Q2. How much improvement of DUNE-CD is contributed by TEAM? Are the patch

Discussion
Though the experiments show that our DUNE-CD can outperform comparison methods, some questions are to be answered about the cascaded structure and effectiveness of TEAM. In each subsection of this section, these questions are discussed, respectively: Q1. The DUNE-CD with four U-Net stages is tested for the comparison with the SOTA method. How will the number of cascaded stages affect the performance of DUNE-CD? What is the optimal number of stages?

Discussion
Though the experiments show that our DUNE-CD can outperform comparison methods, some questions are to be answered about the cascaded structure and effectiveness of TEAM. In each subsection of this section, these questions are discussed, respectively: Q1. The DUNE-CD with four U-Net stages is tested for the comparison with the SOTA method. How will the number of cascaded stages affect the performance of DUNE-CD? What is the optimal number of stages?
Q2. How much improvement of DUNE-CD is contributed by TEAM? Are the patch embedding layer, ConvNeXT block, and the two consecutive blocks really as effective as claimed?
Q3. If a larger number of cascaded stages is good for DUNE-CD, can we further reduce the computational cost without losing advances of cascaded stages?
In Section 5.1, Q1 is answered regarding how the number of cascaded stages matters and how much improvement of model accuracy is introduced by the cascaded structure and the computational effectiveness of DUNE-CD is compared. In Section 5.2, the effectiveness of our TEAM module (Q2) on the cascaded structure is discussed and TEAM is compared with methods to produce the output. Then, ablation studies on different modules of the proposed method are discussed. In the last subsection, Q3 will be answered by cascading stages with a different number of layers at each stage, which will reduce the computational costs.

Effectiveness of Cascaded Stages
The effects of cascaded structures on CCD datasets are discussed in this section. The experimental results in Table 3 show the performance of different cascaded stages. The DUNE-n in Table 3 means DUNE-CD with n cascaded U-Net stages, e.g., DUNE-1 is a single modified U-Net without cascading. The performance of DUNE-CD with different numbers of stages can be quantitively analyzed from Table 3. The cascaded structure improves the performance compared with DUNE-1, which is a simple modified U-Net. Even DUNE-2 with two stages is much better than DUNE-1 with a 0.51% higher F1-score than DUNE-1. Though the performance gap between DUNE-2 and DUNE-3 is small, our proposed method with four stages achieves the best performance with a 0.15% higher F1-score than DUNE-3. The computational time increases linearly with the number of cascaded U-Nets.
The experiments showed enough evidence that the cascaded structure can improve the performance of our model. As shown in Table 3, the performance gain of cascaded stages may continue beyond four stages. However, DUNE-CD with four stages already achieves good results. As a compromise between detection accuracy and computational cost, four cascaded U-Nets are used in the proposed method.

Ablation Study
The ablation studies on the effect of TEAM on our DUNE-CD change detection model are conducted in this subsection. In the ablation experiments, TEAM with a simple weighted sum is replaced as row "DUNE-CD w/o TEAM" in Table 4. In "DUNE-CD w/o weighted sum" the output of the last U-Net is used directly as output without the weighted sum. As shown in Table 4 below, without TEAM, only the proposed model achieves an F1-score of 97.03%. Though this is already higher than compression methods, with the help of TEAM, the F1-score is raised to 97.50%. With the weighted sum removed, the model only achieves an F1-score of 96.75%, which is the worst score. Furthermore, the number of GMACs and the inference time of all three methods are almost the same, which indicates the cost of the TEAM mechanism at inference time is negligible. From Figure 11, it can be observed that with the help of TEAM, more attention is paid to the last stage and attention on the first three stages drops gradually. Furthermore, at the later stage of training, the curve became smoother due to learning rate scheduling, which makes the impact of TEAM greater; thus, attr 1 is dropped as shown in Figure 11b.

R REVIEW
16 of 20 From Figure 11, it can be observed that with the help of TEAM, more attention is paid to the last stage and attention on the first three stages drops gradually. Furthermore, at the later stage of training, the curve became smoother due to learning rate scheduling, which makes the impact of TEAM greater; thus, is dropped as shown in Figure 11b. The results of ablation studies on the patch-embedding layer (PE), ConvNeXT blocks (CX), and two consecutive blocks (2C) are shown in Table 5 below. All these methods removed the TEAM module and experimented with a single U-Net. The "Proposed 1" is a U-Net with ConvNeXT blocks. The ConvNeXT blocks increased the F1-score significantly compared with the baseline method. However, the computational time is very slow. With patch embedding, "Proposed 2" is four times faster than "Proposed 1" with a slight drop in the detection accuracy. The spatial size of the features is reduced via patch embedding; thus, the increase in inference time by the two consecutive blocks is trivial. Furthermore, our proposed method with four cascaded U-Nets achieves higher scores with half the computational time compared with "Proposed 1".   The results of ablation studies on the patch-embedding layer (PE), ConvNeXT blocks (CX), and two consecutive blocks (2C) are shown in Table 5 below. All these methods removed the TEAM module and experimented with a single U-Net. The "Proposed 1" is a U-Net with ConvNeXT blocks. The ConvNeXT blocks increased the F1-score significantly compared with the baseline method. However, the computational time is very slow. With patch embedding, "Proposed 2" is four times faster than "Proposed 1" with a slight drop in the detection accuracy. The spatial size of the features is reduced via patch embedding; thus, the increase in inference time by the two consecutive blocks is trivial. Furthermore, our proposed method with four cascaded U-Nets achieves higher scores with half the computational time compared with "Proposed 1". Table 5. Ablation studies of patch-embedding layer (PE), ConvNeXT blocks (CX), and two consecutive blocks (2C) on the CCD dataset. The best scores are in bold font, and the module is enabled when checked with "".

More Efficient Cascaded Stages
The main disadvantage of the cascading structure is the time and memory complexity. Each stage must be executed after all previous stages are finished, and the final output is produced following an up and down path across different stages. The memory usage and inference time grows linearly with the number of cascaded stages but the performance gain does not. In real-world change detection, application data are often processed offline and thus time complexity is not critical in these applications. The computational cost of the cascaded U-Nets is reasonable.
To partially overcome this difficulty. In this subsection, we discuss a method to reduce the computational costs but still harness the power of cascaded structures. In our proposed method, we cascaded U-Nets with the same levels. Further experiments indicate that if the upper right part of the proposed model is removed, the model can almost half the computational cost with a slight loss in accuracy. The reduced cascaded U-Net structure is shown in Figure 12. the computational cost with a slight loss in accuracy. The reduced cascaded U-Net structure is shown in Figure 12. The experimental results of the reduced U-Net structure are shown in Tables 6 and  7. With a slight loss in accuracy, the model can be reduced to almost half the size. It can be seen from Table 6 that the reduced model achieves an F1-score of 97.19% on the CCD dataset and can also achieve a decent performance in the LEVIR-CD dataset. This reduced model's prediction result is more accurate than the proposed method with three cascaded U-Nets with fewer GMACs.   The experimental results of the reduced U-Net structure are shown in Tables 6 and 7. With a slight loss in accuracy, the model can be reduced to almost half the size. It can be seen from Table 6 that the reduced model achieves an F1-score of 97.19% on the CCD dataset and can also achieve a decent performance in the LEVIR-CD dataset. This reduced model's prediction result is more accurate than the proposed method with three cascaded U-Nets with fewer GMACs.

Conclusions
In this paper, an effective change detection model with four cascaded U-Nets is proposed. The TEAM module is added to aggregate results from four cascaded stages. The proposed model DUNE-CD achieves a good result in two different change detection datasets. In the discussion part, the effectiveness of the cascaded structures by comparing models with a different number of cascaded stages is studied. Moreover, ablation experiments on the different modules of the proposed method are performed. These experiments confirmed the validity of our proposed method. Furthermore, a scheme to reduce the computation without losing the advantage of the cascaded structures is introduced. The scheme shows that our method has a better tradeoff in terms of efficiency and accuracy than the ten comparison methods.