A Multi-Temporal Network for Improving Semantic Segmentation of Large-Scale Landsat Imagery

With the development of deep learning, semantic segmentation technology has gradually become the mainstream technical method in large-scale multi-temporal landcover classification. Large-scale and multi-temporal are the two significant characteristics of Landsat imagery. However, the mainstream single-temporal semantic segmentation network lacks the constraints and assistance of pre-temporal information, resulting in unstable results, poor generalization ability, and inconsistency with the actual situation in the multi-temporal classification results. In this paper, we propose a multi-temporal network that introduces pre-temporal information as prior constrained auxiliary knowledge. We propose an element-wise weighting block module to improve the finegrainedness of feature optimization. We propose a chained deduced classification strategy to improve multi-temporal classification’s stability and generalization ability. We label the large-scale multitemporal Landsat landcover classification dataset with an overall classification accuracy of over 90%. Through extensive experiments, compared with the mainstream semantic segmentation methods, our proposed multi-temporal network achieves state-of-the-art performance with good robustness and generalization ability.


Introduction
In recent years, with the rapid development of remote sensing information extraction technology, we can exploit rich information from remotely sensed big data [1]. Volume, variety, velocity, and veracity are the 4V characteristics of Remotely sensed big data [2,3]. Therefore, remotely sensed big data is applied in many practical scenarios. In remote sensing, image classification is an important research direction for the intelligent extraction of information. The typical applications are land-use [4][5][6], landcover [7][8][9], cultivated land extraction [10][11][12], woodland extraction [13][14][15], waterbody extraction [16][17][18], residential area extraction [19][20][21], etc. Image classification based on traditional methods requires the manual design of corresponding feature extractors according to the characteristics of different objects and with the help of expert prior knowledge [22]. For example, indices [23][24][25] such as NDVI, NDWI, NDBI, or textures such as edges and shapes are used to extract ground objects [26]. However, traditional methods require manually fine-tuning parameters when facing complex application scenes. When the data distribution changes, the artificially designed feature extractors are likely to fail and have poor generalization ability. With the development of deep learning, those problems have been solved. In computer vision, semantic segmentation methods use deep convolutional neural networks (DCNNs) to exploit the deep features of different objects [27]. DCNNs avoid the limitations The spatial resolution of Landsat images is relatively low, so it is necessary to exploit effective classification information from limited spatial contexts. We can introduce an attention mechanism so that the network can pay attention to relatively important features and suppress useless features. SENet [42] uses the channel attention mechanism to enhance the effective channels in high-dimensional features and suppress the invalid channels to optimize the features. CBAM [43] uses a spatial attention mechanism to filter and optimize the valuable features and suppress useless features, thereby improving the target object's position accuracy. The features of different ground objects will be prominently reflected on different channels and spatial positions of the feature map. Therefore, using the simple channel attention and spatial attention mechanism will indiscriminately suppress the entire channel or some spatial positions of the feature map so that the effective features in the feature cube cannot be accurately optimized. As shown in Figure 2, the color represents the enhanced feature, and the black represents the suppressed feature. DANet [44] uses both the channel attention and spatial attention mechanism in a single network. But it only uses two branches to independently enhance the effective features in the channel dimension and spatial dimension and cannot achieve high-fine-grained feature selection. We propose an element-wise weighting block module, which can perform element-level enhancement or suppression on the features in the feature cube to improve the effect of feature extraction.
For a multi-temporal semantic segmentation task with time-lapse order, the model obtained by training the samples from two adjacent phases is more in line with the changes in adjacent phases. When two-phase images with a larger time span need to be classified, using the adjacent phase model directly will lead to a decrease in accuracy due to the larger changes in the ground objects. Training the samples at all possible two phases will increase the amount of computation. We propose a chained deduced classification strategy, which can achieve multi-temporal semantic segmentation tasks with time-lapse order, achieve high classification accuracy, and is consistent with the real changes of ground objects. In this paper, we integrate the multi-temporal network architecture with the elementwise weighting block module and use the chained deduced classification strategy for training and inference to solve the above problems. In summary, the main contributions of this paper are described as follows: • We propose a multi-temporal network (MTNet) architecture that introduces the label information of the previous phase at the network's input to provide prior constraint knowledge for the feature extraction of the latter phase. It avoids pseudo changes caused by differences in color distribution of multi-temporal images and improves the performance of semantic segmentation. • We propose an element-wise weighting block (EWB) module to perform high-finegrained feature enhancement and suppression on feature cubes, addressing the limitations of simple channel attention and spatial attention mechanisms with low fine-grainedness. • We propose a chained deduced classification strategy (CDCS) to improve the overall accuracy of multi-temporal semantic segmentation tasks and ensure that the multitemporal classification results are consistent with the real changes of ground objects. • To validate our proposed method, we make a large-scale multi-temporal Landsat dataset. Extensive experiments demonstrate that our method achieves state-of-the-art performance on Landsat images. The classification accuracy of our proposed method is much higher than other mainstream semantic segmentation networks.

Multi-Temporal Network
UNet is a relatively simple and most commonly used encoder-decoder network. It is used as the baseline network in this work. UNet has low GPU memory overhead and fast running speed. We design a dual-branch encoder network to introduce the previous phase images and the corresponding ground-truth labels to assist the segmentation of the latter phase images. Each encoder branch extracts image features in different time phases. We take the ground-truth label of the previous phase as the known prior knowledge and input it into the network together with the image of the previous phase to assist the network learning. It makes the network focus more on the features of the changing area and reduces the difficulty of feature extraction. The network evolves from fully learning all feature information to focusing on learning the feature information of changing regions. When a certain position does not change, the category of the prior knowledge of the previous phase is directly brought into the classification result; when a certain position changes, the feature information of the ground object category before and after the change is learned by the network. At this time, feature learning is optimized from relatively dense learning to relatively sparse learning, significantly improving the network's feature extraction ability.
It can avoid the pseudo-change phenomenon caused by the difference in color distribution of images in different phases.
We named the input data of the two phases the reference phase data and the unknown phase data, respectively. Suppose the number of bands of the image is N. In that case, the reference phase data is formed by stacking the image and the corresponding label, so the number of input channels is N + 1. We call this branch the reference branch. The unknown phase data is only image data, so the number of input channels is N. We call this branch the unknown branch. Both the reference branch and the unknown branch use the ResNet-50 [45] encoder for feature extraction. Since the number of channels of the input data of the two branches is different, the encoder weights are independent of each other and do not share weights. The features of each stage of the two encoders are fused by the EWB module. The fused features are transferred to the decoder for low-and high-level fusion. The decoder is a standard UNet decoder. We call this network architecture the dual temporal network (DTNet). The overall architecture of the DTNet is shown in Figure 3. The DTNet can be denoted as follows: where x N+1 1 represents the N + 1 band image and label input in the first phase, x N 2 represents the N-band image input in the second phase, f re f represents the reference branch encoder, f unk represents the unknown branch encoder, f EWB represents the EWB module, and f dec represents the decoder.
The number of input channels of the two branches is N + 1 and N, respectively. Therefore, the network is an asymmetric two-branch network. In the training phase, the labels corresponding to the unknown phase data are used to calculate the loss value at the end of the network and guide the network backpropagation and gradient update. In the inference stage, we need to input the images and labels of the previous phase and the images of the latter phase to obtain the latter phase's segmentation results.
Based on DTNet, we can introduce more previous phase images and the corresponding ground-truth labels to obtain richer temporal information and more stable ground object change information. Similar to the DTNet architecture, but in the encoder part, we add a new reference branch, which is used to extract features of the newly added reference phase data. The newly added reference branch is similar to the reference branch of DTNet. The image of this phase and the corresponding ground-truth label are stacked as input data. ResNet-50 is also used as the encoder to extract features. The number of input data channels of the two reference branches is N + 1. To reduce the network's number of parameters and the network's training difficulty, we share the weights of the ResNet-50 encoders of the two reference branches. It means letting the same encoder extract the feature information from the first two reference phase data. After sharing the weights, the positions of the features extracted by the two reference branches in the feature cube can also be the same. So that the network can compare the differences between the two-phase features and learn the change feature information in the ground objects. During the backpropagation stage, when training the network, the two reference branches jointly update the weights of the same encoder. Since the number of input channels of the unknown branch is different from that of the reference branch, the unknown branch still uses an independent encoder and does not share weights. Like DTNet, the features of each stage of the three branches are fused by the EWB module. The decoder is a standard UNet decoder. We call this network architecture the triple temporal network (TTNet). The overall architecture of the TTNet is shown in Figure 4. The TTNet can be denoted as follows: where x N+1 1 and x N+1 2 represents the N + 1 band image and label input in the first and second phase, x N 3 represents the N-band image input in the third phase, f re f represents the weight-sharing reference branch encoder, f unk represents the independent unknown branch encoder, f EWB represents the EWB module, and f dec represents the decoder.
The number of input channels of the three branches is N + 1, N + 1, and N, respectively. Two N + 1 branches are weight-sharing branches. Therefore, the network is an asymmetric weight-sharing three-branch network. In the training phase, the labels corresponding to the unknown phase data are used to calculate the loss value at the end of the network and guide the network backpropagation and gradient update. In the inference stage, we need to input the images and labels of the previous two phases and the images of the latter phase to obtain the latter phase's segmentation results.
In theory, when there are more phase input data, more reference branches can be stacked to build a multi-temporal network (MTNet). The reference branch in the MTNet uses a weight-shared encoder, the unknown branch uses an independent encoder, and the multi-branch feature fusion uses the EWB module. MTNet is a plug-and-play network framework. In addition to using UNet, various mainstream semantic segmentation networks can be transformed into MTNet architectures. The more reference branches, the richer the temporal information of the ground objects and the more conducive to the network to exploit the feature information of the ground objects. The segmentation result of the last unknown phase can be classified more accurately.
The MTNet can be denoted as follows: where x N+1 1 , x N+1 2 , · · · , and x N+1 m−1 represents the N + 1 band image and label input in the first, second to the m − 1th phase, respectively, x N m represents the N-band image input in the mth phase, f re f represents the weight-sharing reference branch encoder, f unk represents the independent unknown branch encoder, f EWB represents the EWB module, and f dec represents the decoder.

Element-Wise Weighting Block
We build a weight feature cube the same size as the original feature cube. Attention learning is performed for each spatial position and each channel, enhancing or suppressing the features of different channels at different spatial positions more flexibly. Compared with simple channel attention and spatial attention, our proposed element-wise weighting block (EWB) module can improve the fine-grainedness of attention weights. As shown in Figure 5, the color represents the enhanced feature, and the black represents the suppressed feature. According to the response of different categories of features at different spatial positions and channels, the network suppresses redundant useless information and enhances effective information, thereby improving the efficiency and accuracy of network learning. For multitemporal branch feature fusion, the EWB module can exploit the effective information of different temporal features with high fine-grainedness and learn the change feature information by itself.
As shown in Figure 6, we first concatenate the features of multiple encoder branches to obtain the combined features. Then we use a 1 × 1 convolution operation to obtain initial feature weights and the rectified linear unit (ReLU) function to deactivate some features. Next, we use a 1 × 1 convolution operation again and then use the sigmoid function to transform the feature weights to obtain the final feature weights. The feature weights are weighted to the combined features to obtain weighted fusion features.  We express a convolution layer W n (x) as follows: where represents the convolution operator, W n×n represents the n × n convolutional kernel, b represents the vector of bias, and x represents the input data.
We denote the concatenation operation as follows: where ⊕ represents the concatenation operator, and x 1 and x 2 represents the features of the two branches. The EWB module can be denoted as follows: where ⊗ represents the dot multiply operator, f sigmoid represents the sigmoid function, f ReLU represents the ReLU function, W 1 1 and W 2 1 represents the first and second 1 × 1 convolution layer, respectively, and x concat represents the combined feature.

Chained Deduced Classification Strategy
Suppose we have 4 phase images, labeled I t1 , I t2 , I t3 , and I t4 , respectively. We also have the ground-truth labels for the two phases t1 and t2, labeled L t1 and L t2 , respectively. The number of image phases of 4 and the number of label phases of 2 are only for ease of understanding. In practice, the phase number of both images and labels can be a larger value.
First, we use the images and ground-truth labels of the two phases t1 and t2 as samples and the DTNet in Section 2.1 for training. I t1 , L t1 , and I t2 are the input data for the network, L t2 is the label for supervised learning. We can obtain model M t1_t2 after training. We denote this process as follows: where f DT represents the DTNet, both sides of represent the network output and supervised learning labels, respectively, and ⇔ represents the training task, including forward and backward propagation. Then, we use the trained model M t1_t2 to predict the classification result C t3 of phase t3. The input data of the network is replaced by two phases t1 and t2 with two phases t2 and t3. We denote this process as follows: where f DT represents the DTNet, both sides of represent the data and model, respectively, and ⇒ represents the inference task, only including forward propagation. Then we use the classification result C t3 of phase t3 as the pseudo label to train adaptively the model M t2_t3 for two phases t2 and t3. We denote this process as follows: where f DT represents the DTNet, both sides of represent the network output and supervised learning labels, respectively, and ⇔ represents the training task, including forward and backward propagation. Then we use the trained model M t2_t3 to predict the classification result C t4 of phase t4. The input data of the network is replaced by two phases t2 and t3 with two phases t3 and t4. We denote this process as follows: where f DT represents the DTNet, both sides of represent the data and model, respectively, and ⇒ represents the inference task, only including forward propagation. So far, we have completed the semantic segmentation of the two phases t3 and t4. When we have more follow-up phase images, we can continue to deduce the new phase result according to the above rules.
In summary, we use the classification result C t[n−1] of phase t[n − 1] as the pseudo label to train adaptively the model M t[n−2]_t[n−1] for two phases t[n − 2] and t[n − 1]. We denote this process as follows: where f DT represents the DTNet, both sides of represent the network output and supervised learning labels, respectively, and ⇔ represents the training task, including forward and backward propagation. . We denote this process as follows: where f DT represents the DTNet, both sides of represent the data and model, respectively, and ⇒ represents the inference task, only including forward propagation. We refer to this training/inference alternating classification strategy as the chained deduced classification strategy (CDCS). The schematic diagram of CDCS is shown in Figure 7. This idea can also be extended to TTNet and MTNet frameworks. This strategy requires that the classification results of the next phase be deduced sequentially in chronological order.
Because the changing rules of some ground objects, such as residential areas, generally show an overall increasing trend, if the later phase is placed in the previous, it will cause the error phenomenon of "reduction of residential areas". Considering the real surface changes and the differences in the color distribution of multi-temporal images, the change feature information from t1 to t2, t3 or t4 will be quite different. When the model M t1_t2 is directly applied to the t3 or t4 classification, the error will be larger than the CDCS. Therefore, the multi-temporal classification results using the CDCS will be more stable and consistent with the actual changes.

Datasets
To test our proposed network architecture, we need multiple phases of images. Each phase of images is fully registered. Each phase of the image needs to have a corresponding classification label. The currently published remote sensing semantic segmentation datasets cannot meet the requirements of multi-phase image registration. Some remote sensing change detection datasets have multiple registered images, but each image is not labeled pixel by pixel. Only the change area is labeled in those change detection datasets. Therefore, it is indispensable to make our own Landsat dataset.
We choose parts of northern and southwestern China as the study areas. We downloaded Landsat images in 2000, 2005, 2010, 2015, and 2020, each containing 102 images. All downloaded images are at the L1TP level, and the surface reflectance data are obtained after FLAASH processing of ENVI software. We only used six bands of data, including blue, green, red, near-infrared, shortwave infrared 1, and shortwave infrared 2. We stitched images of the two study areas by year, cut out the redundant parts, and only kept the images in the study area. We finally cropped all images according to the standardized grid and obtained 20 tiles of images with 10, 240 × 10, 240 pixels each year. Region-I is the name defined for the study area in the northern China. It has 16 tiles covering an area of 1,440,000 km 2 and is located between 110.9 • E ∼ 122.    All samples were visually interpreted manually in ArcGIS software by a workgroup of more than ten people, within six months. Since some small ground objects are difficult to interpret visually, the workgroup used multi-temporal high-resolution images as a reference. In order to ensure high-quality labels, the entire workgroup has basic knowledge of landcover, and some controversial sample sites were confirmed through field surveys. All samples were randomly cross-checked three times, and disputed samples were uniformly determined. Although there is a certain possibility of error in manual labeling, we try our best to minimize it and make the label accuracy as close to 100% as possible.
The ground objects in the adjacent area have certain correlations and similarities, so the pixel value distribution on the image also has a certain correlation. Therefore, if all image slices are directly divided into the training sub-dataset and the test sub-dataset randomly, there is no way to de-correlate the two sub-datasets, resulting in an inaccurate evaluation of the model's generalization ability. Therefore, we must keep the different sub-datasets in relatively independent geographic locations. As shown in Figure 11a, the inner 4 tiles (IDs 6, 7, 10, 11) of Region-I are used for the training and validation dataset. We name it Landsat dataset I (LSDS-I). The outer 12 tiles (IDs 1, 2, 3, 4, 5, 8,9,12,13,14,15,16) of Region-I are used for the test dataset. We name it Landsat dataset II (LSDS-II). To further validate the model's cross-region generalization ability, the images in Region-II are chosen only for prediction and do not involve in the training stage. Region-II is far away from Region-I. As shown in Figure 11b, all 4 tiles ( IDs 17,18,19,20) of Region-II are used for the test dataset. We name it Landsat dataset III (LSDS-III).

Data Preprocessing
For LSDS-I, we cropped the big tiles of LSDS-I with a sliding window to generate 1600 small tiles with 512 × 512 pixels. We randomly divided the cropped tiles into training set and validation set in a ratio of 8:2. For LSDS-II and LSDS-III, these two datasets do not participate in training and validation. They are only used as test sets for generalization ability testing. Since the mainstream semantic segmentation network and our proposed MTNet are all fully convolutional neural networks, which can adapt to any image size. Therefore, LSDS-II and LSDS-III are not cropped. All bands in the data are used for training and prediction.
Three phases of samples are included in the three LSDS series datasets. The mainstream semantic segmentation network adapts to single-temporal input, our proposed DTNet adapts to dual-temporal input, and the TTNet adapts to triple-temporal input. To ensure that all three phases of samples can be trained in the single-temporal network and DTNet, we preprocessed the multi-phase samples. For the single-temporal network, we mix the three-phase samples together to generate pseudo-single-phase samples for training. For the DTNet, we mix the three-phase samples in pairs to generate pseudo-dual-temporal samples for training.
We normalize the training image data according to Equation (13) to conform to the network's standard input format.
where D represents the normalized data, D represents the input data, mean represents the mean value of the corresponding channel in the training image data, and stddev represents the standard deviation of the corresponding channel in the training image data.

Training Settings
PyTorch deep learning framework [46] is used to train the MTNet proposed in this paper. Some mainstream semantic segmentation models are also implemented by the PyTorch framework. We use four NVIDIA RTX 3090 GPUs with 24 GB memory each GPU for training. The data augmentation operations include random horizontal flip, random vertical flip, and random rotation. We choose 32 as the batch size and AdamW [47] as the optimizer. The initial learning rate LR is set to 1 × 10 −5 . First, we use a warm-up strategy to adjust LR. According to Equation (14), the LR rises to 1 × 10 −3 after 10 epochs. Then, we use reduce-LR-on-plateau strategy to adjust LR. According to Equation (16), the LR is multiplied by a factor of 0.3 if the validation accuracy does not increase in 20 epochs. The training process will be stopped if the LR is lower than 1 × 10 −7 .
where lr represents the current LR, lr 0 represents the initial LR, lr * represents the LR after the warm-up stage, t represents the current iteration number, and n represents the total iteration number in the warm-up stage. n is defined as follows: where e represents the total epoch number, and k represents the iteration number per epoch.
where lr represents the current LR, lr represents the last LR, and α represents the adjustment factor in the reduce-LR-on-plateau stage.
To speed up network convergence and improve learning ability, we use cross-entropy loss to calculate pixel-level loss and Lovasz-softmax loss to calculate region-level loss simultaneously. To avoid the effects of differences in hardware and hyperparameters, we configure each model training with the same hardware and hyperparameters. There will be random errors in the training stage, resulting in slight fluctuations in the accuracy of each training. To make the accuracy more objective, we randomly trained each model 10 times. We compute the average accuracy of each model as the metric of the final accuracy comparison.

Evaluation Metrics
Overall accuracy (OA) is the proportion of correctly classified pixels, which can be defined as follows: where y represents the ground truth of samples, andŷ represents the predicted results. Each single-class accuracy uses the F 1 score as the evaluation metric. True positive (TP) means that the sample is true and the prediction result is also true. False positive (FP) means that the sample is false and the prediction result is true. False negative (FN) means that the sample is true and the prediction result is false. The precision and recall can be calculated by TP, FP, and FN as follows: Therefore, the F 1 score is defined as follows: Since OA is not sensitive enough to the category with a small proportion [48], we additionally use mean F 1 score (mF 1 ) as the second metric for the overall evaluation. The mF 1 is defined as the mean of the F 1 scores for every category, shown as follows: where F i 1 represents the F 1 score of the ith category, and N represents the number of categories.

Experiments on the Landsat Dataset I 3.3.1. Ablation Study
Based on the baseline UNet, we incrementally build DTNet and TTNet in turn and introduce the EWB module to evaluate the effect of each module and structure on improving the classification accuracy. Table 1 shows the ablation comparison accuracy of the MTNet on the validation set of LSDS-I. The OA of UNet is 81.1%. We add a reference branch and use the Concat method for dual branch fusion to build the network DTNet-C. Due to the prior information and constraints provided by the previous phase data, the accuracy of each category has been increased. The accuracy of bare land has been greatly increased. The OA of DTNet-C has reached 89.4%. We further add the previous reference branch and still use the Concat method to achieve triple branch fusion to build the network TTNet-C. Since we have obtained richer temporal information and more change feature information in ground objects, the accuracy of all classes has been further increased. The OA of TTNet-C has reached 90.6%. To verify the effect of the EWB module, we replace the Concat modules of DTNet-C and TTNet-C with the EWB module to build DTNet-E and TTNet-E, respectively. Since the EWB module allows the network to select effective features more efficiently from multi-branch features and reduce the interference of redundant and useless features, the accuracy of all classes has been greatly increased compared to the Concat fusion method. On the Landsat image, two groups of easily confused ground objects, wetland/waterbody and grassland/bare land, are better distinguished with the help of the EWB module. Therefore, the accuracy of those confused categories is improved obviously. The accuracies of DTNet-E and TTNet-E reach 91.3% and 93.2%, respectively. Compared with the baseline UNet, the accuracy of the best MTNet member TTNet-E proposed in this paper is improved by 12.1%. The visualization results of our proposed MTNet family networks and the baseline UNet on LSDS-I are shown in Figure 12. In the first row, UNet cannot extract the wetlands in the red rectangle. Although DTNet-C has added a reference branch, there is no EWB module to filter and optimize the multi-branch effective features. Therefore, the wetlands are misclassified. Also, the grassland is wrongly classified as woodland in the lower right corner. TTNet-C introduces richer temporal information, but without the help of the EWB module, it misclassifies wetlands as waterbodies. DTNet-E can correctly extract wetlands. However, due to the short phase, the network only obtains information about previously cultivated land from the previous phase. Therefore, the waterbody is misclassified as cultivated land. Under the optimization of rich temporal information and the EWB module, TTNet-E has learned the change feature information in ground objects. Therefore, the waterbodies and wetlands in this area are correctly classified. In the second row, UNet mistakenly classifies the wetland in the yellow rectangle as woodland. The artificial surface is also roughly classified, and the woodland is incorrectly classified into cultivated land. After DTNet-C and TTNet-C introduced the image and label information of the previous phase, the misclassification problem of large areas of woodland was solved. However, the wetland in the yellow rectangle was still wrongly classified as grassland. The boundary detail of the artificial surface is improved, but the river has disconnection problems. After DTNet-E and TTNet-E introduced the EWB module, the multi-branch features were filtered and optimized. The wetlands in the yellow rectangle were correctly classified, and the river connectivity was better. The boundaries between ground objects were more accurate. In the third and fourth rows, the details of small objects have been greatly improved after adding the previous phase label information. From the results, the MTNet and EWB module proposed in this paper are effective, especially for small targets and boundary details optimization.

Comparing Methods
We trained some mainstream semantic segmentation networks and our best-performing MTNet member TTNet-E on LSDS-I. Table 2 shows that the OA of mainstream networks is around 79% to 81%, but TTNet-E can reach 93.2%. All categories' accuracy is much higher than mainstream networks, which perform poorly on bare land. However, TTNet-E improves the accuracy of bare land to 81.1%, and the improvement of other categories is about 13.6% to 33.1% compared to UNet.
The mainstream semantic segmentation networks we trained in this work are listed as follows: (1) UNet [33]: UNet uses an encoder to extract each stage's features as spatial feature information. The decoder gradually restores the original size of the pooled features and fuses them with the corresponding spatial feature information of the encoder. (2) UNet++ [49]: UNet++ adds more node modules in each stage of UNet's decoder, making feature processing more intensive.   Figure 13 shows the visualized results of MTNet and mainstream networks on the LSDS-I. In the first row, the mainstream networks cannot correctly extract the bare land, and the grass in the upper left corner was incorrectly classified as woodland. However, MTNet relies on multi-temporal information constraints and the EWB module optimization to correctly classify bare land and grassland, and the boundaries of ground objects are excellent. In the second row, the location distribution of woodland, grassland, and wetland is very complicated, and there is a certain degree of confusion. The mainstream networks have poor classification results in this area. The artificial surface also cannot extract correctly in the lower left corner. However, MTNet can effectively classify complex and interlaced ground objects after the feature optimization of the EWB module. The classification result is very close to the label, and the accuracy is very high. In the third and fourth rows, there are a lot of misclassifications in other mainstream models. MTNet relies on the pre-phase label and the change information constraints to improve classification accuracy significantly. Therefore, the MTNet proposed in this paper far exceeds the current mainstream networks and reaches the state-of-the-art level.

Ablation Study
We performed prediction and accuracy evaluation on LSDS-II to test our proposed model's generalization ability on Landsat images. Table 3 shows that all networks' accuracy decreases slightly relative to LSDS-I. LSDS-I is used for accuracy validation in the training stage. The best model weights are selected according to the accuracy of the LSDS-I validation set. Whereas LSDS-II is only used for prediction, the accuracy of the same model is usually lower on LSDS-II than on LSDS-I. The accuracy of our proposed TTNet-E still has a significant improvement on LSDS-II, which can achieve 91.3% OA. It indicates that our MTNet has good generalization ability. The visualization results of our proposed MTNet family networks and the baseline UNet on LSDS-II are shown in Figure 14. In the first row, waterbodies and wetlands are the main ground objects. UNet, without the help of pre-phase information, mistakenly classified wetlands into bare land. On Landsat images, wetlands and grasslands are easily confused. Therefore, DTNet-C did not use the EWB module to filter and optimize features, so the wetlands were mistakenly classified as grasslands. Wetlands and waterbodies are also prone to confusion. Therefore, TTNet-C incorrectly classified wetlands into waterbodies without the EWB module. After introducing the EWB module, DTNet-E can correctly extract most of the wetlands. However, the two-branch network has limited change information on the ground objects. Under the influence of the bare land in the previous phase, many waterbodies are wrongly classified as bare land. TTNet-E has further optimized the classification results through richer change feature information in ground objects and feature selection and optimization of the EWB module. In the second row, the artificial surface classified by UNet is relatively rough. A large number of cultivated land are also wrongly classified as wetlands. Without the EWB module, both DTNet-C and TTNet-C misclassify a small wetland in the middle as grassland or waterbody. TTNet-E solves the problem of change information learning and multi-branch effective feature selection and optimization. Therefore, the classification result has been significantly optimized. In the third and fourth rows, it can be clearly seen that after adding the pre-phase label information, the long and narrow objects can be correctly extracted, and the details have been greatly improved. Therefore, the MTNet family and EWB module proposed in this paper have excellent generalization ability on multi-temporal Landsat images.

Comparing Methods
We performed prediction and accuracy evaluation on LSDS-II to test our proposed model's generalization ability on Landsat images and compare it with the mainstream networks. TTNet-E is chosen as the representative of MTNet. As shown in Table 4, although the accuracies of all models have a slight drop on LSDS-II, TTNet-E still outperforms other mainstream networks with 91.3% OA. Compared with the best performing UNet and PAN in mainstream networks, the accuracy is improved by 13.1%. The mainstream networks used for comparison are the same as Section 3.3.2. Our TTNet-E continues to be denoted as MTNet here.  Figure 15 shows the visualized results of MTNet and mainstream networks on the LSDS-II. In the first row, MTNet can correctly extract the woodland below, and the details of the artificial surface are more accurate. However, other networks cannot effectively distinguish the woodland from the cultivated land, and parts of the artificial surface have been missed. In the second row, MTNet, with the help of the pre-phase information, the outline of the woodland is very accurate, and the artificial surface can also be completely classified. However, the woodland boundaries extracted by other networks are very inaccurate, and there are a lot of omissions on the artificial surface. In the third and fourth rows, other mainstream networks have the phenomenon that small objects are missed, and large areas are misclassified. With the help of the pre-phase information, the MTNet is more focused on the feature information extraction of the changing area, preventing the unchangeable area from misclassifying and significantly improving the classification accuracy. Therefore, the MTNet proposed in this paper has advanced performance and good generalization ability.

Experiments on the Landsat Dataset III
We performed prediction and accuracy evaluation on LSDS-III to further test our proposed model's generalization ability on multi-temporal Landsat images. As shown in Table 5, since Region-II is far away from Region-I, the accuracy of all models dropped in a certain. Since DCNN is trained for the existing scene in the sample, the accuracy will inevitably drop when facing a scene that has not been faced before. At this time, the accuracy retention ability can indicate the model's generalization ability. Our proposed MTNet mainly focuses on the changed information. When the network thinks that a certain region has not changed, it will retain the label of the previous phase as the classification result. Hence, the stability of the classification result for MTNet is very high. On LSDS-III, the accuracy drop of MTNet is very small. It is obvious that MTNet still has a strong generalization ability. The MTNet can still achieve 90.1% OA on LSDS-III, far exceeding other mainstream networks. Since other mainstream networks need to learn complex features and have poor adaptability to unfamiliar scenes, the generalization ability is severely reduced. However, the MTNet mainly focuses on changing feature information and has strong adaptability in the face of unfamiliar scenes, so it can still keep the good generalization ability. Figure 16 shows the visualized results of MTNet and mainstream networks on the LSDS-III. Obviously, other mainstream networks have many misclassifications due to their poor generalization ability, while MTNet can still keep the high classification accuracy.    Table 6 shows the quantitative accuracy comparison of the baseline UNet, MTNet without CDCS, and MTNet with CDCS. For convenience, our DTNet-E is noted as MTNet here. LSDS-I and LSDS-II located in Region-I were evaluated for accuracy together, and LSDS-III located in Region-II was independently evaluated for accuracy to compare the geospatial generalization ability. All models were optimized with the data in 2005 as the training target. The data in 2010 were not used for training, which was used to compare the multi-temporal generalization ability. Since the 2005 results of MTNet without CDCS and MTNet with CDCS were predicted by the same model, and CDCS began to work in the 2010 results, the 2010 results and labels were used for accuracy evaluation. It can be seen that CDCS has improved the network performance. Although it is less than the improvement brought by the multi-temporal reference branch, it still has an improvement effect. It is because, in CDCS, the model undergoes deduced training and gradually adapts to the next phase image's color distribution and feature pattern. The classification results in the next phase have a more stable performance. Without the assistance of CDCS, the model is prone to unstable classification results due to differences in image color distribution. The data in 2015 and 2020 have only images, and no corresponding ground-truth labels, so qualitative analysis and evaluation are mainly performed by visual inspection. To verify and compare the effect of CDCS in MTNet, we put the classification results in 2005, 2010, 2015, and 2020 together to obtain a multi-temporal classification result sequence. Figure 17 shows the multi-temporal classification results of MTNet without CDCS and MTNet with CDCS in Region-I. In the first group, without the help of CDCS, a large number of artificial surface false detections appeared in the 2015 results. After adding CDCS, the consistency of multi-temporal classification results was significantly improved. In the second group, without the support of CDCS, the 2015 and 2020 results showed an obvious false explosion of growth on the artificial surface. With the constraints of CDCS added, false classification results are suppressed. The growth trend of the artificial surface is more in line with the actual situation. In the third group, without the assistance of CDCS, the 2020 classification results showed an unreasonable explosion of artificial surfaces. Under the action of CDCS, the problem of false detection has been solved, and the growing trend of the artificial surface is more in line with the actual situation. Figure 18 shows the multi-temporal classification results of MTNet without CDCS and MTNet with CDCS in Region-II. In the first group, in the absence of CDCS, the artificial surface was reduced anomalously in the 2020 results. With the assistance of CDCS, the artificial surface shows a gradual and slight growth trend, which is more in line with the actual situation. In the second group, when there is no CDCS, the classification results show an abnormal phenomenon of more and fewer waterbodies. Under the constraint of CDCS, abnormal waterbody results are suppressed, and the consistency of classification results is improved significantly. In the third group, without CDCS, the 2015 results showed a large amount of grassland, while the other years were woodland. With the help of CDCS, the falsely detected grasslands were suppressed, and the multi-temporal classification results had excellent consistency.

Ablation Study
It can be seen that, although the overall consistency is good, when the color difference of the image is apparent, the artificial surface, waterbody, and grassland are prone to false detection, resulting in a wrong trend of change. The CDCS can gradually adapt to the color distribution and feature pattern of the next phase image, thereby significantly improving the consistency of multi-temporal results.   Table 7 shows the quantitative accuracy comparison between the mainstream singletemporal networks and our proposed MTNet with the CDCS. For convenience, our DTNet-E is noted as MTNet here. LSDS-I and LSDS-II located in Region-I were evaluated for accuracy together, and LSDS-III located in Region-II was independently evaluated for accuracy to compare the geospatial generalization ability. All models were optimized with the data in 2005 as the training target. The data in 2010 were not used for training, which was used to compare the multi-temporal generalization ability. It can be seen that the overall performance of the MTNet is greatly higher than that of single-temporal networks. From 2005 to 2010 in Region-I, the accuracy of each category of the single-temporal networks decreased to varying degrees. The wetland category decreased very seriously, even lower than 10% OA. The single-temporal networks completely lose the generalization ability for the multi-temporal classification task of wetland categories. However, the accuracy of each category of the MTNet is kept, with only a slight decline. Considering that the samples in 2010 were participating in the training stage, this decline is normal. The single-temporal network cannot solve the wetland generalization problem. However, MTNet can keep 78.2% OA for the wetland category. It shows that the MTNet fully uses the advantages of multi-temporal data in multi-temporal tasks and has outstanding generalization ability. For LSDS-III, which does not participate in the training stage and has a very low geographic correlation, the accuracy of each category of the single-temporal networks also decreases to varying degrees. In addition to the severe decline in the accuracy of wetlands, the accuracy of grasslands, waterbodies, and artificial surfaces has also dropped significantly. That is to say, the generalization ability of single-temporal networks in both geographic space and temporal changes is fragile. However, in unfamiliar geographic scenes, MTNet can enhance the learning of changing information with the help of multi-temporal information and reduce the burden of unchanging information in the feature extraction stage. It has a powerful generalization ability regarding geographic space and temporal changes.

Comparing Methods
The data in 2015 and 2020 have only images, and no corresponding ground-truth labels, so qualitative analysis and evaluation are mainly performed by visual inspection. Table 7 shows that the overall performance of the single-temporal networks is relatively similar. Therefore, we choose UNet++ as the representative of the single-temporal network and compare it with the MTNet with the CDCS. To compare the effect of the MTNet with the CDCS, we put the classification results in 2005, 2010, 2015, and 2020 together to obtain a multi-temporal classification result sequence. Figure 19 shows the multitemporal classification results of UNet++ and MTNet in Region-I. In the first group, the classification results of UNet++ will miss a lot of small ground objects and details, such as woodlands, small villages, and roads. Residential areas increased significantly in 2015 but declined obviously in 2020. The changes in the residential area are completely inconsistent with the real scene. MTNet can effectively extract small ground objects and keep good details. The residential area is gradually increasing, in line with the real natural scene. In the second group, the results of UNet++ showed that the grassland and cultivated land changed repeatedly, but the cultivated land should not change greatly in the actual scene. In addition, the details of the classification results are poor. MTNet can keep the distribution of grassland and cultivated land stable with few changes. Only the artificial surface has a slight increase. It shows that MTNet has good stability and consistency. In the third group, the results of UNet++ are not stable enough. Although there is no major change in the ground objects from the image, the results are chaotic changes among woodland, grassland, waterbody, bare land, artificial surface, and cultivated land, which are completely inconsistent with reality. MTNet can keep the stability of the classification results with only a few changes, which is consistent with the real situation. It can be seen that UNet++ cannot establish multi-temporal correlation and the classification results are unstable. Obvious errors that do not conform to the actual situation always occur. The results of MTNet are consistent with the real changes in ground objects. It has good stability and consistency. It also has a strong ability to keep details.  Figure 20 shows the multi-temporal classification results of UNet++ and MTNet in Region-II. In the first group, UNet++ was unable to separate the relatively finely fragmented grassland in the upper right corner from the woodland. The wetland changed too much, and the waterbody was missed by UNet++. The surrounding woodland was misclassified. MTNet can keep the outline of the wetland, and the waterbody in the wetland is increased, which is consistent with the image. In the second group, the woodland, grassland, and cultivated land in the UNet++ results have changed too much. But from the image, the change in this area is very small. MTNet can keep the stability and consistency of classification results. In the third group, UNet++ missed a lot of small ground objects. In 2015, there were many false detections of woodland, and in 2020, there were many unreasonable increases in residential areas. MTNet can effectively extract small ground objects. The residential area gradually increases, which is in line with the law of natural development. The results of UNet++ and MTNet on Region-II are similar to that on Region-I. Obviously, the MTNet has better time deduction logic than the single-temporal network.  To sum up, MTNet with the CDCS has excellent stability, robustness, and powerful spatial-temporal generalization ability when facing multi-temporal classification tasks. For multi-temporal classification, single-temporal networks cannot establish the temporal correlation of ground objects, and images of different phases will be regarded as a new region to a certain extent. Due to differences in imaging conditions and image preprocessing, the color distribution of similar objects in different phases is different, resulting in inconsistent classification results, serious misclassification, abnormal changes in ground objects, and inconsistency with natural laws. The MTNet uses the help of the pre-phase samples to establish the change feature information of the ground objects and reduce the interference caused by the pseudo color changes of the images. In addition, MTNet makes the model continuously adapt to the characteristics of the following phases by alternating training/prediction, keeping the stability and consistency of multi-temporal classification, and the changes of ground objects conform to natural laws.

Large-Scale Multi-Temporal Landcover Mapping
We finally adopted the DTNet-E and the CDCS as the representative of the MTNet to classify Landsat images in 2005, 2010, 2015, and 2020 in the two study areas. We stitched all the classification results by year. Figures 21 and 22 show the large-scale multi-temporal classification results of Region-I and Region-II, respectively. Intuitively, the classification results have been greatly improved with the help of multi-temporal label prior knowledge and change feature information. Our proposed MTNet has strong generalization ability on multi-temporal Landsat images and good robustness, stability, and consistency. The change of landcover is in line with the real situation of 15 years, which has a high research and application value.

Trade-Off Problem
The MTNet and CDCS proposed in this paper can effectively solve the consistency problem of multi-temporal classification, but there are some trade-off problems in accuracy and efficiency.
The trade-off problem of sample labeling and multi-temporal consistency. MTNet has high requirements for samples, requiring at least two samples that have been strictly registered, and the preparatory work is relatively large. However, after the samples are prepared, the multi-temporal landcover classification can be completed very quickly. It has excellent consistency and can be used for quantitative analysis of landcover changes. The traditional mainstream single-temporal network has low requirements for sample labeling and small preparatory work. But it cannot guarantee multi-temporal consistency, and the classification results have poor details, which cannot ensure the accuracy requirements of quantitative analysis. Therefore, we can choose the traditional single-temporal network or the MTNet proposed in this paper according to the actual application requirements.
The trade-off problem of parameter quantity and multi-temporal consistency. When using CDCS to deduce the multi-temporal landcover classification, it is necessary to train a model for each phase, so the number of parameters of the model inevitably increases linearly. The linear dynamic grown amount of parameters is directly related to the number of phases, which will not cause explosive uncontrollability. The multi-temporal consistency can be significantly improved. Therefore, according to the actual computing resources, we can choose whether to use CDCS in MTNet.
The trade-off problem of running time and multi-temporal consistency. When using CDCS to deduce the multi-temporal landcover classification, in addition to the linear increase in the number of parameters, the running time also increases linearly. CDCS will improve the adaptation of the model to the image color distribution through timephased training, thereby further improving the consistency of the results of multi-temporal landcover classification. Therefore, it is possible to choose whether to use CDCS in MTNet according to the limitation of actual computing time. Table 8 shows the effects of CDCS on the number of parameters, training time, and inference time. It can be seen that since MTNet has one more encoder branch, the amount of network parameters is directly doubled compared to UNet. Without the use of CDCS, the running time difference between UNet and MTNet is not much different. Since MTNet is twice the size of UNet, it is normal for a slight drop in running speed. After using CDCS, the number of network parameters and the running time increases linearly. In summary, the MTNet and CDCS proposed in this paper must be selected according to the application requirements. Traditional single-temporal networks are used when we cannot obtain multi-temporal registered samples. Our proposed MTNet is used when we need good consistency. When storage resources are limited, and large models cannot be deployed, MTNet without CDCS can be used. MTNet without CDCS is used when there is a strict limit on the running speed and cannot be trained multiple times. When only pursuing high precision and excellent multi-temporal consistency, and there is no computational resource limitation at all, we can use NTNet with CDCS.

Implications and Limitations
The MTNet proposed in this paper solves two critical problems of classical semantic segmentation networks in multi-temporal landcover classification. The first problem is that the classical networks only perform feature extraction and classification on single-temporal images and do not establish the association between multi-temporal images and labels. The second problem is that when the classical networks perform multi-temporal classification, the classification results are not stable enough, the consistency is poor, and the changes in ground objects do not match the real situation.
Based on the classic semantic segmentation network, this paper introduces the images and labels of the previous phase, and extends the network to a multi-branch network. Since the labels of previous phases are input to the network as prior knowledge, the network evolves from classical dense full-semantic learning to sparsely focused learning. It dramatically reduces the difficulty of the network's feature extraction and network training. With the help of the pre-phase label information, it will not be affected by the difference in the color distribution of the image. Only when the change is obvious, the network regards the position as a change to avoid the interference of pseudo-changes. In a classic network, small objects tend to be swallowed by larger objects around them. Thanks to the reduced burden of sparse key learning on network feature expression, the multi-temporal network can pay attention to more detailed features, and small objects that are difficult to classify can also be accurately extracted. For the more difficult categories that are easy to confuse, the pre-phase label can play a very key reference role. Therefore, the accuracy of each category of the multi-temporal network is very high, and the anti-interference ability is very strong, so there is no need to worry about the long-tail effect that is most often encountered in semantic segmentation. Since MTNet focuses on whether the two images have changed, it still has a very good generalization ability when the image to be classified is quite different from the sample. However, the classical semantic segmentation network often suffers from serious degradation in accuracy due to sample mismatch and cannot be used.
We have made minor modifications to the common attention mechanism, making the weight cube in the attention module more fine-grained, enabling finer filtering and optimization of features. Due to the characteristics of sparsely focused learning in MTNet, the feature cube will have many useless and redundant features. The EWB module can fine-tune the optimized features, which can be used together to improve the classification accuracy further. However, the improvement of the EWB module is small, and the main contribution is provided by MTNet. MTNet can actually be regarded as a plug-and-play framework, and any classical semantic segmentation network can be upgraded to MTNet by adding reference branches. It can simultaneously take advantage of the high-performance feature extraction capability of state-of-the-art semantic segmentation networks and the sparsely focused learning of MTNet.
In the multi-temporal classification task, when the time span of the images to be classified is large, the training/prediction is alternated phase by phase so that the model can continuously adapt to the characteristics of subsequent phase images. Classical semantic segmentation networks tend to use all available samples, in conjunction with a large number of data augmentation strategies, to try to simulate the color distribution of images to be classified. However, it is often encountered that the difference in color processing is too large, which leads to the abnormal phenomenon that the variation range of the ground objects in the multi-temporal classification results is outrageous, or even more and more suddenly less. In multi-temporal classification tasks, many errors occur in singletemporal networks. The advantage of MTNet is that it focuses on learning the features of changing regions. The advantage of CDCS is that it controls the magnitude of change and continuously adapts to the changing features gradually. Therefore, MTNet with CDCS can ensure that in multi-temporal classification, the classification results can be deduced phase by phase, the classification accuracy can be kept at a high level, and the change rule is completely consistent with the real situation. Since the deep learning network relies on big data-driven training and has certain fault tolerance, the error in CDCS is within an acceptable range.
All in all, compared with mainstream single-temporal semantic segmentation networks, the method proposed in this paper works by introducing additional reference constraint information. It has solved the consistency problem encountered in multi-temporal semantic segmentation from another perspective and greatly improved the classification accuracy and the stability of multi-temporal classification. This method is helpful for studying the changes in landcover and can be applied to scenarios such as urban expansion, cultivated land changes, geological disaster monitoring, ecological and environmental protection, wetland monitoring, woodland protection, etc., and has significant application value.
However, this method also has certain limitations, requiring strict registration of multitemporal images. Additional one-phase images and labels are required as reference input in both the training and prediction stages. Therefore, if there is only one phase sample, the method proposed in this paper will not be applicable. In future research, we will explore using samples in non-corresponding positions as reference constraint information to reduce the dependence and limitations on multi-temporal registered samples. Since few-shot learning, self-supervised learning, and active learning have gradually become more popular, we will try to build no-label or less-label models in our future research work.

Conclusions
In this paper, we proposed a multi-temporal network. Based on the single-temporal semantic segmentation network, we added one or more reference phases of samples as prior constraint knowledge, which solves the problem of pseudo changes and misclassification caused by the difference in color distribution of images in different phases. We proposed an element-wise weighting block module, which improves the attention weight's fine-grainedness and can improve feature cubes' filtering and optimization effect. We proposed a chained deduced classification strategy, which improves the stability and consistency of multi-temporal landcover classification and ensures that the multi-temporal classification results are consistent with the real changes of ground objects. In large-scale multi-temporal Landsat landcover classification, our method surpasses most mainstream networks, achieves state-of-the-art performance, and has strong robustness and generalization ability. In future research, we will generalize our proposed MTNet to more and higher resolution multi-temporal remote sensing data.
Author Contributions: X.Y. wrote the manuscript, designed the methodology, and conducted experiments; B.Z. and Z.C. supervised the research; Y.B. and P.C. preprocessed the data of the study area and made the datasets. All authors have read and agreed to the published version of the manuscript.

Acknowledgments:
The authors thank the editors and anonymous reviewers for their valuable comments, which greatly improved the quality of the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: