Transferable Deep Learning from Time Series of Landsat Data for National Land-Cover Mapping with Noisy Labels: A Case Study of China

Large-scale land-cover classification using a supervised algorithm is a challenging task. Enormous efforts have been made to manually process and check the production of national landcover maps. This has led to complex preand post-processing and even the production of inaccurate mapping products from large-scale remote sensing images. Inspired by the recent success of deep learning techniques, in this study we provided a feasible automatic solution for improving the quality of national land-cover maps. However, the application of deep learning to national land-cover mapping remains limited because only small-scale noisy labels are available. To this end, a mutual transfer network MTNet was developed. MTNet is capable of learning better feature representations by mutually transferring pre-trained models from time-series of data and fine-tuning current data. An interactive training strategy such as this can effectively alleviate the effects of inaccurate or noisy labels and unbalanced sample distributions, thus yielding a relatively stable classification system. Extensive experiments were conducted by focusing on several representative regions to evaluate the classification results of our proposed method. Quantitative results showed that the proposed MTNet outperformed its baseline model about 1%, and the accuracy can be improved up to 6.45% compared with the model trained by the training set of another year. We also visualized the national classification maps generated by MTNet for two different time periods to quantitatively analyze the performance gain. It was concluded that the proposed MTNet provides an efficient method for large-scale land cover mapping.


Introduction
Large scale land-cover mapping is an important data source for monitoring changes in land cover and land use, managing land resources, and achieving sustainable development [1,2]. The most important factors affecting the accuracy of large scale landcover maps are the model used for image classification and the training samples, which should provide sufficient information to train the model. Land-cover classification methods have developed from unsupervised to supervised methods. In the 1990s, clustering-based methods constituted the mainstream of land-cover mapping [3]. The spectral characteristics of objects are very complex to model, and this can result in unsatisfactory classification results [4]. Supervised methods, however, use samples of the detected classes to train a classifier and then use the trained classifier to predict the attributes of each pixel in a remote sensing image. With the help of the knowledge extracted from the training samples, supervised methods greatly outperform unsupervised ones [5].
The maximum likelihood classifier (MLC), the decision tree (DT), the random orest (RF) and the support vector machine (SVM) are the most popular supervised classifiers. The MLC uses statistical models to describe the characteristics of training samples [6,7]. It is suitable for application to images with known statistical distributions, but the data distribution needs to be assumed in advance. The essence of the DT is to build a set of rules that depend on the selected image features. The introduction of information gain gives it a chance to construct a diagram automatically and thus greatly expands its applications [8][9][10]. However, the classification results are still highly dependent on the selected image features and are sensitive to noise. The RF classifier was designed to overcome these drawbacks. The RF decreases the impact of image features by using an ensemble of DTs trained with various training sub-sets [11,12]. Sometimes, image features are difficult to distinguish in their original forms. Therefore, the SVM proposes to map the original image features to a higher-dimensional space in which the image features can be easily divided into two parts by a hyperplane [13,14]. A small number of support vectors are sufficient for estimating this hyperplane. However, in large-scale remote sensing images the same class presents different spectral features due to the difference of imaging conditions. A small number of training samples cannot cover all the image features of the same class. Thus, only using a small number of training samples to construct a classifier is a disadvantage in large-scale remote sensing image classification. A similar situation occurs in the other supervised classifiers mentioned above. To alleviate this drawback, a large-scale study area is often divided into several small sub-regions, and models in each sub-region are trained independently. However, this introduces another problem: the inconsistency in the classification criteria and the related accuracy between different sub-regions, which occurs even when complex pre-and post-processing is used.
A convolutional neural network (CNN) has the advantage of being able to learn essential features from a large set of training samples. Because of this, CNN has achieved great success in computer vision. This gives us a reference in large-scale remote sensing land-cover mapping, which has attracted a significant amount of attention [15,16]. Most of these networks end up with fully connected layers. These kinds of networks assume that all pixels in an image patch share the same label and that only the center pixel label can be predicted each time. This not only seriously affects the efficiency of large-scale products but also leads to inaccurate segmentation results, especially near the boundary of objects when the input image patch contains more than one class. Fully convolutional networks are only composed of convolutional layers. Networks such as these output the labels of all the pixels in the input image and can thus avoid the drawbacks of patch-based ones. FCN was the first fully convolutional network; this extracts information from the input image using a range of convolutional kernels [17]. However, detailed information about the detected objects may be lost during the stacking of convolutional and pooling layers, and the object boundaries in FCN are unsatisfactory. Unet adopts a hierarchical upsampling strategy to reconstruct target details and concatenates upsampled layers into corresponding downsampled ones to improve the transfer efficiency for the detailed information [18]. However, the contribution of detailed information and small object features to the final classification results is small as only low-level features are concatenated with high-level ones. PSPNet uses multiple branches to extract a range of information and then downsamples the learned information to different scales to construct dependencies at different scales [19]. By concatenating features at different scales, PSPNet is able to learn both the global information and detailed information. Considering the scales of objects captured in large-scale remote sensing images, PSPNet is suitable for the corresponding land cover mapping. Unfortunately, CNN-based models need a large amount of accurately annotated training samples.
Enough high-quality training samples for large-scale land-cover mapping are difficult to access to [20]. Insufficient or unreliable training samples may cause serious misclas-sification [21,22]. Manual labeling is labor intensive and error-prone. Traditional image classification algorithms use point-based samples to train the classifier, and the training samples can be obtained by automatic sample selection methods according to their radiometric information or other characteristics [23]. On the contrary, samples for CNN should be labeled pixel-wise, which makes it much more difficult to access. Existing high-precision land-cover maps have high-potential sample resource value. When constructing training sets from existing land-cover maps, the most serious problem is a noisy label (a label that does not represent the real label of the corresponding pixel).
Traditional data-cleaning methods cannot be directly used on pixel-wise training samples [24]. Other methods such as noise transition and relabeling may need human assistance, which makes them tedious and error-prone [25][26][27]. Using specially defined loss functions and corresponding reweighing methods needs enough knowledge about the training set, which is a huge workload for large-scale land cover mapping [28,29]. Existing small-scale noisy labels contain enough information, which can be used for large-scale land cover mapping if properly used. Fortunately, time series images contain complementary information due to the difference in time series imaging conditions. Aiming at fully utilizing the information in original Landsat images to overcome the drawbacks of noisy labels, we proposed the use of a mutual transfer network (MTNet) by using the transferring property of CNNs. The model trained in one year was considered as a good initialization and also a regularization on the target training samples to improve the generality of the networks. In this study, PSPNet was employed as the baseline due to its local and global information extraction ability and the changes in object scale in the remote sensing image. The main contributions can be summarized as follows.
(1) To the best of our knowledge, this is the first time that a mutual deep-learning model of this type-that is, a MTNet that can be used for national-level land cover mappinghas been developed. (2) A novel interactive training strategy was proposed, and this was embedded into our MTNet to produce large-scale land cover maps with unbalanced training samples and noisy labels. The remainder of this article is arranged as follows. First, previous studies are introduced in Section 2. The data sources and corresponding training sets are shown in Section 3. Then, the proposed MTNet is described in detail in Section 4. Experiments for two different time periods (2005 and 2010) are carried out to test the effectiveness of the proposed MTNet in Section 5. Then, the experiments are extended to the national level to achieve highly accurate land cover maps of China in Section 6. Conclusions and ideas for future work are discussed in Section 7.

CNN-Based Land-Cover Mapping
The rise of convolutional neural networks (CNNs) in computer vision has provided a new concept that can be used in large-scale land-cover mapping due to its advantages in processing big data. Rezaee et al. [30] used AlexNet [31] to extract areas of wetland from remote sensing images, and it was found that this method outperformed traditional classifiers. To improve the diversity of learned feature maps, Huang combined AlexNet with a light parallel network [32] and achieved a classification accuracy of up to 80%. However, training a network requires a large number of accurate training samples, and these are usually difficult to access in the context of remote sensing imagery. The network was fine-tuned using pre-trained models in ImageNet, and the overall accuracy improved from 83.1% to 92.4% [33]. Besides pre-trained models, image augmentation has also been demonstrated to produce more efficient remote sensing image classification [34].
The CNNs that were mentioned above all end up with fully-connected layers, which means that they assume pixels from the image batch share the same label and predict the label of middle pixels instead of the whole image batch. This assumption leads to inaccurate classification results near target boundaries. To reduce this phenomenon, superpixel-and object-based networks have been developed [35,36]. The consistent spectrum information provided by superpixels is able to improve the recognition ability of CNNs [37]. However, the application of CNNs to large-scale land-cover mapping still face problems such as weak generalization, rotation variance and the difficulty of collecting pixel-level training samples. To solve these problems, pooling layers were recombined to improve the efficiency of information transmission in the network, and hierarchical sampling strategies were proposed to automatically construct training datasets [38]. The rotation equivalence was encoded in a CNN architecture to maintain the rotation invariance [39].
Most traditional classifiers utilize image features with clear physical meanings. These features are easy to understand and can provide stable information for classifiers. On the contrary, the feature maps in CNNs are learned from the training samples automatically, which makes it incomprehensible. To make full use of the difference in their features, a decision-based classifier that was able to combine the features from RF, SVM and CNNs was designed [40]. As well as the features learned by traditional classifiers and CNNs, the feature maps learned by CNNs were also different [41]. The use of ensemble CNNs is also an efficient method of improving the classification performance [42]. Ref. [43] combined contextual-based CNNs with pixel-based multilayer perceptron (MLP) to take advantage of both the spatial and spectral feature representations. To further explore the advantages of CNNs and MLP, Ref. [44] proposed a joint learning strategy that learned the joint distribution between CNN and MLP. The proposed algorithm was demonstrated to produce a great increase in the classification accuracy.

Noisy Label Problem
A noisy label is an inevitable problem in the practice of land cover mapping. For traditional classifiers, which use point-based training samples, detecting and filtering noisy labels is an effective method [45,46]. However, it is invalid in pixel-wise-labeled samples for CNN. Fortunately, the effects of a noisy label and a correct label on the learning process are different [47]. To fully utilize this property, Ref. [46] used a selforganizing map to identify inliers and outliers. Ref. [48] proposed to exploit model consistency across iterations and combined a hard mask selection and soft mask reweighing to invalid noisy labels without discarding possibly clean ones. In fact, outliers also contain usefully information, so identifying and discarding them will be fatal, especially when training samples are insufficient. To overcome this drawback, Ref. [49] proposed to use the uncertainty information of training samples and proposed an uncertain aware co-training method to achieve good generalization performance. Ref. [50] proposed a turning value to effectively learn positive samples over negative ones to increase the learning ability of the network. However, it is difficult to recognize and evaluate correct labels and noisy labels. Ref. [45] randomly selected some seed points as clean labels and propagated the label information from them to the rest unlabeled samples. Ref. [51] proposed a coarse-to-fine label iteration model to dig out a set of high-quality labels from fully aggregated labels by using a sparse filter.
The loss function dominates the learning preference of deep learning networks. Different loss functions focus on describing different aspects of the difference between the predicted result and the corresponding label. For most of the loss functions, it is a robust to noisy label to some extent in the presence of a large amount of training samples. However, too many noisy labels will affect the generality of the network [52]. Ref. [53] combined mixup entropy and Kullback-Leibler entropy to define a new loss function by fully utilizing the difference between them. A similar idea was performed in [54]. Ref. [55] proposed an end-to-end correction with mix-up and balance terms to correct noisy labels to true labels. To fully utilize the similarities of pixels belonging to the same class, Ref. [56] proposed a dubbed self-reweighing from the class centroids method. The class centroids can be used to measure the reliability of data labels and thus improve the robustness to noisy labels. Besides the loss function, the structure of the CNN network also has a great impact on its learning ability. Ref. [54] proposed a novel dual-channel structure to improve the learning ability along with a noisy robust loss function constructed by reverse cross entropy and normalized cross entropy.

Data Sources
For large-scale land cover mapping, data availability and corresponding observation ability are the main factors affecting the selection of data sources. Although full-wave LiDAR contain back scattering information about the detected object, it is still difficult to recognize the detected classes. In addition, LiDAR implies the highest cost among commonly used remote sensing datasets [56,57]. SAR is a cost-effective alternative to LiDAR and is unaffected by clouds. However, the back scattering and interferometric coherence in special areas are serious, leading to unclear structure representation [58]. Overall, optical images are the most appropriate even though it is easily affected by clouds. As for the observation ability, low-resolution images cannot reflect the detailed information about the detected objects, which will affect the distinguishing ability on different classes, such as vegetation type. Intermediate-resolution Landsat and Sentinel images are the most commonly used resources in large-scale land cover mapping. Experiments demonstrated that even though Sentinel images have a higher spatial resolution, the overall accuracy of land cover maps produced by Landsat 8 images is similar to that produced by Sentinel-2 images [59]. Considering the long time series, Landsat images were employed in this work. The Level 1T Landsat 5 images of the years 2005 and 2010 from growing seasons were downloaded from https://earthexplorer.usgs.gov (accessed on 10 December 2017).
Different from traditional classifiers, pixel-wise-labeled image patches are necessary to train a fully convolutional neural network. That means most of the existing training samples are invalid for CNN. Manually labeling samples for CNN is labor-intensive and, more importantly, unreliable. Under this circumstance, existing high-accuracy land cover maps can be an option. The Land Cover Map of the People's Republic of China is a high accuracy time series product, which is widely accepted. It can be downloaded from http://www.geodata.cn (accessed on 10 April 2018). This map contains six first-level classes and 38 second-level classes (details shown in Table 1) that have overall accuracies of 94% and 86%, respectively.

Data Pre-Processing
Pre-processing is inevitable in traditional large-scale land cover mapping, since it can provide reliable back-scattering properties of the detected classes. However, the laborious and time-consuming pre-processing heavily affects the practical application in large-scale land cover mapping. Fortunately, CNNs use a large number of parameters to approximate the transformation from input image to output labels, and they are robust to the changes in spectral features in large-scale remote sensing images. That makes it a natural choice for large-scale land cover mapping, which can avoid laborious preprocessing such as radiometric calibration and atmospheric correction. For convenience of automatic processing, an image mosaic using median values was performed to alleviate the impact of clouds, images from non-vegetation season, etc. As for the reference, the data downloaded from http://www.geodata.cn (accessed on 10 April 2018) were stored in provinces. It can be directly aligned with the mosaicked Landsat images. However, to fully utilize the label information, its classification system should be modified to adapt the representation ability of Landsat images. Referencing existing classification systems employed in [60] and other commonly accepted land-cover maps, the 38 second-level classes were merged into 8 first-level classes, as shown in Table 1. Then, the merged land cover map was considered as a reference.

Study Area and Training Sets
In the sample selection stage, the sample distribution and the accuracy of the corresponding reference had the most impact on the construction of the training sets. Therefore, the selected samples should be evenly distributed and should avoid the location of obvious classification errors. For training CNN, the larger the sample size is, the more information it contains, and yet, more computation is needed.The larger the size of a training sample, the more global features it can represent; however, more GPU memory is also required. To make a trade-off between the expression of global characteristics and the computation ability of the GPU, a 512 × 512 pixel size was chosen. For the experiments in this study, 80% of the selected samples constructed the training set, and the other 20% acted as the validation set.

Training Samples for the Jing-Jin-Ji Region with Different Composition
It is known that the quality and quantity of training samples has a great impact on the accuracy of supervised classifiers [61]. For most large-scale land cover mapping training sets, there is a significant amount of redundant information contained, as well as inaccurate information. Experiments in [62] demonstrated that the RF classifier can achieve stable global land cover maps even with 60% fewer sample points or containing 20% errors from the whole training set, which contain 340,000 sample units of various sizes (from 30 m × 30 m to 500 m × 500 m) located at approximately 93,000 sites worldwide. However, the training samples used in this study were pixel-wise image patches, which were cropped from existing classification results and matched with the collected original Landsat images. That means we could not accurately evaluate the accuracy of training samples without a great effort. Therefore, we designed a small experiment to reflect the accuracy of the training samples. In this experiment, the Jing-Jin-Ji region as shown in Figure 1 was selected due to the rich classes contained in this region. TS-1: This set consisted of 1280 training samples covering all the classes present in the study area, namely, forest, grassland, wetland, water body, cultivated land, artificial surface and bare land. TS-2: As there exists serious confusion in the labels of grassland and forest in TS-1, these two classes were merged to decrease the impact of labeling errors in this set. TS-3: In this set, the grassland was oversampled so that the impact of the proportion of inaccurate classes on the final land-cover map could be explored.

Typical Regions
The Chinese provinces of Liaoning, Hubei and Qinghai were selected as study areas. The locations of these areas within China are shown in Figure 2, in which the original Landsat images are represented in pseudo-color composed by NIR, red and green bands. Liaoning is located in northeast China and is mainly composed of cultivated land, artificial surfaces, forest and grassland. Affected by the low temperature, the vegetation types of the same class are obviously different from other regions, leading to different spectral features in vegetation type. Hubei is located in the middle of China. There are thousands of lakes in this province, so it is a perfect study area to test the accuracy of the classification on water bodies. Qinghai province is located on the Qinghai-Tibet Plateau, very far away from the sea. This province is mainly composed of bare land and grassland; in addition, 15.19% of wetlands in China are found in this province. Similar to the experiments on the Jing-Jin-Ji region, the labels shown in Figure 2 were taken from the reference derived from Land Cover Map of the People's Republic of China. Considering the influence of proportions in the training set, 15,000 non-overlapping samples with 512 × 512 pixels from across China were collected for 2005 and 2010 except for the three provinces that had been selected as study areas. Four thousand samples with 512 × 512 pixels were selected from the study areas and used for validation (as shown in Table 2). There were significant differences in the quantity of the training samples for each class due to the natural imbalance between these classes within China. Besides, original Landsat images for 2005 contained some images obtained from the non-growing season, which increased the uncertainty in its samples. To express different features as comprehensively as possible, the selected samples contained different spectral features that expressed the same objects in different locations. Additionally, the training set contained objects with different sizes to make it possible for the DL-based method to learn both detailed and global information. All six bands except the thermal infrared were used to provide sufficient information for the classifier.  To train a network that could be used in national-land cover mapping, for both 2005 and 2010, 19200 non-overlapping training samples were manually selected, each with a size of 512 × 512 pixels as shown in Table 2. The two training sets from typical regions and the whole nation were composed of two time periods, namely, 2005 and 2010. Time series training samples were employed to increase the generality of the proposed MTNet in this study, due to the similarity between their image features and the difference of the corresponding information. Since we wanted to train a network that would be used to produce a national land-cover map of China, the selected training samples were distributed across the whole nation. More training samples were taken from regions where more features were presented, such as eastern coastal areas; fewer training samples were acquired from regions containing fewer features, such as the large areas of grassland and desert with China. It is well known that the learning ability of the DL network is better when the training samples are balanced. If a DL-based network is used with an unbalanced training set, the accuracy of the classes that occupies a small proportion tends to be sacrificed to ensure the overall accuracy. However, the distribution of classes across China as a whole is naturally unbalanced. Considering the influence of different proportions of each class, the proportion of the less distributed classes are increased to match the proportion of majority classes to balance the training set as much as possible. The proportions of different classes for the whole of China and in the training set are shown in Figure 4. Less-distributed classes such as water bodies, artificial surfaces and snow and ice clearly had greater proportions in the training set than their actual proportions within China; for wetland, the opposite was the case. The reason for this is that the areas of wetland were very dispersed, and this made it difficult to collect training samples for this class. Even if a greater effort had been made to achieve a better balance between the class proportions, the variation would still have been large.

Methodology
Traditional CNNs learn essential features from a training set, in which sufficient highquality training samples are necessary. However, sufficient high-quality training samples are difficult to access, especially for large-scale study areas. Fortunately, small-scale noisy labels can be derived from existing land cover maps. Although CNN is robust to noise to some extent, the learning ability is highly dependent on the degree of noise corruption. Serious noise corruption will lead the network to over-fitting, which will seriously decrease the quality of the reduced land cover mapping. As a widely used source for land cover mapping, Landsat images are superior to other remote sensing images in terms of long-term and stable observation ability. The similarity and difference information contained in time series Landsat image gave us a chance to overcome the drawback brought by inaccurate training samples by providing various features for the same class and thus improving the generality of the network. To fully utilize the consistency and complementarity of time series images, a mutual transfer network called MTNet was proposed in this study. First, the DL-based network was trained by the time series training set introduced in Section 3. As the training samples in the time series are similar, the purpose of this step in the training process was to learn the essential features of the training samples: the more details that can be learned, the better. Then, the trained network was used as an initialization on the current training set. Benefiting from the consistency of Landsat images, the fine-tuned process on current data cannot only make full use of the learned essential information but can also adapt to the distribution of the current classes. In addition, the initially trained MTNet can be considered as a regularization on the time series training set to resist the impact of noisy labels and thus improve the generality of the network.
CNN stacks convolutional layers to extract information from input images. The convolutional layer can be expressed as where t represents the iteration index; X (t) is the input image; w (t) is the parameter of the convolutional kernel, which slides over X (t) ; b (t) is the bias of the learned features and f (·) is an activation function that can convert a linear combination of information into a non-linear one. By stacking different convolutional kernels, the same classes present similar features, while different classes present different features. The learned features were mainly dominated by the training process. Taking the Adam optimizer as an example, the gradient, the first-order momentum and the second-order momentum can be calculated as Here, β 1 and β 2 are coefficients of the first-and second-order momentum, respectively. The step is then given by Finally, the updated w (t) is given by When the training set is imbalanced or contains noisy labels, the calculation of Equations (2) and (3) will have large deviations, thus leading to over fitting. On the contrary, time series training samples can provide additive information and generalize the distribution of the classes. Due to the increase in correctly labeled samples in the generalized distribution, it is more robust to imbalanced and noisy labels. To take advantage of this property, an MTNet was proposed. Firstly, the proposed MTNet trained the network with one of the time series training sets to approach the optimum parameters. Then, the trained model was used as an initialization to fine-tune the model on the current data. As an initialization, it can provide a near-optimum combination of parameters. On this basis, the fine-tuned network can not only converge faster but also better adapt to the distribution of the target domain. In this way, Equation (4) was obtained by two training sets: the original parameter w (t) comes from the time series training set, which was used to train the MTNet, and the gradient η (t) , which was obtained from the current training set to further approach the optimum. The disagreements between the time series training sets are the main source of information in the mutual learning. The learned parameters can be written in the form where w f irst_set represents the parameters trained by the time series training set that dominates the learning ability of the MTNet; ∆w second_set represents the updated parameters, which were fine-tuned on the current training set. As a regularization, the pre-trained model can provide diversified information to enlarge the parameter space. In counter with the training process, which reduces the searching scope of the parameter space, the pre-trained model can prevent the network from learning features that are too complex. In other words, by utilizing the consistency and complementarity between time series Landsat images, the MTNet is able to prevent the network from over-fitting in the presence of noisy labels. Taking PSPNet as an example, the flowchart of the proposed MTNet is shown in Figure 5. The efficiency of the MTNet is improved by ensuring that the network learns essential and general features from time series training sets. By fully utilizing the similar and dissimilar features of the time series training sets, the MTNet can learn general information and is also robust to imbalanced and noisy labels.

Experiments and Analysis on Typical Regions
The network was coded using Pytorch and trained by a 3 Titan XP with a 12 GB memory. To improve the efficiency and decrease the number of training samples needed, a model that had been pre-trained on ImageNet was employed to initialize the network parameters. An Adam optimizer was used to calculate the gradient of the network and realize the backward propagation. The learning rate was set to 10 −4 , and the weight decay was 10 −4 . The batch size was 18 with a momentum of 0.1. The network was trained 300 epochs and fine-tuned 40 epochs so that it could fully utilize the information contained in the time series images.

Experiments on Different Compositions of Training Samples
PSPNet was used to classify the above-mentioned three training sets in Section 3.3.1. The corresponding precisions are listed in Table 3. Comparing the classification results of TS-2 with TS-1, it was clear that combining the labels of forest and grassland can significantly improve both the precision (from 75.25% to 83.16%) and the F1_score (from 71.83% to 78.32%). Of course, merging classes simplifies the learning model, but the improvement in the accuracy of the training set also contributes to the improvement in the precision and the F1_score. The existence of labeling errors in the training set cannot only affect the recognition of the corresponding class but also improves the learning ability of the network. Actually, the quality of the training set cannot be easily improved in real-world tasks. So, it is difficult to quantitatively describe the connection between the noisy corruption degree and the learning ability of the network.
As for the classification result of TS-3, increasing the proportion of grassland also increases the proportion of forest, since they usually appear to be adjacent. With more information about grassland, the classification precision rose about 18.56%, and the F1_score increased about 7.11%. Similar to the result of TS-2, all the other classes except cultivated land were all increased in TS-3. This means that introducing more information can significantly increase the learning ability of CNN, even though the introduced information was with uncertainty. That gave us a clue to jointly learn from time series remote sensing images by using the similar but differential information.

Qualitative Results for Typical Regions
To test the efficiency of the proposed MTNet, we compared with the most widely used methods in large-scale remote sensing image classification (to make a fair comparison, all the network employed in this study took PSPNet as a backbone), including: (1) traditional random forest (RF) classifier, which was demonstrated to be the most effective method [63]; (2) PSPNet-RL, which employed a robust loss proposed in [54] to alleviate the impact of noisy labels; (3) PSPNet trained by a training set collected outside the current data (we called it PSPNet-TF in this study); (4) the traditional PSPNet. The above-mentioned algorithms and the proposed MTNet were performed on the training set introduced in Section 3.3.2.
Detailed results for the performance of the model for Liaoning, Hubei and Qinghai are shown in Figures 6-8. RF uses pixel-wise training samples to train the classifier. The lack of context relationship makes it sensitive to noise. Accordingly, the corresponding classification results were not satisfactory. In fact, the class noise contained in the training samples also had a great impact on RF. As demonstrated in the last subsection, the training samples were derived from existing land cover maps and contained some class noise. Since the RF classifier does not need a pixel-wise-labeled image patch, the center pixel of a training sample was selected as a training point to construct a training set for the RF classifier. Actually, we tried to select 1-30 pixels from each image as training samples and even used all the training samples, both randomly or following certain rules. However, there was no significant difference on the training or testing precision due to the capacity of the RF classifier. Considering the trade-off between the precision and the training efficiency, we randomly selected five pixels from each image and constructed a training set with 67,301 sample points by removing pixels, which were labeled as others.
It is clear that the PSPNet was sensitive to the training samples that were used. Its recognition ability was significantly reduced when the original Landsat image contained features that were obviously inconsistent with the training samples. Benefiting from the robust loss in PSPNet-RL, the noisy resistant ability was slightly improved. This also decreased the recognition ability of PSPNet-RL on objects that contained complex features, such as Figure 6(b4,d4), Figure 7(b4,c4) and Figure 8(a4), etc. PSPNet obtained better classification results than PSPNet-TF, which was trained with samples outside the current time periods. Overall, the models trained using the 2010 training set had a stronger generalization ability than the models trained using the 2005 training set. This was because the 2010 training set was of better quality. Using the consistent information provided by the 2010 training set, PSPNet was able to learn the image features more efficiently. The MTNet was fine-tuned using the current training set. The parameters of the MTNet were determined by time series training sets. This means that the MTNet was able to learn more general features from the current training set while maintaining the stable feature-learning ability acquired from the time series training set. In this way, the MTNet can overcome the drawback of having imbalanced training samples and learn essential features from noisy labels. As shown in the results in Figures 6-8, the classification for MTNet were the best among all the models and sometimes outperformed even the reference, visually.

Quantitative Results for Typical Region
Using the training set introduced in Section 3.3.2, the precision and F1_score for the study areas are listed in Table 4. It can be seen that: (1) Among all the methods: RF obtained the lowest accuracy. Influenced by the independence in training samples, RF cannot effectively learn the relationship between pixels and thus is sensitive to noise. Accordingly, both the precision and the F1_score were significantly lower than DL-based methods. (2) Generally, PSPNet-RL obtained better accuracies than the PSPNet, due to the introduced robust loss function. This means the robust loss proposed in [54] does have the ability of improving the robustness to noisy labels. However, as shown in Figures 6-8, this improvement in the classification accuracy is at the cost of decreasing the recognition ability of complex features. Unfortunately, for most applications, object recognition ability is much more important than accuracy. PSPNet outperformed PSPNet-TF, while all the classification accuracies of the MTNet were higher than the others. The MTNet performed well even when the training samples were not balanced and noisy labels were included. The use of interactive training samples from time-series data significantly improved the performance of the networks. This demonstrates the efficiency of the mutual training strategy used in the MTNet. (6) For different study areas, the overall accuracies were affected by the quality of the original images and labels, and so they varied with the study areas. This was a result of the different proportions of each class that were present in the training samples. For example, the main classes made up a large proportion of the training samples for Liaoning and Hubei. As a result, the overall accuracies for these areas were higher than for Qinghai, for which the samples corresponding to the main classes were of poor quality or contained noisy labels.

Extension to Land Cover Maps of China
Most of the existing land cover maps were produced by point-based classifiers, such as MLC, DT, RF and SVM. Point-based training sets are easy to access, and the quality is highly guaranteed. This greatly shortens the training time of the model and ensures the learning efficiency of the classifiers. However, restricted by the limited information contained in small training sets and the essential driver of the classifiers, traditional point-based classifiers are sensitive to the variation of features representing the same class. This leads to poor performance on large-scale study areas, which is the reason why the production of large-scale land cover products need complex pre-and post-processing and a large amount of human interaction. This means using the RF classifier in the last section cannot obtain comparable accuracy with existing land cover maps. With the rapid development of remote sensing techniques, this kind of classifier cannot meet the rapid development of application demand. DL-based methods use a large amount of parameters to learn from massive information contained in pixel-wise-labeled training samples, and even it suffers from imbalanced and noisy labels. This increases the suitability of the DL-based method on large-scale study areas. Nevertheless, the learning ability of the DL-based method is highly dependent on the quality of the training set. Benefiting from the transferability of the DL-based method, the proposed MTNet mutually transfers pretrained models to other time periods and jointly learns general information in time series training samples. This obviously increases its robustness on imbalanced and noisy labels. To further test the proposed MTNet and also to provide a solution for large-scale land cover mapping, the MTNet was used to produce a new national land cover map of China in 2010. The produced land cover map was qualitatively compared with existing products as shown in Figure 9. The original Landsat images for 2010 were only stitched without any pre-processing. Therefore, the stitching lines were obvious as shown in Figure 9a. Actually, the original Landsat images wereexactly the ones used for producing all the land cover maps except Figure 9b. The Land Cover Map of People's Republic of China (LCMPRC) is the most accurate land cover product we know. However, there are some obvious errors in northeast China, where the cultivated land was classified as wetland as shown in Figure 9b. The Glob-alLandCover 30 (GLC30) contained 10 first-level classes and was merged to the proposed 8 first-level classes listed in Table 1. In Figure 9c, most of Qinghai-Tibet was classified as grassland where they should be bare land according to Figure 9a. The classification results of PSPNet and MTNet were similar, from a national perspective.
To further explore the difference among all the land cover maps, detailed classification results are shown in Figure 10. Figure 10(a1) is an area stitched from multi-images, where the lower-right part comes from non-growing seasons. PSPNet classifies this area as grassland, while the proposed MTNet can correctly recognize this area as forest. Figure 10(a2) shows the details of the misclassification of cultivated land to wetland in the northeast China of LCMPRC. GLC30 cannot obtain a satisfactory result in this area as shown in Figure 10(a3). Figure 10(b1) is mostly covered by forest (darker red part) and cultivated land (lighter red part). LCMPRC can distinguish different classes, but the result is fragmented. GLC30 classifies some forest as cultivated land, while PSPNet classifies some forest as grassland. The classification results of MTNet are promising compared with other land cover maps. Figure 10(c1) is an atypical landform in China, and it can only be found in the middle east area. LCMPRC misclassified some of the cultivated land as an artificial surface, and even typical features of artificial surface exist in this area (as shown in Figure 10(a2)). GLC30 cannot recognize the water body with such salient features and classifies them as cultivated land. PSPNet is able to recognize the water bodies but misses the features of artificial surfaces. Benefiting from time series training samples, MTNet is more robust to changes of features in the same class and obtains satisfactory classification results (as shown in Figure 10(c5)). Influenced by the nature distribution of different classes all over China, wetland occupies the least proportion. Accordingly, the training samples for wetland is less efficient compared with other classes. This has a great impact on both traditional point-based classifiers and DL-based methods (as shown in Figure 10(d2,d4)). Besides, the similarity between features of different classes also has a great impact on the recognition ability of traditional classifiers, as shown in Figure 10(d3). The MTNet jointly learns from time series training samples and is able to learn more information and obtain better classification results.
From a quantitative evaluation point of view, traditional point-based evaluation may over-estimate the accuracies since sample points have a higher probability to lie in the middle of an object, while misclassification is more likely to appear near the boundary. To roughly estimate the classification results of PSPNet and MTNet, we randomly selected 15 images with 10,240 × 10,240 pixels as an estimation area and then calculated the precision and F1_score by considering the GLC30 and LCMPRC as the ground truth, respectively. The corresponding results are listed in Table 5. It was clear that: (1) the evaluation results based on LCMPRC were much higher than those based on GLC30. That was caused by the accuracy of the base, which was considered as the ground truth. The overall accuracy of GLC30 was about 80%, while it was between 86% and 94% for LCMPRC. This means the evaluation results based on LCMPRC were more accurate than those based on GLC30.
(2) Under this evaluation criteria, DL-based methods (PSPNet and MTNet) outperformed traditional ones (GLC30, which is a combination of MLC, DT, SVM and human labors, and LCMPRC, which is produced by many cooperative institutes). Benefiting from the large amount of parameters, DL-based methods can learn more from the training samples and outperform traditional classifiers. (3) MTNet outperformed PSPNet to some extent. As shown in the detailed results of GLC30 and LCMPRC in Figure 10, these two products may be superior in some classes in some areas, but they also contain obvious misclassified areas. That may have an impact on the evaluation results. Actually, jointly learning from time series training samples allows MTNet to learn more essential information than PSPNet. Overall, MTNet provides a new idea for large-scale land cover mapping, and its classification result is promising compared with existing ones.

Conclusions
A mutual transfer network (MTNet) for large-scale time-series land-cover mapping was proposed. After comparing the performances of PSPNet, PSPNet-RL, PSPNet-TF and MTNet, we proposed an idea for large-scale land cover mapping. It was demonstrated by experiment that, as also stated in [20], the quality of the training samples had a significant impact on the classification results. Based on the transferability of CNN, the proposed MTNet can take advantage of the time-series of training samples and obtain better classification results than a traditional training strategy based on an imbalanced training set with noisy labels. This study provides a solution for practical large-scale land cover mapping, but it did not take the consistency between time series land cover mapping into consideration. Large-scale land cover maps produced by MTNet can be post-processed according to [64,65].
The DL-based method provides a new opportunity for producing large-scale land cover maps. In future work, we will focus on introducing training samples with different imaging times to improve the recognition ability of the DL-based method, especially for vegetation. We will also try to introduce domain adaption and change detection methods into MTNet to improve its generality for large-scale land cover mapping.