Fine Land-Cover Mapping in China Using Landsat Datacube and an Operational SPECLib-Based Approach

Fine resolution land cover information is a vital foundation of Earth science. In this paper, a novel SPECLib-based operational method is presented for the classification of multi-temporal Landsat imagery using reflectance spectra from the spatial-temporal spectral library (SPECLib) for 30 m land-cover mapping for the whole of China. Firstly, using the European Space Agency (ESA) Climate Change Initiative Global Land Cover (CCI_LC) product and the MODIS Version 6 Nadir bidirectional reflectance distribution function adjusted reflectance (NBAR) product (MCD43A4), a global SPECLib with a spatial resolution of 158.85 km (equivalent to 1.43° at the equator) and a temporal resolution of eight days was developed in the sinusoidal projection. Then, the Landsat datacube covering the whole of China was developed using all available observations of Landsat OLI imagery in 2015. Thirdly, the multi-temporal random forest method based on SPECLib was presented to produce an annual land-cover map with 22 land-cover types using the Landsat datacube. Finally, the annual China land-cover map was validated by two different validation systems using approximately 11,000 interpretation points. The mapping results achieved the overall accuracy of 71.3% and 80.7% and the kappa coefficient of 0.664 and 0.757 for the level-2 validation system (19 land-cover types) and the level-1 validation system (nine land-cover types), respectively. Therefore, the case study in China indicates that the proposed SPECLib method is an operational and accurate method for regional/global fine land-cover mapping at a spatial resolution of 30 m.

resolution of eight days was first developed using the MCD43A4 NBAR (Nadir bidirectional reflectance distribution function adjusted reflectance) product and the CCI_LC (European Space Agency Climate Change Initiative Global Land Cover) land-cover product for 2015. Then, using the reflectance spectra in SPECLib, a multi-temporal classification method based on the stacked random forest was proposed to produce an annual land-cover map with 22 land-cover types. Thirdly, the Landsat datacube covering the whole of China was developed using all available observations of Landsat OLI imagery in 2015. Finally, the SPECLib-based multi-temporal land-cover mapping method was tested using the above Landsat datacube to produce China's 2015 annual land-cover map. The validation results indicated that the proposed method was effective and accurate for large-area land-cover mapping.

Landsat Imagery and Datacube
More than 7000 Landsat OLI images covering China and acquired in 2015 were downloaded from the USGS (United States Geological Survey) and RADI (Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences). These images had a spatial resolution of 30 m and a cloud coverage of less than 60%. In this study, all images were radiometrically corrected to surface reflectance (SR) using the C-correction topographical correction method [33,44,45] and the Landsat Surface Reflectance Code (LaSRC) atmospheric correction method [46][47][48].

Reprojection and Tiling
Due to the poleward convergence of Landsat orbits, the across-track scene overlap for adjacent Landsat eight orbits increases significantly from a minimum of 0.105 • at the equator to 12.5 • at a latitude of 80 • [49]. Therefore, using a regular grid of points in the universal transverse mercator (UTM) or geographic projection might be inappropriate for large-area applications because the land surface recorded by the sensor is sampled with the different spatial grid density at different latitudes [50]. In order to increase the chance of having more frequent surface observations, the global sinusoidal equal area projection grid (GSinGrid) defined every 5.560 km (equivalent to 0.05 • at the equator) was used so that the grid sampling density did not vary by geographic region [49].
As there are about 305,000 GSinGrid tiles covering the whole of China (Figure 1), monitoring and analyzing all of the reprojected grid data separately would have been inefficient and complicated. In order to facilitate subsequent handling and application, each GSinGrid tile was dealt with as the minimum unit of the three-level land-tile. Specifically, the level-1 land-tile was defined as being the 10 • × 10 • MODIS land tile and the level-2 land-tile was defined as the global Web-enabled Landsat Data (GWELD) land tile because the size of GWELD tile (158.85km) is almost the same as the 185-km swaths in the Landsat scene [30]. There were 7 × 7 GWELD tiles within each MODIS land tile ( Figure 1). The final level, the level-3 land-tile, was the GSinGrid tile, each tile being composed of 186×186 30-m pixels. Each GWELD tile contained 29 × 29 GSinGrid tiles ( Figure 1). In order to facilitate the application and monitoring of the data like GWELD tile in Zhang and Roy [30], each GSinGrid tile locations were recorded in the filenames, these were given as 'hh(x0)vv(y0)h(x1)v(y1)_p(x2)r(y2)', where x0 and y0 are the horizontal and vertical coordinates for MODIS land tile, x1 and y1, ranged from 0 to 6, are the horizontal and vertical coordinates of GWELD tile, and x2 and y2, ranged from 0 to 28, are the horizontal and vertical coordinates of GSinGrid tile. Moreover, it should be noted that the coordinate origin of the lower level tile, GWELD, and GSinGrid tile, corresponds to the left-upper corner of the upper-level tile, MODIS and GWELD tile, respectively.
To store and retrieve the GSinGrid tile more efficiently and flexibly, a simplified database which was similar to the Australian Geoscience Data Cube (AGDM) [51], called Landsat datacube, was designed to manage these massive GSinGrid tile data. In this study, all 7,059 Landsat SR images were processed into the GSinGrid tiles using the aforementioned method and then stored in the datacube. Figure 2 displays the locations and frequency distributions of the Landsat WRS-2 scenes and corresponding GSinGrid tiles in the datacube. The comparison illustrates that the tiling process significantly increased the data availability especially in the overlap regions between two orbits. To store and retrieve the GSinGrid tile more efficiently and flexibly, a simplified database which was similar to the Australian Geoscience Data Cube (AGDM) [51], called Landsat datacube, was designed to manage these massive GSinGrid tile data. In this study, all 7,059 Landsat SR images were processed into the GSinGrid tiles using the aforementioned method and then stored in the datacube. Figure 2 displays the locations and frequency distributions of the Landsat WRS-2 scenes and corresponding GSinGrid tiles in the datacube. The comparison illustrates that the tiling process significantly increased the data availability especially in the overlap regions between two orbits.

Cloud and Shadow Detection and Filling
As clouds and cloud shadows obscure the spectral characteristics of ground surfaces and significantly influence the availability of useful remote sensing data, identification and filling of cloud-and shadow-contaminated pixels was necessary. First, all cloud and shadow pixels were identified using the Fmask algorithm, which has previously been demonstrated to have higher accuracy and more stability than traditional single-date detection methods [52,53]. Next, a modified neighborhood similar pixel interpolator approach [54] was used to restore the cloud-and shadowcontaminated pixels in the GSinGrid tile data. As the accuracy of the restored images slightly decreased as the amount of cloud increased [54], to ensure the quality of the restored images, the image was discarded if the percentage of cloud and shadow pixels exceeded 30%.  To store and retrieve the GSinGrid tile more efficiently and flexibly, a simplified database which was similar to the Australian Geoscience Data Cube (AGDM) [51], called Landsat datacube, was designed to manage these massive GSinGrid tile data. In this study, all 7,059 Landsat SR images were processed into the GSinGrid tiles using the aforementioned method and then stored in the datacube. Figure 2 displays the locations and frequency distributions of the Landsat WRS-2 scenes and corresponding GSinGrid tiles in the datacube. The comparison illustrates that the tiling process significantly increased the data availability especially in the overlap regions between two orbits.

Cloud and Shadow Detection and Filling
As clouds and cloud shadows obscure the spectral characteristics of ground surfaces and significantly influence the availability of useful remote sensing data, identification and filling of cloud-and shadow-contaminated pixels was necessary. First, all cloud and shadow pixels were identified using the Fmask algorithm, which has previously been demonstrated to have higher accuracy and more stability than traditional single-date detection methods [52,53]. Next, a modified neighborhood similar pixel interpolator approach [54] was used to restore the cloud-and shadowcontaminated pixels in the GSinGrid tile data. As the accuracy of the restored images slightly decreased as the amount of cloud increased [54], to ensure the quality of the restored images, the image was discarded if the percentage of cloud and shadow pixels exceeded 30%.

Cloud and Shadow Detection and Filling
As clouds and cloud shadows obscure the spectral characteristics of ground surfaces and significantly influence the availability of useful remote sensing data, identification and filling of cloudand shadow-contaminated pixels was necessary. First, all cloud and shadow pixels were identified using the Fmask algorithm, which has previously been demonstrated to have higher accuracy and more stability than traditional single-date detection methods [52,53]. Next, a modified neighborhood similar pixel interpolator approach [54] was used to restore the cloud-and shadow-contaminated pixels in the GSinGrid tile data. As the accuracy of the restored images slightly decreased as the amount of cloud increased [54], to ensure the quality of the restored images, the image was discarded if the percentage of cloud and shadow pixels exceeded 30%.

Validation Dataset
The validation dataset used in this study was collected by visual interpretation of high-resolution Google Earth images. Due to the easy access to high-resolution images, Google Earth has great advantages in the collection of ground truth samples [6,33]. In addition, several auxiliary datasets, including the existing land cover maps (GlobeLand30 [4], FROM_GLC from 2015 [12], ESA CCI_LC [55] and MCD12Q1 [56]), as well as topographical data (digital elevation model and slope data) and the normalized difference vegetation index (NDVI) seasonality product [55], were collected to assist the interpretation of each validation sample.
The validation dataset contained approximately 11,000 points and included nine level-1 land-cover types (cropland, forest, shrubland, grassland, wetland, water body, impervious, bare area, and snow/ice) and 19 level-2 land-cover types (Table 1), for the nominal year 2015. In this study, as some forest validation samples-closed broadleaf/needle-leaved forest and open broadleaf/needle-leaved forest-were difficult to distinguish in the visual interpretation process, the classification system containing 22 land-cover types was simplified to the two-level validation systems (Table 1). Figure 3 illustrates the spatial distribution of all validation samples. As the locations of all samples were randomly generated, the proportions of bareland and grassland samples are larger than those of the other land-cover types. Furthermore, to minimize the subjective influence of interpretation by experts, the validation samples were independently interpreted by four different scientists.

The Spatial-Temporal Spectral Library
As explained in our previous work [33], the spatial-temporal spectral library was developed to store the reflectance spectra of different land-covers types within each geographic grid with a temporal resolution of eight days for global land areas. It makes full use of the spectral consistency between Landsat and MODIS data [57] and also the high classification accuracy and detailed classification scheme of GlobCover2009 [58]. However, in contrast to the previous version of SPECLib developed using the MOD09A1 reflectance product and the GlobCover2009 land-cover product, the current SPECLib has been updated using the MCD43A4 NBAR product and the ESA CCI_LC landcover product. The use of these updated products (MCD43A4 and CCI_LC) leads to the following changes and advantages for the current SPECLib version.
First, as the Landsat images processed into the GSinGrid tiles shared the same sinusoidal projection as MCD43A4, and the size of level-2 (GWELD) land tiles (158.85km) was similar to that of the 185 km Landsat swaths, the SPECLib cells were defined to have a size of 158.85 km × 158.85 km (equivalent to 1.43° at the equator) on a sinusoidal projection. If the cells had been larger (as for the level-1 10° × 10° MODIS land product tiles), the spectral signature extension performance would have been significantly poorer [32]. Smaller cells (as for the level-3 0.05° × 0.05° GSinGrid tiles) would have meant that most SPECLib cells had no spectral reflectance because the spatial extent of GSinGrid covered approximately 11 × 11 500-m MCD43A4 pixels when collecting spectral uniform points. To facilitate the spatial matching between the GSinGrid SR with that in the SPECLib, the name of each SPECLib cell also included the tile information as 'hh(x0)vv(y0)h(x1)v(y1)'. Further, the MCD43A4 has an overlapping, eight-day processing cycle [59], therefore, the SPECLib was developed to have a temporal resolution of eight days. This resulted in there being 46 epochs per year for each 158.85 km × 158.85 km cell and, for each GSinGrid tile, a spectral library that matched temporally and spatially and that had a phenological difference of less than five days could be found.
Secondly, at the stage of where spectrally uniform points within each 158.85 km × 158.85 km cell were collected, the uniform points were still determined based on the variance of a 3 × 3 local window

The Spatial-Temporal Spectral Library
As explained in our previous work [33], the spatial-temporal spectral library was developed to store the reflectance spectra of different land-covers types within each geographic grid with a temporal resolution of eight days for global land areas. It makes full use of the spectral consistency between Landsat and MODIS data [57] and also the high classification accuracy and detailed classification scheme of GlobCover2009 [58]. However, in contrast to the previous version of SPECLib developed using the MOD09A1 reflectance product and the GlobCover2009 land-cover product, the current SPECLib has been updated using the MCD43A4 NBAR product and the ESA CCI_LC land-cover product. The use of these updated products (MCD43A4 and CCI_LC) leads to the following changes and advantages for the current SPECLib version.
First, as the Landsat images processed into the GSinGrid tiles shared the same sinusoidal projection as MCD43A4, and the size of level-2 (GWELD) land tiles (158.85km) was similar to that of the 185 km Landsat swaths, the SPECLib cells were defined to have a size of 158.85 km × 158.85 km (equivalent to 1.43 • at the equator) on a sinusoidal projection. If the cells had been larger (as for the level-1 10 • × 10 • MODIS land product tiles), the spectral signature extension performance would have been significantly poorer [32]. Smaller cells (as for the level-3 0.05 • × 0.05 • GSinGrid tiles) would have meant that most SPECLib cells had no spectral reflectance because the spatial extent of GSinGrid covered approximately 11 × 11 500-m MCD43A4 pixels when collecting spectral uniform points. To facilitate the spatial matching between the GSinGrid SR with that in the SPECLib, the name of each SPECLib cell also included the tile information as 'hh(x0)vv(y0)h(x1)v(y1)'. Further, the MCD43A4 has an overlapping, eight-day processing cycle [59], therefore, the SPECLib was developed to have a temporal resolution of eight days. This resulted in there being 46 epochs per year for each 158.85 km × 158.85 km cell and, for each GSinGrid tile, a spectral library that matched temporally and spatially and that had a phenological difference of less than five days could be found.
Secondly, at the stage of where spectrally uniform points within each 158.85 km × 158.85 km cell were collected, the uniform points were still determined based on the variance of a 3 × 3 local window using spectral thresholds of [0.03, 0.03, 0.03, 0.06, 0.03, and 0.03] for the six MODIS bands (blue, green, red, NIR, SWIR1, and SWIR2) [57]. Since the view-angle effects had already been removed from the directional reflectances of MOD09A1, this resulted in a more stable and consistent product (MCD43A4), and more candidate points for spectral uniformity were selected after removing the constraint of view angle between Landsat and MODIS.
Finally, compared with the GlobeCover2009 land-cover product [58], the CCI_LC has a higher classification accuracy, more detailed classification scheme and better stability of the land-cover maps from year to year [6,55]. Therefore, the classification system (Table 1) adopted in this study contained a fine classification scheme that included 22 land-cover types after the removal of several mosaic land-cover types (these included mosaic cropland and natural vegetation, and mosaic tree and shrub and herbaceous cover). The CCI_LC is provided with a geographical projection while the SPECLib and MCD43A4 are provided with a sinusoidal projection. The global CCI_LC map for 2015 was first reprojected to the sinusoidal projection and further tiled into the 10 • × 10 • MODIS land product tiles. Since the spatial resolution of the MCD43A4 (500 m) data was roughly two times that of the CCI_LC (300 m) and the uniform points were determined in 3 × 3 local window, and the CCI_LC achieved higher accuracy over homogeneous areas [55], the larger window of 4 × 4 instead of 2 × 2 was used to determine the land-cover types for these uniform points, namely, the boundary of each uniform point collected in MCD43A4 would be composed of 4 × 4 CCI_LC pixels. In order to ensure the reliability of the reflectance spectra in SPECLib, only those spectrally uniform points that were further identified as homogeneous in the CCI_LC were retained. In other words, if the maximum frequency of dominant land-cover types was less than 14 in the 4 × 4 local window, the point was excluded from the SPECLib.

Normalization of the SPECLib Reflectance Spectra
Although there was a great deal of consistency between the Landsat SR and MCD43A4 NBAR products [60], the minor spectral differences caused by the differences in the acquisition date, the quantitative processing, and the spectral response function, also had to be considered and minimized using relative radiometric normalization. The SPECLib was defined for each level-2 land tile using MCD43A4 and had the same projection as the level-2 land-tile Landsat SR (Section 2.1). To remove the resolution differences between the two sensors, the Landsat SR data in the level-2 land-tiles were further aggregated to the MODIS resolution by averaging the Landsat values. The homogeneous points selected by the spectral thresholds [0.03, 0.03, 0.03, 0.06, 0.03, and 0.03] for the six spectral bands (see also [33] and [57]) in both types of imagery were used to build the linear regression defined as: where the coefficients α and β describe the radiometric differences between the two sensors in the six spectral bands: the closer the slope to the 1:1 line, the better the spectral consistency. Therefore, all the reflectance spectra in SPECLib were normalized using the above formula and then used to train the multi-temporal classification method.

Multi-Temporal Classification Method Based on SPECLib
The RF classifier can handle high data dimensionality and multicollinearity. It is also fast, insensitive to overfitting, less sensitive to noise and more efficient and accurate than other non-parametric classifiers [40]. Therefore, the RF was selected for use in our operational SPECLib-based multi-temporal land-cover mapping method. To generate the annual land-cover map using the intra-annual multi-temporal data, a stacked random forest was developed and adopted ( Figure 4). The stacked random forest usually consists of two steps: the independent RF classifier (base classifier) was first generated using the training data from each time/date. Individual outputs from all the independent classifiers were then used as input features to train the second classifier.

Training the Base Classifier
For a given GSinGrid tile, there were multiple epochs (Nepoch) Landsat SR data (the temporal frequencies are shown in Figure  2). According to the spatial location (hh(x0)vv(y0)h(x1)v(y1)_p(x2)r(y2)) and day of year (DOY) corresponding to these SR data, the Nepoch spectral libraries that were temporally closest (phenological difference less than five days) could be found in the SPECLib which was defined for each GWELD tile (hh(x0)vv(y0)h(x1)v(y1)) and the temporal resolution of eight days. It should be noted that all GSinGrid tiles belonging to the same GWELD tile shared the same reflectance spectra because the SPECLib was defined at the GWELD scale. In addition, to ensure classification consistency across these spatial-neighbor GWELD tiles and to avoid losing reflectance spectra for broken land-cover types caused by the coarse resolution of MCD43A4 and CCI_LC, the adjacent 3 × 3 GWELD tiles from SPECLib were imported to train the random forest for classifying the central tile. After importing adjacent training spectra, the total number of training spectra for each GWELD tile was more than 7,200.
Spectral indexes have been demonstrated effective to improve classification accuracy for landcover types with similar reflectance spectra [42]: for example, the NDVI has been used for vegetation separation [61], the normalized difference water index (NDWI) for inland water detection [62] and the normalized difference built-up index (NDBI) for building detection [63]. In this study, including the six spectral reflectance bands, NDVI, NDWI, and NDBI, there was a total of nine spectral features for each training sample. In addition, as altitude information is helpful in land-cover classification [64], especially for vegetation-related land-cover types [65], the topographical elevations from the Shuttle Radar Topography Mission (SRTM) [66] were also included in the training of the RF classifier. For example, Jin et al. [67] found that the vegetation distribution in the Qilian mountains exhibited an obvious vertical gradient and that the vegetation types from low to high altitude progressed from desert-grassland vegetation through dry shrub-grassland vegetation, mountain forest-grassland vegetation, sub-alpine shrub-grassland vegetation, and cold-desert alpine meadow vegetation.
The RF has only two main parameters: the number of selected prediction variables (Mtry) and the number of classification trees (Ntree) [39]. Du et al. [43] investigated the influence of Ntree on the classification accuracy of the RF classifier (from 10 to 200 trees at 10 intervals) and found no significant relationship between them. In this study, a value of 100 for Ntree was, therefore, selected for each base RF classifier. For the Mtry parameter, Gislason et al. [68] found that the classification accuracy was insensitive to the parameter, so the default value of the square root of the total number of training features was used [68,69]. In this paper, there were ten training features used, including nine spectral features and one elevation channel, so the default value of three for Mtry was selected.

Stacking of Base Classifiers
After training each base classifier using single-date training data, there were Nepoch independent RF classifiers. As Healey et al. [70] explained, how to fuse these single-date land-cover results derived from corresponding classifiers to produce an annual land-cover map can be divided into two main types: one type uses the simple algebraic operators, such as mean or majority voting. The other uses a secondary classification model to reweight the output of these classifiers according to their performance against similar cases in the reference data, namely, using the outputs from these

Training the Base Classifier
For a given GSinGrid tile, there were multiple epochs (Nepoch) Landsat SR data (the temporal frequencies are shown in Figure 2). According to the spatial location (hh(x0)vv(y0)h(x1)v(y1)_p(x2)r(y2)) and day of year (DOY) corresponding to these SR data, the Nepoch spectral libraries that were temporally closest (phenological difference less than five days) could be found in the SPECLib which was defined for each GWELD tile (hh(x0)vv(y0)h(x1)v(y1)) and the temporal resolution of eight days. It should be noted that all GSinGrid tiles belonging to the same GWELD tile shared the same reflectance spectra because the SPECLib was defined at the GWELD scale. In addition, to ensure classification consistency across these spatial-neighbor GWELD tiles and to avoid losing reflectance spectra for broken land-cover types caused by the coarse resolution of MCD43A4 and CCI_LC, the adjacent 3 × 3 GWELD tiles from SPECLib were imported to train the random forest for classifying the central tile. After importing adjacent training spectra, the total number of training spectra for each GWELD tile was more than 7200.
Spectral indexes have been demonstrated effective to improve classification accuracy for land-cover types with similar reflectance spectra [42]: for example, the NDVI has been used for vegetation separation [61], the normalized difference water index (NDWI) for inland water detection [62] and the normalized difference built-up index (NDBI) for building detection [63]. In this study, including the six spectral reflectance bands, NDVI, NDWI, and NDBI, there was a total of nine spectral features for each training sample. In addition, as altitude information is helpful in land-cover classification [64], especially for vegetation-related land-cover types [65], the topographical elevations from the Shuttle Radar Topography Mission (SRTM) [66] were also included in the training of the RF classifier. For example, Jin et al. [67] found that the vegetation distribution in the Qilian mountains exhibited an obvious vertical gradient and that the vegetation types from low to high altitude progressed from desert-grassland vegetation through dry shrub-grassland vegetation, mountain forest-grassland vegetation, sub-alpine shrub-grassland vegetation, and cold-desert alpine meadow vegetation.
The RF has only two main parameters: the number of selected prediction variables (Mtry) and the number of classification trees (Ntree) [39]. Du et al. [43] investigated the influence of Ntree on the classification accuracy of the RF classifier (from 10 to 200 trees at 10 intervals) and found no significant relationship between them. In this study, a value of 100 for Ntree was, therefore, selected for each base RF classifier. For the Mtry parameter, Gislason et al. [68] found that the classification accuracy was insensitive to the parameter, so the default value of the square root of the total number of training features was used [68,69]. In this paper, there were ten training features used, including nine spectral features and one elevation channel, so the default value of three for Mtry was selected.

Stacking of Base Classifiers
After training each base classifier using single-date training data, there were Nepoch independent RF classifiers. As Healey et al. [70] explained, how to fuse these single-date land-cover results derived from corresponding classifiers to produce an annual land-cover map can be divided into two main types: one type uses the simple algebraic operators, such as mean or majority voting. The other uses a secondary classification model to reweight the output of these classifiers according to their performance against similar cases in the reference data, namely, using the outputs from these single-date classifiers as input to train the secondary model [71]. The final decision came from outputs of the secondary model (or stacking classifier). Compared with the simple combination, the stacking (the second solution) strategy has been proven to give better prediction accuracy [70][71][72].
The stacked random forest classifier was trained using the outputs of all the independent base classifiers (posterior probabilities and corresponding land-cover type). Moreover, whether the training data needed to participate in the training of the stacked RF classifier depended on the total number of input features of posterior probabilities and land-cover types because it was possible that the problem of high dimensionality could be amplified when the number of training samples was relatively small, thus resulting in an increase of classification error [40]. Our solution was to train two stacked random forest models, one of which added training data and the other that did not: the model that achieved the higher accuracy was adopted to predict land-cover types for the GSinGrid tile data. Lastly, for the two main parameters (Ntree and Mtry) default values of 500 and the square root of the total number of input features [40], respectively, were used.

Rule-Based Verification
Although the temporal information and topographical variables were considered, spectral similarities, including that between cropland and grassland or water bodies and terrain shadow, together with errors in cloud and shadow detection and filling (described in Section 2.1.2) led to the misclassification of a small number of pixels. For each pixel, if the pixel's land-cover type were to be changed, the new land-cover type would be determined by the corresponding probabilities estimated by the stacked random forest classifier. Specifically, pixels classified as water body and cropland were further checked using slope thresholds of 10 • and 20 • , respectively [33], and classification as permanent snow and ice was restrained using the normalized difference snow index (NDSI) threshold of 0.4 [33,73].

Accuracy Assessment
Although the classification scheme included 22 land-cover types, the validation system needed to be further simplified because the confidence in the validation samples could not be guaranteed for more detailed land-cover types, for example, closed broadleaf/needle-leaved forest and open broadleaf/needle-leaved forest were combined into broadleaf/needle-leaf forest. In this study, the validation system was split into two parts (Table 1): a level-1 validation system containing nine land-cover types and a level-2 validation system containing 19 land-cover types. Specifically, the level-2 validation system was derived from the classification system after merging these similar land-cover types according to the CCI_LC validation scheme [55]. The level-1 validation system was inherited from GlobeLand30 [4] and FROM_GLC [12], and the merging strategy integrated the works of Yang et al. [6] and Defourny et al. [55].
There are many metrics available to assess the performance of a classifier in the literature [74,75]. In this study, the classification error matrixes [75] were developed for the two independent validation systems ( Table 1). The derived parameters in the error matrix-the producer accuracy (P.A.) (measure of omission), user accuracy (U.A.) (measure of commission), overall accuracy (O.A.) and kappa coefficient [75]-were used to assess the performance of the classifier. Figure 5 illustrates the 30-m annual land-cover map for 2015 derived using the multi-temporal random forest method for all 305,000 GSinGrid tiles covering the whole of China. To make the land-cover map more intuitive, the sinusoidal projection has been transformed into a geographical projection. It can be seen that bare areas, grassland, cropland, and forest are among the most abundant land-cover types, a result that is consistent with the true spatial patterns of land-cover types in China. Specifically, the mosaic effect widely existing in many large-area land-cover maps using a supervised classification approach was very slight in our annual land-cover map in Figure 5. In addition, Table 2 and Table 3 include quantitative assessments of the annual land-cover map obtained using different validation systems (see Table 1) and 11,232 validation samples. Table 2 summarizes the accuracy metrics for 19 different land-cover types. Overall, the proposed method achieved a kappa coefficient of 0.664 and an overall accuracy of 71.3%. Specifically, among these landcover types, water body had the best accuracies (94.1% for the user's accuracy and 90.9% for the producer's accuracy) because inland water had the most distinctive spectral curve. This was followed, in terms of accuracy, by bare areas, snow and ice, grassland, and deciduous needle-leaved forest. For the impervious, wetland and shrubland cover types, the performance was relatively poor because these land-cover types were more complex and had more variable spectral and temporal features. For example, the impervious type (accuracy approximately only 50.7%) primarily consisted of asphalt, concrete, sand and stone, brick and glass, and could be divided into high-reflectance (airports and greenhouses), medium-reflectance (urban residential areas) and low-reflectance (rural cottages) sub-types [4]. Mixed-leaf forest had a very low accuracy with an 8.1% user's accuracy because of errors related to confusion between this land-cover type and other forest types-the CCI_LC user guide also gives a low user's accuracy of 0.051 for mixed forest [55]. In addition, there was serious misclassification of land-cover types with similar spectral and phonological features: for example, rainfed cropland and irrigated cropland, evergreen broadleaved forest and evergreen needle-leaved forest, and grassland and bare areas. As these errors were mainly caused by the high degree of similarity between these land-cover types, it was concluded that some confusion between similar land-cover types was inevitable.

Results and Validation
Finally, for a more comprehensive evaluation of the classification accuracy of the proposed method, the confusion matrix obtained using the level-1 validation system is shown in Table 3. After combining similar land-cover types, the classification accuracy clearly improved, giving a kappa coefficient of 0.757 and an overall accuracy of 80.7%. In fact, the major difference between the level-1 and level-2 validation systems was mainly concentrated in the cropland and forest types. Correspondingly, the average producer's accuracy for cropland and forest improved from the 0.611 In addition, Tables 2 and 3 include quantitative assessments of the annual land-cover map obtained using different validation systems (see Table 1) and 11,232 validation samples. Table 2 summarizes the accuracy metrics for 19 different land-cover types. Overall, the proposed method achieved a kappa coefficient of 0.664 and an overall accuracy of 71.3%. Specifically, among these land-cover types, water body had the best accuracies (94.1% for the user's accuracy and 90.9% for the producer's accuracy) because inland water had the most distinctive spectral curve. This was followed, in terms of accuracy, by bare areas, snow and ice, grassland, and deciduous needle-leaved forest. For the impervious, wetland and shrubland cover types, the performance was relatively poor because these land-cover types were more complex and had more variable spectral and temporal features. For example, the impervious type (accuracy approximately only 50.7%) primarily consisted of asphalt, concrete, sand and stone, brick and glass, and could be divided into high-reflectance (airports and greenhouses), medium-reflectance (urban residential areas) and low-reflectance (rural cottages) sub-types [4]. Mixed-leaf forest had a very low accuracy with an 8.1% user's accuracy because of errors related to confusion between this land-cover type and other forest types-the CCI_LC user guide also gives a low user's accuracy of 0.051 for mixed forest [55]. In addition, there was serious misclassification of land-cover types with similar spectral and phonological features: for example, rainfed cropland and irrigated cropland, evergreen broadleaved forest and evergreen needle-leaved forest, and grassland and bare areas. As these errors were mainly caused by the high degree of similarity between these land-cover types, it was concluded that some confusion between similar land-cover types was inevitable. Table 2. Confusion matrix for the annual land cover map using the level-2 validation system.  Total  1210  79  713  725  675  617  263  86  112  32  3076  23  334  75  276  2212  46  454  224   Finally, for a more comprehensive evaluation of the classification accuracy of the proposed method, the confusion matrix obtained using the level-1 validation system is shown in Table 3. After combining similar land-cover types, the classification accuracy clearly improved, giving a kappa coefficient of 0.757 and an overall accuracy of 80.7%. In fact, the major difference between the level-1 and level-2 validation systems was mainly concentrated in the cropland and forest types. Correspondingly, the average producer's accuracy for cropland and forest improved from the 0.611 to 0.768 and from 0.625 to the 0.909, respectively. This sharp increase also indicates the degree of confusion between these similar sub-types.

Influence of the Temporal Frequency
Due to the contamination by cloud and cloud shadow, the temporal frequency of Landsat SR varied greatly by geographical location (Figure 2). As the temporal information contributed a lot to the land-cover mapping, the relationship between the temporal frequency and the classification accuracy was analyzed using the validation samples described in Section 2.2. The analysis in Figure 6 shows that the classification accuracy first increases and then slowly decreases with increasing temporal frequency. In addition, it was found that a temporal frequency around 21 gave the highest overall accuracy.
Remote Sens. 2019, 11, x FOR PEER REVIEW 2 of 18 to 0.768 and from 0.625 to the 0.909, respectively. This sharp increase also indicates the degree of confusion between these similar sub-types.

Influence of the Temporal Frequency
Due to the contamination by cloud and cloud shadow, the temporal frequency of Landsat SR varied greatly by geographical location (Figure 2). As the temporal information contributed a lot to the land-cover mapping, the relationship between the temporal frequency and the classification accuracy was analyzed using the validation samples described in Section 2.2. The analysis in Figure  6 shows that the classification accuracy first increases and then slowly decreases with increasing temporal frequency. In addition, it was found that a temporal frequency around 21 gave the highest overall accuracy. A similar conclusion can also be found in the work of Karakizi et al. [76]: a higher classification accuracy was achieved when only some of the temporal-spectral features were used to train the classifier because a greater number of temporal features can easily lead to the Hughes phenomenon when the number of training samples is fixed [40]. Therefore, the further research work would combine feature-space optimization algorithms, such as wrapper methods, embedded methods, and A similar conclusion can also be found in the work of Karakizi et al. [76]: a higher classification accuracy was achieved when only some of the temporal-spectral features were used to train the classifier because a greater number of temporal features can easily lead to the Hughes phenomenon when the number of training samples is fixed [40]. Therefore, the further research work would combine feature-space optimization algorithms, such as wrapper methods, embedded methods, and filter methods [40,77], with the RF classification model to achieve the best classification accuracy for areas with a high temporal frequency. As for the areas with temporally sparse data, such as the cloudy regions of the southwest of China, according to Figure 2, the temporal frequency was usually less than five because of cloud contamination. Our future work will, therefore, combine other medium-resolution satellite data, such as Landsat 7 [78,79] and Sentinel-2 [80] imagery, to build high temporal frequency datacube for accurate land-cover mapping in cloudy regions.

Consistency between MCD43A4 and Landsat SR for Land-Cover Mapping
As the reflectance spectra in the SPECLib came from the MCD43A4 NBAR product and were further used to classify the Landsat imagery, the radiometric consistency between the two SR products needed to be analyzed. As illustrated in Figure 7, the consistency between the SR data retrieved from Landsat OLI and the MODIS NBAR product is strong in every spectral band and gives an average coefficient of determination of 0.865 and root mean square error (RMSE) of 0.017. As for the fitted coefficients (slope and bias) of the fitted line, the shorter wavelength bands are closer to the 1:1 line than the longer wavelength bands (SWIR1 and SWIR2). The main cause of the reflectance discrepancy is that the two MODIS SWIR bands are spectrally narrower than the corresponding Landsat OLI bands [57]. Therefore, after the linear radiometric normalization for the reflectance spectra in SPECLib, the radiometric discrepancy between the two sensors could be minimized as much as possible.
Remote Sens. 2019, 11, x FOR PEER REVIEW  3 of 18 filter methods [40,77], with the RF classification model to achieve the best classification accuracy for areas with a high temporal frequency. As for the areas with temporally sparse data, such as the cloudy regions of the southwest of China, according to Figure 2, the temporal frequency was usually less than five because of cloud contamination. Our future work will, therefore, combine other mediumresolution satellite data, such as Landsat 7 [78,79] and Sentinel-2 [80] imagery, to build high temporal frequency datacube for accurate land-cover mapping in cloudy regions.

Consistency between MCD43A4 and Landsat SR for Land-Cover Mapping
As the reflectance spectra in the SPECLib came from the MCD43A4 NBAR product and were further used to classify the Landsat imagery, the radiometric consistency between the two SR products needed to be analyzed. As illustrated in Figure 7, the consistency between the SR data retrieved from Landsat OLI and the MODIS NBAR product is strong in every spectral band and gives an average coefficient of determination of 0.865 and root mean square error (RMSE) of 0.017. As for the fitted coefficients (slope and bias) of the fitted line, the shorter wavelength bands are closer to the 1:1 line than the longer wavelength bands (SWIR1 and SWIR2). The main cause of the reflectance discrepancy is that the two MODIS SWIR bands are spectrally narrower than the corresponding Landsat OLI bands [57]. Therefore, after the linear radiometric normalization for the reflectance spectra in SPECLib, the radiometric discrepancy between the two sensors could be minimized as much as possible. Similarly, Liu et al. [27] used the normalized reflectance spectra from the MOD09A1 SR product to classify time-series Landsat images at a different year in Maduo county, Qinghai province, China, and achieved the average overall accuracy of 90.17%, which is even slightly higher than that of traditional scene-by-scene supervised classification. Therefore, the normalized reflectance spectra in SPECLib can be considered as suitable for land-cover mapping using Landsat imagery. Similarly, Liu et al. [27] used the normalized reflectance spectra from the MOD09A1 SR product to classify time-series Landsat images at a different year in Maduo county, Qinghai province, China, and achieved the average overall accuracy of 90.17%, which is even slightly higher than that of traditional scene-by-scene supervised classification. Therefore, the normalized reflectance spectra in SPECLib can be considered as suitable for land-cover mapping using Landsat imagery.

Limitations of SPECLib for Fine-Resolution Land-Cover Mapping
Although the 500-m reflectance spectra in SPECLib were successfully used to produce annual 30-m land-cover map of China, the coarse resolution of the SPECLib might still result in a lack of land-cover types that cover small areas, such as roads and villages. In this study, by importing the reflectance spectra from the 3 × 3 adjacent tiles in SPECLib, we were able to minimize the resolution limitation efficiently. Similarly, Zhang and Roy [30] also imported the training data from adjacent tiles to train the local classifiers. However, it is undeniable that this limitation still exists in some extreme areas, therefore, in subsequent work we will develop another candidate spectral library containing the standard reflectance spectra for all land-cover types for every MODIS land tile.
According to the quantitative statistics shown in Tables 2 and 3, 'impervious' had a lower accuracy than other land-cover types. Similarly, the FROM_GLC land-cover product also suffers from having a low accuracy of 30.77% for the impervious type [12]. Due to the spectral variety of impervious, such as urban areas, rural cottages, and roads, it can be classified into three sub-categories including 'high reflectance', 'low reflectance' and 'vegetated' [4]. In this study, the impervious reflectance spectra contained in SPECLib were treated as a whole instead of as three different sub-types, therefore, the misclassification of this cover type may have been quite serious. Fortunately, some scientists have proposed several methods to independently identify impervious pixels. For example, Tian et al. [81] proposed the construction of a perpendicular impervious surface index for impervious mapping and achieved an overall accuracy of more than 90% for the four cities that they studied. Gao et al. [82] used time series of normalized spectral distances and decision trees to map the expansion of the impervious cover in the Yangtze River Delta, achieving an overall accuracy of 92.5%. Therefore, in future work, this cover type will be classified independently using multi-temporal imagery.

Conclusions
Large area land-cover mapping is a difficult and time-consuming task that includes sample collection, image processing, feature extraction, classification, mosaicking, and accuracy assessment [12]. In this study, a novel operational methodology to classify Landsat OLI datacube using reflectance spectra from SPECLib was used to automatically produce the annual land-cover map for the whole of China. First, SPECLib was developed using a time series of the MCD43A4 NBAR and CCI_LC land-cover products. This SPECLib had a spatial resolution of 158.85 km (equivalent to 1.43 • at the equator) in the sinusoidal projection and had a temporal resolution of eight days. Secondly, the reflectance spectra in SPECLib were radiometrically normalized according to the linear regression relationship between MCD43A4 NBAR and the Landsat OLI datacube. Lastly, the multi-temporal random forest models were trained using the normalized reflectance spectra and then applied to the Landsat datacube for producing an annual land-cover map with 22 land-cover types. Validation of the results gave an overall accuracy of 80.7% and kappa coefficient of 0.757 for the level-1 validation system (nine land-cover types) and an overall accuracy of 71.3% and kappa coefficient of 0.664 for the level-2 validation system (19 land-cover types). Therefore, it was concluded that the proposed method provides a novel strategy for large-area land-cover mapping. Future work will adopt this method to produce a global land-cover map with a fine classification system. Author Contributions: Conceptualization, L.L.; investigation, X.Z.; methodology, L.L. and X.Z.; software, X.Z. and X.C.; validation, X.Z., S.X., X.C. and Y.G.; Writing-Original Draft preparation, X.Z.; Writing-Review and Editing, S.X.