Comparison and Assessment of Regional and Global Land Cover Datasets for Use in CLASS over Canada

: Global land cover information is required to initialize land surface and Earth system models. In recent years, new land cover (LC) datasets at ﬁner spatial resolutions have become available while those currently implemented in most models are outdated. This study assesses the applicability of the Climate Change Initiative (CCI) LC product for use in the Canadian Land Surface Scheme (CLASS) through comparison with ﬁner resolution datasets over Canada, assisted with reference sample data and a vegetation continuous ﬁeld tree cover fraction dataset. The results show that in comparison with the ﬁner resolution maps over Canada, the 300 m CCI product provides much improved LC distribution over that from the 1 km GLC2000 dataset currently used to provide initial surface conditions in CLASS. However, the CCI dataset appears to overestimate needleleaf forest cover especially in the taiga-tundra transition zone of northwestern Canada. This may have partly resulted from limited availability of clear sky MEdium Resolution Imaging Spectrometer (MERIS) images used to generate the CCI classiﬁcation maps due to the long snow cover season in Canada. In addition, changes based on the CCI time series are not always consistent with those from the MODIS or a Landsat-based forest cover change dataset, especially prior to 2003 when only coarse spatial resolution satellite data were available for change detection in the CCI product. It will be helpful for application in global simulations to determine whether these results also apply to other regions with similar landscapes, such as Eurasia. Nevertheless, the detailed LC classes and ﬁner spatial resolution in the CCI dataset provide an improved reference map for use in land surface models in Canada. The results also suggest that uncertainties in the current cross-walking tables are a major source of the often large di ﬀ erences in the plant functional types (PFT) maps, and should be an area of focus in future work.


Introduction
Accurate characterization of land cover and related dynamics are required to document environmental changes and to provide initial surface conditions for land surface, ecosystem, climate, and Earth system models [1,2]. Land cover (LC) is considered an essential climate variable due to its influence on the exchange of energy, water, and carbon between the land surface and the Table 1. Legends for land cover datasets included in this study. The parentheses in the CCI legend are used to indicate sub-classes, while the brackets show the main classes.

Study Area
The study area focuses on Canada, where several national LC products have been generated from satellite data sources. Canada is almost a billion hectares in area and forest dominated ecosystems occupy approximately 65% of the national land base. North of forested ecosystems are sparsely populated tundra and arctic environments. Southern Canada is dominated by a mosaic of agricultural and urban land use ( Figure 1). We divide Canada into 18 regions determined by ecozone (EZ) [39] or sub-ecozone ( Figure 1). Some of the large ecozones are sub-divided into smaller ones mainly based on the location and the spatial distribution of LC characteristics. For example, the Boreal Shield zone extends from northeastern Alberta to Newfoundland and Labrador, and is divided into five sub-zones (EZ5-9), with EZ5 dominated by mixed forest (MF), EZ6, EZ7 and EZ9 dominated by NF, and EZ8 by shrub. Names of the EZs can be found in Table 2.

Reference Sample Data
We used 567 randomly selected LC samples ( Figure 1) classified by [40] through expert interpretation of Google Earth (GE) or Bing Maps (BM) high-resolution images and a false color composite from Landsat Enhanced Thematic Mapper Plus (ETM+) acquired in 2010. Available ground truth data from field campaigns were used to assist in interpretation of the samples. These sample data are representative at the 30m scale, and they are a subset of reference samples used to validate  [40]; thus they were classified using the same NALCMS legend. In this legend (see details below) treed wetland is not separated from herbaceous wetland, which is problematic given our focus is forested areas of Canada. To resolve this issue, the samples are reinterpreted using high-resolution images in GE/BM images focusing on samples classified as wetland. We reclassify the samples into forest and non-forest samples first, then use them as our reference to validate forest/non-forest pixels mapped by the high-resolution datasets used in this study.

Regional LC Datasets Over Canada
In recent years satellite data with finer resolutions (~25-30 m) have become readily available. This in combination with advanced computing power makes it possible to generate finer resolution LC products over large areas. Several LC products were generated using medium-to-high resolution optical satellite data over Canada. Recently a LC dataset for circa 2010 was generated at 30m resolution using Landsat images by the Canada Centre for Remote Sensing under the NALCMS program [40] (hereafter the NALCMS dataset). An innovative approach was used first to train and classify locally confined reference data over a large number of partially overlapping areas which ensured the optimization of the classifier to a local LC distribution; then a weighted combination of labels determined by the classifier in overlapping windows defined the final label for each pixel. The NALCMS dataset has 19 classes based on the United Nations Land Cover Classification System (LCCS, [41]) ( Figure 1). Please note that four of these classes are not found in Canada (underlined in Table 1). An accuracy assessment based on 2011 randomly distributed samples showed that adjacent forest classes can be confused, for example coniferous with mixed forest, as well as deciduous forest, shrubland, shrub-covered wetlands, and certain croplands [40]. Confusion may also arise with the herbaceous class being misidentified as either low biomass croplands, or sparse coniferous forest along the northern tree line. Common to many remote sensing analyses, errors can be attributable to ambiguous spectral features and/or radiometric saturation occurring at leaf area index levels in the range of 3-5 [40,42], which generally apply to all satellite-derived LC products [19]. Nevertheless an overall accuracy of 76.6% was achieved, demonstrating quality and reliability as a land cover product representing Canada. The NALCMS dataset is used as a reference in our comparisons and assessments.
To better monitor and manage Canada's forests, the Earth Observation for Sustainable Development of Forests (EOSD) LC project was initiated as a partnership between the Canadian Forest Service and the Canadian Space Agency, with provincial and territorial participation and support. Based on the National Forest Inventory hierarchical classification system, the EOSD project produced a 23 class LC map of the forested area of Canada representing circa year 2000 conditions at 25 m resolution [43]. Over 480 Landsat-7 ETM+ images were classified-based upon an unsupervised hyperclustering, cluster merging and labeling method. Following release, EOSD was the most detailed and comprehensive map of the forested area of Canada. A unique feature of this dataset is that it provides detailed sub-classes for all vegetative LC classes ( Table 1). The third regional LC dataset was derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) at 250 m resolution, and is available for 2000-2011 [44]. The development of temporally consistent LC time series from satellite data has proven difficult because multi-year observations are usually acquired under different conditions [10,35]. A change-based update approach was developed and applied to annual MODIS time series to generate a consistent LC dataset. The 2005 MODIS LC dataset for Canada [45] was used as the base map. The approach is similar to that employed to generate the annual CCI LC time series. However, the MODIS-derived dataset focuses on Canada only and benefitted from local (Canadian) expertise. It uses the same 19-class legend as the NALCMS dataset ( Table 1)

Global LC Datasets
The GLC2000 dataset was generated from SPOT-VEG data collected from November 1999 to December 2000 at 1 km resolution [9]. It was produced by 21 separate regional expert groups using an unsupervised classification method. Based on the LCCS, the regional products were merged to one global product with a generalized LCCS legend of 22 classes (Table 1). Assessment based on random sampling of reference sites globally estimated an accuracy of 68.6% [17].
Based on broad user consultations and lessons learned from the development of previous LC products, especially the GlobCover products, a unique LC mapping approach was developed to generate annual CCI LC datasets at 300 m for the period 1992-2015 [12, 35,46]. This approach was developed and tested previously in Canada by [44,47]. There are three steps in the classification chain. First, a baseline map was generated using a combination of machine learning and unsupervised classification methods from the entire archive of ENVISAT/MERIS for 2003-2012. The reference database consisted of global, regional and local reference LC maps selected as the most accurate and thematically compatible ones for a region. In the classification algorithm, the whole globe was stratified into 22 equal-reasoning areas based on climatic and remote sensing conditions to account for local LC characteristics [48]. Then annual LC changes were detected at 1 km resolution from the AVHRR time series between 1992 and 1999, SPOT-VEG time series between 1999 and 2013 and PROBA-V time series between 2013 and 2015. The last step consists of backward or forward updating from the baseline map to produce the annual LC maps. This approach is designed to avoid the problem of significant year-to-year variations in LC labels not associated with actual LC changes which were found in the MODIS and GlobCover products [10,35], and enables the internal consistency of the annual LC time series, a major request from the climate modelling community. Based on the LCCS legend, the CCI LC maps have 22 level 1 classes, and 15 level 2 sub-classes (Table 1). Assessment based on the GlobCover validation database estimated an overall accuracy of 71% [12]. The CCI LC product for 2000 and 2010 are assessed relative to GLC2000 and finer resolution LC datasets. Given our reference dataset (NALCMS) is for 2010, we use the CCI dataset for 2010 in the following analyses unless otherwise specified.

Tree Cover Fraction Data
Given the large impact of forest cover on the simulation of winter albedo, we also include a tree cover fraction and forest cover change dataset produced by Hansen et al. [49] (hereafter the Hansen dataset). It was based on Landsat images at 30 m resolution. In contrast to the discrete LC classification datasets described above, the Hansen dataset is a vegetation continuous field product, in which the satellite spectral information was used to estimate the tree cover fraction in each pixel using a regression tree algorithm [50,51]. This may better represent heterogeneous areas than is possible by discrete LC classification. In the algorithm, Quickbird images (2 m resolution) were used to map tree crowns to relate tree cover fraction to spectral signals from Landsat. Tree cover was defined as canopy closure for all vegetation taller than 5 m in height. Forests are generally defined as taller than 3 m in the regional and global LC datasets. The different definitions in tree heights should not result in much difference in areas with mature forests, such as most boreal forests over Canada.

Methods
We use multiple methods to assess the above datasets at different scales. First, we reclassify the three high-resolution datasets (NALCMS, EOSD, and Hansen) into forest and non-forest categories, and evaluate their capability for discriminating forest cover from low vegetation and barren land using the reference sample data described in Section 2.2 at a 30 m scale. This provides some measure of the uncertainties in these datasets. Second, we transform all the LC datasets into a common legend and grid at 1 km resolution, compare the frequency of LC classes of all the LC datasets, and examine the producer and user accuracies of the GLC2000 and CCI datasets. Then we compute the sub-pixel fractional accuracy of the CCI dataset using the 30 m NALCMS dataset as a reference. Finally, we convert the GLC2000 and CCI datasets to the four CLASS PFTs at 0.5 degree and compare the spatial agreement of the PFT maps.

Reclassify to Forest and Non-Forest at 30 m Scale
Through investigating the various class descriptions, it is straightforward to classify the reference sample data (Section 2.2), as well as the NALCMS and EOSD products into forest and non-forest categories. For instance, to obtain the forest category, we simply combine all the forest classes, which are LC 1-6 for reference samples and NALCMS, and LC81 and 211-233 for EOSD (Table 1), and the remaining classes are all combined as the non-forest category. The definition for wetland is different in the EOSD and the NALCMS maps (Table 1). There are three sub-classes for wetland in EOSD: wetland-treed (81), wetland-shrub (82), and wetland-herb (83). Although wetlands are largely herbaceous and shrubs in the NALCMS legend for Canada according to [40], there are trees in some wetland pixels in NALCMS according to the high-resolution GE/BM images. An approximation of wetland-treed pixels in NALCMS is derived based on the ratio of wetland-treed pixels and total wetland pixels in EOSD for each ecozone ( Table 2). It is assumed that forested wetland occupies the same fraction of total wetland as in EOSD and this is added to the forest category in NALCMS to produce Adj-NALCMS.
Given that the NALCMS is for 2010 and the Hansen tree fraction data are for 2000, we apply the forest loss and gain information from the Hansen dataset [49] to the tree fraction data for 2000 to obtain an estimate of tree cover fraction for 2010 (hereafter Hansen2010). Different thresholds of tree cover fraction (from 5% to 25% with 5% increments) are used to classify the Hansen datasets into forest and non-forest categories. The results show that the 25% threshold provides the best match with the reference sample data and is therefore used in our analysis; the 25% tree cover threshold was also used in [51] to define forest classes.
We choose pixels closest to the locations of the reference sample points ( Figure 1) from the original grid of the NALCMS, EOSD and Hansen datasets (their resolutions are all around 30 m), and compare the number of forest and non-forest samples correctly identified by each dataset. In addition, we reproject/regrid the forest/non-forest maps from the three datasets into a 300 m resolution common grid and compare the frequency of forest cover in the 18 ecozones across Canada.

Transform to the Common Legend
The different datasets cannot be compared directly because, as will be explained, they have different resolutions and different numbers of LC classes ( Table 1). As in previous LC dataset comparison studies e.g., [16], we reproject/regrid all the datasets into a common grid and convert each dataset into a common legend. During the transformation process, we visually examine the spatial distribution of the LC classes from each of the datasets, making sure that LC classes converted into the same categories in the common legend are located in similar regions/ecozones.

Reproject to the Common Grid and Convert to the Common Legend
When the GLC2000 dataset was involved in the analysis, we chose the associated latitude-longitude projection as our common grid since it has the lowest resolution. Pixel resolution in GLC2000 is 1/112º, which corresponds to 1 km at the equator. In the reprojection/regridding process, we kept track of the most and the second most abundant categories and the respective fractions in order to define pixels that meet the definition of the mosaic LC class (see details below). All other pixels were assigned the LC class corresponding with the most abundant class. When the GLC2000 dataset was not involved, all datasets with resolutions finer than 1 km were reprojected/regridded to a 300 m (0.0028º) common grid using the same approach.
Based on the characteristics of LC types over Canada, 13 classes were chosen for the common legend based on LCCS ( Table 3); 12 of the classes are similar to the level I legend in the NALCMS dataset [40]. We add a mosaic class of tree and other vegetation because this is a common LC category at 1 km resolution. Except for the EOSD dataset, the LC classes in all the datasets were fully developed in or comply with the LCCS legend. The 23 classes in EOSD can be easily merged to the corresponding broad categories in the common legend (e.g., combine classes of conifer-dense, conifer-open, and conifer-sparse into needleleaf forest (NF)) ( Table 3). Although the forest definition in the EOSD legend looks different compared to the other legends, the sparse, open, and dense forest categories in EOSD can be transformed to corresponding classes in the LCCS legend ( Table 4 in [52]). The forest definitions are also slightly different for GLC2000/CCI (>15% cover) and NALCMS/MODIS (>5% cover for sub-polar taiga needleleaf forest and >20% cover for other forest types). They all comply with the height requirement of above 3 m as in the LCCS legend for forest definition. Please note that LC classes are defined hierarchically by the height of the canopy layer ranging from trees to shrubs to herbaceous cover. Table 3. The common legend and the merging rules for each land cover dataset.  Since treed wetland is not explicitly separated from herbaceous wetland in the NALCMS/MODIS legend, LC classes of tree cover flooded with fresh/saline water in the CCI and GLC2000 legends are labeled as wetland under the common legend (Table 3). While the mosaic classes in the two global datasets (CCI and GCL2000) can be linked directly to the mosaic category in the common legend, there are no mosaic categories in the regional datasets. We define the mosaic class for the regional datasets based on the following rules. If a pixel has (1) the most abundant type as shrub with fractional coverage > 40%, and the second most abundant type is from one of the tree types (e.g., 1,2,5,6 for NALCMS) with fractional coverage > 10%; or (2) has the most abundant type as grass with fractional coverage > 50%, and the second most abundant type is from one of the trees or shrubs (e.g., 1,2,5,6,8 for NALCMS) with fractional coverage > 10%, it is labeled as mosaic forest and other vegetation. This is similar to rules used to define the mosaic classes in the CCI dataset (with labels of 100 and 110). If none of the conditions are met, the pixel remains as a shrub or grass class.

Comparison of LC Datasets at 1 km Scale
We compare the spatial distribution of the LC classes and the number of pixels for each class under the common legend over Canada. Using NALCMS as a reference, we compute the producer and user accuracies for the CCI and GLC2000 datasets. The producer's accuracy is the complement of the omission error [53], and it is the percentage of pixels of a given LC class in the reference dataset (in this case NALCMS) which are correctly identified in the LC dataset being assessed. The user's accuracy is the complement of the commission error, and it is the percentage of pixels in a given LC class in the dataset being assessed that are correctly identified with respect to the reference dataset. Pflugmacher et al. [54] showed that the choice of sampling unit significantly affected accuracy estimates. When the finer resolution datasets (i.e., CCI at 300 m) were regridded into the same grid of the coarser resolution dataset (i.e., GLC2000 at 1 km), the finer resolution datasets performed better in terms of overall accuracy. This is because the finer resolution datasets tend to have more homogenous pixels than the coarser resolution datasets. However, this advantage is reduced for larger pixel blocks [54]. Therefore, we compute the producer/user accuracies for 10 km by 10 km pixel blocks. In addition, we compare the frequency of LC classes from the CCI and GLC2000 datasets with that from the NALCMS dataset for the 18 ecozones across Canada, with a focus on the forest LC classes.

Sub-Pixel Fractional Accuracy at 300 m Scale
Sub-pixel fractional error matrices were introduced by Latifovic and Olthof [16]. They are produced by assigning sub-dominant LC classes from all fine-resolution pixels in the reference data to the corresponding single coarse-resolution pixel. This allows a quantitative assessment of the fractional composition of each class in the coarse-resolution map, which will be useful for the partitioning of the coarse-resolution dataset into PFTs. In this study, the NALCMS data at 30 m resolution is used to compute the sub-pixel fractional error matrices of the CCI dataset based on the 13-class common legend. This is first done for each ecozone, then the mean error matrices are computed from the 18 ecozones across Canada.
Previous studies showed that landscape heterogeneity has a large impact on LC mapping accuracy [16,55,56]. To quantify landscape heterogeneity, 3x3 pixel neighborhoods were assessed for the CCI dataset following [19]. A neighborhood is considered homogenous if only one LC class is present. For comparison, we also compute the sub-pixel fractional error matrices for homogenous areas.  (Table 5).  The CCI LC product user guide [12] provides a cross-walking table for converting the 37 CCI LC classes into ten PFTs. This table was originally developed by [58] based on recommendations from experts in the remote sensing and climate modelling communities. It includes four tree PFTs, four shrub PFTs and two grass PFTs. CLASS does not yet have explicit shrub PFTs (research on including shrubs as a separate PFT is ongoing), so the four shrub PFTs were merged into either the NF or BF tree PFTs as was done in creating the GLC2000 table (Tables 4 and 5). Based on Tables 4 and 5, the four CLASS PFTs are generated from GLC2000 and CCI (for 2000) datasets respectively on a 0.5 × 0.5 degree latitude/longitude grid for a sub-domain of North America.

Comparison of Forest Cover
First, we compare the NALCMS, Hansen, and EOSD datasets with the reference sample data at 30 m resolution for selected ecozones, in which there are relatively abundant sample data. Table 6a (Table 2). This suggests that the Hansen dataset may overestimate tree cover (at least for tree cover less than 25%) in wetland environments, which we confirmed by examining high-resolution images in GE/BM. Overall NALCMS performs the best among the three datasets. The proportions of forest cover from the above three datasets are obtained for each ecozone by aggregating the~30 m data into a 300 m common grid (Figure 2). The forest cover from NALCMS agrees well with those from Hansen2010 (within 10%) in most regions except for EZ7, 10, and 16. Considering the completely different methods used to generate the datasets, and the different definitions for forest cover, the 10% difference is understandable. There are relatively large percentages of wetland in EZ7 (11%) and EZ10 (37%) in NALCMS ( Figure 2 and Table 2). The agreement improves for these two ecozones after wetland-treed pixels are added to the forest cover as shown in Adj_NALCMS (Figure 2). EZ16 only has a small percentage of wetland (2%), but it has a large fraction of taiga NF (0.49, Table 2), which is defined with a cover of greater than 5% (and an upper limit of~25%). These taiga NF areas would not have been mapped by Hansen given the use of a 25% tree cover fraction threshold for forest definition, likely explaining the much higher forest cover in NALCMS than Hansen in EZ16 (Figure 2). EZ10 and EZ15-18 all have a relatively large fraction of taiga NF (> 0.15, Table 2), which may to some extent contribute to the differences in forest cover mapped by the two datasets. The proportion of forest cover from EOSD tends to be lower than that from Hansen and NALCMS in most ecozones across Canada.
agrees well with those from Hansen2010 (within 10%) in most regions except for EZ7, 10, and 16. Considering the completely different methods used to generate the datasets, and the different definitions for forest cover, the 10% difference is understandable. There are relatively large percentages of wetland in EZ7 (11%) and EZ10 (37%) in NALCMS ( Figure 2 and Table 2). The agreement improves for these two ecozones after wetland-treed pixels are added to the forest cover as shown in Adj_NALCMS (Figure 2). EZ16 only has a small percentage of wetland (2%), but it has a large fraction of taiga NF (0.49, Table 2), which is defined with a cover of greater than 5% (and an upper limit of ~25%). These taiga NF areas would not have been mapped by Hansen given the use of a 25% tree cover fraction threshold for forest definition, likely explaining the much higher forest cover in NALCMS than Hansen in EZ16 (Figure 2). EZ10 and EZ15-18 all have a relatively large fraction of taiga NF (> 0.15, Table 2), which may to some extent contribute to the differences in forest cover mapped by the two datasets. The proportion of forest cover from EOSD tends to be lower than that from Hansen and NALCMS in most ecozones across Canada. These results suggest that there are relatively large uncertainties in forest LC mapping in EZ7 and 10 due to larger fractions of wetland in NALCMS. Though the fraction of treed wetland can be estimated based on information from EOSD as above, EOSD mapped a much larger area of wetland than NALCMS (see below), plus the type of forest is not identified/given. We therefore do not attempt to separate treed wetland from herbaceous wetland but keep this in mind in the following analyses. These results suggest that there are relatively large uncertainties in forest LC mapping in EZ7 and 10 due to larger fractions of wetland in NALCMS. Though the fraction of treed wetland can be estimated based on information from EOSD as above, EOSD mapped a much larger area of wetland than NALCMS (see below), plus the type of forest is not identified/given. We therefore do not attempt to separate treed wetland from herbaceous wetland but keep this in mind in the following analyses. The sparse taiga NF in NALCMS is not found in the two global LC datasets, which will add some uncertainty to the results. The NF fractions from NALCMS should be biased high in ecozones with a high percentage of taiga NF (Table 2).

The Spatial Patterns and Pixel Counts
The spatial distribution of LC classes from the NALCMS, MODIS and the CCI maps are similar, while there are relatively large differences from the EOSD and GLC2000 maps compared to the other maps ( Figure 3). Relative to the NALCMS map (Figure 3a), the EOSD map ( Figure 3b) shows less NF and mosaic classes but more shrub and wetland (Table 7 and Figure 4). EOSD mapped at least 100% more wetland than any other LC product, likely due to the different definition of wetland in its legend (as wetland-treed is a treed/forest category [52]). The MODIS maps have less NF but more MF category. The CCI maps have more NF and sparse vegetation but less area classified as barren lands. CCI mapped over one million more pixels of NF than all the other maps, while it mapped a bit less of all the other tree categories relative to NALCMS. The GLC2000 map has more mosaic, shrub, and sparse vegetation and less NF, wetland, and barren land. The mosaic dominated areas in GLC2000 are mainly located in NF dominated regions according to the other maps. GLC2000 has~50% more snow and ice pixels than all the other maps. The large differences in sparse vegetation and barren lands may be partly due to the different definitions in the legends of each dataset. For example, barren lands in NALCMS/MODIS were defined as areas characterized by bare rock, gravel, sand, silt, clay, or other earthen material, with little or no "green" vegetation present and vegetation generally accounts for less than 10% of total cover, while in GLC2000, bare areas were defined as primarily non-vegetated areas containing less than four percent vegetation during at least 10 months a year, including areas like bare rock and sands.   Table 3 for details).

The spatial patterns and pixel counts
The spatial distribution of LC classes from the NALCMS, MODIS and the CCI maps are similar, while there are relatively large differences from the EOSD and GLC2000 maps compared to the other maps ( Figure 3). Relative to the NALCMS map (Figure 3a), the EOSD map (Figure 3b) shows less NF and mosaic classes but more shrub and wetland (Table 7 and Figure 4). EOSD mapped at least 100%  Table 3 for details). The sparse taiga NF in NALCMS is not found in the two global LC datasets, which will add some uncertainty to the results. The NF fractions from NALCMS should be biased high in ecozones with a high percentage of taiga NF ( Table 2).  Table 3 for details).

Changes in LC Classes between 2000 and 2010
The MODIS and CCI datasets are available for multiple years with an overlap period of 2000-2011. We include both the 2000 and 2010 datasets in our analyses and compare the pixel counts of the LC classes from the two datasets. In addition, changes in forest cover (loss/gain) are available from [47] for the period 2000-2012. Thus, we also include a comparison of changes in forest cover from MODIS, CCI, and Hansen between 2000 and 2010 for ecozones dominated by forests. For MODIS and CCI, forest cover fractions are the ratio of the total number of forested pixels (LC1-6 for MODIS and LC50-100, LC160 and LC170 for CCI) and the total number of pixels at the 300 m common grid in each ecozone.
Based on the pixel counts in 2000 and 2010, overall the MODIS maps show larger changes than the CCI maps during the period, and the change direction is not always consistent between the two datasets ( Table 7).   [44] suggested that the decreases in NF were attributable mostly to fire, insect damage, and some harvesting, which was likely not detected by CCI. The CCI user manual [12] suggested that small changes before 2003 may not be captured by the CCI maps because change detection was based on coarse-resolution satellite data. This will be discussed further in the discussion section.  Table 7 show that NF is the dominant forest type in Canada, so here we only present the accuracy results for NF. For the CCI dataset, the producer accuracies are generally high (dark red, > 85%) across Canada except for the Hudson Plains and some high relief areas in northwestern Canada (Figure 6a). In contrast, the user accuracies are in general lower than the producer accuracies except for some areas in Quebec and small areas in western Canada, with values less than 50% (blue to green color in Figure 6b) in large areas, such as the taiga shield in Quebec and areas west of Hudson Bay. Areas with high producer but low user accuracies indicate an over-mapping of NF by CCI. For the GLC2000 dataset, the producer accuracies are generally low across Canada but the user accuracies are relatively higher in most areas of Canada, which indicates an under-mapping of NF (Figure 6c  and 6d). Areas to the southwest of Hudson Bay are an exception where there are high producer but low user accuracies in GLC2000, indicating an over-mapping of NF there. As mentioned in Section 4.1, there are relatively large uncertainties in forest mapping in EZ7 and 10, due to large fractions of wetland in NALCMS. Thus, there are relatively large uncertainties in the accuracy estimation in those  Table 7 show that NF is the dominant forest type in Canada, so here we only present the accuracy results for NF. For the CCI dataset, the producer accuracies are generally high (dark red, > 85%) across Canada except for the Hudson Plains and some high relief areas in northwestern Canada (Figure 6a). In contrast, the user accuracies are in general lower than the producer accuracies except for some areas in Quebec and small areas in western Canada, with values less than 50% (blue to green color in Figure 6b) in large areas, such as the taiga shield in Quebec and areas west of Hudson Bay. Areas with high producer but low user accuracies indicate an over-mapping of NF by CCI. For the GLC2000 dataset, the producer accuracies are generally low across Canada but the user accuracies are relatively higher in most areas of Canada, which indicates an under-mapping of NF (Figure 6c,d).
Areas to the southwest of Hudson Bay are an exception where there are high producer but low user accuracies in GLC2000, indicating an over-mapping of NF there. As mentioned in Section 4.1, there are relatively large uncertainties in forest mapping in EZ7 and 10, due to large fractions of wetland in NALCMS. Thus, there are relatively large uncertainties in the accuracy estimation in those regions. The frequencies of NF, MF, and Mosaic classes for ecozones with more than 10% cover are shown in Figure 7. The results show that relative to NALCMS, the GLC2000 dataset mapped less NF for most ecozones except for EZ7 and 10, while the CCI dataset mapped more or nearly the same NF (Figure 7a). These are consistent with the accuracy assessment results shown in Figure 6. If we assume all wetland-treed pixels are NF and add that to NALCMS (Table 2), this mainly increases the percentage of NF in EZ7 and 10. The GLC2000 dataset mapped more MF and Mosaic classes than NALCMS and CCI in most ecozones (Figure 7b and 7c).  Table 8 shows the fractional error matrices revealing more detailed class compositions of the CCI map. Values at the rows show the mean sub-pixel fractional cover of LC classes in the NALCMS dataset contributing to a given class in the CCI dataset across 18 ecozones in Canada. Values at the diagonal of the table are the sub-pixel fractional accuracies, which show the fraction of a given CCI The frequencies of NF, MF, and Mosaic classes for ecozones with more than 10% cover are shown in Figure 7. The results show that relative to NALCMS, the GLC2000 dataset mapped less NF for most ecozones except for EZ7 and 10, while the CCI dataset mapped more or nearly the same NF (Figure 7a). These are consistent with the accuracy assessment results shown in Figure 6. If we assume all wetland-treed pixels are NF and add that to NALCMS (Table 2), this mainly increases the percentage of NF in EZ7 and 10. The GLC2000 dataset mapped more MF and Mosaic classes than NALCMS and CCI in most ecozones (Figure 7b,c).   Table 8 shows the fractional error matrices revealing more detailed class compositions of the CCI map. Values at the rows show the mean sub-pixel fractional cover of LC classes in the NALCMS dataset contributing to a given class in the CCI dataset across 18 ecozones in Canada. Values at the  Table 8 shows the fractional error matrices revealing more detailed class compositions of the CCI map. Values at the rows show the mean sub-pixel fractional cover of LC classes in the NALCMS dataset contributing to a given class in the CCI dataset across 18 ecozones in Canada. Values at the diagonal of the table are the sub-pixel fractional accuracies, which show the fraction of a given CCI class being mapped as that class by NALCMS. For example, the MF class in CCI is mainly composed of NF (0.23), BF (0.24), MF (0.28), Shrub (0.13) and other classes with small fractions in NALCMS. This suggests that near equal fractions should be assigned to NF and BF for the MF class in CCI in its CW table, which is the case in Table 5, while for the Mosaic class in CCI, nearly double the fraction should be assigned to NF (0.22) than to BF (0.12). However, the fractions for NF are 1/3 of BF in Table 5 for the two mosaic classes (100/110), likely impropriate for use in Canada. The accuracy for the wetland class is rather low (0.16), with most of the wetland class in CCI composed of NF (0.35) and shrub (0.12) in NALCMS. This suggests large discrepancies in this class between the two datasets.  Unlike most previous studies, which used limited sample points or sites in the assessments, we include all CCI pixels for a given class in each ecozone, and the ecozones are large (composed of millions of pixels each), and thus with low homogeneity (see last column in Table 2). NF is the dominant LC class for 15 out of the 18 ecozones, and the remaining LC classes do not cover large areas of Canada. These may partly explain the rather low accuracies in Table 8a, especially for LC3 to LC8 (<0.3). The bottom row in Table 8a shows the maximum accuracy for a given class among the 18 ecozones. The maximum accuracies for LC class of MF, shrub and wetland are still quite low (<0.6), which are likely due to the low mapping accuracies of these classes in both datasets [12,40]. The accuracy for LC classes of urban, water, and snow/ice are among the highest in Table 8a. These classes were largely mapped using external datasets in the CCI product [12]. There are some improvements in the accuracies for most classes when only homogeneous areas are considered (Table 8b). However, the fraction of each class occurring in homogeneous areas is low (< 0.31) except for LC class 1, 7, 9, 12-13 (last row in Table 8b), consistent with the low coverage of most LC classes in Canada (Tables 2 and 7).

Comparison of CLASS PFTs Derived from GLC2000 and CCI
There are large differences in the fractional coverage of all four PFTs from the GLC2000 and the CCI datasets ( Figure 8). Compared to the CCI dataset, the GLC2000 dataset mapped more NF fractions in central Canada, the Hudson Plains, and southern Quebec, while it mapped smaller NF fractions in northwestern NA (Figure 8a). The GLC2000 dataset tends to map more BF fractions in areas where it mapped less NF (Figure 8b). This is at least partly due to the fact that GLC2000 mapped large areas in northwestern Canada as mosaic forests and Wang et al. [57] attributed this class to BF and grass PFTs (Table 4), although NF dominates most of those areas according to the other LC datasets (Figure 3).
The GLC2000 dataset mapped about 40% less fractions of crop than CCI, and it mapped more fractions of grass in the same area (Figure 8c,d). This is in contrast with the frequency comparison results showing that the percentages for crop from the GLC2000 and the CCI are within 10% (not shown). A comparison of the fractions for the main crop categories in the CW tables (Tables 4 and 5) for the GLC2000 and CCI datasets reveals the causes of this inconsistency. In our domain, the main crop class is LC16 from GLC2000 and it has a fraction of 0.5 for crop, while the rest is attributed to either grass or bare ground (Table 4). For the CCI dataset, the main crop class is LC11 and it is 100% attributed to crop (Table 5). Thus, smaller fractions of crop from GLC2000 than those from CCI are mainly due to the smaller fractions in their respective CW tables. In addition, GLC2000 mapped less fractions of grass in the arctic tundra than CCI. Remote Sens. 2019, 11, x FOR PEER REVIEW 21 of 27

Comparison of CLASS PFTs derived from GLC2000 and CCI
There are large differences in the fractional coverage of all four PFTs from the GLC2000 and the CCI datasets ( Figure 8). Compared to the CCI dataset, the GLC2000 dataset mapped more NF fractions in central Canada, the Hudson Plains, and southern Quebec, while it mapped smaller NF fractions in northwestern NA (Figure 8a). The GLC2000 dataset tends to map more BF fractions in areas where it mapped less NF (Figure 8b). This is at least partly due to the fact that GLC2000 mapped large areas in northwestern Canada as mosaic forests and Wang et al. [57] attributed this class to BF and grass PFTs (Table 4), although NF dominates most of those areas according to the other LC datasets (Figure 3).
The GLC2000 dataset mapped about 40% less fractions of crop than CCI, and it mapped more fractions of grass in the same area (Figure 8c and 8d). This is in contrast with the frequency comparison results showing that the percentages for crop from the GLC2000 and the CCI are within 10% (not shown). A comparison of the fractions for the main crop categories in the CW tables (Table  4 and 5) for the GLC2000 and CCI datasets reveals the causes of this inconsistency. In our domain, the main crop class is LC16 from GLC2000 and it has a fraction of 0.5 for crop, while the rest is attributed to either grass or bare ground (Table 4). For the CCI dataset, the main crop class is LC11 and it is 100% attributed to crop (Table 5). Thus, smaller fractions of crop from GLC2000 than those from CCI are mainly due to the smaller fractions in their respective CW tables. In addition, GLC2000 mapped less fractions of grass in the arctic tundra than CCI.

Discussion
The NALCMS LC dataset at 30 m resolution is used as the main reference data to assess the two global LC datasets in this study, which is not ideal as it is a satellite-derived product itself and subject to the limitations of such products (e.g., radiometric saturation, ambiguous spectral separability). We estimate the uncertainties in NALCMS by comparing it with the reference sample data, the EOSD, and the Hansen tree cover fraction datasets with regard to their capability in forest cover mapping. The results show that the wetland class in NALCMS exerts the largest uncertainty in forest cover mapping because treed wetland was not separated from herbaceous wetland in the legend. The inclusion of a taiga NF class with a lower tree cover threshold (>5%) than that used in the global datasets may also exert some uncertainty in NF mapping in NALCMS, which should be biased high relative to the other LC datasets in ecozones with a high percentage of taiga NF.
To compare the LC maps, we converted the maps into a 13-class common legend. The results show that using NALCMS as a reference, the 300 m CCI dataset provides much improved LC distribution over that from the 1 km GLC2000 dataset. Its more detailed LC classes and legend descriptions benefit the partitioning of LC classes into PFTs for use in LSMs. GLC2000 mapped too little NF in northwestern Canada, but too much in the Hudson Plains and some areas in northern Quebec, and it mapped a large area as MF and Mosaic categories, which results in large uncertainties for PFT mapping (i.e., unknown forest type). However, the CCI dataset appears to map more NF over Canada than all other LC datasets included in this study (e.g., Table 7). This is consistent with assessment results in previous studies. Based on the GlobCover 2005 reference dataset, Tsendbazar et al. [20] showed that the CCI Phase I product (epochs 2005) had high producer (85.7%) but low user (58.9%) accuracies for evergreen needleleaf forest (ENF) (their Table 3). This over-mapping of ENF has not been improved much in the CCI Phase II product (used in this study) as shown in the user manual [12]. Based on the GlobCover 2009 reference dataset, the producer/user accuracies for ENF were found to be either 78%/64% for all sites or 85%/66% for homogeneous sites (Tables 3-6 and 3-7 in [12]). Investigation of the causes of this ENF over-mapping suggests the following. Firstly, the main classification process of the CCI LC chain relies on MERIS full resolution mean seasonal surface reflectance composites [59], whose number and period lengths vary regionally. In Canada and other high latitude regions (including most of the boreal forest regions where ENF dominates), usually only one 4-month seasonal composite centred on mid-July is available due to the long cloud and snow cover periods [48], which may have contributed to mislabeling other forest types to ENF and thus the over-mapping. Secondly, values in Tables 3-6 and 3-7 in [12] suggested that this was mainly due to confusion with mixed forests (90), shrubs (120) and grass (130). These LC classes all have low mapping accuracies in current global LC maps, likely due to the fact that most landscapes in these environments are a mixture of different types of vegetation [19,20]. We do not separate ENF from DNF (deciduous needleleaf forest) in this study for DNF accounts for less than 1% of NF over Canada according to the CCI dataset.
We have attempted to assess changes in the LC classes and forest cover by comparing the datasets between 2000 and 2010. The results show that relative to the MODIS maps, the CCI maps show smaller changes and/or changes in the opposite direction for most LC classes. The CCI maps indicate a forest gain while both the MODIS and Hansen datasets indicate a forest loss in most ecozones across Canada during the 2000 to 2010 period. This is consistent with results shown in [60]. Their explanation was that small-scale changes were not captured in the CCI product due to the use of coarse-resolution 1 km data for change detection before 2003 (300 m data were used after 2003), which was also acknowledged in the user manual [12].To verify this we include a comparison of changes in forest cover between 2004 and 2010 from the MODIS and CCI datasets (dashed lines in Figure 5). The results show close agreement for changes from CCI and MODIS in EZ1-6 and 8-10, while there are still large differences for the remaining ecozones. On one hand, this confirms that the use of coarse-resolution data for change detection before 2003 does contribute to large uncertainties in the CCI datasets. On the other hand, it suggests that changes in CCI are not always consistent with those from other datasets especially in northwestern Canada (EZ11, [14][15][16][17][18]. Changes in forest cover from the Hansen dataset are consistent with those from MODIS over the 2000 to 2010 period. However, Hansen shows more forest cover relative to both the reference samples and the NALCMS and EOSD datasets in EZ7 and 10, suggesting that the Hansen tree cover fraction dataset overestimates tree cover fraction, at least for tree cover fraction less than 25% (the threshold used for their forest definition), in wetland environments. Nevertheless, we demonstrate that the Hansen dataset is useful in assessing the performance of LC datasets in forest cover mapping. Given the limitations in currently available LC datasets and the importance of discriminating tree cover from short vegetation and bare ground in LSMs, an accurate vegetation continuous field tree cover fraction product would be very useful to inform the partitioning of LC classes into PFTs. Similar products can also be derived from the Laser Altimeters (by estimating tree cover fraction above a certain height) onboard the newly launched Ice, Cloud, and Land Elevation Satellite-2 and the Global Ecosystem Dynamics Investigation missions, which will be an important addition to the optical satellite-based products, e.g., [61].
A sub-pixel fractional error analysis was carried out for the 300 m CCI dataset using the 30 m NALCMS dataset as a reference, which revealed the detailed composition of the CCI LC classes. This will be especially useful for determining appropriate PFT fractions for the MF and Mosaic LC classes, and thus to develop a CW table representative of LC characteristics of Canada. Similar analyses can be done for other regions where fine-resolution LC datasets are available. Eventually a CW table specific to regional climate and landscapes can be created for the whole globe.
We amalgamated the 10 PFTs in the default CW table provided in the CCI LC product user manual into four broad PFTs in CLASS, and compared them with those based on the GLC2000 dataset currently used to provide initial surface conditions in the ECCC models. The results show that there are large differences in all four PFTs based on the two datasets, which should significantly impact model simulations [62]. Some of the differences are attributable to differences in the LC classes from the two datasets; however, fractions in the CW table appear to play a bigger role in most cases. This is consistent with findings in [31]. Due to the large ranges in some of the LC classes (i.e., >15% for forest classes), there are large uncertainties in the current available CW table for the CCI dataset [58].

Conclusions
Previous studies demonstrated the necessity for developing higher than 1 km resolution LC products in order to improve LC mapping in heterogeneous landscapes and the transition zones where LC changes primarily occur, and this was reinforced by findings in this study. The applicability of the CCI LC dataset for use in CLASS is being assessed through comparisons with the GLC2000 dataset currently in use in the model and with finer resolution (~30 m) datasets over Canada. The assessment results show that in comparison with the finer resolution maps over Canada, the 300 m resolution CCI dataset provides much improved LC distribution over that from the 1km resolution GLC2000 dataset. The GLC2000 mapped too little needleleaf forest in northwestern Canada but too much in the Hudson Plains, and it mapped a large area as mixed and mosaic forest categories, which resulted in large uncertainties for PFT mapping.
However, the CCI dataset appears to overestimate needleleaf forest cover especially in the taiga-tundra transition zone of northwestern Canada. This may have partly resulted from limited availability of clear sky MERIS images used to generate the CCI classification maps due to the long snow cover season in Canada, as well as the generally low mapping accuracies for mixed trees, shrubs, and grass in current LC datasets. Assessment of changes in LC mapping suggests that the CCI annual time series often show contradicting results relative to other datasets, and therefore caution should be exercised when using the annual CCI time series, especially prior to 2003, to represent changes in LC distribution. Nevertheless, the detailed LC classes and finer spatial resolution in the CCI dataset provide an improved reference map for use in CLASS and other LSMs.
Comparison of PFTs derived from the two global datasets suggests that uncertainties in the current CW tables are a major source of the often large differences in the PFT maps, and should be an area of focus in future work. Results in this study also suggest that the sub-pixel error analyses and continuous vegetation field tree cover fraction products are potentially useful for reducing uncertainties in the CW table and thus PFT mapping.
Author Contributions: L.W. conceived and designed this research, and wrote majority of the manuscript; P.B. created the CCI CW table, participated in discussions about the research and contributed to the writing; D.P. reinterpreted the reference sample data and contributed helpful advice for understanding the datasets and results, E.C. designed the reprojection/regridding method for the datasets and contributed to some data analyses; C.L. investigated the causes of needleleaf forest over-mapping in the CCI product and contributed helpful advice for understanding the dataset. All authors provided helpful comments and suggestions which improved the manuscript.
Funding: This research received no external funding.