Accuracy Assessment of GlobeLand 30 2010 Land Cover over China Based on Geographically and Categorically Stratified Validation Sample Data

Land cover information is vital for research and applications concerning natural resources and environmental modeling. Accuracy assessment is an important dimension in use and production of land cover information. GlobeLand30 is a relatively new global land cover information product with a fine spatial resolution of 30 m and is potentially useful for many applications. This paper describes the methods for and results from the first country-wide and statistically based accuracy assessment of GlobeLand3


Introduction
As Earth observation technologies have evolved, huge amount of remote-sensing image data covering the whole globe has been acquired.Multi-scale land cover datasets are widely used in the Earth surface processes research, ecosystem assessments, environmental modeling, and sustainable development planning [1][2][3].Several global land cover products have been developed by national and international organizations during the past decades.These include, for example, the International Geosphere-Biosphere Program Data and Information System's land cover (IGBP DISCover) data product (1 km spatial resolution) [4], the University of Maryland (UMD) land cover maps (1 km resolution) [5], the Global Land Cover 2000 (GLC2000) maps from the European Commission's Joint Research Center (JRC) (1 km resolution) [6], the Moderate Resolution Imaging Spectroradiometer (MODIS) land cover maps (500/1000 m resolution) [7], the GlobCover land cover maps from European Space Agency (ESA), 2005-2006/2009 (300 m resolution) [8,9], the Climate Change Initiative Land Cover maps (CCI-LC) from ESA, 2000/2005/2010 (300m resolution) [10], and the Global Land Cover by National Mapping Organizations (GLCNMO) maps, 2003/2008/2013 (500/1000 m resolution) [11].Recently, GlobeLand30 land cover datasets have been developed by the National Geomatics Center of China, based on integration of pixel-and object-based classification followed by knowledge-based interactive verification of classification results (shortly known as the POK-based approach) [12].This product has been available worldwide since September 2014 [12].In comparison with global land cover products previously developed, GlobeLand30 has the finest spatial resolution of 30 m and thus can provide information, in greater details, about land cover status and dynamics.
The importance of accuracy assessment is increasingly recognized.Accuracy assessment helps map users to evaluate the utility of land cover maps for their intended applications [13].It also informs map productions so that map quality may be improved through adopting more robust classifiers, exploring more informative class-discriminant features, incorporating extra image data or ancillary data, or a combination of these.
To verify GlobeLand30's accuracy globally, a preliminary assessment was conducted based on a two-rank sampling strategy.In first-rank (first-stage) sampling, map sheets were selected globally, while in second-rank (second-stage) sampling sample data for each land cover type within each of the selected map sheets were collected [14].Eighty sample map sheets were selected from a total of 847 map sheets in the first-rank sampling.A total number of 159,874 sample pixels were selected for the assessment of GlobeLand30 2010.The overall accuracy of the product was estimated about 80.3 ± 0.2% [15].
However, there have been few accuracy assessment efforts for GlobeLand30 in China.Below, we reviewed relevant results from country-wide, thematic (e.g., cropland), and regional assessments, respectively.To compare and assess seven global land cover datasets (including GlobeLand30) in China, Yang et al. [29] manually collected five sets of multi-scale validation sample units (VSUs), with spatial resolution ranging from 600 m × 600 m to 2 km × 2 km.It was found that GlobeLand30 2010 was the most accurate datasets examined, with an overall accuracy of 82.4% reported based on 1063 validation sample units (with 600 m resolution, implying coarsening for GlobeLand30 and exclusion of sample units at heterogeneous locations), while the overall accuracies of the other datasets ranged from 33.9% to 67.2% [29].
Lu et al. [30] compared five global cropland datasets in China for the year 2010.Cropland census data at the provincial and regional scales were used to evaluate the cropland areas derived from these datasets, while 5704 validation sample units (including reference data at 2130 test sample units originally acquired for validating the Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC) dataset by Gong et al. [31]) were utilized to verify locations of cropland parcels.
The results showed that GlobeLand30 datasets are the most accurate for cropland area estimation, with an overall accuracy of 79.6%.
Regional accuracy assessments of GlobeLand30 were also undertaken in China.For example, accuracy assessments of GlobeLand30 2010 in Shaanxi Province and Henan Province indicated overall accuracies of 80.0% and 81.5%, respectively [32,33].
Clearly, validation results in [30] and [32,33] need to be extended in terms of thematic and geographic coverages, respectively.For the country-wide assessment reported by Yang et al. [29], sampling was not probability-based, as admitted.Besides, the assessment was not done in its nominal (spatial) resolution of 30 m but in a coarsened resolution of 600 m.This country-wide assessment would be usefully strengthened with respect to statistical soundness and resolution refinement to enable inter-comparisons between classification accuracies of products with comparable resolution.
We should look for theoretical guidance and good practice examples regarding accuracy assessment from the literature.For example, the National Land Cover Database (NLCD) series products (also of 30 m resolution), which were developed by the MultiResolution Land Characteristics (MRLC) Consortium (www.mrlc.gov),provide consistent land cover information in the United States from decadal Landsat satellite imagery and other supplementary datasets [34].Accuracy assessments were carried out for NLCD land cover following statistical framework and procedures well established as in [35][36][37][38][39].To facilitate sustainable research and developments concerning use and production of GlobeLand30 datasets globally and in China, in particular, country-wide assessments of GlobeLand30 over China should be pursued based on well established procedures of sampling design, response design, and analyses.This is what this paper seeks to contribute to.
The remainder of the paper is organized as follows.Section 2 describes the methods for accuracy assessment of GlobeLand30 land cover in China.Section 3 reports accuracy assessment results obtained for different regions and for entire China.In Section 4, the main results are discussed, with patterns of misclassifications summarized, followed by conclusion in Section 5.

Methods
In GlobeLand30 maps, ten land-cover classes (as defined in Table 1) are depicted across the world at a nominal pixel resolution of 30 m, in accordance with a hierarchical classification method featuring pixel classification, object-based post-processing, and knowledge-based interactive verification [15].In this paper, the accuracy assessment of GlobeLand30 2010 was performed by following the guidance suggested by leading scientists in the field, such as Stehman et al. [40] and Olofsson et al. [41].According to these authors, three major components of the accuracy assessment are sampling design, response design, and analysis, which are described in this section.
In this paper, the spatial units for sampling and analysis concerning accuracy assessment are pixels.Although there is debate over the most appropriate sample unit for accuracy assessment, Stehman et al. [42] suggested that using pixels as sample units could avoid many of the complications that might arise when other spatial support units (e.g., blocks of pixels and polygons) were used.Use of pixels as units for sampling and analysis is a well-established practice in the validation of NLCD land cover information [35][36][37][38][39].

Sampling Design
It may not be feasible to compare GlobeLand30 maps with reference data of complete coverages even for a county let alone whole China (with 34 provincial administrative regions) pixel by pixel.Sampling design determines both the cost and statistical rigor of accuracy assessment [43].By sampling design, we select certain number of locations at which reference data will be collected.In this paper, two-level stratified random sampling is adopted.The first level of stratification is based on partitioning Chinese administrative regions (including the regions of Hong Kong, Macao, and Taiwan, but with small islands and the seas excluded) into 10 geographical regions (Figure 1).The regional stratification helps to reduce standard errors in accuracy estimates, facilitate regional reporting of accuracy, and provide an indication of how accuracy varies spatially across China.The second level of stratification is by GlobeLand30 map classes (Table 1 and Figure 1).In terms of sample size, each class in each region is allocated 100 sample pixels.The statistical explanation is that 100 sample pixels per stratum results in an expected standard error of 0.05 for simple random sampling within a stratum, if the true user's accuracy is 50% with a confidence level of 95% [35].It should be noted that the tundra area in China is almost zero and it was not included in the validation.In addition, permanent snow and ice is only distributed in R1-R4.Thus, in total, 8400 sample pixels were collected for the accuracy assessment, as shown in Table 2.
Table 2.The regional distributions of the mapped land-cover type (percent of area) for GlobeLand30 2010 over China.The sample size is 100 for all land-cover types in all regions with the exception that permanent snow and ice is not sampled in R5-R10.

Response Design
Reference land-cover labels for sample pixels are usually obtained from images of finer resolution.Online access to reference images is greatly enhanced thanks to Google Earth.GlobeLand30 2010 datasets for China are available in raster format with 54 tiles; the datasets are provided in WGS84 (World Geodetic System 1984) reference system and UTM (Universal Transverse Mercator) projection.Before selecting sample locations, the datasets need to be re-projected into the Web Mercator projection used for Google Earth.Other available datasets (e.g., Bing maps, Yahoo maps and color composites of Landsat images) were used when Google historical images of good quality were not available.
Three experienced interpreters carried out reference data acquisition.All reference data for each of individual regions were collected by a single interpreter.Interpreters had no a priori knowledge of map land-cover labels to avoid interpreter bias [36].In addition to the primary reference land-cover labels, alternate land-cover labels should also be recorded for sample locations which it would be more appropriately labeled with both reference labels than any one alone.A nominal level of confidence was assigned to reference land-cover labels, namely "confident", "somewhat confident", and "not confident", as in Wickham et al. [36].The acquisition dates of Google historical images (used for determining the reference land-cover classes) and satellite images (used for map production, acquired by Landsat satellites and the Chinese Environmental and Disaster satellite) were also recorded, since the satellite images used for GlobeLand30 2010 were not restricted to the year 2010 alone.
After the first round of visual image interpretation, verification of "somewhat confident" and "not confident" reference class labels was performed by the project manager, who was typically the most experienced photo-interpreter in the team, to minimize errors in reference data.Consistency in reference label assignments within the team is also important, but precedence is given to the project manager's assignments when label assignments disagree.Especially, great attention should be paid to "not confident" labels, which are allowed to be re-assigned to "somewhat confident" or "confident" labels.In this step, ancillary data, such as DEM data, local photos, and Landsat images could be used to assist image interpretation.
With consideration for positional uncertainty and thematic ambiguity between land-cover map and reference data, agreement is registered for a sample pixel if the map class matches either the primary or alternate reference label therein [39].To examine the effects of surrounding pixels on accuracy and to increase the reliability of interpretation, reference land-cover labels in a 3 × 3-pixel neighborhood centered on each sample pixel were also collected.

Analysis
By cross-tabulation between map and reference classes at sample pixels, error matrix can be constructed, allowing for estimating accuracy indicators, such as overall accuracy, user's accuracy, and producer's accuracy [44].An example error matrix, as shown in Table 3, is constructed from sample counts, where an element n ij represents the number of sample pixels classified as map category i (i = 1, 2, . . ., k; k being the total number of candidate classes) but actually verified to be reference category j (j = 1, 2, . . ., k).In Table 3 (and Table 4 below), the rows indicate map classes while the columns represent reference classes.Olofsson et al. [45] described a more informative presentation of the error matrix.It is in terms of unbiased estimators of area proportions in cells (i, j) corresponding to map-reference class label pairs i and j: where W i represents the proportion of the area mapped as category i (in the region under study) and is calculated as A tot is the total area of the region under study and A m,i (subscript m denotes "mapped") is the mapped area of category i.An example error matrix populated with estimated area proportions is shown in Table 4, from which accuracy measures can be computed (see Equations ( 2)-( 4)).Estimating the proportion of area in each cell of the error matrix using Equation (1) takes into account the inclusion probabilities of the stratified design.An inclusion probability is defined as the probability that a particular pixel is included in the sample.Unlike in simple random and systematic sampling where inclusion probability of each selected pixel is the same (so that accuracy measures may be computed directly based on Table 3), in stratified random sampling, sample units from different strata usually have different weights, as sample units from different strata likely have different inclusion probabilities.Thus, area proportions of the map classes (W i ) must be incorporated in the stratified estimators of overall and producer's accuracies to account for different sampling intensities in different strata.
Once the area proportions are estimated (Equation ( 1)) as in Table 4, user's accuracy ( Ûi ) and producer's accuracy ( Pj ) for any category and overall map accuracy ( Ô) are estimated directly from area proportions (Table 4).The estimators are: Variance estimators are also necessary for accuracy assessment, which could be applied to calculate confidence intervals [43].For stratified random sampling, the estimated variance of overall accuracy is As for the estimated variance of user's and producer's accuracy, the method described by Wickham et al. [38] is usually applied.Further detail about variance estimation is provided in Appendix A.
Equations ( 1)-( 5) are applied for estimating accuracies on the assumption that stratified random sample data are used.For the study in this paper, a two-level stratified sampling design was adopted (Section 2.1).Thus, Equations ( 1)-( 5) are not direct applicable for estimating national accuracies, although they are perfectly suitable for estimating accuracies in individual regions where map-class-stratified random sampling was applied independently.
To evaluate nationwide accuracies, regional error matrices need to be aggregated to a national error matrix at first.This is done by summing up corresponding cell values at (i, j) (i.e., area proportions for map-reference class pair (i, j)) in regional error matrices via proper weights that are individual regions' areal proportions in the entire study area.Consider the study here as an example.The ten regional error matrices were estimated using Equations ( 1)-( 4), as listed in Appendix B. The weights (W h in Equation ( 6)) were the ten regions' areal proportion relative to the whole study area (Table 2, the last row, Row "Country-wide proportion").In terms of a formula, the aggregated proportion for a particular class pair (i, j) ( Pi,j ) is computed as a weighted sum of corresponding proportions in regional error matrices: where H is the total number of regional strata in the study area (H = 10 for this paper), W h is region h's areal proportion in the whole study area (W h = N h /N, with N h being the population size of region h, and N being the population size for the whole study area), and Pi,j|h represents cell (i, j) in region h's error matrix.These properly calculated cell values ( Pi,j ) constitute the national error matrix.Based on it, national accuracy indicators are computed using Equations ( 2)-( 4) by inserting properly estimated Pi,j values (Equation ( 6)), as was implemented in this study.
Alternatively, country-wide accuracy estimates can be computed by following the method described in Wickham et al. [37,38].It is the so-called combined ratio estimator based on suitably formulated indicator functions.This method is highly recommendable for computing accuracy estimates in situations where stratifications are more than one level as in this case study here and where strata are different from land cover classes for which accuracy indicators need to be estimated properly [37,46].
In Appendix A, we provide further detail about how national accuracy indicators are estimated based on aggregation of regional error matrices and the combined ratio estimator.We also show the equivalence of these two kinds of methods with explanations and using data from the study reported in this paper.

Results
In this section, we mainly report results of accuracy assessment based on the less strict definition of agreement (at a sample pixel) as a match between the map label and either the primary or alternate reference label.Assessment results obtained using the stricter definition of agreement (as a match between the map label and the primary reference label only) are reported briefly at the end section.The ten regional error matrices were estimated (using the less strict definition of agreement unless stated otherwise in the remainder of this section), as shown in Appendix B (Tables A1-A10).Overall accuracies based on estimated population class area proportions (Equation ( 4)) are shown in Figure 2 (also shown in Table 5, the last row, Row "Overall").The overall accuracies of ten regions range from 76.0% to 90.3%, while standard errors for estimated overall accuracy range between 1.6% and 2.6%, as shown in Figure 2. Most regions have overall accuracies over 80%, except for three regions (R4, R6 and R7).Overall accuracies in these three regions are about 8.6% lower than that of the other seven regions.Geographically, overall accuracies of northern regions (i.e., R1, R3, R5, and R10) rank the top four, with the three southern regions (R4, R6, and R7) ranking bottom three, while the rest (R2, R8, and R9) are in the middle.These differences are related to various factors, such as class composition, classification system, and landscape heterogeneity.In terms of land cover compositions, land cover classes are unequally distributed across China, as shown in Table 2. Based on GlobeLand30 2010 maps, R1-R3 and R5 are dominated by grassland and bareland, accounting for 89%, 82%, 73%, and 74% of their total areas, respectively.R4 and R6-R10 are dominated by cultivated land and forest, accounting for 74%, 88%, 87%, 74%, 85%, and 81% of their total areas, respectively.Among them, bareland in R1 occupies as much as 67% of its whole area.The top three classes with the largest areas in each of the ten regions account for at least 85% of the corresponding region's total areas.Cultivated land, forest, wetland, water bodies, and permanent snow and ice register user's accuracies greater than 82%, as shown in Table 5. Grassland and bareland have higher user's accuracy in regions where these classes are dominant (R1-R3 and R5), while their user's accuracies are much decreased in the other six regions because of class rarity.Shrubland and artificial surfaces, which account for small area proportions of China (1.0% and 1.8%, respectively), have medium-level user's accuracies (Table 5).
Table 5. Regional user's accuracies for nine land cover classes with standard errors (SE) in parentheses.Agreement is defined as a match between the map class and either the primary or alternate reference class (Cultivated land is abbreviated as CuL, Artificial Surfaces is abbreviated as ArS, Permanent snow and ice is abbreviated as PSI; -means there is no permanent snow and ice in a given region; Column "National" represents countrywide user's accuracies).
To evaluate the nationwide thematic accuracy, a country-wide error matrix of estimated area proportions was constructed by aggregating the ten regional error matrices using methods described in Section 2.3, with results shown in Table 7.At the country level, the overall accuracy for GlobeLand30 2010 over China is 84.2%, as shown in Table 7.This estimate of overall accuracy is slightly higher than the results reported in [29].Below, we report results of accuracy evaluation obtained following the procedures above but using the stricter definition of agreement at a sample pixel (as a match between the map label and the primary reference label only).The results are shown in Appendix C. Specifically, regional overall accuracies and user's accuracies are shown in Table A11 (Appendix C), while regional producer's accuracies in Table A12.The national error matrix is shown in Table A13.Tables A11-A13 are similar in format to Tables 5-7, respectively, for convenience of comparisons between them.
Consider national accuracy assessment results.As expected, with the stricter definition of agreement, country-wide overall accuracy is estimated lower at 81.0% (Table A13), reduced by about 3% as opposed to the estimated overall accuracy of 84.2% (Table 7) based on the less strict definition of agreement.User's and producer's accuracies are decreased to differing extents depending on the specific classes concerned, although there are no differences between either user's accuracies or producer's accuracies for bareland and permanent snow and ice.The most obvious decreases in accuracies are observed for the class of artificial surface, which registers a user's accuracy of 70% (Table A13) (as opposed to 80%, Table 7) and a producer's accuracy of 53% (Table A13) (as opposed to 62%, Table 7).Further detail about decreases in accuracies (using the stricter definition of agreement as opposed to the less strict definition) and their variations in regional, categorical, and national terms is shown in Tables A11-A13 in Appendix C.
Table 7.The country-wide error matrix, cell entries are expressed as percent of area (see Table 5 for meanings of CuL, ArS, and PSI).Agreement is defined as a match between the map class and either the primary or alternate reference class.User's accuracy (UA) and producer's accuracy (PA) are reported with standard errors (SE) in parentheses.Overall accuracy is 84.2% (0.7%).

Discussion
In this section, discussion is based on assessment results obtained with the less strict definition of agreement (as a match between the map label and either the primary or the alternate reference label at a sample pixel).It is, however, sensible to be aware of the implications of using a stricter definition vs. a less strict definition of agreement for accuracy assessment and to appreciate the relevance of the latter definition of agreement for accuracy validation in a landscape of complexity.

Patterns of Misclassification Errors
Error patterns for GlobeLand30 2010 over China are summarized below (in the remainder of the paper, without causing ambiguity, reference to China will not be made unless necessary): (1) Some dominant classes are overestimated where there are inclusions, such as the forest and cultivated land included in artificial surfaces, artificial surfaces and grassland in cultivated land, shrubland and grassland in forest, grassland and shrubland in bareland.(2) Grassland, shrubland and forest are difficult to distinguish as their spectral characteristics are often similar.These three classes are usually mixed in natural environment, making it "look like its surroundings".(3) Fragmented patches, such as scattered villages and small blocks of cultivated land in hilly/mountainous areas, are likely to be omitted and classified as dominant classes in neighborhoods.( 4) "Salt and pepper" noise is common in GlobeLand30 maps.GlobeLand30 has been prescribed minimum mapping unit (MMU) for each land-cover class and allowable minimum error of omission or commission per scene for each class.However, there is no restriction to small blocks with size smaller than specified MMU.There are still many small blocks of shrubland, grassland, bareland, and forest in GlobeLand30 maps.When sample pixels are located in these small blocks, they are likely to be classified as dominant classes in neighborhoods.( 5) Time lags between map image acquisition dates and the dates for image interpretation also have effects on reported accuracies, because of the possibility of land-cover change.The dates of Landsat images used for GlobeLand30 2010 over China range from January to December (2010), while 89.4% of sample pixels were based on fine resolution images flown from May to November.Although Land-cover change within the year is relatively rare, misclassifications caused by time lags are observed (e.g., bareland in summer is misclassified as permanent snow and ice, cultivated land in non-growing season is misclassified as bareland).( 6) Map heterogeneity has the expected effect of reducing reported agreements between map and reference classes.Land-cover heterogeneity is defined as the number of land-cover classes occurring in a 3 × 3-pixel window centered on the sample pixel [47].A heterogeneity value equal to one is defined as homogeneous (interior pixels), otherwise heterogeneous (edge pixels) in this paper.Nationwide, 72.6% of the sample pixels are interior, while the probability of disagreements for edge pixels is 2.3 times that of interior pixels.

Information for User Community and Product Improvement
Accuracy assessment is a standard component of GlobeLand30 mapping protocol.The overall accuracy for GlobeLand30 2010 dataset was estimated to be 84.2%, while user's accuracies for individual classes (except for shrubland) exceeded 78% (Table 7).This indicates that the GlobeLand30 2010 dataset depicts spatial distributions of different land cover types in China with relatively great accuracy, given its fine resolution.Regional accuracy assessment results showed variations in accuracies due to regional differences in landscape patterns, as summarized at the end of Section 3.
Accuracy assessment can also inform and guide future map production.The results of GlobeLand30 2010 accuracy assessment show greater than 79% producer's accuracies for all classes except for shrubland (only 11%), wetland, and artificial surface (Table 7).This suggests that mapping protocols need to be further developed and refined to better distinguish scattered grassland, shrubland, and small mixed grass-shrub patches, which are usually misclassified as other classes nearby.In China, especially southwestern China, spectrally mixed pixels are common in complicated and fragmented landscape.Although image segmentation and object-based classification techniques developed in GlobeLand30 data production can suppress "pepper-and-salt" effects in resultant classifications to some extent, determination of suitable segmentation parameters that are globally applicable is extremely difficult.Therefore, omission errors (of shrubland) are inevitable, especially in areas with a complex landscape.As shown in [48,49], scattered classes of small extents are likely misclassified as dominant classes in neighborhoods.This was similarly observed for wetland and artificial surface (e.g., human settlements in rural areas dominated by cultivated land).Further work is required to solve these problems in future global land cover mapping at fine spatial resolution.Regions and classes prone to misclassifications (in particular, shrubland in all regions, wetland in R2, R6 and R7, and artificial surface in R2, R4 and R7; see Table 6) should be given special attention.

Comparison of GlobeLand30 with Other Related Data Products of 30 m Resolution
In this subsection, we compare GlobeLand30 with two fine-resolution land cover products, whose Level I thematic resolution (i.e., number of classes) is similar to that of GlobeLand30.One is a global product, known as FROM-GLC [31], the other is a US national product, NLCD [50], as mentioned in Section 1.
FROM-GLC contains two levels of land cover classes: 10 Level I classes (i.e., cropland, forest, grassland, shrubland, water bodies, impervious areas, bare lands, snow and ice, clouds and unclassified) and 29 Level II classes.It utilized thousands of Landsat images flown from 1981 to 2011 and was generated from fully automated image classification.The overall accuracies (Level I) range from 54% (maximum likelihood classifier) to 65% (support vector machine, SVM), while its best overall accuracy (Level II) is 53%.Yu et al. [51] improved FROM-GLC classification results using MODIS time series image and other auxiliary data, achieving an overall classification accuracy of 67%.In comparison, FROM-GLC is of lower accuracy than GlobeLand30 2010 as the former resulted from fully automated classification while the latter was enhanced with object-based classification and knowledge utilization.
Clearly, in terms of overall accuracy, GlobeLand30 2010 is slightly inferior to NLCD level I products (except for NLCD 1992 and NLCD 2006).We take the liberty in assuming approximate semantic equivalence between the following GlobeLand30 and NLCD classes: "cultivated land" = "agriculture", "bareland" = "barren", and "artificial surface" = "developed".As for user's accuracies, shrubland, grassland, and artificial surface are less accurately classified in GlobeLand30 than in NLCD 2011, while barren and wetland are less accurately classified in NLCD 2011 than in GlobeLand30.With respect to producer's accuracies, NLCD 2011 is apparently superior to GlobeLand30, as there were large omission errors for shrubland, wetland, and artificial surface in the latter.
However, it should be noted that comparisons with FROM-GLC and NLCD were made for Level I classes above.A total of 15 land cover classes will be mapped in GlobeLand30 2015 which is under development [55].Class ambiguity tends to increase as thematic detail of classification increases (e.g., from 10 classes to 15 classes), and this increasing ambiguity may have a negative impact on accuracy.For instance, overall accuracies decrease from 84% to 78% for NLCD 2006, when thematic resolution increases from Level I (8 classes) to Level II (16 classes) [37].Users and producers should be aware of the implications of increased thematic resolution for GlobeLand30 2015.

Conclusions
Accurate assessment in land cover information is of great importance to research and applications concerning natural resources and the environment.This paper provides detailed information about regional vs. national accuracies (overall, user's, and producer's) of GlobeLand30 2010 land cover over China by adopting a two-level stratified random sampling design and furnishing suitable methods for aggregating regional error matrices to a national one.The national overall accuracy for GlobeLand30 2010 was estimated as 84.2% (with agreement at a sample pixel defined as a match between the map label and either the primary or the alternate reference label), indicating GlobeLand30's relatively high accuracy.The national overall accuracy was estimated 81.0% when defining agreement at a sample pixel more strictly as a match between the map label and the primary reference label only.However, areas with heterogeneous landscapes and scattered small patches, in particular, need to be mapped with improved classification methods in future endeavors, as such areas tend to be labeled with low accuracies, as revealed in this study.GlobeLand30 production and validation teams will benefit from the research reported in this paper.aggregated with their corresponding proportions Pij|h−reweighetd summed up, giving rise to the national error matrix with correct (national) proportions Pi,j (Equation (6)).
For example, P2,2 for forest-forest class pair in the national error matrix (shown in After creation of the national error matrix, UA, PA, and OA can be easily computed using Equations ( 2)-( 4), respectively.For example, based on the national error matrix in Table 7, UA and PA for forest are: where Pi• and P•j represent row i and column j's totals in the national error matrix, respectively (i = j = 2 for forest).The estimate for OA is: OA = (18 + 19.3 + 23.2 + 0.7 + 0.4 + 1.5 + 1.4 + 18.8 + 0.9)/100 = 0.842.
By the combined ratio estimator method, on the other hand, UA and PA are estimated as a ratio R = Y/X, where Y is the population total of y u and X is the population total of x u (u being a pixel in the population).y u and x u are indicator functions for pixel u on condition A and condition B, respectively.For UA of a particular class, say "forest", condition A is that the map and reference labels are both forest, while condition B is that map label is forest.For PA of forest, condition A remains the same, but condition B is that the reference label is forest, as also explained in [38].
The combined ratio estimator for UA or PA is: where x h is the sample mean of x u in stratum h, y h is the sample mean of y u in stratum h, N h is the population size in stratum h, and H is the number of strata in the study area [37].This ratio estimator is very general as it can handle sample data with double stratifications (as in this paper) and situations where there is no one-to-one correspondence between strata and classes for which accuracy indicators need to be estimated.In this study, the sample data were collected following a two-level stratification (ten regions, each with nine or eight land cover classes), as described in Section 2.1.We can treat the sample data as consisting of 84 strata and calculate R with Equation (A2) (H = 84).
However, given the congruence among regional error matrices and one-one correspondence of strata and mapped classes in individual regions, simplified use of Equation (A2) is possible on the basis of regional error matrices.In other words, it is more sensible to view regions in the study as strata to work with when applying Equation (A2) (i.e., H = 10).Then, for a particular region h, N h refers to the region's areal proportion (i.e., Table 2, bottom row).In addition, given regional error matrices, we can easily get sample statistics required in Equation (A2).Specifically, x h is class i's sample proportion in the region (e.g., row or column i's totals in the error matrix for region h, Pi•|h and P•i|h depending on whether UA or PA is concerned), while y h is the proportion of sample pixels of reference class i classified correctly as class i in the region (e.g., Pi,i|h in the error matrix for region h).
For example, using regional error matrices (Tables A1-A10, Appendix B), UA and PA for forest are computed as: Clearly, estimates for national UA and PA computed from the two methods are identical.The equivalence between the two methods' results is not only for forest but for any class i, as it is established by: where Equation ( 6) is applied for numerators and denominators, separately.
As for the estimated variance of UA and PA, the method described in [38] is applicable.Specifically, the estimated variance of the combined ratio estimator is computed as: where n h is sample size for stratum h (N h being population size for stratum h, as previously in Equation (A2)), H is the number of strata (84 for the study in this paper), s 2 yh and s 2 xh are the sample variances of y u and x u for stratum h, and s xyh is the sample covariance of x u and y u in stratum h.
The estimated variance of national OA can be calculated using Equation (5).Adaptation is, however, required by viewing the country-wide population as consisting of 84 strata (region-class combinations), for which weights W i, h|H need to be calculated properly (see Equation (A1) and the example for computing OA three lines above in Equation (A3)).In addition, we can compute variance of national OA using Equation (A6): where s 2 h is the sample variances of stratum h, N is the population size (total number of pixels) of the study area, and H is the number of strata to run the summation (84 for the study in this paper).We employed Equation (A6) for computing variance of estimated national OA, although we tested both methods (Equations ( 5) and (A6)) and obtained identical results.

Appendix B. Regional Error Matrices of Estimated Area Proportions when Defining Agreement at a Sample Pixel as a Match between the Map Label and Either the Primary or Alternate Reference Label
Tables A1-A10 show error matrices of estimated area proportions for GlobeLand30 2010 in all regions (R1-R10).In Tables A1-A10, cultivated land is abbreviated as CuL, artificial surfaces as ArS, and permanent snow and ice as PSI, -means there is no permanent snow and ice in a given region, as in Tables 5-7.User's accuracy (UA) and producer's accuracy (PA) are reported with standard errors (SE) in parentheses.These results are based on defining agreement as a match between the map label and the primary or alternate reference label at sample pixels.

Figure 1 .
Figure 1.Regional stratification for GlobeLand30 2010 accuracy assessment over China.The boundaries of 10 geographic strata are shown in black.The labels "R1-R10" identify the regions used to geographically stratify sample data (e.g., R1 = Region 1).

Figure 2 .
Figure 2. Regional overall accuracies for GlobeLand30 2010 over China based on the definition of agreement as a match between the map class and either the primary or alternate reference class.Standard errors for the overall accuracies are in parentheses.

Table 1 .
Classification, codes, and definition of each land cover type of GlobeLand30.

Table 3 .
Error matrix of sample counts, n ij .

Table 6 .
Regional producer's accuracies for nine land cover classes with standard errors (SE) in parentheses.Agreement is defined as a match between the map class and either the primary or alternate reference class (see Table5for meanings of CuL, ArS, PSI, and -, and Column "National" represents countrywide producer's accuracies).

Table A11 .
Regional user's accuracies for nine land cover classes with standard errors (SE) in parentheses when agreement is defined as a match with the primary label.

Table A12 .
Regional producer's accuracies for nine land cover classes with standard errors (SE) in parentheses when agreement is defined as a match with the primary label.