An Explorative Study on Estimating Local Accuracies in LandCover Information Using Logistic Regression and Class-Heterogeneity-Stratified Data

It is increasingly recognized that classification accuracy should be characterized locally at the level of individual pixels to depict its spatial variability to better inform users and producers of land-cover information than by conventional error-matrix-based methods. Local or per-pixel accuracy is usually estimated through empirical modelling, such as logistic regression, which often proceeds in a class-aggregated or a class-stratified way, with the latter being generally more accurate due to its accommodation for between-class inhomogeneity in accuracy-context relations. As an extension to class-stratified modelling, class-heterogeneity-stratified modelling, in which logistic models are built separately for contextually heterogeneous vs. homogeneous sub-strata in individual strata of map classes, is proposed in this paper for proper handling of within-class inhomogeneity in accuracy-context relations to increase accuracy of estimation. Unlike in existing literature where sampling is usually approached separately, the double-stratification method is also adopted in sampling design so that more sample data are likely allocated to heterogeneous sub-strata (which are more prone to misclassifications than homogeneous ones). This class-heterogeneity-stratified method furnished for sampling and modelling jointly thus constitutes an integrative framework for accuracy estimation and information refinement. As the first step in building up such a framework, this paper investigates the proposed double-stratification method’s performance and sensitivity to sample size regarding local accuracy estimation in comparison with those of existing methods through a case study concerning Globeland3


Introduction
Land-cover information is important for resource management and environmental modelling.A variety of land-cover information products (static and dynamic, crisp and soft) are generated from different sensor datasets at regional and global scales [1][2][3][4][5][6].This research focuses on static land-cover information coded with discrete class labels rather than percent covers (or fractional covers or class proportions).However, land-cover information is always inaccurate to some extent.This is because information about land-cover status and dynamics is not directly measurable but results from complex processes of image and data analyses, interpretation, and reasoning, which are subject to various forms of uncertainty.There are increasing research efforts directed towards describing, quantifying, and analyzing accuracies (or misclassification errors) in land-cover information [7][8][9][10][11][12][13].
Research on local accuracy has focused on two major inter-related aspects: (1) local accuracy characterization through spatial and statistical analyses of accuracy-context associations, and (2) local accuracy estimation which is usually based on sample data and empirically built accuracy models.Here, context, as a broadly defined term, includes map class labels, locations, and indices quantifying patterns of class occurrences, as is the case in this paper.It may also be defined in image data and feature space [16].Classes can refer to static land-cover types or their changes (e.g., forest loss and urban gain, as in [17]), although this paper concerns the former case.We review related work on these two aspects below.
Research on analyses of accuracy-context relationships has found that informative contextual features (for explaining spatial variations in classification accuracy) include spatial heterogeneity, patch size, and other landscape pattern indices [18][19][20].Heterogeneity indicates textural complexity of land-cover classes occurring in certain neighborhoods and generally includes compositional (the number and proportions of different classes) and configurational (the spatial arrangement of classes) types [21].A few examples are as follows.
It was found that land-cover heterogeneity and patch size were important factors determining local accuracy for the United States National Land-Cover Data (NLCD) land-cover product [18,19].Van Oort et al. established

relationships between classification error and landscape characteristics,
showing that the probability of correct classification decreases with higher focal heterogeneity (in a neighborhood of 3 by 3 pixels) and smaller patch size [20].Lechner et al. developed a statistical simulation model to test the effects of patch size and shape, classification threshold, and grid location on classification accuracy of small and linear features.They found that the patch size was an important factor affecting classification accuracy [22].Chen et al. analyzed and examined the relationships between accuracies of crop classification and area estimation and spatial heterogeneities, in particular, sample pixel impurity and landscape heterogeneity, and found that the impact of configurational heterogeneity on the area estimation was more significant than that of the compositional heterogeneity [21].As reviewed above, complex landscapes (as indicated by increased heterogeneity, decreased dominance, and smaller patch sizes, etc.) likely lead to more misclassifications.Clearly, misclassifications are also more likely with blurred remote-sensing images and lack of class separability in feature space (e.g., [16]).Logistic models were usually used for describing statistical relationships between local accuracies and contextual/landscape patterns [18][19][20].There is also increasing research on local accuracy estimation (or prediction) as follows.
The aforementioned methods are mostly employed in the spatial domain (i.e., user's domain), while some of them are also applicable in the spectral domain (i.e., producer's domain) (e.g., kernel functions and logistic regression, as described in [24]).A useful method for local accuracy estimation in the spectral domain is the so-called calibration method that seeks to transform various classification certainty measures, such as maximum posterior probabilities, which are computed as intermediate results prior to output of end results, to accuracy indicators [31].There was also research on local accuracy estimation in combined spectral and spatial domains.For example, Steele et al. [28] formulated a concept of misclassification probability and present a resampling-based method of estimating misclassification probabilities at training sample locations, from which misclassification probability estimates are then interpolated to a lattice of points via kriging.Additional examples of combined spectral and spatial methods [16,29], in which spectral data and spectrally-derived soft class probabilities were used as the basis for modelling local accuracies, respectively.
As this research is oriented to local accuracy estimation in spatial domains, we elaborate on relevant (spatial-domain) methods, though most of them are mentioned above.A useful method is to compare map and reference class labels at certain sample locations so that a map of misclassifications can be created, helping to analyze their occurrences in the map being assessed.However, such an error location map does not show complete-coverage misclassifications over the problem domain.For mapping per-pixel accuracies, Foody [25] proposed a method based on interpolating accuracy measures computed from locally constructed error matrices.This method relies heavily on availability of relatively dense sample data to work well (a sampling intensity of about 6% was employed [25]).However, sampling intensities say 2.5% which are definitely affordable for small areas (e.g., [24]) will become prohibitive for large-area assessments (e.g., [17]).Developments on this method are reflected on geographical weighting and other extensions in local construction of error matrices [23,27,32].In addition to such methods making use of only locational information contained in the sample data, logistic modelling using contextual information (in addition to locational information) for estimating local accuracy may be usefully explored (as in this paper), given the observation that per-pixel probability of correct classification is closely related to contextual features characterizing patterns of map class occurrences in the neighborhoods [18][19][20].In fact, logistic regression was implemented in both geographic space (e.g., locations) [24] and contextual feature space (e.g., contextual information about class occurrence patterns and landscape characteristics) [17].The fitted logistic models can then be used to estimate the per-pixel probabilities of correct classification and hence generate maps showing spatially varying accuracies.See the work by Wickham et al. [17], Khatami et al. [24] and Zhang and Mei [30] for examples of using logistic models built on sample data and land-cover map data to estimate local accuracies.
Having provided some relatively solid justification for local accuracy estimation based on logistic modelling (which this research adopts), we consider issues of sampling (for collecting reference sample data), in particular, coupling of sampling designs and modelling approaches, below.The coupling of modelling and sampling facilitates integration of accuracy estimation and information refinement, with the latter using information about local accuracies in fusion of map and reference data for enhancing quality of fused maps [33,34].This integrative framework actually represents the paper's major contribution to the literature, as is seen below.
Like in error-matrix-based accuracy assessment, reference or validation sample data consisting of reference class labels (from which binary data indicating correct or incorrect classifications at sample pixels are obtained) are necessary for model-building.As understandable, models empirically built and model predictions are conditional to specific sample data employed (for model training), which are collected following certain sampling designs.It is thus important to reflect on how logistic modelling was implemented in combination with sampling in the past.The review below aims to provide a general indication to the largely loose coupling between modelling and sampling in existing literature, though it is by no mean comprehensive or detailed.
Smith et al. implemented logistic regression for characterizing local accuracy in the NLCD datasets in the eastern US encompassing four regions across 21 states with 5020 sample pixels (presumably with a region-stratified random sampling design) by a class-aggregated modelling strategy [18], with models built for individual regions separately.Then, Smith et al. carried out logistic modelling of local accuracies by a (map) class-stratified modelling strategy (stratifications with map classes at both Levels I and II), using the same sample set (5020 sample pixels) [19].Based on a class-aggregated modelling strategy, Van Oort et al. used a sample set of 1161 grid cells (collected with a kind of near-systematic sub-sampling) to model and analyze the classification accuracy of agricultural crops in the Dutch national land-cover database [20].Based on a simple random sample data collected at a sampling intensity of about 5%, Zhang and Mei integrated logistic regression and geostatistics for local accuracy characterization in land-cover change information via class-aggregated modelling [30].With stratified random sample data collected at intensities of 0.5% and 2.5%, Khatami et al. compared logistic modelling with other modelling approaches for estimating local accuracies in classified remote-sensing images, with both class-aggregated and class-specific (i.e., class-stratified) modelling approaches considered in the spatial domain or spectral domain [24].It was confirmed that class-specific modelling provides more accurate estimation of local accuracies than class-aggregated modelling, as investigated in [18,19,24].
As reviewed above, with reference sample data collected, logistic modelling can be performed in a (map) class-aggregated or class-stratified way.The latter is well suited to accommodating between-class inhomogeneity in accuracy-context relations, as demonstrated in [19], and has been confirmed to be more accurate than the former [24].In addition to systematic sampling and random sampling (simple or stratified), which are among the commonly used sampling designs, sampling adaptive to local class heterogeneity (e.g., class impurity in a focal neighborhood of 3 by 3 pixels) was also explored for accuracy assessment [35].This is motivated by the observation that boundary areas (i.e., edge pixels) are more likely misclassified than inner areas (i.e., interior pixels), as amply demonstrated in the literature on local accuracy estimation [35].Based on sample data in which edge pixels and interior pixels were treated separately, accuracy assessment was carried out, showing large differences between classification accuracies in segments of edge pixels and those of interior pixels [36].
Similar to the aforementioned error-matrix-based accuracy assessment, models of local accuracies may be built separately for contextually heterogeneous vs. homogeneous pixel segments (sub-strata) in individual strata of map classes, hopefully increasing accuracy in resultant model estimation.In other words, as an extension to class-stratified modelling, class-heterogeneity-stratified modelling can be usefully explored for proper handling of within-strata inhomogeneity in accuracy-context relations.This double-stratified method should also be considered for sampling pertaining to reference sample data collection so that sampling and modelling are well coupled with each other.More importantly, with this double-stratified method applied in sampling designs, heterogeneous sub-strata (which usually are more prone to misclassification than homogeneous sub-strata) are likely sampled at greater sampling intensities than with other designs without considering sub-stratification by heterogeneity.The increased number of sample pixels in error-prone locations will, in turn, enable detailed studies of misclassification patterns and facilitate direct correction of misclassification errors for refinement of land-cover information through fusion of map data and reference sample data.This helps to broaden usability of sample data for not only local accuracy estimation but also information refinement.Therefore, the aforementioned class-heterogeneity-stratified method for sampling and logistic regression modelling constitutes this paper's major contribution to the literature.The main features and values of the proposed double-stratification method include a combined perspective of sampling and modelling (which were seldom treated coherently in the past) and an integrative construct for local accuracy characterization and information refinement.
As the first step towards building up the aforementioned integrative framework, this paper investigates performances of the proposed double-stratified method (featuring class-heterogeneity-stratification in both logistic modelling and sampling) in comparison with those of alternative methods (i.e., logistic regression modelling and sampling that are not class-heterogeneity-stratified).This is important as the proposed method needs to be proved competitive in terms of performance for local accuracy estimation at the first place to be worthy of being pursued further for information refinement.In addition to comparing the proposed and alternative methods' performances based on a separate model-testing sample, these methods' sensitivities to sample sizes were also analyzed, with their robustness to varying sample sizes examined.This (sensitivity analysis) actually represents another contribution of this research to the literature, as it was rarely considered in similar research.As shown in the case study, the proposed class-heterogeneity-stratified method generates significantly more accurate estimation of local accuracies than alternative methods including a double-stratification method with sub-stratification by edge vs. interior pixels (as described in [36]), according to results of statistical testing and sensitivity analyses.
The remainder of the article is as follows.In Section 2, the study area and data used in the research are described first, followed by descriptions of methods for sampling and logistic regression modelling, in particular, those with double stratifications by class and heterogeneity.Section 3 describes the experiment carried out and the results obtained, aiming to test the proposed method in comparison with alternative methods.Finally, Section 5 concludes the paper after discussing some issues in Section 4.

The Study Area and Experimental Data
GlobeLand30 2010 land-cover dataset for Wuhan city was used for the study in this paper, as shown in Figure 1.As a global fine-resolution land-cover information product, GlobeLand30 (for 2000 and 2010), which was produced by the National Geomatics Center of China (NGCC) in 2014, has ten land-cover classes (http://www.globallandcover.com).The city of Wuhan (Lat 29 • 58 -31 • 22 N, Long 113 • 41 -115 • 05 E) is about 8495 km 2 in areal extent, located in the middle and lower reaches of the Yangtze, and is the provincial capital of China's Hubei province, as shown in Figure 1 (the inset map of China, lower right corner).For Wuhan, there are seven classes (Table 1, except for shrub, tundra, and permanent snow and ice), as shown in Figure 1.In particular, the dominate class is cultivated land, occupying about 60 percent of Wuhan's areal extent, followed by water, forest, and artificial surface, which account for 15 percent, 12 percent, and 7 percent of the total area of Wuhan, respectively.Grassland, wetland, and bare land together take about 6 percent of Wuhan's areal extent.
Reference data recording reference class labels are required for local accuracy estimation.Reference class labeling is defined as the best available assessment of the ground conditions.Collecting reference data is a time-consuming and costly procedure.In the study, the reference classes at sample pixels (sampling will be described in the next subsection) were obtained using visual interpretation of high spatial resolution images (i.e., Google Earth images).Interpretation was undertaken according to the standards consistent with GlobeLand30 classification system (Table 1).In most cases, Google Earth images were used for interpretation.When such images were not available, actual ground visits and Landsat TM image flown in temporal proximity of corresponding GlobeLand30 2010 maps were used as sources to obtain reference class labels.A set of reference data were used as (model) training data, with another as testing data for performance evaluation (which is to be described in Section 3.4).
The training data and testing data were independent.For the training data, further information about sampling design and resultant sample data collected is provided in Sections 2.2 and 3.1, respectively.Reference data recording reference class labels are required for local accuracy estimation.Reference class labeling is defined as the best available assessment of the ground conditions.Collecting reference data is a time-consuming and costly procedure.In the study, the reference classes at sample pixels (sampling will be described in the next subsection) were obtained using visual interpretation of high spatial resolution images (i.e., Google Earth images).Interpretation was undertaken according to the standards consistent with GlobeLand30 classification system (Table 1).In most cases, Google Earth images were used for interpretation.When such images were not available, actual ground visits and Landsat TM image flown in temporal proximity of corresponding GlobeLand30 2010 maps were used as sources to obtain reference class labels.A set of reference data were used as (model) training data, with another as testing data for performance evaluation (which is to be described in Section 3.4).The training data and testing data were independent.For the training data, further information about sampling design and resultant sample data collected is provided in Sections 2.2 and 3.1, respectively.

Class Name (Abbreviation) Definition
Cultivated land (Cultivt) Land used for agriculture, horticulture, and gardens, including paddy fields, irrigated and dry farmland, vegetable and fruit gardens, etc.
Forest Land covered by trees, vegetation covers over 30%, including deciduous and coniferous forests, and sparse woodland with cover 10-30%, etc.

Grassland (Grass)
Land covered by natural grass with cover over 10%, etc.

Shrub
Land covered by shrubs with cover over 30%, including deciduous and evergreen shrubs, and desert steppe with cover over 10%, etc.

Wetland
Land covered by wetland plants and water bodies, including inland marsh, lake marsh, river floodplain wetland, forest/shrub wetland, peat bogs, mangrove, and salt marsh, etc.
Water bodies (Water) Water bodies in land area, including river, lake, reservoir, fish pond, etc.

Class Name (Abbreviation) Definition
Tundra Land covered by lichen, moss, hardy perennial herbs and shrubs in the polar regions, including shrub tundra, herbaceous tundra, wet tundra, and barren tundra, etc.
Artificial surfaces (Artfct) Land modified by human activities, including all kinds of habitation, industrial and mining area, transportation facilities, and interior urban green zones and water bodies, etc.
Bare land (Bare) Land with vegetation cover lower than 10%, including desert, sandy fields, Gobi, bare rocks, saline and alkaline land, etc.

Permanent snow and ice
Lands covered by permanent snow, glacier, and icecap.

Sampling Design and Sample Allocation for Reference Data
In simple random sampling (SRS), n (sample) units are selected out of N units in the population such that every one of the distinct samples has an equal chance of being drawn.By a stratified random sampling (StRS) that may be based on geographic regions or map classes [12][13][14], independent and random sample units are drawn from individual strata.In this study, SRS and StRS were employed for comparative study about the proposed class-heterogeneity-stratified sampling design in the context of local accuracy estimation.They were sampling designs well suited for class-aggregated (CA) and class-stratified (CS) modelling, respectively, thus identified also as CA and CS, respectively, in the remainder of the paper.
As mentioned previously, for the proposed method, stratification is first based on map class and then heterogeneity/homogeneity. Here, homogeneity is defined as the number of pixels with the same class label as that of the center pixel in a focal neighborhood of 3 by 3 pixels.The homogeneity value can be viewed as the patch size of the center pixel in the focal neighborhood.A homogeneity value of 4 is chosen to be the threshold value to determine if the center pixel lies in a homogeneous sub-stratum or a heterogeneous one within a stratum of a certain map class.In contrast, a threshold of 8 was used for determining if the center pixel is interior (i.e., it belongs to a homogeneous sub-stratum) in [36].The former sampling design (stratified by map classes and sub-stratified by pixels being at the centers of homogenous vs. heterogeneous focal neighborhoods) is labeled EO, while the latter (stratified by class and sub-stratified by edge vs. interior pixels) EI.
For sample allocation, Neyman allocation method was used in this study.This is because Neyman allocation considering stratum size or proportion and the degree of variation of each stratum will greatly improve variance or standard error of estimation.For StRS, in particular, when sample size is fixed, variance or standard error of estimation can be minimized by Neyman allocation.For example, Neyman allocation method was used for sample allocation among different strata of map classes [35].By Neyman allocation method, the sample size for a stratum (but sub-stratum for EI or EO) is calculated as: where n h is the number of sample units in stratum h, n is total sample size, W h indicates the stratum's area proportion, and S h represents the stratum's standard deviation.In this paper, for CS sampling and class-heterogeneity-stratified sampling (EO and EI), the "strata" when using Equation (1) were map classes and combinations of map classes and heterogeneity/homogeneity sub-classes, respectively.When conducting Neyman allocation, the standard deviation (or variance) of each stratum should be known for calculating the sample size of each stratum.Standard deviation of each stratum was estimated based on initial sample data, as in [13,35].For stratum h, its standard deviation (S h ) is calculated according to Equation (5.55) in [37].
The initial sample data were collected using stratified sampling design with proper stratification plan (e.g., class-stratification for Method CS, class-heterogeneity-stratification for Methods EI and EO).

Logistic Regression Modelling
Logistic regression models are usually used to describe relationships between a binary response variable I(x) and one or more explanatory variables Z k (x) (k = 1, . . ., K) at pixel x.For mapping local accuracies, the response variable is an indicator I(x) for classification correctness (or agreement between the map and reference class labels) at pixel x, and is coded as 1 if pixel x was correctly classified and 0 otherwise.The explanatory variables are pattern indices for map class occurrences within multi-scale neighborhoods, as discussed previously.Model predictions are probabilities of individual pixels being correctly classified.A logistic model is where p(x) is the probability of pixel x being correctly classified, β = (β 0 , β 1 , β 2 . . ., β k ) represents the parameters to be estimated.Relations between p(x) and I(x) are further discussed in the Discussion, so is possible extension to logistic-regression-based estimation.Logistic regression analyses are preferably approached in a way well coupled with the model-training sample data available.In this study, logistic modelling was performed in different ways: for Methods EO and EI, models were built for individual sub-strata separately; models were built for individual strata in CS; for CA, a single model was built for the study areas as a whole.Resultant logistic models built for EO and EI should be applied to their corresponding sub-strata in the land-cover map concerned, and those for CS to corresponding strata.When modelling is performed separately per land-cover class, as in EO, EI, and CS methods, logistic models can be applied to commission errors [17], although only per-pixel probabilities of correct classification were considered in this study.
Class occurrence pattern indices including homogeneity, heterogeneity, dominance, entropy, and contagion were used as candidate explanatory variables in this study (Table 2).These pattern indices were quantified in different-sized moving windows.In the study, due to computational limitation, moving window sizes were 3 by 3, 5 by 5, to 39 by 39 pixels at the maximum.
Definitions for the candidate explanatory variables shown in Table 2 are as follows.Class is represented by six binary variables, as there are seven land classes occurring in the study area.Homogeneity (Hom) refers to the number of pixels with the same map class label as the center pixel in a neighborhood (or moving window).Heterogeneity (Het) indicates the number of different classes in the moving window.If the value of heterogeneity equals 1, it indicates that the neighborhood is pure or homogeneous with the same class.Entropy (Ent) indicates the average uncertainty of class occurrences.When the probability of each class being present in a given neighborhood is roughly the same, entropy value reaches its maximum, and when only one class dominates, entropy value is zero.Dominance is the difference between the maximum possible diversity of the neighborhood or moving window being considered (measured by entropy, ln(K), K being the number of classes occurring in the moving window) and computed diversity in the moving window.Thus, dominance measures the extent to which one or a few cover types dominate the landscape (the moving window, to be more precise).A higher value indicates that the neighborhood is dominated by one or a few land-cover classes, and a lower value indicates that land-cover types have nearly equal proportions [38], given the same number of classes in the area (moving window) (as the maximum diversity value is determined by the number of classes in an area).Clearly, dominance is related to entropy, although their relation is modulated by K, which is itself varied rather than fixed across the landscape where moving windows fall in.Contagion index describes the degree of clumping of land cover, which is computed from the frequencies by which different pairs of cover types occur as adjacent pixels within a moving window [39].Because the index contains spatial information, it is one of the most important landscape pattern indices.In general, high contagion values show that certain landscape patch types form good connectivity; on the contrary, low contagion values indicate that the landscape is highly fragmented [20,39].
The −2LogLikelihood statistics (−2LL) is often used to test whether all regression parameters in a model are simultaneously zero, furthermore, the difference between the -2LL of the two models follows a chi-square distribution with degrees freedom being equal to the number of the extra variables to those shared by the two models [40].A χ 2 -test then can be used to test whether the addition of these extra variables can significantly improve model-fitting.

Experiment, Results, and Analyses
The experiment procedures consisted of reference sample data collection (for model-training and testing, see Sections 3.1 and 3.4, respectively), model-building, predictive mapping of local accuracies using models fitted with training data, and performance evaluation based on testing data.The performances of different methods for local accuracy estimation were compared, so were their sensitivities to sample size.

Training Sample Data Collection
As mentioned in Section 2.2, the sampling designs tested were: (1) SRS (CA); (2) StRS (CS), (3) EI (a sampling design with stratification by map classes and sub-stratification by edge/interior pixels), and (4) EO (the proposed class-heterogeneity-stratified random sampling).For all methods being examined, the sample size was 3000 pixels (for model-training).The rationales are that a sample size of 3000 pixels is about the minimum to enable wetland and bare land (rare classes) to be allocated minimum numbers of sample pixels necessary for model-building and that the sample size is to be reduced to 1000 pixels at the minimum for sensitivity analysis as in Section 3.5.Sample allocations for the aforementioned sampling designs are shown in Table 3.For sample allocation, the Neyman method was used for all methods except CA (where SRS was adopted).For EO and EI, the homogenous and heterogonous areas of each class were considered as different strata during sampling.
As shown in Table 3 (upper part, strata of classes), sampling intensities are about the same across different classes (strata) for Method CA as SRS was employed therein, while variations among different classes are apparently increasing in CS, EI, and EO.As shown in Table 3 (lower part), sampling intensities at heterogeneous sub-strata are obviously higher than at homogeneous sub-strata, as evaluated relative to the respective population sub-strata (e.g., about 3.93% for Wetland_E, wetland class, heterogeneous sub-stratum, but only 0.10% for Wetland_O, wetland class, homogeneous sub-stratum).Such a remarkable difference is seen to be narrower in the cases of EI, even reversed in CS and CA (Wetland_E vs. Wetland_O, for example).

Model Selection
For model selection, an exhaustive procedure was applied to find the optimal model containing the largest number of significant explanatory variables based on a particular model-training sample, as in [20].There were 98 candidate explanatory variables (i.e., map class (1), pattern indices computed in different-sized windows (95), and sample pixel's coordinates (2)) to choose from.The pattern indices in different windows were written as the combination of abbreviations and numbers, where numbers represented the sizes of the window.For example, Hom3 indicated the homogeneity of the sample data in a 3 by 3 pixels' window.Individual candidate explanatory variables were tested using chi-square statistics with respect to their statistical significance in model-fitting.This was to test if adding a candidate variable to a model already selected (i.e., a simpler model) significantly improves model-fitting (i.e., leading to a significant decrease in model deviance) (at a significance level (α) of 0.05).The significance of interaction terms for every two significant explanatory variables already selected in logistic regression was also assessed.Optimal models with the largest number of significant explanatory variables were identified when it was confirmed that adding variables further leads to insignificant reduction in model deviances.This means that a simpler model (with less significant explanatory variables) is preferred between two models with the same level of goodness of fit.This procedure was implemented on the R software system.
Results of model selection are shown in Table 4. "E" and "I" (Method EI) represent edge (more heterogeneous) and interior (more homogeneous) pixels, respectively, in Table 4. Thus, Forest_I indicates interior pixels sub-stratum for Forest class, and corresponds (though not one-to-one) to Forest_O (for Method EO) of homogenous sub-stratum of Forest class.Unsurprisingly, there are apparent differences in the optimal models for individual strata or sub-strata, as shown in Table 4.In other words, individual strata or sub-strata have their own optimal models with unique significant explanatory variables of their own.This (non-uniformity of logistic models of local accuracies across map strata and heterogeneity sub-strata) confirms the need for modelling local accuracies for individual map classes and heterogeneous/homogeneous sub-strata (of each class) separately.This also necessitates sensitivity analyses (to be described in Section 3.5) for methods being tested with respect to different sample sizes.
Lastly, regarding relative significance of locational vs. contextual explanatory variables in logistic modelling, the latter seem to be more informative in explaining observed classification correctness.This is supported by the results shown in Table 4, where accuracy models have solely contextual features (e.g., CS, Forest) as explanatory variables in 21 cases out of a total of 36 cases, while there are only four cases when explanatory variables are locations alone (e.g., CS, Water).

Mapping Per-Pixel Probabilities of Correct Classification and Classification Correctness at Sample Locations
With models of local accuracies obtained for the four methods (i.e., CA, CS, EI, and EO) as in Table 4, maps of per-pixel probabilities of correct classification were generated using these models, as shown in Figure 2a-d, respectively.The differences between the maps of estimated local accuracies shown in Figure 2 are somehow appreciable.Also, we can check their differences quantitatively using model-testing sample data, as shown later.

Performance Evaluation
As mentioned previously, AUC was used to evaluate performances of different methods for local accuracy estimation.T-test was performed on AUC values to examine whether there exist significant differences between the aforementioned different methods.As shown in Figure 2, locations of misclassifications (in red) are more likely found near locations with smaller estimated probabilities of correct classification (from orange to red).On the other hand, locations of correct classifications (in blue) are more likely found near locations with larger estimated probabilities of correct classification (from green to blue).
The closeness between estimated probabilities and actual indicators of correct classification can be assessed quantitatively using the area under the receiver operating characteristic curve (AUC) [41][42][43][44], as in the next subsection for performance evaluation based on independent testing sample data.AUC is a commonly used metric for assessing performances (discriminatory power) of models constructed to predict binary outcomes.AUC values can theoretically range from 0 to 1, with larger value indicating greater accuracy in predictions [42,43].Using training sample data (3000 sample pixels but different designs for different methods) as references, AUC values for the four methods tested were computed using an R package pROC [45].Their AUC values were estimated to be 0.6810, 0.7198, 0.7349, and 0.7625, for methods CA, CS, EI, and EO, respectively, indicating their increasing accuracy in predicting per-pixel probabilities of correct classification.Nevertheless, it should be noted that assessing models' accuracy based on training data is not recommendable.Rather, we should use independent sample data to assess these models' performances, as in the next subsection.

Performance Evaluation
As mentioned previously, AUC was used to evaluate performances of different methods for local accuracy estimation.T-test was performed on AUC values to examine whether there exist significant differences between the aforementioned different methods.
For this, an independent set of 1020 model-testing sample pixels were acquired (with a simple random design) from visual interpretation of high-resolution satellite images, as described in Section 2.1.Different sets of local accuracy estimations (obtained with samples collected with different sampling designs) were compared to their corresponding reference indicator data, with AUCs computed also using the R package pROC [45].AUC values obtained by these different methods are shown in Table 5.As mentioned previously, the range of AUC is from 0.0 to 1.0.Models are graded as excellent, fair, and poor.Specifically, models providing excellent predictions have AUC values higher than 0.9, fair models have AUC values between 0.7 and 0.9, and models are considered as poor when their AUC values are below 0.7 [41,43].The AUC values shown in Table 5 (based on testing sample data) are in the range between 0.75 and 0.8.Thus, all methods should be considered as fair.In comparison, as shown in Table 5, the performances of predictions obtained with EO, CS, and EI are better than that with CA in terms of AUC values, with EO method getting the greatest AUC value of 0.7968.AUC values shown in Table 5 are greater than their respective AUC values obtained with training sample data.This is perhaps due to use of different sample designs by them and a much smaller sample size used in the latter.After all, AUC values obtained from sample data were estimates and can be assessed with respect to their variability, as examined in the next subsection.
T-test was performed to determine statistical significances of differences between these AUC values, leading to results shown in Table 6.R package pROC [45] was used also for testing the significances of differences in AUC values obtained from alternative methods.pROC has a built-in function for t-test, assessing the significance of differences in AUC values, with variance and covariance of AUC estimates for a pair of methods computed by the package before p-values are output.As shown in Table 6, EO method is significantly more accurate (α = 0.05) than EI, CS, and CA methods, while CS is significantly more accurate than SRS (α = 0.10).Comparing EI and CS methods, there was no significant difference between them.

Sensitivity Analysis
As model predictions depend on model-training sample data characteristics (i.e., sampling design and sample size), it is important to undertake sensitivity analysis for the methods being compared in the study.This was done in the following steps:

•
The original sample size (3000 pixels) decreased at an equal step of 200 (pixels), down to 1000 pixels at the minimum, leading to ten reduced sample sizes (N d d = 1, 2, . . ., 10) for ten new samples.Sample allocations to strata and sub-strata were done proportionately to the reduced sample sizes N d for new samples, according to sample allocation for the original sample shown in Table 3.

•
The ten new sample sets (of reduced sizes) for each method were collected from the original sample by SRS from the original sample (CA), individual strata (CS), or sub-strata (EO and EI).

•
Logistic regression modelling was done using samples of reduced sizes, for the four methods.For this, model selection was carried out for all methods with all new samples with reduced sizes (refer to Section 3.2 for technical detail).This resulted in different models with different optimal explanatory variables based on different new samples for different methods.

•
Local accuracies were estimated for different methods using their respective new samples of reduced sizes.The resultant accuracy estimations were assessed using the test sample (1020 pixels) mentioned in Section 3.4, leading to AUC values for different methods with new samples of reduced sizes.

•
Steps 2 through 4 were repeated 20 times, giving rises to 20 AUC values for each of the method with a specific reduced sample size N d Means and standard deviation were computed based on these AUC values.The mean AUC values for different methods with different reduced sample sizes are shown in Figure 3.
As shown in Figure 3, for all methods, mean AUC values decrease as sample sizes decrease.EO method has the highest mean AUC values except for the sample with 1200 pixels when CS method performs slightly better.When the sample size is greater than 1800 pixels, the order of performances from good to poor in terms of mean AUC values is Methods EO, CS, EI, and CA.As sample size decreases from 1800 to 1000, the order of performances for Methods CS, EI, and CA fluctuates, with Method EI performing worse than Method CA.Two more observations are outstanding from Figure 3. First, CA method's quality of predictions (as quantified by AUC values) does not increase as sample size increases after 1400.Second, even with much smaller sample size, class-specific methods, especially EO method, can perform better than or the same as CA.For example, AUC of EO with 1400 samples is very close to AUC of CA with 3000 samples.From a practical point of view, this is important as it shows a simple modification in modelling can save costs via decreasing sample size.
3. First, CA method's quality of predictions (as quantified by AUC values) does not increase as sample size increases after 1400.Second, even with much smaller sample size, class-specific methods, especially EO method, can perform better than or the same as CA.For example, AUC of EO with 1400 samples is very close to AUC of CA with 3000 samples.From a practical point of view, this is important as it shows a simple modification in modelling can save costs via decreasing sample size.7. Method EO is significantly more accurate than other methods, especially when the sample size is greater than 1800 pixels.Method EI is significantly less accurate than Method CS.When the sample size is no less than 2000, all kinds of stratified methods are significantly more accurate than CA.However, when sample size is less than 1800, the relativity among Methods EI, CS, and CA fluctuates.T-test was undertaken based on the aforementioned sets of AUC values (means and standard deviation) for different methods with sample sizes.The results are shown in Table 7. Method EO is significantly more accurate than other methods, especially when the sample size is greater than 1800 pixels.Method EI is significantly less accurate than Method CS.When the sample size is no less than 2000, all kinds of stratified methods are significantly more accurate than CA.However, when sample size is less than 1800, the relativity among Methods EI, CS, and CA fluctuates.

Discussion
As shown in the results obtained in the study, the proposed class-heterogeneity-stratified method (applied for sampling and logistic modelling jointly) was confirmed to be the most accurate for estimating local accuracies in comparison with other methods.Sensitivity analyses also showed the proposed method's effectiveness and robustness, confirming its fair level of reliability.This study has met its goal of testing the proposed method's performance in local accuracy estimation, as the first step towards building an integrative framework for accuracy estimation and information refinement.
Below, some aspects of the work reported in the paper are reflected upon, with further work prospected briefly.
Firstly, in the paper, residuals of logistic regression predictions were not analyzed with respect to spatial correlation, nor was logistic-regression-kriging explored for mapping local accuracies as in [30].In the logistic model in Equation (2), p(x) represents the probability of correct classification (or agreement between map and reference class labels) at pixel x. p(x) is actually the mean of a binary variable I(x) indicating if x is correctly classified: p(x) = E(I(x)).Logistic-regression-kriging can thus be viewed as kriging with local means to get estimation of I(x), with logistic regression predicting local means, while kriging transferring spatial information contained in residuals (i.e., I(x) − p(x)) from sampled locations to unsampled ones [46].It (logistic-regression-kriging) certainly merits consideration for mapping local accuracies, especially when regression residuals are spatial correlated (hence should be incorporated for improved estimation of local accuracies).However, given the paper's future orientation information refinement (after local accuracy estimation), it makes sense to perform kriging based on land-cover data concerned directly (rather than indicator data representing classification correctness) when pursuing data fusion in the future.Another reason for having not pursued kriging in the paper is the extra computational cost that would be incurred by implementing kriging after logistic regression, since sensitivity analysis, as a relatively novel aspect of this study, was already computationally expensive.
Secondly, in this study, double-stratified modelling in combination with double-stratified sampling was confirmed to be the most accurate for local accuracy estimation, given same sample sizes.However, it is worth exploring how sampling may be optimally configured (beyond double stratifications) with respect to information refinement [33,34] (the top priority in the future).Related to this is the issue of how we may figure out the optimal sample size for a specific study area given the budgets for reference data collection.Furthermore, it is important to devise methods for combined use of all reference data available to improve accuracy characterization and data fusion, regardless of with what designs (which may be more complex than those in this study) they were originally collected.We acknowledge that there is great room for improvements of and extensions to the work done in the paper, being aware that sampling is itself a topic of breadth and depth.
Thirdly, some technical aspects are worth further explorations.On one hand, given the facts that double-stratified modelling tends to become complicated with much more models to build than CA modelling and that sub-stratified models are very much homogenized over corresponding data sub-strata, it seems sensible to explore possible simplification of sub-stratified models without significantly compromising accuracy of estimation (e.g., using sub-stratum means).On the other hand, it is worth exploring the double-stratification method in time.For this, it is interesting to investigate how the double-stratification method may be used to characterize per-pixel accuracies in land-cover change [17,30].
Fourthly, we discuss the issue concerning threshold selection for defining heterogeneous vs. homogeneous sub-strata, which are essential for the proposed Method EO.As described in Section 2.2, the threshold for a sub-stratum being homogeneous was 4 pixels in a neighborhood of 3 by 3 pixels.This means that the class type of center pixel needs to be in majority (no less than 5/9) to claim it being in a relatively homogeneous neighborhood.Please note that homogeneity is defined as the number of pixels with the same class label as that of the center pixel in the focal neighborhood (see also Table 2).Clearly, unlike first level stratification by map classes that are fixed for a given map, sub-stratification into heterogeneous vs. homogeneous pixels segments in a stratum can be made on a more adaptive basis, as threshold selection is obviously related to and varies by the land-cover mosaic (cover types, patch shape, and landscape texture, etc.) depicted in the map being assessed.By adaptively selecting thresholds of homogeneity in sub-stratification, we can optimize sub-stratification by optimal thresholding to maximize reliability in estimated local accuracies under the constraints of sampling intensity and sample size.This issue (threshold selection) is certainly worth exploring in future research.Fifthly, we discuss potentially useful methods for per-pixel accuracy estimation in soft (subpixel) classifications [2,5].Soft classifications are often considered as a kind of fuzzy classifications.In other words, "fuzzy" is more general than "soft" in conceptual terms, as the former refers to the cases whereby classes themselves are vaguely defined (e.g., the severity of drought).However, for soft classifications representing subpixel proportions of candidate classes in individual pixels, their probabilistic interpretations seem to be more relevant.With this understanding, we assume numerical equivalence (or similarity, more correctly) between subpixel class proportions and fuzzy membership values without causing confusion in the following discussion.Comber et al. [32] represented one piece of pioneering work on per-pixel accuracy estimation for fuzzy/soft classifications, while Khatami et al. [47] was more recent contribution to the relevant literature.For such maps, per-pixel accuracy measures were differences between map and reference membership values (or class proportions) for a candidate class (denoted D) in [32,47] (absolute differences, |D|, were used in the former).In the papers by Comber et al. [32] and Khatami et al. [47], spatial interpolation was applied to generate surfaces of weighted moving window means of D's at sample locations, where weights are computed with distance-based kernels in a similar manner to geographically weighted regression (GWR).Clearly, the proposed method is not directly applicable to estimating local accuracies in fuzzy maps.To make the proposed method applicable to fuzzy classifications, two extensions are required.One concerns adaptions to regression modelling, the other is related to how heterogeneity is defined on fuzzy maps to better facilitate double-stratification on such maps.Regression modelling needs to consider the fact that accuracy measures applied to fuzzy maps are no longer probabilistic but continuous-valued D. Thus, regression analyses rather than logistic regression may be explored.See the work by Shortridge and Messina [48] for an example of analyses of continuous-valued errors in Shuttle Radar Topography Mission (SRTM) DEM and their associations with globally available topographic and land-cover variables across a wide range of landscapes in the United States, although it was not about classifications per se.On the other hand, definitions of heterogeneity vs. homogeneity should be reviewed in the context of fuzzy maps and their local accuracy modelling [49].It seems that continua of heterogeneity-homogeneity are closely related to class proportions (or fuzziness in class memberships), although relations are not yet well understood.Once heterogeneity is defined with proper thresholds, double-stratification may be implemented: sub-stratifications of heterogeneity vs. homogeneity are based on the thresholds chosen while strata of prototype map classes are based on alpha-cuts [49,50] or maximum membership values (or dominant classes' proportions) [51].
Lastly but not the least, it should be recognized that there is issue of uncertainty related to estimated local accuracies.This is so because the models obtained (i.e., significant explanatory variables and model parameters) were conditional to the specific sample data given, as shown in Table 4, even to sample data with same sampling design and sample size (Table 7).Reference sample data quality [52] is also a factor, although reference data in this research were assumed to be accurate.More importantly, the explanatory variables used in this research were derived from map data that were known to be contaminated with unavoidable misclassification errors.This means that estimated local accuracies were subject to two-levels of uncertainties propagating: from map data to explanatory variables (i.e., contextual features or landscape pattern indices computed from a land-cover map being assessed) and from explanatory variables to logistic-modelling-based estimation.Local accuracy estimation (reported in the paper and elsewhere) is, thus, by no means perfect, no matter how sophisticated the methods employed are.It is important to develop and promote methods that not only depict spatially varying accuracies in land-cover information products but also support uncertainty analyses in predicted per-pixel accuracies.
We discuss further the aforementioned two-level uncertainties in the remainder of this section.As mentioned above, unless field-measured data that are sufficiently accurate are used [53], spatial analyses and modelling based on remote-sensing data and land-cover information estimated from them are subject to uncertainty.There is impressive literature on uncertainty in landscape pattern indices (or metrics) and analyses due to misclassification errors in land-cover maps [54][55][56][57][58][59].As landscape pattern indices were used as explanatory variables for logistic modelling of accuracy in this paper, existing methods in the literature listed above may be usefully explored for analyzing sensitivities of relevant pattern indices to misclassification errors.
However, we need to go further to analyze and quantify uncertainty in estimated local accuracies using map-data-derived pattern indices in future research.Relevant literature is rather limited, especially with respect to uncertainty in logistic-modelling-based accuracy estimation.Nevertheless, literature on error-in-variables in regression analysis may shed light on issues of two-level uncertainties, while simulation-based error modelling is well worth exploring as another promising methodology.Regarding regression modelling considering error-in-variables, Zhang et al. [60] and Fu et al. [61]) addressed error-in-variables issue in the context of forest inventory using linear regression analyses based on remote-sensing data that are known to suffer from errors.Literature on error-in-variables in the context of logistic regression is more relevant to furthering this research, as logistic regression is designed for binary response variables (e.g., agreement/disagreement between map and reference labels).The work by Carroll and Wand [62] and Yi et al. [63] may serve as good starting points for further research.On the other hand, simulation-based approaches (e.g., [59]) also merit consideration.We can simulate (land-cover) maps containing misclassification errors.These simulated maps can be used to generate a large sample of maps showing estimated local accuracies.Statistical summary and analyses on these maps of local accuracies can provide useful information about the effects of map inaccuracy on resultant per-pixel accuracy estimation, supporting uncertainty-informed local accuracy estimation and information refinement.Simulation-based methods are necessarily adapted to facilitate conditioning to reference sample data, to accommodate spatial correlation in misclassification errors in land-cover maps being assessed, and to promote mechanism-based uncertainty analyses [64].

Conclusions
Local accuracy characterization for land-cover information products is important for users and producers alike.In this paper, Method EO (a class-heterogeneity-stratified method) is proposed for sampling and modelling in the context local accuracy estimation.This method was compared with three alternative methods (Methods EI, CS, and CA) based on GlobeLand30 2010 land cover over Wuhan.Comparisons were also made under different sample sizes through sensitivity analysis.It was confirmed that Method EO generally yields the most accurate estimates of local accuracies with varying sample sizes, while Method CS performs with the second highest accuracy.The work accomplished in this paper holds great potentials for further research on local accuracy estimation and information refinement based on data fusion, given continuing proliferation of land-cover information products and growing accumulation of reference sample data for product validation.
As discussed in Section 4, although logistic modelling using class-heterogeneity-stratified data was advocated as a promising method for local accuracy estimation with a fair level of reliability confirmed, there exist issues deserving further research.The major issue concerns two-level uncertainty in Method EO and other alternatives, as explanatory variables used were actually map data and their derivatives (i.e., contextual features), which are subject to errors themselves.As conventionally understood, the reliability of resultant local accuracy estimates depends on various factors, such as sampling intensity and sample size, domains of explanatory variables (spatial vs. spectral), and strengths of empirically derived accuracy-context relationships.While these factors were analyzed or discussed to some extent in the past, potential effects of two-level uncertainty on local accuracy estimation have not received the kind of research attention they deserve.Advancements on methods for handling and incorporating two-level uncertainty in local accuracy mapping will bring our research on land-cover information validation and refinement to an elevated level of sophistication.
F. Goodchild (Emeritus Professor of Geography at the University of California, Santa Barbara) and Roger Kirby (retiree from The University of Edinburgh, Scotland) have provided long-term advice about spatial uncertainty for J.Z., the principal author.

22 Figure 2 .
Figure 2. Maps of per-pixel probabilities of correct classification estimated by different methods (overlaid with observed classification correctness indicators at respective model-training sample pixels): (a) CA, (b) CS, (c) EI, and (d) EO.

Figure 2 .
Figure 2. Maps of per-pixel probabilities of correct classification by different methods (overlaid with observed classification correctness indicators at respective model-training sample pixels): (a) CA, (b) CS, (c) EI, and (d) EO.

Figure 3 .
Figure 3. Mean AUC values for different methods with varying sample sizes

Figure 3 .
Figure 3. Mean AUC values for different methods with varying sample sizes.

Table 3 .
Sample allocations for different sampling designs (number of sample pixels belonging to individual strata or sub-strata, while sampling intensities (in percentages) shown in parentheses are with respect to the total number of pixels (N strata ) belonging to specific strata or sub-strata in the land-cover map).

Table 4 .
Optimal models with significant explanatory variables for different methods.

Table 5 .
AUC values for different methods.

Table 6 .
T-test results of statistical significance in differences between AUC values shown in Table5(p-value shown, * for significant at α = 0.10, ** for significant at α = 0.05).