Surface Heterogeneity-Involved Estimation of Sample Size for Accuracy Assessment of Land Cover Product from Satellite Imagery

Sample size estimation is a key issue for validating land cover products derived from satellite images. Based on the fact that present sample size estimation methods account for the characteristics of the Earth’s subsurface, this study developed a model for estimating sample size by considering the scale effect and surface heterogeneity. First, we introduced a watershed with different areas to indicate the scale effect on the sample size. Then, by employing an all-subsets regression feature selection method, three landscape indicators describing the aggregation and diversity of the land cover patches were selected (from 14 indicators) as the main factors for indicating the surface heterogeneity. Finally, we developed a multi-level linear model for sample size estimation using explanatory variables, including the estimated sample size (n) calculated from the traditional statistical model, size of the test region, and three landscape indicators. As reference data for developing this model, we employed a case study in the Jiangxi Province using a 30 m spatial resolution global land cover product (Globeland30) from 2010 as a classified map, and national 30 m land use/cover change (LUCC) data from 2010 in China. The results showed that the adjusted square coefficient of R2 is 0.79, indicating that the joint explanatory ability of all predictive variables in the model to the sample size is 79%. This means that the predictability of this model is at a good level. By comparing the sample size NS obtained by the developed multi-level linear model and n as calculated from the statistics model, we find that NS is much smaller than n, which mainly contributes to the concerns regarding surface heterogeneity in this study. The validity of the established model is tested and is proven as effective in the Anhui Province. This indicates that the estimated sample size from considering the scale effect and spatial heterogeneity in this study achieved the same accuracy as that calculated from a probability statistical model, while simultaneously saving more time, labour, and money in the accuracy assessment of a land cover dataset.


Introduction
Land cover provides basic geospatial information for applications in the fields of global environmental change, natural resources management, carbon and nitrogen cycle, and ecological monitoring [1][2][3]. Because of the continuous earth surface scanning and the correspondingly long-term data archives, satellite remote sensing is proven as an effective way in mapping global land cover and measuring land cover dynamic change [4]. Currently, national and international agencies have successfully created no less than ten global scale land cover datasets with spatial resolutions of 1 km, Therefore, this study addresses how to determine a number of samples while considering surface spatial heterogeneity.
Statistical theory is the foundation for determining the sample size. This study derived a sample size estimation using a stratified sampling approach. Then, a multi-level linear sample size estimation model was developed by considering the scale effect and surface spatial heterogeneity, with emphasis on two aspects of these issues. First, a watershed unit with ecological and geographical significance was introduced in this study as the basic spatial unit for performing the accuracy assessment, avoiding the objectivity issues existing in current spatial units of pixels or polygons [28]. Second, landscape indicators were employed to describe the surface heterogeneity and complexity. As the characteristics of the spatial heterogeneity would inevitably affect the sample size used for validating the land cover dataset, this study computed several major landscape indicators and assessed their impacts on the surface heterogeneity in watershed units, thereby reaching the goal of this study (to develop a reasonable model to estimate the sample size).
The remainder of the paper is organised as follows. Section 2 introduces the study area, data sources, and data pre-processing methods. Section 3 describes the commonly used statistical model of sample size estimation, the selection of scale factor and landscape indicators, and the development of the multi-level linear regression model. Section 4 presents results and an analysis of the developed model, and Section 5 provides preliminary conclusions.

Study Area
The Jiangxi Province is located in south-eastern China, with a total area of 169,900 square kilometres. It belongs to a subtropical humid climate with abundant rainfall. The landform is dominated by mountains (36%) and hills (42%). The main land use types are forest lands and crop lands, with the forest coverage rate reaching 60%, ranking as first in China. The typical geomorphological and climatic features cause Jiangxi to be covered by various types of land covers. Therefore, we selected the Jiangxi Province as the study site to develop the sample size estimation model [29,30]. The Anhui province which is adjacent to the Jiangxi Province is selected to testify the developed model ( Figure 1). Until now, the determination of a sample size has primarily relied on expert knowledge or conditional assumptions. This often makes it difficult to ensure the rationality of the sample size. Therefore, this study addresses how to determine a number of samples while considering surface spatial heterogeneity.
Statistical theory is the foundation for determining the sample size. This study derived a sample size estimation using a stratified sampling approach. Then, a multi-level linear sample size estimation model was developed by considering the scale effect and surface spatial heterogeneity, with emphasis on two aspects of these issues. First, a watershed unit with ecological and geographical significance was introduced in this study as the basic spatial unit for performing the accuracy assessment, avoiding the objectivity issues existing in current spatial units of pixels or polygons [28]. Second, landscape indicators were employed to describe the surface heterogeneity and complexity. As the characteristics of the spatial heterogeneity would inevitably affect the sample size used for validating the land cover dataset, this study computed several major landscape indicators and assessed their impacts on the surface heterogeneity in watershed units, thereby reaching the goal of this study (to develop a reasonable model to estimate the sample size).
The remainder of the paper is organised as follows. Section 2 introduces the study area, data sources, and data pre-processing methods. Section 3 describes the commonly used statistical model of sample size estimation, the selection of scale factor and landscape indicators, and the development of the multi-level linear regression model. Section 4 presents results and an analysis of the developed model, and Section 5 provides preliminary conclusions.

Study Area
The Jiangxi Province is located in south-eastern China, with a total area of 169,900 square kilometres. It belongs to a subtropical humid climate with abundant rainfall. The landform is dominated by mountains (36%) and hills (42%). The main land use types are forest lands and crop lands, with the forest coverage rate reaching 60%, ranking as first in China. The typical geomorphological and climatic features cause Jiangxi to be covered by various types of land covers. Therefore, we selected the Jiangxi Province as the study site to develop the sample size estimation model [29,30]. The Anhui province which is adjacent to the Jiangxi Province is selected to testify the developed model ( Figure 1).

Data Sources
In this study, the global land cover dataset 'Globeland30' from 2010 was selected as the test dataset (http://www.globeland30.org). Globeland30 is a high-resolution land cover mapping product developed by the National Geomatics Centre of China, with a spatial resolution of 30 m. It has attracted the attention of researchers at home and abroad. The Globeland30 dataset includes 10 first-level classes. An approach based on the integration of pixel-and object-based methods with knowledge (POK) was used to extract land categories, effectively improving the classification accuracy [31].
The reference data is the national land use/cover change (LUCC) dataset from 2010 with a scale of 1:100,000, which was provided by the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (CAS). The data was produced mainly by visual interpretation from remote sensing experts in China and was updated every five years. The LUCC data has six first-level classes and 25 second-level classes. Field investigation has shown that the LUCC classification accuracy is very high, with more than 90% overall accuracy (OA) for the second-level class. Both of these two datasets use the same major data sources of Landsat imagery [32,33]. In addition, the digital elevation model (DEM) dataset of the 'Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model' (ASTGTM) (http://datamirror.csdb.cn) in the study area was downloaded to perform the division of the watershed unit.

Classification System Transformation
The Globeland30 and LUCC datasets must be pre-processed before accuracy assessment, owing to their differences in projection and classification systems. For avoiding area deformation, the Globeland30 dataset was projected to the Albert equal area projection (European Petroleum Survey Group (EPSG) code: 3857), which is the projection system employed for the LUCC dataset. Then, these two products were unified in the classification system. Based on the LUCC definition for each product, the classification transformation is presented in Table 1 [34,35]. Finally, we obtained the test and reference data, as shown in Figure 2.

Digital Elevation Model (DEM) Processing
The DEM was projected into the Albert equal area projection system. A hydrological analysis was implemented by using ArcGIS 10.2 to calculate the watershed units. The watershed unit number was decided by the threshold value of the catchment surface [36][37][38]. After repeated experiments and visualization on the division of the watershed unit, we used a threshold value of flow accumulation of 4,000,000 for the catchment surface to obtain 48 basin units. Some units with small areas emerged adjacent to the unit that has the largest area. Finally, we collected 30 basin units as shown in Figure  3b, to perform the following study.

Digital Elevation Model (DEM) Processing
The DEM was projected into the Albert equal area projection system. A hydrological analysis was implemented by using ArcGIS 10.2 to calculate the watershed units. The watershed unit number was decided by the threshold value of the catchment surface [36][37][38]. After repeated experiments and visualization on the division of the watershed unit, we used a threshold value of flow accumulation of 4,000,000 for the catchment surface to obtain 48 basin units. Some units with small areas emerged adjacent to the unit that has the largest area. Finally, we collected 30 basin units as shown in Figure 3b, to perform the following study.

Sample Size Determination from Probability Statistical Model
The estimation of the sample size is dependent on the sampling design method. For a land cover dataset with multiple categories, a stratified random sampling design has often been recommended in related research [39][40][41]. Cochran provided a set of rigorous calculation equations to obtain a

Sample Size Determination from Probability Statistical Model
The estimation of the sample size is dependent on the sampling design method. For a land cover dataset with multiple categories, a stratified random sampling design has often been recommended in related research [39][40][41]. Cochran provided a set of rigorous calculation equations to obtain a sample size from the perspective of statistical theory, which is the foundation of latter research on sampling techniques [27,42,43]. According to the good practices for estimating accuracy recommended by Olofsson et al. [27], the commonly used equation for calculating sample size is: where, V(y st ) is the standard error of the estimated OA, and is generally designated as 0.01, W h is the stratum weight (proportion of area of class i in the map), S h = p i (1 − p i ) [43], S h represents the level i standard deviation, and p i is the user accuracy, which can be obtained through experiments [42,44].

Determination of Variables in a Multi-Level Linear Model
On the basis of the estimated sample size from the statistical model, this study attempted to develop a sample size estimation model by considering the scale effects of spatial units and the spatial heterogeneity reflected from the land cover product. As at least three factors, i.e., sample size from the statistical model, spatial effect, and heterogeneity characteristics, impacted the number of samples, this study employed a multi-level linear regression method to develop the sample size estimation, expressed as follows: In the above, N S is the estimated sample size, n is the sample size calculated by the probability sampling method (Equation (1)), and C represents the sampling constraint on each watershed unit. Theoretically, the larger the test region is, the more samples are needed, and vice versa. In that regard, the 'best' sample constraints for regions with different areas should be discussed in this study. L is a set of spatial heterogeneity factors, including a number of landscape indicators that can indicate information on the fragmentation, diversity, and stability of the watershed units. A 0 , A 1 , A 2 , and A 3 are regression coefficients, and will be discussed in Section 4. With the help of watershed units, we obtained Globeland30 and LUCC data as test and reference data respectively, for each single watershed. This following part mainly describes the calculation and determination of n, N S , C, L, and the other variables needed in Equation (2). The main procedures are shown in Figure 4. the 'best' sample constraints for regions with different areas should be discussed in this study. L is a set of spatial heterogeneity factors, including a number of landscape indicators that can indicate information on the fragmentation, diversity, and stability of the watershed units. A0, A1, A2, and A3 are regression coefficients, and will be discussed in Section 4.
With the help of watershed units, we obtained Globeland30 and LUCC data as test and reference data respectively, for each single watershed. This following part mainly describes the calculation and determination of n, NS, C, L, and the other variables needed in Equation (2). The main procedures are shown in Figure 4.  (1) in Section 3.1 is used to calculate the sample size n. We can see that all parameters in Equation (1) are available except for S h . To ensure the optimization and specificity of the n value in each watershed unit, S h is determined by stratified random sampling for the land cover products  (1) in Section 3.1 is used to calculate the sample size n. We can see that all parameters in Equation (1) are available except for S h . To ensure the optimization and specificity of the n value in each watershed unit, S h is determined by stratified random sampling for the land cover products of Globeland30. The main steps are as follows ( Figure 5): First, each basin unit in the study area is sampled by the stratified random sampling method. Sample N S values are assigned by 50, 100, 150, . . . , 1000, i.e., with an interval of 50. Second, a user's accuracy p i from different samples N S in each basin unit is calculated from the constructed error matrix, while using LUCC as a reference data. Third, S h is obtained for each single basin unit according to the evaluated results.

Determining NS, n, and C
Equation (1) in Section 3.1 is used to calculate the sample size n. We can see that all parameters in Equation (1) are available except for S h . To ensure the optimization and specificity of the n value in each watershed unit, S h is determined by stratified random sampling for the land cover products of Globeland30. The main steps are as follows ( Figure 5): First, each basin unit in the study area is sampled by the stratified random sampling method. Sample NS values are assigned by 50, 100, 150, ••••••, 1000, i.e., with an interval of 50. Second, a user's accuracy pi from different samples NS in each basin unit is calculated from the constructed error matrix, while using LUCC as a reference data. Third, S h is obtained for each single basin unit according to the evaluated results. As an example, Table 2 describes the S h results for the Number 2 basin unit. The value of h ranges from 1 to 8, representing eight different strata or land types of Globeland30. W h is the area weight information of each stratum in the basin unit. The S h for each basin unit can be obtained by the above-mentioned stratified random sampling.
After obtaining the parameter S h , the sample size n of each NS can be calculated from Equation   As an example, Table 2 describes the S h results for the Number 2 basin unit. The value of h ranges from 1 to 8, representing eight different strata or land types of Globeland30. W h is the area weight information of each stratum in the basin unit. The S h for each basin unit can be obtained by the above-mentioned stratified random sampling.
After obtaining the parameter S h , the sample size n of each N S can be calculated from Equation  Figure 6 shows broken-line maps between the N S values obtained from the successive stratified random sampling of 30 basin units and the OA information. The blue line is the result from the stratified random sampling method, the green line represents the result of the accuracy assessed by the whole sample, the black dotted line gives the allowable precision range under the premise of an absolute error of 0.01, and the red circled part is the selected region where the OA varies with N S in the acceptable precision range. The first N S value in the circled region is regarded as the estimated sample size [45], aiming to obtain a reasonable result in accuracy assessment by using as few samples as possible. The sample size n corresponds to N S . Once N S is determined, n can also be obtained. Finally, we obtained 30 records of N S and n from the 30 basin units.
Sample constraint C describes how large of a region can allocate a sample, which is a factor affecting the number of samples. In this study, the sample constraints of each basin unit were obtained by the following expression: Here, S is the basin unit area (km 2 ).     Figure 6. Determination of sample size Ns. Figure 6. Determination of sample size Ns.

Selection of Landscape Indicators
The landscape index, an index for quantitative analysis of landscape patterns, can measure the type, quantity, shape, spatial distribution, and complexity of the analysis units [46,47]. In recent years, an increasing number of studies have used the landscape index to describe spatial heterogeneity information, although their focus is not on the estimation of sample size, but on the layout of sample points [45,48] or land cover extraction [49,50]. According to the target of estimating the sample size of the surface coverage data, 14 landscape indicators were selected to describe the spatial heterogeneity information of the landscape levels in the watershed units from seven categories: area metrics, contrast metrics, edge metrics, shape metrics, proximity metrics, aggregation metrics, and diversity indexes. The ecological significance and descriptions of these indicators are presented in Table 3. Table 3. Landscape indicators used in this study.

Class
Name Unit Range We need to identify the most representative indicators from the 14 landscape indexes in the 7 categories. All-subsets regression, a commonly used method for feature selection, was employed to select the indicators. By adjusting the values of R 2 , the 'best' model was selected to determine the variables of the fitting model. As shown in Figure 7, we found that the adjusted R 2 value of the 'best' model was 0.78, and the corresponding landscape indices were the landscape shape index (LSI), contagion index (CONTAG), Shannon's evenness index (SHEI), area-weighted mean shape index (AWMSI), area-weighted mean patch fractal dimension (AWMPFD), and patch richness density (PRD). In addition, the R 2 value of the sample size from the probability sampling theory and sample constraints in the fitting model is greater than 0.7, indicating that both of them are independent in the model for sample size estimation. Therefore, the 6 indicators of LSI, CONTAG, SHEI, AWMSI, AWMPFD, and PRD were selected for the following multi-level regression analysis. contagion index (CONTAG), Shannon's evenness index (SHEI), area-weighted mean shape index (AWMSI), area-weighted mean patch fractal dimension (AWMPFD), and patch richness density (PRD). In addition, the R 2 value of the sample size from the probability sampling theory and sample constraints in the fitting model is greater than 0.7, indicating that both of them are independent in the model for sample size estimation. Therefore, the 6 indicators of LSI, CONTAG, SHEI, AWMSI, AWMPFD, and PRD were selected for the following multi-level regression analysis.

Regression Analysis
Equation (2) indicates that the sample size estimation is supposed to be a multi-level linear function of sample size from probability sampling theory, sample constraints, and spatial heterogeneity. Among them, coefficients A 0 , A 1 , A 2 , and A 3 are determined by regression analysis. In this study, an ordinary least squares (OLS) regression model was used to determine these regression coefficients [51,52]. OLS is one of the most commonly used core methods in multi-level linear regression models. Its form is as follows:Ŷ i =B 0 +B 1 X 1i + · · · · · · +B k X ki (4) where, i = 1, 2, . . . , n,Ŷ i is the predicted value of the dependent variable corresponding to the i th observation, X ki denotes the value of the j th predictive variable corresponding to the i th observation, andB 0 denotes the intercept term, i.e., the predicted value of Y when all of the predicted variables are zero.B k is the regression coefficient of the predictive variable j, i.e., the change of Y caused by X j changing a unit. The OLS model obtains regression coefficients by reducing the difference between the real values of response variables and the predicted values, i.e., by minimizing the sum of squares of residual errors:

Multi-Level Regression
As shown in Figure 3b, there are 30 basin units in the study area. Therefore, we obtained a total of 30 records with optimised sample size N S . To show the correlations between sample size N S and the explanatory variables, a scatterplot matrix was obtained, and is presented in Figure 8. The diagonal area is the density map of the variables, whereas the blue line in the non-diagonal area represents the linear and smooth-fitting curves. From the matrix, every predictive variable has a tilt trend to some extent. The sample size N S decreases with the increase of sample constraints (C), SHEI, and AWMPFD, whereas it increases with an increase of CONTAG and PRD. This means that the relationship between the sample size N S and the independent variables is not a phenomenon of simple positive or negative correspondence. the explanatory variables, a scatterplot matrix was obtained, and is presented in Figure 8. The diagonal area is the density map of the variables, whereas the blue line in the non-diagonal area represents the linear and smooth-fitting curves. From the matrix, every predictive variable has a tilt trend to some extent. The sample size NS decreases with the increase of sample constraints (C), SHEI, and AWMPFD, whereas it increases with an increase of CONTAG and PRD. This means that the relationship between the sample size NS and the independent variables is not a phenomenon of simple positive or negative correspondence. After analysing the scatterplot matrix, the multi-level linear regression analysis was implemented, and the coefficients are shown in Table 4. The column for Pr(>t) shows the significance of the regression coefficients of the independent variables, and the column of 'Significance codes' represents the degree of significance. The regression coefficients of AWMSI, AWMPFD, and PRD are not significant enough to pass the t test, and the multiple R 2 of the model is different from the adjusted R 2 . This reflects a problem of instability. Therefore, the model needs to be improved. From additional experiments, we found that the regression coefficients of the independent variables are more significant when AWMSI, AWMPFD, and PRD are removed. Although the R 2 value decreases, the stability performance improves ( Table 5). As a result, only 3 indicators were involved in the final estimation of sample size.
By substituting Equation (1) into Equation (6), we finally obtain the estimation of the sample size by considering the scale effect and spatial heterogeneity characteristics, and it is expressed as follows: (7) This relationship shows that LSI, CONTAG, and SHEI contribute positively to the sample size N S , whereas the sample size n of probability sampling theory and the sample constraint C contribute negatively to the sample size N S . CONTAG and SHEI contribute the most and play a vital role in the change of sample size. All of the predictive variables explain the variance of 79% of the sample size N S .

Model Verification
This study used a cross-validation method to test the generalization ability of the established OLS regression model. Cross-validation, as a commonly used model validation technology, has the advantage of high prediction accuracy [53]. Considering the small amount of data used to fit the model, three-fold cross-validation was used. Three-fold cross-validation was used to divide the original sample into three equal-sized sub-samples, one of which was retained as test data for model validation, while the other two sub-samples were used as training data. This process was repeated three times, and each of the three sub-samples was used as the validation data. The average R 2 of the three cross-validation results was taken as the final estimation solution, as shown in Table 6. The results show that there are some differences between the original R 2 and the three-fold cross-validation R 2 . The R 2 value changes greatly after cross-validation, indicating that the stability between the variables and the generalization ability of the models is less than satisfactory. In addition to the cross-validation, we applied the method of sample size estimation to the Anhui Province to test the developed model. Using the same method as mentioned above, we divided the Anhui province into 32 watersheds and selected 14 units to perform the model validation ( Figure 9). The same datasets as those used in the Jiangxi Province were used to test this model. First, we calculated the sample size using a statistical model, i.e., Equation (1), and the developed model in this study, i.e., Equation (7), respectively. Then, we assessed the accuracy of Globeland30 by using the two above-calculated sample sizes. Finally, we compared the accuracies and the results are presented in Table 7. We can see that the difference of OA obtained from two different approaches is very small in most watersheds, the maximum and minimum difference is 4% and 0.1% respectively, with an average of 1.21%. In contrast to the similar accuracy, the sample sizes calculated by the developed model are smaller than that from the statistical model. For the employed 14 watersheds in the Anhui Province, there are 24,425 sample points calculated by the statistical model, while it is 12,399 computed by the developed model in this work. Compared with the statistical model, our developed model decreases the sample size by 49%. This indicates that a smaller sample size can achieve the same performance as the statistical model by considering the scale effect and surface heterogeneity.

Relative Importance of Predictor Variables
The OLS regression method is used to analyse the influence of the sample size n of the probability sampling theory, the sample constraints C, and spatial heterogeneity factors on the estimated sample size N S , and the relationship between them is obtained in Equation (7). The standardised regression coefficient method and the relative weight method are employed to evaluate the relative importance of the predictor variables in the multi-level regression analysis. The results are shown in Table 8 and Figure 10. The standardised regression coefficient is the simplest method for predicting the relative importance of variables. It represents the expected change of response variables caused by the change of one standard deviation of a predictor variable when other predictor variables remain unchanged. Table 8 shows that when the other variables remain unchanged, SHEI changes by a standard deviation, and that the sample size will increase by 2.13 standard deviations, i.e., the most important relative to the sample size N S . In contrast, the sample size n of the probability sampling theory has the least relative importance to the sample size.
As compared with the standardized regression coefficient, the relative weight is a more promising method for predicting the relative importance of factors [54]. It ranks variables according to their contribution to R 2 . Figure 10 shows the relative importance of each factor. The results show that the sample constraint C accounts for 44% of R 2 , which is of the greatest relative importance to the sample size N S . It shows that the regional scale cannot be neglected in the sample size determination. SHEI and CONTAG, as spatial heterogeneity factors describing the aggregation and diversity of watershed units, explain 25% and 15% of R 2 , respectively. The remaining factors are LSI and n. Therefore, in terms of relative importance, the regional scale effect and spatial heterogeneity have an influence on the determination of the sample size in accuracy assessment of remotely sensed land cover products.  Table 8 and Figure 10. The standardised regression coefficient is the simplest method for predicting the relative importance of variables. It represents the expected change of response variables caused by the change of one standard deviation of a predictor variable when other predictor variables remain unchanged. Table 8 shows that when the other variables remain unchanged, SHEI changes by a standard deviation, and that the sample size will increase by 2.13 standard deviations, i.e., the most important relative to the sample size NS. In contrast, the sample size n of the probability sampling theory has the least relative importance to the sample size.
As compared with the standardized regression coefficient, the relative weight is a more promising method for predicting the relative importance of factors [54]. It ranks variables according to their contribution to R 2 . Figure 10 shows the relative importance of each factor. The results show that the sample constraint C accounts for 44% of R 2 , which is of the greatest relative importance to the sample size NS. It shows that the regional scale cannot be neglected in the sample size determination. SHEI and CONTAG, as spatial heterogeneity factors describing the aggregation and diversity of watershed units, explain 25% and 15% of R 2 , respectively. The remaining factors are LSI and n. Therefore, in terms of relative importance, the regional scale effect and spatial heterogeneity have an influence on the determination of the sample size in accuracy assessment of remotely sensed land cover products.  Figure 11 is a histogram of the sample size NS obtained from the multi-level linear model and sample size n and based on probability and statistics theory. Figures 12 and Figure 13 are polygonal maps of the landscape index and sample constraint factor C, respectively. LSI reflects the complexity of the overall landscape shape of the watershed units. The larger the value, the simpler the overall landscape shape. CONTAG describes the degree of agglomeration of patch types in the watershed units. The higher the value, the better the degree of agglomeration. SHEI describes the diversity of the patch distribution in the watershed units. The smaller the value, the simpler the patch type, and the smaller the diversity.  Figure 11 is a histogram of the sample size N S obtained from the multi-level linear model and sample size n and based on probability and statistics theory. Figures 12 and 13 are polygonal maps of the landscape index and sample constraint factor C, respectively. LSI reflects the complexity of the overall landscape shape of the watershed units. The larger the value, the simpler the overall landscape shape. CONTAG describes the degree of agglomeration of patch types in the watershed units. The higher the value, the better the degree of agglomeration. SHEI describes the diversity of the patch distribution in the watershed units. The smaller the value, the simpler the patch type, and the smaller the diversity. NS and n are obtained under the premise that the OA standard deviation V(y st ) equals 0.01. According to Figure 11, the value of n is often more than 1000, and most of them are distributed near 1750, whereas the values of NS are much smaller. This shows that the sample size NS, as obtained by the multi-level linear model considering spatial heterogeneity, can save more manpower and material resources in the accuracy assessment of a land cover dataset.

Analysis
The green squares in Figure 11 indicate the relative size of the basin unit area, i.e., the scale difference. According to Figure 3b, there are evident scale differences among the basin units. In theory, the larger the scale of the watershed unit, the more samples are extracted, and the n values obtained based on the probability and statistics theory should show similar laws. However, the larger the scale of the unit, the smaller the NS values needed, such as the basin unit number (No.) 11,No. 20,and No. 35, and the smaller the unit, the larger the NS values, e.g., basin unit No. 14 and No. 17. According to Figures 12 and 13, we can explain why NS changes with areas of the spatial unit. The values of the LSI, CONTAG, and SHEI in units No . 11,No. 20,and No. 35 are higher, indicating that these few units have a high patch aggregation, uniform patch distribution, and simple landscape shape, and thus weak spatial heterogeneity. Although the scale is large, the required sample size NS is small. However, the values of LSI and CONTAG in units No. 14 and No. 17 are low, and the SHEI values are high. They need a large sample size of NS. Therefore, by analysing the relationship between the sample size NS of the watershed units No. 11,No. 14,No. 17,No. 20,No. 35 and the landscape index, it is further demonstrated that the values calculated from the multi-level sample size estimation model (considering the spatial heterogeneity of land cover) are more reasonable than those calculated from the probability statistical model. Figure 12. Diagrams of landscape shape index (LSI), contagion index (CONTAG), and C.
As noted above, C indicates how large of a sample is taken. If the land cover in different areas is homogeneous and the patch types and distribution are the same, the C values are similar. However, in Figure 12, the trend of C variation is similar to that of LSI, because the spatial distribution of the N S and n are obtained under the premise that the OA standard deviation V(y st ) equals 0.01. According to Figure 11, the value of n is often more than 1000, and most of them are distributed near 1750, whereas the values of N S are much smaller. This shows that the sample size N S , as obtained by the multi-level linear model considering spatial heterogeneity, can save more manpower and material resources in the accuracy assessment of a land cover dataset.
The green squares in Figure 11 indicate the relative size of the basin unit area, i.e., the scale difference. According to Figure 3b, there are evident scale differences among the basin units. In theory, the larger the scale of the watershed unit, the more samples are extracted, and the n values obtained based on the probability and statistics theory should show similar laws. However, the larger the scale of the unit, the smaller the N S values needed, such as the basin unit number (No.) 11,No. 20,and No. 35, and the smaller the unit, the larger the N S values, e.g., basin unit No. 14 and No. 17. According to Figures 12 and 13, we can explain why N S changes with areas of the spatial unit. The values of the LSI, CONTAG, and SHEI in units No. 11,No. 20,and No. 35 are higher, indicating that these few units have a high patch aggregation, uniform patch distribution, and simple landscape shape, and thus weak spatial heterogeneity. Although the scale is large, the required sample size N S is small. However, the values of LSI and CONTAG in units No. 14 and No. 17 are low, and the SHEI values are high. They need a large sample size of N S . Therefore, by analysing the relationship between the sample size N S of the watershed units No. 11,No. 14,No. 17,No. 20,No. 35 and the landscape index, it is further demonstrated that the values calculated from the multi-level sample size estimation model (considering the spatial heterogeneity of land cover) are more reasonable than those calculated from the probability statistical model. NS and n are obtained under the premise that the OA standard deviation V(y st ) equals 0.01. According to Figure 11, the value of n is often more than 1000, and most of them are distributed near 1750, whereas the values of NS are much smaller. This shows that the sample size NS, as obtained by the multi-level linear model considering spatial heterogeneity, can save more manpower and material resources in the accuracy assessment of a land cover dataset.
The green squares in Figure 11 indicate the relative size of the basin unit area, i.e., the scale difference. According to Figure 3b, there are evident scale differences among the basin units. In theory, the larger the scale of the watershed unit, the more samples are extracted, and the n values obtained based on the probability and statistics theory should show similar laws. However, the larger the scale of the unit, the smaller the NS values needed, such as the basin unit number (No.) 11,No. 20,and No. 35, and the smaller the unit, the larger the NS values, e.g., basin unit No. 14 and No. 17. According to Figures 12 and 13, we can explain why NS changes with areas of the spatial unit. The values of the LSI, CONTAG, and SHEI in units No. 11,No. 20,and No. 35 are higher, indicating that these few units have a high patch aggregation, uniform patch distribution, and simple landscape shape, and thus weak spatial heterogeneity. Although the scale is large, the required sample size NS is small. However, the values of LSI and CONTAG in units No. 14 and No. 17 are low, and the SHEI values are high. They need a large sample size of NS. Therefore, by analysing the relationship between the sample size NS of the watershed units No. 11,No. 14,No. 17,No. 20,No. 35 and the landscape index, it is further demonstrated that the values calculated from the multi-level sample size estimation model (considering the spatial heterogeneity of land cover) are more reasonable than those calculated from the probability statistical model. As noted above, C indicates how large of a sample is taken. If the land cover in different areas is homogeneous and the patch types and distribution are the same, the C values are similar. However, in Figure 12, the trend of C variation is similar to that of LSI, because the spatial distribution of the As noted above, C indicates how large of a sample is taken. If the land cover in different areas is homogeneous and the patch types and distribution are the same, the C values are similar. However, in Figure 12, the trend of C variation is similar to that of LSI, because the spatial distribution of the patches in different basin units is different. In an area with simple spatial distribution and weak spatial heterogeneity, N S is small, and the C value is high, i.e., C has a negative correlation to N S , whereas when N S is large and the C value is low, C has a positive correlation to N S .
Where the C values are the same, e.g., in basin units No. 16  patches in different basin units is different. In an area with simple spatial distribution and weak spatial heterogeneity, NS is small, and the C value is high, i.e., C has a negative correlation to NS, whereas when NS is large and the C value is low, C has a positive correlation to NS.
Where the C values are the same, e.g., in basin units No. 16 39. As a result, C, as a factor affecting the sample size of NS, is also influenced by the surface spatial heterogeneity.

Conclusion
The accuracy of a dataset is often the first problem to be considered in scientific research and field applications. Sample size calculation is the first step in performing an accuracy assessment of land cover products from satellite imagery. On the basis of a statistical model for the estimation of sample size, this study establishes a multi-level linear model for estimation of sample size by considering the scale effect and spatial heterogeneity. A watershed unit was introduced to obtain a spatial analysis unit, to avoid subjectivity in the selection of assessment units. Landscape indices were selected to indicate the spatial heterogeneity of the region.
The multi-level linear sample size estimation model shows that: (1): All predictive variables can explain 79% of the variance of the sample size NS. The coefficients of the predictive variables of the model are significant, indicating that there is a strong relationship between the sample size NS and the independent variables. By comparing the sample size NS from the multi-level linear model with a sample size n based on probability and statistics theory, we see that the sample size of NS is much smaller than that of n. The smaller sample size can achieve the same performance as the statistical model and it contributes to the consideration of surface heterogeneity. The relative importance of the predicted variables in the model is calculated by using standardised regression coefficients and relative weights. The results show that the CONTAG and SHEI indicators (describing the diversity and dispersion of basin units, respectively) are relatively important, followed by LSI, sample constraint C, and sample size n, as calculated from probability sampling theory. According to the validation of the developed model, we can conclude that the smaller sample size from the developed estimation model can achieve the same performance as the statistical model while saving more time, cost, and energy in the accuracy assessment of land cover products.
(2): After performing three-fold cross-validation, the R 2 value changes from 0.79 to 0.63. This means that the generalization of the sample size estimation model is still a problem, although the test of the model in the Anhui Province proved the validity of this estimation of sample size. Therefore,

Conclusions
The accuracy of a dataset is often the first problem to be considered in scientific research and field applications. Sample size calculation is the first step in performing an accuracy assessment of land cover products from satellite imagery. On the basis of a statistical model for the estimation of sample size, this study establishes a multi-level linear model for estimation of sample size by considering the scale effect and spatial heterogeneity. A watershed unit was introduced to obtain a spatial analysis unit, to avoid subjectivity in the selection of assessment units. Landscape indices were selected to indicate the spatial heterogeneity of the region.
The multi-level linear sample size estimation model shows that: (1): All predictive variables can explain 79% of the variance of the sample size N S . The coefficients of the predictive variables of the model are significant, indicating that there is a strong relationship between the sample size N S and the independent variables. By comparing the sample size N S from the multi-level linear model with a sample size n based on probability and statistics theory, we see that the sample size of N S is much smaller than that of n. The smaller sample size can achieve the same performance as the statistical model and it contributes to the consideration of surface heterogeneity. The relative importance of the predicted variables in the model is calculated by using standardised regression coefficients and relative weights. The results show that the CONTAG and SHEI indicators (describing the diversity and dispersion of basin units, respectively) are relatively important, followed by LSI, sample constraint C, and sample size n, as calculated from probability sampling theory. According to the validation of the developed model, we can conclude that the smaller sample size from the developed estimation model can achieve the same performance as the statistical model while saving more time, cost, and energy in the accuracy assessment of land cover products.
(2): After performing three-fold cross-validation, the R 2 value changes from 0.79 to 0.63. This means that the generalization of the sample size estimation model is still a problem, although the test of the model in the Anhui Province proved the validity of this estimation of sample size. Therefore, we need more work on the improvement and testing of the developed model for sample size estimation.
For a specific work on accuracy assessment, although the model established in this study cannot be directly applied, it is expected to provide an approach for the determination of sample size, by considering the study areas and the characteristics of the surface heterogeneity. In the future, we need to improve the developed model by applying this surface heterogeneity-concerned sample size estimation model to other study sites, aiming to assess the accuracy of a land cover dataset with as few samples as possible.