Analysis of the Impact of Positional Accuracy When Using a Single Pixel for Thematic Accuracy Assessment

: The primary goal of thematic accuracy assessment is to measure the quality of land cover products and it has become an essential component in global or regional land cover mapping. However, there are many uncertainties introduced in the validation process which could propagate into the derived accuracy measures and therefore impact the decisions made with these maps. Choosing the appropriate reference data sample unit is one of the most important decisions in this process. The majority of researchers have used a single pixel as the assessment unit for thematic accuracy assessment, while others have claimed that a single pixel is not appropriate. The research reported here shows the results of a simulation analysis from the perspective of positional errors. Factors including landscape characteristics, the classiﬁcation scheme, the spatial scale, and the labeling threshold were also examined. The thematic errors caused by positional errors were analyzed using the current level of geo-registration accuracy achieved by several global land cover mapping projects. The primary results demonstrate that using a single-pixel as an assessment unit introduces a signiﬁcant amount of thematic error. In addition, the coarser the spatial scale, the greater the impact on positional errors as most pixels in the image become mixed. A classiﬁcation scheme with more classes and a more heterogeneous landscape increased the positional e ﬀ ect. Using a higher labeling threshold decreased the positional impact but greatly increased the number of abandoned units in the sample. This research showed that remote sensing applications should not employ a single-pixel as an assessment unit in the thematic accuracy assessment.


Introduction
Land cover maps describe the natural and human-made materials that encompass the Earth's physical surface (e.g., water, forests, croplands, grasslands, and developed land) [1,2]. These maps have become essential data for studying the complex interaction between human activities and climate change [3,4]. The rapid development of satellite remote sensing has provided a variety of spatial, spectral, and temporal images for regional or global land cover mapping [5,6]. Many successful land cover products have been published, such as the National Land Cover Database (NLCD) [7] covering the United States, the Global Land Cover 2000 (GLC 2000) [8] and the Global Food Security-Support Analysis Data at 30 m (GFSAD) [9]. Before these data were released for decision making, a thematic accuracy assessment should be performed to ensure their scientific validity [8,10,11].
The purpose of thematic accuracy assessment is to determine the quality of the information about the land cover map [12][13][14]. Validation has evolved into a widely accepted framework that generally consists of four components: reference data collection, sampling, generating an error matrix, Remote Sens. 2020, 12, 4093 3 of 21 landscape or a classification scheme composed of more classes could increase the positional impact; however, this trend does not apply equally to all spatial scales. The reference labeling procedure determines the label of the reference data [14]. This procedure is necessary because while a pixel in the per-pixel classification map has a unique class label, the same spatial unit in the reference data may consist of multiple classes. For example, suppose a pixel (e.g., spatial resolution is 250, 500, or 1000 m) in the Moderate Resolution Imaging Spectroradiometer (MODIS) image was classified as an evergreen needle-leaved forest. However, the same spatial extent in the reference data (e.g., Landsat image) may include 55% of the needle-leaved forest, 30% of mixed forest, and 15% of grassland. The coarser the classification map, the more significant the heterogeneity of the spatial units in the reference data. Intuitively, a slight positional error could substantially change the proportions of classes and, therefore, the answer. To match the unique label in the classification map and also reduce the positional effect, per-pixel classification accuracy assessment usually hardens multiple labels within a reference unit to a unique one and then compares it with that of the classification map [12,14]. The hardening process usually applies a simple majority rule, but a labeling threshold could be further added to the rule to reinforce the reference unit's homogeneity [30]. The majority rule determines if a unique class dominates; if not, this reference unit is abandoned. For example, any reference data sample unit must contain at least 50% of a single map class or it is not used as a valid reference data sample. The labeling threshold could also be set higher or lower than 50%. If the threshold is set higher, then the reference sample units that are retained for the analysis will have greater homogeneity, but more samples will be abandoned. The converse is also true.
A higher labeling threshold potentially reduces the positional effect. However, a higher labeling threshold also leads to a larger number of abandoned assessment units, which may further affect the sampling population. A trade-off between the homogeneity to minimize the positional effect and abandoned units needs to be considered. However, how this threshold affects the balance has not been well studied.
Therefore, the primary objective of this study was to determine whether using a single-pixel as an assessment unit is appropriate for the thematic accuracy assessment. The positional error was assumed to be a leading factor in this determination. Other factors, including landscape characteristics, classification scheme, spatial scale, and labeling thresholds, were also considered.

Simulation of Positional Errors between the Map and the Reference Data
Suppose a map is created using a classification scheme composed of C classes labeled as 1, 2, . . . , c, from a remotely sensed image based on a hard classification algorithm (e.g., support vector machine or random forest [49,50]). The resulting classification map consists of N pixels where the pixel at (x, y) has a unique label c (x,y) . A sample of n pixels was randomly collected for the thematic accuracy assessment, and a higher spatial resolution image was obtained as the reference data. Due to the difference in spatial resolution, a sampled pixel in the classification map corresponds to a cluster of pixels in the reference data.
As a result of misregistration, the sampled pixel at (x, y) would use the cluster at (α 0 + α 1 x + α 2 y, α 3 + α 4 y + α 5 x) as reference data where the parameters α 0 , α 1 , α 2 , α 3 , α 4 , α 5 and α 6 describe the amount and forms of the misregistration, such as scaling, rotation, translation, and scan skewing [51]. Additionally, the positional errors vary non-uniformly through the map [40]. All these factors make the simulation of positional errors complicated and impractical. Therefore, this research applied a simplified model developed by Dai and Khorram [51] and Chen, et al. [52], assuming that positional errors are distributed equally in a small neighborhood. In other words, a pixel (x, y) in the sample would use the cluster at (x + ∆, y + ∆) as a reference location where ∆ is a relative distance denoting the number of pixels off from its original position.

Determination of the Reference Label
The pixel at (x, y) in the classification map has a unique label c (x,y) . However, multiple classes may exist within a reference cluster. A unique cluster label was determined by Equations (1) and (2).
In Equation (1), r (x+∆, y+∆) represents the label of a cluster, which was assigned either a unique label r belonging to a non-empty set R or a label Null if R is empty. The set R depends on Equation (2). The P(i) in Equation (2) represents the proportion of a class type i within a cluster. The inequation ∀i ∈ C : P(i) < P(l) returns the single dominant class l if it has a higher proportion than any other class i belonging to C. It is worth noting that if a cluster has two dominant classes with the same percentage, then the inequation returns null, making R empty. The inequation P(l) > T denotes that the labeling threshold should be greater than T. As T increases, the cluster collected in the reference data would become more homogeneous, reducing the possibility of a wrong label assigned from a more heterogeneous cluster caused by positional errors. However, a higher T could also make R more likely to be empty. Therefore, this research also counted the number of abandoned assessment sample units (AASU) if r (x+∆, y+∆) = Null and then calculated the abandoned proportion of assessment units (APAU) by dividing the AASU by the total number of the sampling units (n). A higher APAU means a larger percentage of abandoned assessment units in the sample.

Thematic Accuracy Assessment
An error matrix (Table 1) was constructed based on the qualified samples. From the error matrix, the overall accuracy (OA) and Kappa could be estimated. A few researchers have questioned the value of Kappa and proposed quantity disagreement (QD) and allocation disagreement (AD) [53,54]. The QD reflects the difference between a classification map and its reference data due to the mismatch in the proportions of the classes, while AD measures the amount of difference caused by the mismatch in the spatial allocation of the classes [53]. It is worth noting that the addition of AD and QD is equal to 1 minus OA [53,54]. Despite these doubts, Kappa is still widely used in validating land cover mapping [55][56][57]. This research's focus is not on which accuracy measure is better but instead analyzing the impact of positional accuracy when using a single pixel for thematic accuracy assessment. Therefore, this research also added AD and QD as accuracy measures. In order to calculate AD and QD, the error matrix (Table 1) was transformed into a proportional error matrix (Table 2) using Equation (3).
In Equation (3), n ij represents the number of samples that were classified as i in the classification map but were class j in the reference data ( Table 1). The term n i+ denotes the number of samples classified into i. N is the total number of pixels in the classification map while N i is the number of pixels identified as i.
This research was interested in the component of the thematic error caused by positional errors and the APAU due to threshold T. This part of the thematic error equals the absolute values of the accuracy measures without positional errors minus the counterpart with positional errors (Equation (4)).
In Equation (4), AM ∆=0, T=t represents an accuracy measure (AM) without positional errors while AM ∆=p,T=t is the accuracy measure when there are positional errors at the same threshold t. AM was replaced by OA, Kappa, QD, and AD to derive OA − error, Kappa − error, QD − error, and AD − error. | | indicates the absolute value operation. Whether to use a single pixel as the assessment unit depends highly on the amount of the AM − error and the APAU.

Study Area
Landscape characteristics were assumed to impact the positional effect. Therefore, twelve study sites representing varied landscape conditions were selected within the conterminous United States through stratified random sampling ( Figure 1). The sampling procedure was performed as follows. First, a fishnet composed of 197 square grids was created completely within the conterminous United States, and the extent of each grid was 180 × 180 km, nearly the same size as a Landsat scene [58]. Second, the landscape shape index (LSI), measuring overall geometric complexity of the entire landscape, was calculated for each grid based on the NLCD 2016 product (level II classification scheme, Table 3) [10,59] using Fragstats v4.2. Fragstats v4.2 is a software widely used for analyzing landscape metrics for categorical maps [60]. Third, the 197 square grids were stratified into seven strata using the LSI as an indicator: LSI < 200, 200 ≤ LSI < 300, 300 ≤ LSI < 400, 400 ≤ LSI < 500, 500 ≤ LSI < 600, 600 ≤ LSI < 700, LSI ≥ 700. Stratified random sampling was implemented with a sample size of 12. The sample size within each stratum was proportional to the number of grids in that stratum. Twelve study sites were sufficient to represent the landscape characteristics of the conterminous United States because these study sites account for over 6% of the sampling population (197 grids) and each strata had at least one sample. The twelve study sites were renamed from #1 to #12 according to the LSI value from small to large (Table 4).
Remote Sens. 2020, 12, x 6 of 23 Figure 1. Locations of the twelve study sites. Table 3. Two levels of the classification scheme (The Level II is from the NLCD 2016 product legend and the class names ending with the (#) symbol are not contained within any of the twelve study sites).

Level I Class Name
Level II Class Name   Table 3. Two levels of the classification scheme (The Level II is from the NLCD 2016 product legend and the class names ending with the (#) symbol are not contained within any of the twelve study sites).

Level I Class Name
Level II Class Name

Classification Data
The classification map of each study site was extracted from the NLCD 2016 produced from Landsat data at the spatial resolution of 30 m [61]. The classification scheme was assumed to influence the positional effect. Therefore, two classification schemes were utilized in this study (Table 3). Level II represents the NLCD 2016 classification scheme with 15 classes that are common to the twelve study sites [10]. The percentage of the 15 classes at each study site is shown in Table 5. The level 1 classification scheme was created by merging the thematic classes from level II into 8 classes ( Table 3). The LSI value and percentage of 8 classes of each study site as shown in Tables 4 and 6, respectively. The spatial resolution was also assumed to impact the positional effect. As a result, a series of classification maps at different spatial resolutions for each study site were generated by upscaling the NCLD 2016. Different window sizes, varying from 5 × 5, 10 × 10, 20 × 20 and 30 × 30 pixels, were used to create the classification maps at spatial scales of 150, 300, 600 and 900 m, respectively. Figure 2 shows an example of creating a classification map at a scale of 150 m. The label of a coarser pixel in the upscaled classification map was aggregated based on the majority rule. In other words, the dominant class in the window determined the label of the upscaled pixel. If there was more than one dominant class within the window, the upscaled pixel would be labeled as "unclassified". This upscaling method was applied to each study site for both levels of the classification scheme.
Remote Sens. 2020, 12, x 9 of 23 was 0.1 pixels and 0.2 pixels, respectively. The labeling result was the same when the positional error was 0.1 pixels, although this amount of positional error slightly changed the class' proportions. The result was completely different when the positional error reached 0.2 pixels because the dominant class became water with a percentage of 44%. Consequently, if a labeling threshold of 0% or 25% was applied, the reference cluster was labeled as water, otherwise, it was labeled as null and would be abandoned.
To avoid sampling errors, all pixels in the classification map except those labeled as "unclassified" were included in the thematic accuracy assessment. Because the map and the reference data are both generated from the NLCD 2016, the thematic errors shown in the error matrix were only the result of positional errors. The -, -, -, -, and were recorded between each pair of classification map and reference data for each of the 12 study sites.

Reference Data
The NLCD 2016 product [10] at the spatial resolution of 30 m for the two classification scheme levels, without the introduction of any positional error, was used as the reference data for each study site. Since the same data were used as the map and the reference data in this study, the accuracy between the two data sets is 100% before the simulation of positional error. In this way, thematic errors not caused by positional shifts were completely controlled.

Accuracy Assessment
In the simulation analysis, the positional error ∆ was varied from 0 to 2 pixels in increments of 0.1 pixels. The positional error model was applied to every pixel in the classification map and its corresponding cluster in the reference data. Different labeling thresholds (T) were selected to determine the label of a cluster in the reference data: 0%, 25%, 50%, 75%, and 100%. This research included the threshold of 0% to simulate the scenario in which the simple majority rule was applied to determine the label of a reference cluster [62]. Figure 2 takes the upper-left classified pixel as an example to show how to determine its reference label based on the multiple classes within the cluster due to the positional errors and various labeling thresholds. The spatial resolution of the classification map is 150 m, while that of the reference map is 30 m. The upper-left pixel in the classification map was classified as non-vegetation, but its reference cluster contained multiple classes. The figure presents the steps to determine the label of this reference cluster. If the positional error (∆) was set to 0 pixels, then the proportions of vegetation, non-vegetation, and water within the reference cluster would be 4%, 68%, and 28%, respectively. Therefore, non-vegetation is the dominant class. The reference label was determined as non-vegetation because its proportion in the reference cluster is greater than the labeling threshold (T). If the dominant map class is less than the labeling threshold (T), the reference label is specified as null. In the example in the figure, the reference cluster's label was specified as non-vegetation when the threshold was set to 0%, 25% or 50%, and null when the threshold was 75% or 100%. This sampling unit (reference cluster) would be abandoned if it was labeled as null. The same calculation was applied to the reference cluster when the positional error was 0.1 pixels and 0.2 pixels, respectively. The labeling result was the same when the positional error was 0.1 pixels, although this amount of positional error slightly changed the class' proportions. The result was completely different when the positional error reached 0.2 pixels because the dominant class became water with a percentage of 44%. Consequently, if a labeling threshold of 0% or 25% was applied, the reference cluster was labeled as water, otherwise, it was labeled as null and would be abandoned.
To avoid sampling errors, all pixels in the classification map except those labeled as "unclassified" were included in the thematic accuracy assessment. Because the map and the reference data are both generated from the NLCD 2016, the thematic errors shown in the error matrix were only the result of positional errors. The OA − error, Kappa − error, QD − error,AD − error, and APAU were recorded between each pair of classification map and reference data for each of the 12 study sites. Figure 3 shows the mean and standard deviation of OA − error, Kappa − error, AD − error, and APAU (abandoned percentage of assessment units) of the 12 study sites at the four spatial scales and using the eight class classification scheme. The QD − error was not presented because its values are zero regardless of the positional errors, spatial scales, and thresholds because simulating positional errors between classification maps and its reference data did not alter the proportions among classes. Figure 3 consists of twenty sub-figures, divided into five groups by rows or four groups by columns. Each row represents a specific labeling threshold (T). The first three columns denote the OA − error, Kappa − error, and AD − error, respectively, while the last column signifies the APAU. There are four curves representing different scales in each sub-figure. The results found in Figure 3 are detailed below.   (4) When the positional error is 0.5 pixels, and T is no more than 50%, Kappa − errors, AD − errors, and OA − errors are higher than 10%. When the threshold is 75%, AD − errors and OA − errors lie between 3.69% and 4.72% while the Kappa − errors lie between 8.65% and 9.60%. Nevertheless, over 23.73% of assessment units were abandoned, and the maximum percentage approximates 42.98%. If T is 100%, the Kappa − errors, AD − errors, and OA − errors drop to 0%, but APAU reaches over 50.72%, and the highest reaches 87.90%.

Results
The results of Figure 3 also hold in Figure 4, where the classification scheme consists of 15 classes. A classification scheme with 15 classes presents higher Kappa − error, AD − error, OA − error, and APAU compared to 8 classes.      (1) The upper-left sub-figure shows that the OA − errors resulted from positional errors ranging from 0 to 2.0 pixels when T is 0%. Generally, the curve of a study site with a smaller LSI is lower than that of a study site with a larger LSI. For example, the line of study site #5 with an LSI of 367.1 is below that of study site #12 that has an LSI of 438.5. However, this is not true for all study sites. For example, study site #1 holds the minimum LSI (Table 3), yet its line is above that of study site #5. (2) OA − errors drop as T grows from 0% to 100%. For example, most OA − errors of the twelve study sites at the positional error of 0.5 pixels are higher than 10% if T is no more than 50%. When T reaches 75%, OA − errors are reduced to under 10%. If T exceeds 75%, the OA − error drops to 0%. (3) The same patterns above were found among the study sites using Kappa − errors and AD − errors.
However, Kappa − errors are more sensitive to positional errors than is the OA − error. For instance, with the amount of 2 pixels' positional errors, all OA − errors are below 40%. In contrast, Kappa − errors approximate 70%. (4) The values of APAU remain steady compared to OA − error, Kappa − error, and AD − error. APAU approaches 0% at the thresholds of 0% and 25%. However, APAU lies between 1.12% and 14.55% and between 16.73% and 53.69% when T reaches 50% and 75%, respectively. When T is 100%, the APAU varies between 49.31% and 87.17%.
The patterns found in Figure 6 are similar to those of Figure 5. The only difference is that each line in Figure 6 is higher than the corresponding one in Figure 5 because the classification scheme consists of more classes (15 instead of 8).

Discussion
Thematic accuracy assessment aims to measure the classification accuracy of land cover products. However, many uncertainties that exist in the validation procedure could propagate into the error matrix and therefore make the accuracy measures misleading [24,26]. This research examined whether the single-pixel as an assessment unit is appropriate for validating land cover mapping from the perspective of the impacts of positional errors. Twelve study sites, each of which covers a spatial extent of 180 km × 180 km, of different landscape characteristics, were investigated by comparing the NLCD 2016 as the reference data with several coarser classification maps generated from the NLCD 2016 at two classification scheme levels. The results presented the errors in the

Discussion
Thematic accuracy assessment aims to measure the classification accuracy of land cover products. However, many uncertainties that exist in the validation procedure could propagate into the error matrix and therefore make the accuracy measures misleading [24,26]. This research examined whether the single-pixel as an assessment unit is appropriate for validating land cover mapping from the perspective of the impacts of positional errors. Twelve study sites, each of which covers a spatial extent of 180 km × 180 km, of different landscape characteristics, were investigated by comparing the NLCD 2016 as the reference data with several coarser classification maps generated from the NLCD 2016 at two classification scheme levels. The results presented the errors in the thematic accuracy measures (overall accuracy, Kappa, allocation disagreement, and quantity disagreement) impacted only by the positional error.
The results showed that overall accuracy, Kappa, and allocation disagreement are very sensitive to positional errors. However, no errors existed in quantity disagreement because either generating coarser classification maps from NLCD 2016 at 30 m or simulating positional errors did not alter the proportions among classes. As a result, the OA − errors are equal to AD − errors. Therefore, the following analysis focused on OA − errors and Kappa − errors. There are larger Kappa − errors than OA − errors given the same amount of positional error. The underlying reason is that Kappa adds off-diagonal elements in the error matrix (Table 1) into the calculation as compared to overall accuracy.
Previous studies have not taken the labeling threshold (T) into account, and therefore our research compares directly with these previous studies when T = 0% (the simple majority rule). Using only the majority rule, Powell et al. [27] indicated that over 30% of thematic error was attributed to one pixel's misregistration using Landsat TM as classification data. Stehman and Wickham [30] proved a conservative bias from 8% to 24% in overall accuracy due to one pixel's positional error. With the same amount of positional error, this research found that OA − errors would vary from 20.05% to 23.21% and 27.72% to 32.00% (Figures 3 and 4) using a classification scheme of 8 and 15 classes, respectively. The difference in results between these previous studies and this research results from two factors. First, previous studies performed the experiments at the spatial scale of 30 m, while the scales in this study include 4 spatial scales. Second, they analyzed the effect on accuracy measures when classification and positional errors simultaneously existed. In contrast, the thematic errors in this research only resulted from positional errors, which is more explicit.
This research also demonstrated that a classification map exhibiting more heterogeneous landscape characteristics (Figures 5 and 6) or more classes in a classification scheme (Figure 4 vs. Figure 3) increases the positional effect. These findings are consistent with results in Gu et al. [26]. A coarser spatial scale results in a higher positional effect (Figure 3 or Figure 4) because of a higher proportion of mixed pixels, as shown in Figure 7. The average error lines and associated error bars tend to overlap each other, especially at scales of 600 and 900 m. The underlying reason is that over 50% of the pixels in half of the study sites at the spatial scale of 150 m are mixed and that the mixed proportion increases with higher heterogeneity and coarser scale (Figure 7). In other words, the effect of spatial scale increases as it becomes coarser because most pixels in the classification map are mixed. Increasing the labeling threshold, T, reduces the errors in the accuracy measures; however, it increases the number of sample units that are abandoned (APAU) (Figures 3-6). Because half-pixel geo-registration accuracy has been achieved and reported in most moderate resolution remote sensing applications [44,45], this research emphasized the thematic errors at the positional error of 0.5 pixels. The OA − errors and Kappa − errors are above 10% at the positional error of 0.5 pixels if T is no more than 50%. The OA − errors and Kappa − errors decrease to below 10% and even down to 0% if T exceeds 50%; however, over 30% of assessment units were abandoned at the three coarser scales (Figure 3), or nine study sites ( Figure 5) when the 8 class map was used. The abandoned proportion exceeds 60% when T is 100%. This phenomenon is more severe in the 15 class map (Figures 4 and 6). These results demonstrate that the labeling threshold does not work for thematic accuracy assessment using single-pixel as an assessment unit. Global land cover mapping has favored pixel-based classification and the use of a single-pixel as an assessment unit [63][64][65]. Table 7 shows several common global land cover products that were created at a variety of spatial resolutions ranging from 300 to 1000 m. The achieved positional accuracy highly depended on the remote sensing sensor [63,66,67]. The global land cover datasets of IGBP, UMD, and GLC 2000 were created at the spatial resolution of 1000 m and validated using the Landsat TM and SPOT images as reference data [24,64]. The positional accuracy achieved 1 pixel, 1 pixel, and 0.3-0.47 pixels, respectively, relative to their spatial resolutions. The average was 31.92%, 31.92%, and 15.06-18.92%, according to the data (16 classes, 900 m) shown in Figure 4. The average was 49.56%, 49.56%, and 23.08%-29.07%. The advent of mediumresolution sensors such as MODIS and MEdium Resolution Imaging Spectrometer (MERIS) has allowed researchers to map the Earth's surface at a spatial scale of 500 or 300 m [68,69]. Meanwhile, these sensors have achieved sub-pixel geolocation accuracy [70]. This research analyzed potential errors in accuracy measure using the positional accuracy of MCD12 because of its highest positional accuracy (0.1-0.2 pixel) at the spatial scale of 500 m. Even so, the and would averagely vary from 5.5% to 10.4% and from 8.21% to 15.57%, respectively, according to the data (16 classes, 600 m) in Figure 4. The potential average and in GlobCover would be 13.56% and 19.71%, respectively, in terms of Figure 4 (16 classes, 300 m). It is worth noting that there were more classes in these global land cover products, which means the actual thematic errors would be more severe. Therefore, from the perspective of the positional effect, using a single-pixel as an assessment unit is not appropriate for thematic accuracy assessment. The errors Global land cover mapping has favored pixel-based classification and the use of a single-pixel as an assessment unit [63][64][65]. Table 7 shows several common global land cover products that were created at a variety of spatial resolutions ranging from 300 to 1000 m. The achieved positional accuracy highly depended on the remote sensing sensor [63,66,67]. The global land cover datasets of IGBP, UMD, and GLC 2000 were created at the spatial resolution of 1000 m and validated using the Landsat TM and SPOT images as reference data [24,64]. The positional accuracy achieved 1 pixel, 1 pixel, and 0.3-0.47 pixels, respectively, relative to their spatial resolutions. The average OA − error was 31.92%, 31.92%, and 15.06-18.92%, according to the data (16 classes, 900 m) shown in Figure 4. The average Kappa − error was 49.56%, 49.56%, and 23.08%-29.07%. The advent of medium-resolution sensors such as MODIS and MEdium Resolution Imaging Spectrometer (MERIS) has allowed researchers to map the Earth's surface at a spatial scale of 500 or 300 m [68,69]. Meanwhile, these sensors have achieved sub-pixel geolocation accuracy [70]. This research analyzed potential errors in accuracy measure using the positional accuracy of MCD12 because of its highest positional accuracy (0.1-0.2 pixel) at the spatial scale of 500 m. Even so, the OA − errors and Kappa − error would averagely vary from 5.5% to 10.4% and from 8.21% to 15.57%, respectively, according to the data (16 classes, 600 m) in Figure 4. The potential average OA − errors and Kappa − error in GlobCover would be 13.56% and 19.71%, respectively, in terms of Figure 4 (16 classes, 300 m). It is worth noting that there were more classes in these global land cover products, which means the actual thematic errors would be more severe. Therefore, from the perspective of the positional effect, using a single-pixel as an assessment unit is not appropriate for thematic accuracy assessment. The errors in accuracy measures in this research were only induced by positional errors. However, in reality, there are other sources of errors such as sampling and interpretation errors that would add to the uncertainties evident in the error matrix [12,18,24,27]. The combined effect would further strengthen our conclusion.
The choice of an assessment unit for the per-pixel classification accuracy assessment used by remote sensing analysts includes a single pixel, a cluster of pixels (e.g., 3 × 3 pixels), and a polygon [12]. This research complemented the research of [27,30] using twelve study sites of various landscape characteristics under multiple cofactors. This research further confirmed that even if the image achieved half-pixel registration accuracy, choosing a single-pixel as the assessment unit is an inferior choice. However, using a cluster or polygon may generate new questions such as how to sample, compare, and then present the error matrix [30], which needs further research.
This research also has several limitations. First, we assumed a uniform spatial distribution for positional errors. Future work should consider the different forms of geometric distortions and evaluate the worst effect [40]. Second, the twelve study sites using the level I classification scheme were created by collapsing the classes in the stratified random level II study sites. Therefore, while the level II sites were selected to be representative of the landscape characteristics, there was no guarantee that the sites would remain so when collapsed to leve1 I. However, a preliminary analysis shows that the LSI of twelve study sites at level I (Table 3) still covers the dynamic range according to the histogram of LSI ( Figure 8). Third, this research only took into account the spatial scales ranging from 150 to 900 m of the classification map, which is limited by the spatial resolution of the NLCD 2016 as reference data. However, if considering the relative size of pixels between the simulated classification map and the reference data, the conclusions could be extended to other spatial scales. Unfortunately, as the spatial resolution increases, it is more challenging to geo-register a pixel to sub-pixel level due to a lack of a high precision digital elevation model [12,71]. Therefore, this research speculates that using a single-pixel as an assessment unit is also not appropriate for thematic accuracy assessment at higher spatial scales. Finally, this research only included twelve study sites within the United States, and all accuracy measures were at the map level. However, this research took 1488 h of processing time using a laptop workstation with an E-2176M 6 core processor and 32 GB of memory. Future work could test the conclusion at a categorical level.

Conclusions
Choosing an assessment unit is a crucial component of the framework of thematic accuracy assessment. There are various choices for the assessment unit, including a single pixel, a cluster of pixels, and a polygon. The main argument lies in whether the single-pixel as an assessment unit is appropriate for thematic accuracy assessment. This research conducted a simulation analysis from the perspective of positional errors. Other factors, including landscape characteristics, classification schemes, spatial scale, and labeling thresholds, were also analyzed. The results showed that the single-pixel as an assessment unit is not appropriate for use in a thematic accuracy assessment. A classification map with a more heterogeneous landscape or more classes in a classification scheme increases the positional effect. The spatial scale has greater impact when most pixels in the classification map are mixed. Increasing the labeling threshold reduces the positional impact; however, it increases the number of assessment units that must be abandoned. Careful consideration of the issues and analysis described in this paper will result in improved thematic accuracy assessment in the future.

Conclusions
Choosing an assessment unit is a crucial component of the framework of thematic accuracy assessment. There are various choices for the assessment unit, including a single pixel, a cluster of pixels, and a polygon. The main argument lies in whether the single-pixel as an assessment unit is appropriate for thematic accuracy assessment. This research conducted a simulation analysis from the perspective of positional errors. Other factors, including landscape characteristics, classification schemes, spatial scale, and labeling thresholds, were also analyzed. The results showed that the single-pixel as an assessment unit is not appropriate for use in a thematic accuracy assessment. A classification map with a more heterogeneous landscape or more classes in a classification scheme increases the positional effect. The spatial scale has greater impact when most pixels in the classification map are mixed. Increasing the labeling threshold reduces the positional impact; however, it increases the number of assessment units that must be abandoned. Careful consideration of the issues and analysis described in this paper will result in improved thematic accuracy assessment in the future.
Author Contributions: J.G. and R.G.C. conceived and designed the experiments. J.G. performed the experiments and analyzed the data with guidance from R.G.C.; J.G. wrote the paper. R.G.C. edited and finalized the paper and manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: Partial funding was provided by the New Hampshire Agricultural Experiment Station. This is Scientific