How Response Designs and Class Proportions Affect the Accuracy of Validation Data

Reference data collected to validate land-cover maps are generally considered free of errors. In practice, however, they contain errors despite best efforts to minimize them. These errors propagate during accuracy assessment and tweak the validation results. For photo-interpreted reference data, the two most widely studied sources of error are systematic incorrect labeling and vigilance drops. How estimation errors, i.e., errors intrinsic to the response design, affect the accuracy of reference data is far less understood. In this paper, we analyzed the impact of estimation errors for two types of classification systems (binary and multiclass) as well as for two common response designs (point-based and partition-based) with a range of sub-sample sizes. Our quantitative results indicate that labeling errors due to proportion estimations should not be neglected. They further confirm that the accuracy of response designs depends on the class proportions within the sampling units, with complex landscapes being more prone to errors. As a result, response designs where the number of sub-samples is predefined and fixed are inefficient. To guarantee high accuracy standards of validation data with minimum data collection effort, we propose a new method to adapt the number of sub-samples for each sample during the validation process. In practice, sub-samples are incrementally selected and labeled until the estimated class proportions reach the desired level of confidence. As a result, less effort is spent on labeling univocal cases and the spared effort can be allocated to more ambiguous cases. This increases the reliability of reference data and of subsequent accuracy assessment. Across our study site, we demonstrated that such an approach could reduce the labeling effort by 50% to 75%, with greater gains in homogeneous landscapes. We contend that adopting this optimization approach will not only increase the efficiency of reference data collection, but will also help deliver more reliable accuracy estimates to the user community.


Introduction
Over the last decades, the number of openly-available geographic data sets has tremendously increased along with their use in policy making, environmental monitoring, hazard prevention and scientific studies. It is of paramount importance that their quality is rigorously evaluated to inform users about their limitations and to limit contradicting results. Good practices in accuracy assessment include recommendations about i) the sampling design that determine how many sampling units should be collected along with their locations, ii) the response design that defines the protocol for labelling each sampling unit, and iii) a rigorous estimation of accuracy using specific metrics (Congalton, 1991;Stehman and Czaplewski, 1998;Stehman, 1999Stehman, , 2001Olofsson et al., 2014). A statistically rigorous assessment is thus a combination of a probability sampling design, appropriate accuracy estimators, and a response design chosen in accordance with features of the mapping and classification process.
Uncertainties linked to the sampling design and variance of the estimators are usually well quantified, but validation methods typically assume that reference data are error-free. In fact, the process of determining the so-called "ground truth" is seldom discussed in the literature and is often considered a straightforward -yet costly-task. Nonetheless, generating authoritative reference data sets remains a major challenge in accuracy assessment and it merits greater consideration in accuracy assessment (Stehman and Foody, 2019). Errors can indeed alter the process of generating reference data and even a small amount of errors can propagate and significantly impact the accuracy assessment (Woodcock and Gopal, 2000;Foody, 2009Foody, , 2013. There is thus a need for new methods that offer better control over the quality of reference data. Practical constraints such as accessibility to the sample locations often affect the implementation of the accuracy assessment. For instance, this prompted Stehman (2001) to propose criteria for quality and statistical rigour while taking, at the same time, practical utility (accessibility and reduced costs) into account. Another alternative is to replace ground observations with photo-interpreted very high-resolution images. Photo-interpretation by a group of experts having regional knowledge is generally seen as the gold standard for reference data collection, especially when dealing with largearea thematic products. Photo-interpretation is not perfect -it typically reaches 80% accuracy-and it varies considerably among operators (with accuracy levels ranging from 11% up to 100% Van Coillie et al., 2014). Errors (e) affecting photo-interpreted samples can be divided in three categories: vigilance, systematic, and estimation errors: e = e vigilance + e systematic + e estimation (1) Vigilance errors, i.e., loss of performance after performing the same monotonous task over a long period, has been highlighted for a wide range of visual interpretation tasks (Parasuraman, 1986). Attitude, either optimistic or pessimistic, may determine how an operator will respond to training for vigilance (Szalma et al., 2006). Drops of vigilance, which are difficult to predict and manage for an individual interpreter, can be reduced by relying on more than one operator (Pengra et al., 2019).
Systematic errors (e systematic ) occur when a photo-interpreter is incorrectly reading the image. Reading of images and maps belongs to cartographic and visual literacy (Svatoňová, 2017). Visual literacy skills change over time and can be improved with the development of geospatial thinking. Image interpretation is a process that combines perception and cognition, both of which tend to facilitate identification (the cognitive task of identifying a pattern) and signification (the assignment of a meaning to a particular pattern (Olson, 1960;Colwell, 1965)). The types of insight derived from imagery are strongly influenced by the interpreter's expertise. Experts bring specialised knowledge, highly attuned perceptual skills and flexible reasoning abilities that novices lack (Klein and Hoffman, 1993). There is however no strong relationship between the fieldwork experience of operators and their photo-interpretation accuracy. This might be explained by the dissimilarities between, on one side, air-and spaceborne images and, on the other side, panoramic images in at least three important aspects: i) the portrayal of features from a downwards -often unfamiliar-perspective, ii) the use of wavelengths outside the visual portion of the spectrum, and iii) the depiction of the Earth's surface at unfamiliar scales and resolutions (Lillesand et al., 2008). The most capable interpreters have keen powers of observation, coupled with imagination and a great deal of patience (Lillesand et al., 2008).
Another individual factor potentially influencing image interpretation accuracy is search strategy. Compared to random search, training in systematic inspection caused higher performance (Wang et al., 1997). Maruff et al. (1999) have suggested that behavioural goals constrain the selection of visual information more than the physical characteristics of the information. This suggests that photo-interpreters with a search strategy based on previous experiences would be more successful at extracting relevant information than someone randomly searching for this information. Geographers would therefore be more successful than non-geographers during a single categorisation round of aerial photos (Lloyd et al., 2002). Accordingly, crowdsourcing (i.e., when photo-interpreters are replaced by volunteers) is particularly prone to errors as it is open to anyone, regardless of the level of expertise of the volunteers. Systematic errors can thus largely be avoided by providing training to photo-interpreters, selecting operators with local knowledge, and relying on multiple contributors (Pengra et al., 2019).
Estimation errors (e estimation ) arise when the class proportions within sampling units are imprecise even when all labels related to the sampling units are correct. These errors stem from three main factors: the number of sub-samples to label per sampling unit, the landscape structure, and the choice of the legend. Imprecise estimates of the proportion of different classifiers for mixed or transitional classes reportedly account for the majority of disagreements among photo-interpreters (Powell et al., 2004). It has also been shown that the accuracy of the labelling as well as the accuracy of the image-based classification generally decrease when the sub-pixel heterogeneity increases (Tran et al., 2014). Contrary to the systematic and vigilance errors, there is currently no mechanism to control estimation errors. That is when best practices in quality control are implemented, i.e., e systematic = 0 and e vigilance = 0, errors in the photo-interpreted labels remain and are solely due to estimation errors. If left unchecked, estimation errors can thus bias reference data as they are intrinsically linked to the complexity of the sub-pixel landscape structure. We thus proposed that estimation errors need to be managed in the response design.
With the objective of improving the accuracy of photointerpreted reference data, we sought to untangle the intricate relationships between legends, response designs and landscape fragmentation with regards to the estimation errors. Our specific goals were i) to estimate the impact of imprecise estimation of land cover proportions on the accuracy of reference data and ii) to propose a response design that optimises the labelling effort. We particularly focused on two aspects of response designs: their structure (point-based vs. partition-based designs) and their associated labelling effort (the number of sub-samples to be labelled per sampling unit). These were studied for both binary an multi-class legends. Our main contributions can be summarised as follows: • We provide an in-depth review of the different types of response designs and their applications; • We analyse the performance of response designs for different types of legends; • We generalise case-specific results using landscape indices; • We optimise the sampling effort with adaptive response designs that leverage the confidence intervals of the estimated proportions.
To investigate estimation errors independently of other sources of error, we simulated different realisations of response designs and legends based on a 2-m land cover map of Wallonia, Belgium, that we considered as ground-truth. To isolate the effect of estimation errors, we assumed a perfect quality control system throughout the paper, i.e., there were no vigilance or systematic errors.

Types of classification system
There is no one ideal classification system and it is unlikely that one could be developed (Anderson, 1976). A large number of classification systems have therefore been developed depending on the purpose of the map and the scale of the analysis. Congalton et al. (2014) suggest that all classes of the legend should be clear determined at the beginning of the mapping projects, which is not always the case (Grekousis et al., 2015). Those classification systems can be divided into three categories: semantic legends, aggregative legends and continuous fields.
A semantic legend provides a formal description of the classes, including properties and relationships with their sub-parts. They are usually preferred to describe spatial entities at large scales like e.g., trees or buildings. Spatial entities unambiguously described by a semantic legend can in turn be used as diagnostic criteria (also referred to as classifiers in the context of classification systems; Di Gregorio, 2005) to build aggregative legends. In some cases, a semantic legend can be used to describe spatial regions instead of spatial entities (Radoux and Bogaert, 2017). In this case, the semantic refers to a specific meaning that encompasses a large set of properties and relationships, e.g., a city (land use type) or a savanna (ecosystem type). To avoid ambiguity, all classes should be precisely described by an ontology that includes a representation, a formal naming, and a definition of their properties and relations (Arvor et al., 2019).
When the spatial resolution of the image becomes coarser than the size of the spatial entities, pixels can either represent a continuous field with the proportions of all (or a subset of) the classifiers (Fernandes et al., 2004), or can be described by an aggregative categorical legend. For the latter, the proportion of spatial entities is computed and decision rules are applied to define the classes at the coarser resolution (Petit and Lambin, 2001). These decision rules of aggregative legends are either based on a majority rule or on fixed threshold values for class proportions.
Majority-based legends are the most prevalent type of aggregative legends in land cover mapping. For instance, all the first global land cover maps used of IGBP (International Geosphere-Biosphere Programme) legend (Grekousis et al., 2015). It is often implicitly assumed (without being always clearly stated) that labels correspond to the class of the prevalent spatial entity within each mapping unit. The boundary between a pure semantic legend and a majority-based legend is therefore fuzzy. The drawbacks are that (i) majority is undefined when multiple classes are equally dominant, and (ii) no information about the actual class proportions within the mapping units is conveyed to users. As an example, in the case of a legend with ten classes, a majority label could be assigned to a class covering between 10% (all classes present in the same proportion) and 100% ("pure" class) of the area of the mapping unit.
Threshold-based legends are also a widely used type of aggregative legends. They rely on a set of rules to partition the feature space of the classifiers' proportions. Those rules introduce sharp boundaries to the continuous field of the classifiers' proportions in order to obtain a limited number of categorical classes. The number of those classes usually exceeds the number of classifiers in the mapping area. Threshold-based legends should be defined with consistent classification systems such as the Land Cover Classification System (LCCS; Di Gregorio, 2005). The LCCS guarantees no overlap between classes and a full description of the possible combinations of classifiers and is considered by Grekousis et al. (2015) as the only universally applicable classification system. LCCS is used for coarse to medium resolution global land cover maps (Bartholome and Belward, 2005;Bontemps et al., 2013), but also for high resolution object-based classification (Radoux and Bogaert, 2017). They provide a larger thematic precision (more classes from the same set of classifiers) than majority-based legend, but they require to define (often arbitrary) thresholds and conditional statements to draw the boundaries between the classes. This may lead to difficult naming conventions when the feature space of classifiers is large and the rules become complex.
Binary legends are a particular case of aggregative legends that indicates whether a given classifier is present or absent inside a given pixel. Several examples include global crop/non cop , water/non water (Lamarche et al., 2017) or forest/non forest (Shimada et al., 2014) maps. The labelling is then defined by a proportion threshold. Most of the time, this threshold is set to 50%, which is then nearly equivalent to a majority-based legend for two classes, but can be unambiguously defined at the proportion of 50% (unlike the majority). There are however examples where the threshold is not 50%, e.g., 10% for a forest/non forest map in arid regions (Bastin et al., 2017).

Structures of response designs
The values of the accuracy parameters are strongly affected by the protocols implemented for the response design. This includes the choice of spatial units and how within-unit homogeneity is addressed when assigning class labels (Stehman and Wickham, 2011). Most of the time, the sampling unit is the elementary unit that is labelled according to the legend of the map. In this study, the sampling unit is a pixel and there is an agreement when the map classification and the response design converge to the same label. The studied response design is therefore crisp, as opposed to alternative response designs where the proportions of spatial entities are used to validate categorical variables, e.g., fuzzy validation. Best practices in accuracy assessment suggest that photo-interpretation should rely on images of finer resolution than the map being validated. With finer resolution data, the sampling unit could appear as heterogeneous. Three protocols could then be used to assign a reference value to those sampling units: Direct assignment: a single label is directly assigned to the sampling unit; this is the most common response design (see Perger et al., 2014); Point-based sub-sampling: the sampling unit is sub-sampled by a set of points that are individually labelled by the interpreter (see Bey et al., 2016;Bastin et al., 2017) before automated aggregation with decision rules; Partition-based sub-sampling: the sampling unit is partitioned into sub-parts that are labelled individually by the interpreter (see Bayas et al., 2017;Waldner et al., 2019) before automated aggregation with decision rules.
Direct assignment is the fastest method because a single class is assigned for each sampling unit by looking at it as a whole. However, (i) it is strongly dependent on the operator skills and its level of concentration, (ii) it is poorly inter-operable because there is no information about classifiers proportion that would help to apply other labelling rules, and (iii) the confidence of the labelling must be provided by the operator. In this study, we focused on the two designs that involve sub-samples, namely point-based and partition-based designs ( Fig. 1) because our primary assumption was that a large part of the labelling errors could be quantified from the selected response design, i.e., independently from the operators. Figure 1: Two types of response designs compared in this study: point-based and partition-based designs. Grassland are in green, impervious surfaces are in red and cropland are in yellow. Both methods would agree on a majority of grassland, as well as the direct assignment of the majority class without looking at sub-samples.

Point-based response designs
In point-based response designs, photo-interpreters label a set of points within every sampling unit. The final class is then assigned based on the proportions of the number of points for each observed category. By definition, points are dimensionless but, in practice, the photo-interpretation is limited by the spatial resolution of the reference image. Nevertheless, even if the vicinity of points provides contextual information as a part of the photo-interpretation process, the label is defined at the precise location of the point.
Randomly selecting a set of points inside each sampling unit inherits the same properties as for the sampling designs of a map as a whole. Systematic sampling is therefore often the most efficient (Stehman, 1992). However, if strong periodicity in the spatial pattern of the landscape is suspected, systematic sampling should be avoided unless sufficient information is available to avoid the phasing between spatial pattern and sampling interval (Matérn, 2013).

Partition-based response designs
Contrary to their point-based counterparts, partition-based response designs provide exhaustive coverage of the sampling unit (Fig. 1). The counterpart is that it implies a discretisation of the landscape, which could result in inaccurate labels. In practice, photo-interpreters are tasked with estimating the proportion of each class within each sub-sample, based on their entire content. The final label is automatically attributed from the estimated class proportions following a set of rules that are specific to the legend. These rules are applied in a two-step process: the first step is performed by a photo-interpreter who assigns a label to each sub-part, and the second step consists in aggregating the labels of the sub-parts to attribute a final label, which can be automated. While square sub-parts are the most widespread type of partition (Bayas et al., 2017), irregular polygons can also be used (Waldner et al., 2019). In the latter case, accurate delineation of the polygons plays a major role in the reliability of the response design.
In the case of binary classification, there are two approaches to define the sampling unit labels. The first approach, hereafter referred to as Threshold-Then-Majority (TTM), applies the labelling rule for each sub-sample level, then determines the majority label amongst sub-samples of the whole sampling unit. The second protocol, hereafter referred to as Majority-Then-Threshold (MTT), starts by determining the majority class inside the sub-samples, then applies the labelling rule with threshold at the level of the sampling unit. These methods are identical when the binary threshold is equal to 50% but differ otherwise. These two frameworks could be applied to LCCS-like multi-class legends as well. In the case of majority class legend, they both simplify into a two-stages application of the majority rule. TTM is expected to work best in fragmented landscapes and MTT for large homogeneous patches.

Generating sampling units and sub-samples from a reference land cover map
To obtain an error-free reference data set across resolutions, we considered a 2-m land cover map as ground truth ( Fig. 2; Radoux et al., 2019). The original map covers the Walloon Region, Belgium, and includes ten land cover classes. The two marginal grassland classes were merged with the agricultural grassland because of their scarcity. Using a 2-m resolution map provides unambiguous labelling of classifiers because the size of the pixels is inferior to the size of standard objects in this landscape. In case of mixed pixels located at the boundary of two spatial objects, the label was chosen based on the pixel centroid. The overall accuracy of the land cover map was estimated based on 1000 points selected by stratified probabilistic sampling. Sample units were photo-interpreted on 25 cm resolution orthophotos by two trained operators. Sampling units with conflicting or uncertain labelling were verified in situ with sub-metric positioning. The estimated overall accuracy of the 2-m land cover map was 85%, this including both thematic and geolocation errors, and it reaches 93% when geolocation errors of less than 5 meters are tolerated. Hence the map was expected to exhibit realistic land cover patterns. Henceforth, we considered the 2-m land cover map as an exact reference, i.e., spatially precise and thematically accurate.
In this experiment, pixels in the map at 360-m resolution corresponded to sampling units and their labels were considered as the target values for the different response designs. The high resolution data set was aggregated at the resolution of 360 m using the different labelling rules based on the pixel counts (with a number of 32400 2-m pixels per 360-m cell; Fig. 3). While 360 m does not correspond to the spatial resolution of any current satellites, it is similar to medium-resolution satellite such as PROBA-V, MODIS, Sentinel-3 or VIIRS and it has the advantage of being factorisable by a diversity of integers (2, 3, 4, 5, 6, 9, 12, 15, 20, . . . ), thereby allowing partitions of a large variety of size of squares.

Types of legends
Two types of aggregative legends were compared in this study: a multi-class majority-based legend and a set of three threshold-based binary legends (Fig. 2). With the majoritybased legend, the label of the most frequent class within the mapping unit was selected. There are eight classes in the 2-m map so there are also eight classes at the aggregated level. In the cases of binary legends, the map represents the presence or absence of a specific class. When pixels are larger or similar in size to the spatial objects or the spatial regions of interest, an arbitrary threshold on the proportion of the class becomes necessary to handle mixed pixels. In this study, we propose to look at maps of the forest class (broadleaved and needleleaved forests together). Three different thresholds of crown cover have been chosen, namely 10% (FAO's forest definition Food and Agriculture Organization of the United Nations, 2010), 50% (the most commonly-used threshold) and 75% (threshold used for closed canopy forests).

Methods
We sought to answer the following questions: 1. What is the accuracy of point-based and partition-based response designs for different number of sub-samples in a realistic case study?
2. How can the accuracy of response designs be predicted based on landscape structure indices?
3. How to optimise the number of sub-samples per sampling unit?
We addressed these questions in three successive steps for the four legends described in section 3.2. First, the accuracy of the various response designs was compared with simulated sampling across the study site (section 4.1). Second, we generalised the relationship between error rates and the landscape structure underlying the sampling units (section 4.2). We finally proposed a method that iteratively adds sub-samples to label until the estimated class proportions driving the labelling process reach a the desired confidence level (section 4.3).
This optimisation methods is formulated for point-based designs only, as theoretical confidence intervals are not available for partition-based designs and labels cannot be reused when the number of partition is increased. In fact, optimising the partition-based designs depends on the ability of the operator to decide the appropriate number of sub-samples. As we assumed perfect operators throughout this paper, the question of optimising partition-based designs falls beyond the scope. Nonetheless, the impact of photo-interpretation errors on the response design is discussed in section 6. 4.1. Accuracy of point-based and partition-based response designs Our approach to empirically quantify the accuracy of response designs was based on a Monte Carlo framework. For every sampling unit, we repeatedly estimated the class proportions of ground truth for a range of sub-sampling efforts. The labels of the randomly simulated response designs were then assigned using the decision rules of the four legends. The same decision rules were applied on the true proportions (i.e., computed from the 2-m reference raster) to derive the true label. For each iteration, the error rate was computed by dividing the number of disagreements between the true and the simulated labels of the sampling units, with Error rate = Number of erroneous labels Number of sampling units For the point-based response designs, the sub-sample selection was repeated 36 times. Sub-samples were selected by simple probabilistic sampling of the 2-m pixels located inside each sampling unit.
For partition-based response designs, the 36 realisations were generated by shifting the origin of the grid by six multiples of eleven pixels (22,33,44,55,66,77) in both the X and Y directions (resulting sampling units that are not completely inside the study area are discarded). Spatial resampling of the 2-m reference map was performed at intermediate spatial resolutions of 180, 120, 90, 72, 60, 36 and 30 m that respectively correspond to a partitioning in 4, 9, 16, 25, 36, 100 and 144 squares. In the TTM case, the proportion of the forest class was computed within every intermediate resolution pixel, which were then labelled as forest or non-forest according to the threshold value. Those pixels were then resampled at the spatial resolution of 360 m with a majority rule to select the final label. In the MTT case, the forest label was assigned to each sub-sample where forest was majority. The proportion of forest pixels was then computed for each 360 m pixel and the final forest/nonforest label was assigned based on the selected threshold. For the majority legend, the majority class was first identified for each sub-sample and the majority of the sub-sample labels was finally assigned to the sampling unit.

Impact of landscape fragmentation
In this section, we sought to evidence the link between landscape structure and response design in order to predict the response design accuracy in other landscapes where prior structure knowledge is available. We therefore selected two landscape metrics -one per legend type-that can be easily computed for any areal sampling unit and any scale.
For binary legends, we characterised landscapes by reporting the purity (π) of the forest class within each sampling unit, with where S forest is the area of forest (more precisely in this case, tree crown cover) inside the sampling unit, and S total is the area of the sampling unit.
For multi-class legends, we opted for the Equivalent Reference Probability ( ; Bogaert et al., 2016). Rooted in information theory, is particularly interesting because it accounts for the full set of probabilities and remains consistent with the maximum probability, unlike entropy. Given p = (p 1 , ..., p k ) the vector of the class proportions in the landscape, k the number of classes and i * the index of the dominant class, the Equivalent Reference Probability is where E[D(i||i * )] is the expected difference of information: with p i * , the proportion of the majority class. Class purity and were computed for each sampling unit based on the true proportions.
Average error rates of the response designs were estimated for the full range of possible π and values with a step of 0.05. For visualisation purposes, the error rate were smoothed by fitting local regressions (LOESS; Cleveland et al., 1992).

Local optimisation of the number of sub-samples
When collecting validation data, the structure of the landscapes covered by the sample units is generally unknown, so that the optimal number of sub-sampling units cannot be estimated a priori from relationships between accuracy and landscape fragmentation. However, the interactivity of Web 2.0 validation platforms allows us to compute on-the-fly the proportion of each class according to the available sub-samples. This third part of the study aimed to optimise the number of sub-samples needed for reaching a certain level of accuracy, resulting in an optimal response design that minimises cost and/or time constraints. Therefore, we propose to define an optimal number of sub-samples for each sampling unit based on the confidence intervals of the estimated sub-sample class proportions. Here, α was set to 0.001 to illustrate the stringent requirements of building authoritative reference data sets, and to 0.1 to illustrate the required effort for collecting reference data under constrained conditions.
In practice, the local optimisation process consisted in randomly selecting an initial set of nine sub-samples and assessing the corresponding confidence level. Sub-samples were then added one at a time until the confidence on the estimated proportions is larger than the desired confidence. For the binary legends, the confidence interval (for a given confidence level) around the estimated proportion must not include the threshold value that divides the study area in the two binary classes. For the multi-class majority legend, the confidence interval around the estimated proportions of the majority class must not include the estimated proportion of the second most frequent class.
For binary legends, a given sampling unit is correctly labelled if the estimated proportion is on the same side of the threshold value than the true proportion. In practice, the proportion of the sampled area is unknown. However, the probability of assigning the correct label can be estimated based on the estimated value of the binomial distribution. Because of the small number of points, the Clopper-Pearson exact confidence interval (CI) was used instead of the Normal approximation (Clopper and Pearson, 1934), with where CI LB is the confidence interval lower bound, CI U B is the confidence interval upper bound, n is the number of sub-sampling units, m is the number of points belonging to the majority class, α is the percent chance of making a Type I error, and 1 − α is the confidence level.
Exact confidence intervals are not available for multinomial cases. Following Goodman (1965), simultaneous confidence interval estimates were selected because preliminary tests revealed that, in a binomial case, it provides a closer match to the Clopper-Pearson interval than other alternatives. For a multinomial distribution p, Goodman's simultaneous confidence interval for the i th class is given by where b = χ 2 1−α/k (1), the 1 − α/k quantile of the chi-square distribution with one degree of freedom.
In some cases, e.g., where the observed class proportion is equal to the arbitrary threshold in a binary classification or when several classes have the same proportion in the case of majority rule, the number of points to meet the required confidence could grow infinitely. Therefore, the maximum number of points was arbitrarily set to 144. This process was repeated 25 times in order to compare the theoretical confidence levels with the observed accuracy and to estimate the average number of sub-samples needed for each sampled area.

Impact of response design and sampling effort on accuracy
of the labels Overall, results highlight the relatively large uncertainty linked with the response designs for all types of legends in the study area. In addition, the average error rate is not only linked with the sampling effort, but also depends on the combination of the legend and the type of response design (Fig. 4).
For any sub-sample size, the most reliable labels are obtained for the binary legend with a threshold at 50% for both partition-based and point-based response designs. For the other legends, the ranking of the ease of validation differs across response designs. For instance, the second most consistent labelling is obtained for the majority legend with a partitionbased design, while the binary legend with 10% threshold ranks second for point-based validation. For the same legend, the partition-based response design performs poorly, with an error rate of 12% for 25 sub-samples.
The average error rates of point-based designs markedly decreases between 4 and 100 sub-samples (Fig. 4a). This trend is observed across the four legends. With only four points, error rates are > 15%. The error rates then drop to < 2.5% for sub-sample sizes larger than 100 in the case of threshold-based legends. The decreasing error rate with respect to the sampling effort is also observed for the majority legend, but the improvement is smaller (still 6% error with 100 sub-samples). In comparison, the other binary legends provide more correct labels (4% error with 100 sub-samples for the 75% binary legend), with the most accurate labelling obtained from the binary 50% legend (3.5% error).
The two types of partition-based response designs exhibit an opposite behaviour for the binary thresholds of 10% and 75%. In those two cases, the error rates of a perfect operator increase in TTM (but decrease in MTT) for increasing numbers of sub-samples. The binary legend at 50% yielded similar results for MTT and TTM, with slightly better results from the TTM approach. It reaches 98.4% accuracy with 25 sub-samples. The majority-based legend fails to generate labels with less than 3% error when using less than 144 sub-samples, and achieves less than 5% errors starting from 25 sub-samples (Fig. 4b).
The results of the case study show that the most efficient response design depends on the legend of the map. Given the spatial resolution of the sampling units and the relatively fragmented landscape of the study area, the partition-based response design outperformed point-based response design for the majority legend. With the latter, 25 sub-samples were necessary to achieve 95% of accuracy. The validation effort required for binary legends depended on the threshold value. The least effort was required with a threshold of 50% and a partitionbased model (95.2% with only 4 sub-samples). On the contrary, point-based response design outperformed partition-based response designs for the 10% threshold. This legend was the most difficult to validate in the study area -36 sub-samples were needed to reach at least 95% accuracy.

Relationship between sampling unit heterogeneity and accuracy
Heterogeneity indices allow us to generalise the overall error rates estimated on the study area. The selected heterogeneity indices, which are independent of the landscape and spatial resolution, highlight the peaks of the labelling uncertainty and the sampling units where the label can be trusted. The error rates are strongly related to the heterogeneity indices of the sampling units for both binary (π, see Fig. 5) and majority legends ( , see Fig. 6).
In point-based designs, the error rate is maximum for sampling units with forest proportions close to the class threshold. The error distribution is slightly asymmetric, especially with small sampling efforts, with the longest tail towards the proportion of 50%. Consistently with the expression of the variance for a binomial distribution, the largest variance of the distribution of errors is observed with the 50% threshold and decreased towards the extremities of the range. In order to achieve an error rate of < 1% on average with 16 points, the actual proportion need to be different from the threshold value by circa 20%.
The error distribution of partition-based response designs also peaks near the threshold values. The results clearly indicate that the partition-based method are strongly biased when the threshold value is not 50%: with 10 and 75% thresholds, the errors rates increase from 50% towards the value of the threshold, where they are systematically wrong. This is due to the systematic omission of the class that contains the 50% interval when approaching the extremities of the range, which is the only type of error when using a partition-based response design with these threshold values. Indeed, the error rate drops to zero in terms of detection of the class that is located at one end of the interval.
With majority legends, the largest error rates are observed for sampling units with similar proportions (low value). Labelling is 99% correct when ≥ 0.5. Interestingly, point-based designs are more accurate than partition-based designs for complex landscapes. On the other hand, designs based on a limited number of partitions outperformed their point-based counterparts for simple landscapes ( >0.25). Overall, partition-based designs appear relatively insensitive to an increase above 9 in the number of sub-samples, while the impact of the number of sub-samples is obvious in the case of point-based design. This corroborates the results of overall error rates in the case study (Fig. 4).

Optimisation of the number of sub-samples
We optimised the number of sub-samples so that the resulting label reached a 90% or 99.9% confidence level for each sampling unit or the maximum number of sub-samples (144) was attained. The difference between the error rates of the 99.9% optimised and the error rate with a fixed (144) number of sub-samples was <1%.
The regions of label uncertainty highlighted in Fig. 5 and Fig. 4 are consistent with the regions that require more validation efforts (Figure 7a and b). More samples are needed when is low or π is close to the threshold value. The mean number of sub-samples then decreases quickly, especially with the binary legends, so that less than 20 points are needed for most of the range of or π when the confidence level is set to 99.9%. Furthermore, the 90% confidence level is achieved with low effort for the binary legends (Fig. 7a). The sampling effort around the minimum value for the majority-based legend remains high in comparison (Fig. 7b), which is consistent with the larger error rates observed for the point-based validation of majority-based legends.
The main difference between the shapes of the distribution of the error rates (Fig. 5) compared with the mean optimized number of points (Fig. 7a) occurs on the extreme values of π for the binary legends. Indeed, the error rate for π value of 0% for the binary legend with 10% threshold (or 100% with threshold 75%) is close to zero, but the number of subsamples needed to achieve 99.9% confidence is relatively high (75 for the 10% threshold and 40 for the 75% threshold).
In the study area, the optimisation method with a very high confidence (99.9%) could more than halve the labelling effort compared to a systematic sub-sampling of 144 sub-samples per sampling unit (Fig. 7c). It required an average of 57, 27 and 26 sub-samples for binary thresholds of 10, 50 and 75%, respectively. The majority-rule legend was the most difficult to validate, with an average of 115 points needed and the maximum (144) number of sub-samples needed in 56.1% of the sampling units.
The 90% confidence interval could be achieved for the binary legends with on average 21, 10 and 11 sub-samples for thresholds of 10, 50 and 75% respectively. With the majority-based legend, it required an average of 99 points. Those results are due to the fact that the maximum number of sub-samples (that is 144 in this study) was reached 38% of the cases.
As observed in the Fig8, Fig. 8 shows that reaching a confidence level of 90% can be achieved at a reasonable cost, but the a large confidence level is much more demanding. However, the spatial distribution of the required number of subsamples (Fig. 8) highlights the particularities of the landscape in the study area. For instance, the majority-based legend requires more points in heterogeneous areas (such as the large urban areas) while the binary legends is more demanding in open landscapes (for the 10% threshold) or closed forests (for the 75% threshold), when the actual proportions are close to the threshold values. On the other hand, the benefits of the optimisation of the number of samples is obvious on large patches where the land cover proportions are distinct from the threshold values of the legend.

Discussions
We assessed the accuracy of two main types of quantitative response designs -a grid of points and a grid of squares-based on a protocol that provided full control over the validation process. While it is well known that mixed pixels are more difficult to label than pure ones, we quantified how the labelling uncertainty increased for class proportions close to the decision boundaries of the legend. These results highlighted the difficulty of building accurate reference data sets for any combination of response design and legend. In many cases, the required number of sub-samples to reach 98% confidence level was indeed too high (more than 100 subsamples per sampling unit) to be practically implemented. When factoring in the cost of response designs with large number of sub-samples, collecting error-free reference data seems thus barely feasible. Therefore, matching the data collection effort to the available resources appears critical. In other words, there is a necessary sacrifice of the confidence on the reference data in order to achieve a rigorous quality assessment at a reasonable cost.
The efficiency of point-based and partition-based response designs differed depending on the legend type. Partition-based response designs ought to be preferred in case of majoritybased legend or binary legend when the proportion is close to 50%, while point-based response designs become more efficient when the binary legends use thresholds that are close to 0 or 100%. The ability to directly determine the class proportions inside a sampling could also help to arbitrate between the two types of partition-based response designs, MTT and TTM, because TTM is much more dependent of the operator skills than the other response designs.
The main advantage of point-based validation is the possibility to estimate the reliability of the label from the points themselves, and hence to objectively optimise the sub-sampling process without prior knowledge about the sampling units. We demonstrated that relying on a fixed number of sub-samples is inefficient because the same amount of resources is spent to label both obvious and complex sampling units. An efficient approach would reduce the effort for those easy-to-interpret cases and allocate it to label complex cases so as to increase their confidence. To this aim, we proposed to iteratively interpret sub-samples until the estimated class proportions reached the desired confidence level. Combined with advanced validation applications, such an approach computes the required number of sub-samples on the fly, thereby reducing the labelling cost in the case of obvious sampling units. We showed that, in our study area, the labelling effort could be reduced by 50% to 75% without affecting the accuracy of the labels. As a result, the labelling effort was strongly reduced across the study site and concentrated in the fragmented and ambiguous areas. In some cases, i.e., close to the legend cut-off value, the added value of labelling additional points plateaus because sampling units with proportions close to the legend definition are always uncertain. It is therefore encouraged to set a threshold on the maximum labelling effort.
An iterative optimisation approach for partition-based designs is impractical because labels could be contradictory when changing scale. Therefore, optimising partition-based designs would rather depend on subjective operator decision about the proportions she/he estimates inside each sampling unit. Nonetheless, well trained operators could be granted the ability to select a number of partitions based on their impression of the complexity of the landscape. This method is likely to work well for threshold values near 50% and could avoid extra work in simple cases, but remains sensitive to the Modifiable Area Unit Problem (MAUP) -a statistical biasing effect that occurs when arbitrary units are used to collect data such as class proportions. As described in Jelinski and Wu (1996), the MAUP applies to two types of problems which are relevant in partition-based designs. The first aspect of the MAUP is the "scale problem", where the same set of areal data is aggregated into several sets of larger areal units, with each combination leading to different data values. The second is the "zoning problem", where a given set of areal units is recombined into zones that are of the same size but located differently, again resulting in variation in data values. MAUP could be mitigated by generating partitions that correspond to actual image objects derived via segmentation (Waldner et al., 2019). Image segmentation is mainly justified in landscapes that can be divided in a small number of homogeneous patches, not in areas that are very fragmented at a larger scale than the sampling unit. However, one may loose control over the num- ber of sub-samples generated by the segmentation algorithm, leading to unpractical labelling effort. Besides, the effect of delineation errors is difficult to predict, so that the accuracy of the response design should then be assessed with external data or again rely on an estimation provided by the operator.
In this study, the reported error rates resulted only from imprecise estimation of the class proportions. There are, however, additional errors that should still be considered for a complete understanding of the response design reliability: the simplification of the pixel model which is a simplified representation of the area observed by remote sensing (Foody and Arora, 1996;Hsieh et al., 2001;Radoux et al., 2016;Fisher, 1997), geolocation errors, and photo-interpretation errors. While we assumed that operators made no errors throughout the paper, their performance is in reality imperfect (Powell et al., 2004;Vancutsem et al., 2012;See et al., 2013;Waldner et al., 2019). For instance, Powell et al. (2004) concluded that five interpreters were required to agree upon a specific class. Human factors are responsible for no less than 20% of the inter-individual differences in operator performance (Van Coillie et al., 2014). To be more realistic, errors rates should account for errors of interpretation of the landscape and, in partition-based designs, errors in estimating the area of each class. As such, the error rates reported in this study are thus lower bounds.
Sampling units of 360 m × 360 m were used because these are divisible by a large number of integers and, therefore, allowed us to easily simulate a large set of regular partitionbased designs. While this practical constraint has no direct impact on the generalisation of the results, changing the size of sampling units would, however, indirectly impact the response design accuracy. Indeed, the average purity of the sampling units increases when the ratio between the ground sampling distance and the width of the object increases (Hsieh et al., 2001). This general rule was also observed in our study area, which showed very strong relationship (R 2 > 0.99) between the pixel purity and the spatial resolution (Fig. 9). If the legend remains the same, the accuracy of the response design will likely increase for sampling units of higher spatial resolution, and the estimation errors could be neglected when the sampling units become smaller than the spatial objects of interest.
In consolidating good practices to collect gold-standard validation data, we demonstrated that the amount of sub-samples required to meet stringent confidence levels is often too large to be realistically implemented. Therefore, this work suggests three main directions for future research. First, direct class assessment by operators should be compared with sub-sampling approaches to evaluate the overall level of confidence with real photo-interpretation in both cases. Second, unbiased confusion matrices could be built to account for uncertain reference data. While the errors affecting reference data cannot be predicted, we have however shown that the probability of error could be estimated based on the sub-samples. This information could be used to quantify a large part of the uncertainty of a reference data set at no extra cost. Third, the recent advances in image recognition and computer vision suggest that computer-assisted Figure 9: Proportions of the purity index values for different spatial resolutions. The yellow line indicates pixels with a purity above 90% and the blue line shows pixels with a purity of less than 50% labelling of sub-samples could help to increase the number of sub-samples at lower cost (see Xing et al., 2018, for instance). However, algorithms would need to perform at a high level of accuracy to avoid compromising the quality standards of reference data, and the accuracy of the secondary classification algorithm should also be assessed in order to measure the reliability of this kind of response design.

Conclusions
Photo-interpreted reference data sets are generally assumed error-free but they are in reality affected by erroneous labelling due to inaccurate image interpretation, drops of vigilance and estimation errors. We argued that, contrarily to interpretation errors and drops of vigilance that could be prevented (using for instance repeated-labelling), estimation errors are intrinsic to the response design and cannot be avoided once the response design is defined. With the goal of improving good practices in reference data collection, we empirically assessed the relationship between the accuracy of reference data and the type of response design for binary and majority legends. Our results highlighted the need of a dense sub-sampling to obtain errorfree reference data set. We further evidenced that estimation errors are strongly linked to landscape composition, labelling errors being more prevalent when the class proportions are close to the class definition threshold (binary legends) or in areas with complex class compositions (majority legends). To leverage the relationship between landscape composition and labelling accuracy, we proposed to iteratively interpret sub-samples until the class proportions are estimated with the desired confidence level. By quantifying the confidence of photo-interpreted labels, this optimisation method provides an efficient trade-off between the accuracy of the reference data and the labelling cost. Therefore, its uptake by the remote-sensing community will likely result in more reliable accuracy estimates and subsequently improved assessment of the usability of thematic maps.