Accounting for Training Data Error in Machine Learning Applied to Earth Observations

Remote sensing, or Earth Observation (EO), is increasingly used to understand Earth system dynamics and create continuous and categorical maps of biophysical properties and land cover, especially based on recent advances in machine learning (ML). ML models typically require large, spatially explicit training datasets to make accurate predictions. Training data (TD) are typically generated by digitizing polygons on high spatial-resolution imagery, by collecting in situ data, or by using pre-existing datasets. TD are often assumed to accurately represent the truth, but in practice almost always have error, stemming from (1) sample design, and (2) sample collection errors. The latter is particularly relevant for image-interpreted TD, an increasingly commonly used method due to its practicality and the increasing training sample size requirements of modern ML algorithms. TD errors can cause substantial errors in the maps created using ML algorithms, which may impact map use and interpretation. Despite these potential errors and their real-world consequences for map-based decisions, TD error is often not accounted for or reported in EO research. Here we review the current practices for collecting and handling TD. We identify the sources of TD error, and illustrate their impacts using several case studies representing different EO applications (infrastructure mapping, global surface flux estimates, and agricultural monitoring), and provide guidelines for minimizing and accounting for TD errors. To harmonize terminology, we distinguish TD from three other classes of data that should be used to create and assess ML models: training reference data, used to assess the quality of TD during data generation; validation data, used to iteratively improve models; and map reference data, used only for final accuracy assessment. We focus primarily on TD, but our advice is generally applicable to all four classes, and we ground our review in established best practices for map accuracy assessment literature. EO researchers should start by determining the tolerable levels of map error and appropriate error metrics. Next, TD error should be minimized during sample design by choosing a representative spatio-temporal collection strategy, by using spatially and temporally relevant imagery and ancillary data sources during TD creation, and by selecting a set of legend definitions supported by the data. Furthermore, TD error can be minimized during the collection of individual samples by using consensus-based collection strategies, by directly comparing interpreted training observations against expert-generated training reference data to derive TD error metrics, and by providing image interpreters with thorough application-specific training. We strongly advise that TD error is incorporated in model outputs, either directly in bias and variance estimates or, at a minimum, by documenting the sources and implications of error. TD should be fully documented and made available via an open TD repository, allowing others to replicate and assess its use. To guide researchers in this process, we propose three tiers of TD error accounting standards. Finally, we advise researchers to clearly communicate the magnitude and impacts of TD error on map outputs, with specific consideration given to the likely map audience.


Introduction
Recent technological advancements have led to a new era in Earth observation (EO, also known as remote sensing), marked by rapid gains in our ability to map and measure features on the Earth's surface such as land cover and land use (LCLU), e.g., [1,2], vegetation cover and abundance [3], soil moisture [4], infrastructure [5,6], vegetation phenology [7][8][9], land surface albedo [10][11][12], and land surface temperature [13,14]. The resulting data are used by an expanding set of disciplines to gain new insights into socioeconomic and environmental dynamics, such as community-level poverty rates [15], changes in surface water [16] and forest cover [17], and carbon accounting [18]. As such, EO is increasingly shaping our understanding of how the world works, and how it is changing.
These breakthroughs are facilitated by several technological advances, particularly the increasing availability of moderate (5-30 m), high-resolution (1-5m, HR), and very high resolution (<1 m, VHR) imagery, as well as new machine-learning (ML) algorithms that frequently require large, high quality training datasets [19][20][21][22][23][24]. Large training datasets have been necessary for decades in the production of continental and global maps [1,2,25,26]. In the current data-rich era, the impact of training data (TD) quality and quantity on map accuracy is even more relevant, especially for maps generated by data-hungry ML algorithms [27][28][29][30][31][32]. Errors in these products also impact the veracity of any downstream products into which they are ingested [33]. While progress in algorithmic performance continues apace, standards regarding the collection and use of TD remain uncoordinated across researchers [34]. Additionally, much of the research and development of big data and ML is occurring in industry and the fields of computer science and (non-spatial) data science, leaving a potential knowledge gap for EO scientists [35,36].
The measurement and communication of map accuracy is a mature topic in EO and related fields, with a variety of metrics and approaches tailored to different data types, analyses, and

Characterizing Training Data Error
Due to different disciplinary lineages, terminology associated with the various datasets used to train and evaluate map algorithms is sometimes contradictory or disparate. Here we harmonize terminology by defining four distinct types of data: training, validation, training reference, and map reference. Training data (TD) refers to a sample of observations, typically consisting of points or polygons, that relate image pixels and/or objects to semantic labels. Validation data are typically a random subset of TD that are withheld and used to fit ML model parameters and internally evaluate performance. Training reference data are expert-defined exemplar observations used to assess TD errors during or after data creation. Map reference data are independent observations used to assess final map accuracy; while these may be collected using many of the same procedures as the other three datasets [57], they have more stringent design protocols and can only be used to assess the final map product, rather than used iteratively in model or map improvement [57]. Map reference data are often referred to as the test set in ML literature [77], but we use the former term to align with the terminology commonly used by the EO community.

Map Accuracy Assessment Procedures
Map accuracy assessment practices and standards are well-established in the EO literature [39,40,45,57,78]. We briefly review these procedures here because they are essential for quantifying how TD error impacts map accuracy. Additionally, the growing use of ML algorithms developed outside of EO has brought with it accuracy assessment practices and terminology that often differ nominally or substantively from those developed for EO, e.g., [57,79,80]. Reviewing EO accuracy assessment standards can, therefore, help to harmonize and improve accuracy assessment practices, while providing necessary context for procedures that can help to account for TD error.
The accuracy of a map is assessed by evaluating the agreement between the values of the mapped variables and those of a map reference variable, and summarizing those discrepancies using an accuracy metric [41,57]. The accuracy metric selected depends on whether the mapped variable is categorical or continuous, since each type of variable has its own foundation for error analysis [81][82][83][84][85]. For categorical variables, this foundation is provided by the confusion matrix, in which rows (but sometimes columns) typically list how many mapped values fall within each category and columns (but sometimes rows) list the distribution of map reference values for each category. In EO, the most widely used metrics calculated from the confusion matrix are user's accuracy (the complement of commission error), producer's accuracy (the complement of omission error), and overall accuracy (i.e., the complement of proportion error) [40]. A fuller explanation of accuracy metrics and other aspects of the error matrix can be found in existing publications [37,39,57,81,[86][87][88]. Another widely used measure in EO is the Kappa index of agreement [89], but Kappa varies with class prevalence [90] and inappropriately corrects for chance agreement [57], thus its continued use is strongly discouraged [40,57,91]. There are a number of other categorical accuracy metrics suitable for assessing the accuracy of a binary Remote Sens. 2020, 12, 1034 5 of 39 categorical variable, such as the F1 score [80], and the true skill statistic [90], which are described in the supplemental materials.
The scatter plot provides the basis for error analysis for continuous variables, wherein deviations between the mapped values plotted on the Y-axis are measured against those of the map reference on the X-axis. Several measures are used to summarize these deviations (see supplementary materials). The root mean squared error (RMSE, also known as root mean square deviation, RMSD) and mean absolute deviation (MAD) summarize deviations along the identity line, also referred to as the 1:1 or y = x line. RMSE has widespread use, but we recommend caution since it combines MAD with variation among the deviations [92][93][94]. Another widely used measure is the R 2 , or coefficient of determination, but this measures deviation relative to the linear regression line, rather than the y = x line [82,92].
Beyond these, there are measures for comparing continuous mapped variables to a binary reference variable, including the receiver operating characteristic (ROC) and the total operating characteristic (TOC) [83,95,96]. The area under this curve (AUC) of an ROC/TOC plot is often used as a single measure of overall accuracy that summarizes numerous thresholds for the continuous variable [96]. There are also metrics for assessing the accuracy of object-based image analysis (OBIA, [97]), which we do not cover here (but see the supplementary information (SI)) because the choice of measure varies according to mapping objectives [65,98].
The creation of the map reference sample is an integral part of the accuracy assessment process and has two major aspects. The first of these is the design of the sample itself (i.e., the placement of sample units), which should be probability-based but can follow several different designs (e.g., simple random, stratified, cluster, systematic) depending on the application and a priori knowledge of the study area [39,57]. The second aspect is the response design, which governs the procedures for assigning values to the map reference samples [39,57]. These include the choice of the sample's spatial and temporal units, the source of the data that the sample extracts from (e.g., high resolution imagery), and the procedure for converting reference data values into map-relevant values [39,57]. For a categorical map in which the reference data source is high-resolution imagery, the map reference sample is assigned labels corresponding to the map legend (e.g., a land-cover scheme) based on a human supervisor's interpretation of the imagery [57].
A key aspect of response design is that map reference data should be substantially more accurate than the map being assessed, even though they are always likely to have some uncertainty [30,39,46,47,57]. This uncertainty should be measured and factored into the accuracy assessment [39,46]. However, in practice this accounting is rarely done, while map reference data uncertainty is also rarely examined [34,38,57]. This tendency is illustrated by Ye et al. [65], who reviewed 209 journal articles focused on object-based image analysis, finding that one third gave incomplete information about the sample design and size of their map reference data, let alone any mention of error within the sample. Errors in map reference data can bias the map accuracy assessment [47,99], as well as estimates derived from the confusion matrix, such as land cover class proportions and their standard errors [46]. To correct for such impacts to map accuracy assessment, one can use published accuracy assessment procedures, including variance estimators, that account for map reference error [38,46,47]. These approaches depend on quantifying errors in the map reference data.

Current Approaches for Assessing and Accounting for Training Data Error
Most of the aforementioned considerations regarding map reference data creation largely apply to TD, particularly since map reference data and TD may often be collected together, e.g., [55], provided the former are kept strictly separate to ensure their independence [57]. Considerations regarding TD may diverge with respect to sample design, as TD often need to be collected in ways that deviate from probability-based sampling, in order to satisfy algorithm-specific requirements related to, for example, class balance and representativeness or the size of the training sample [31,51]. Another difference is that map TD can be usable even with substantial error [48,50,51]-although we show in Section 3 that Remote Sens. 2020, 12, 1034 6 of 39 TD error can propagate substantial map error-whereas map reference data needs to have the highest possible accuracy and its uncertainty should be quantified, as described above [39,46,57].
If the quality of map reference data is often unexamined, TD quality may be even less so. To gain further insight into the level of attention TD receives in EO studies, we reviewed 30 top-ranked research papers published within the previous 10 years that describe land cover mapping studies. (Publications identified by Google Scholar search algorithm results; the search was performed in January 2019, with terms land cover and land use mapping, including permutations of spelling and punctuation. Twenty-seven articles kept after initial screening for relevance-see Table S1 [2,63,64,). This assessment showed that only three papers explicitly and systematically assessed the quality of the TD used in classification [2,115,122], while 16 made no mention of TD standards at all. Over 75% of these studies used image interpretation, as opposed to in situ data, in either training, accuracy assessment, or both. One-quarter of these papers used unsupervised classifiers in the processing chain to outline training areas, followed by image interpretation to assign labels to the polygons/pixels. Although only a snapshot, this finding suggests that key details regarding the design and collection of TD (and even map reference data) is lacking in the EO literature.
Even though TD quality appears to be largely unreported, efforts have been made to examine how TD error can impact ML-based classifications, typically within the context of evaluating specific algorithms. For example, research examining the effectiveness of random forests [124] for land-cover classification also evaluated their sensitivity to TD error, sample size, and class imbalance [48,51,125]; similar research has been conducted for Support Vector Machines (SVM) [28,32,52]. Several studies comparing multiple ML algorithms also compared how each reacted variations in TD sample size and/or error [50,59,126,127]. Maxwell et al. [31] touch on a number of these TD quality issues in an even broader review of ML algorithms widely used in EO classification but excluding newer deep learning approaches.
Beyond these examples, several studies have focused more explicitly on how to train ML-algorithms for remote sensing classification when TD error is present. Foody et al. [30] conducted tests to examine how two different types of TD labeling error impacted land-cover classifications, with a primary interest in SVM. Similarly, Mellor et al.'s [48] study measured uncertainty introduced by TD error in a random forest classifier, with specific focus on class imbalance and labeling errors. Swan et al. [49] examined how increasing amounts of error introduced into the TD for a deep-learning model impacted its accuracy in identifying building footprints. These studies collectively demonstrate that TD has substantial impact on ML-generated maps. They also reveal that there is no standard, widely accepted practice for assessing TD error, which, similar to map reference data, is generally not reported and thus implicitly treated as error-free [30].

Sources and Impacts of Training Data Error
In the following two sections we describe the common causes of TD error and explore its potential impacts. To describe these causes, we divide the sources of TD error into two general classes: (1) errors stemming from the design of the training sample, including some aspects of sample and response design that are shared with standards for the collection of map reference data (see 1.2.1 above), and (2) errors made during the collection of the training sample, including additional elements of response design such as the process of digitizing and labeling points or polygons when interpreting imagery or when collecting field measurements. In addressing the impacts of error, we provide a summary of potential problems, and then two concrete case examples for illustrative purposes.

Design-Related Errors
With respect to TD sampling design, errors primarily stem from failures to adequately represent the spatial-temporal-spectral domains of the features of interest in the manner most suited to the Remote Sens. 2020, 12, 1034 7 of 39 specific ML algorithm being used [53]. This problem may be exacerbated in cases where TD are collected exclusively using the same rigorous probability-based specifications used to collect map reference data, which may be overly restrictive for the purposes of TD collection. While the use of such standards to collect TD may be possible provided that there is a large enough data set (e.g., a large benchmark data set), smaller training data sets and/or cases of geographically sparse target classes/objects will benefit strongly from the increased flexibility afforded to TD collection standards, which are less restrictive than those for map reference data (e.g., allowing for purposive rather than purely probabilistic sampling). A lack of geographic representation of the phenomena of interest results in a disparity between the distribution of TD compared to the true distribution of the mapped phenomenon in geographic and/or feature space [28][29][30][31]. This problem is highly relevant in ML approaches, which are sensitive to TD quality, including class balance, labeling accuracy, and class comprehensiveness relative to the study area's true composition [30].
Temporal unrepresentativeness is also a common source of error in the response design of TD, due to the prevalence of image interpretation as a source for TD. In this case, error arises when obsolete imagery is interpreted to collect training points or polygons and their associated labels [39,61]. The problem is illustrated in Figure 1, which contrasts smallholder fields that are clearly visible in a satellite base map (Bing Maps) with ground data collected in 2018. Center pivot fields were installed after the base map imagery was collected, but before ground data collection, causing a temporal mismatch between the base map and the in situ data. Labels generated from the base map would therefore introduce substantial error into an ML algorithm classifying more recent imagery. New HR/VHR satellites that have more frequent acquisitions (e.g., PlanetScope [128]) can help minimize such temporal gaps for projects that are designed to map present-day conditions (e.g., 2018 land cover), but cannot solve this problem for mapping projects covering earlier time periods (i.e., before 2016). The same can be said for aerial and unmanned aerial vehicle acquisitions, which are typically limited in geographic and temporal extent [129]. While hardcopy historical maps can help supplement temporal data gaps, these data sources come with their own problems, such as errors introduced during scanning and co-registration, and unknown production standards and undocumented mapping uncertainties.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 39 which are less restrictive than those for map reference data (e.g., allowing for purposive rather than purely probabilistic sampling). A lack of geographic representation of the phenomena of interest results in a disparity between the distribution of TD compared to the true distribution of the mapped phenomenon in geographic and/or feature space [28][29][30][31]. This problem is highly relevant in ML approaches, which are sensitive to TD quality, including class balance, labeling accuracy, and class comprehensiveness relative to the study area's true composition [30]. Temporal unrepresentativeness is also a common source of error in the response design of TD, due to the prevalence of image interpretation as a source for TD. In this case, error arises when obsolete imagery is interpreted to collect training points or polygons and their associated labels [39,61]. The problem is illustrated in Figure 1, which contrasts smallholder fields that are clearly visible in a satellite base map (Bing Maps) with ground data collected in 2018. Center pivot fields were installed after the base map imagery was collected, but before ground data collection, causing a temporal mismatch between the base map and the in situ data. Labels generated from the base map would therefore introduce substantial error into an ML algorithm classifying more recent imagery. New HR/VHR satellites that have more frequent acquisitions (e.g., PlanetScope [128]) can help minimize such temporal gaps for projects that are designed to map present-day conditions (e.g., 2018 land cover), but cannot solve this problem for mapping projects covering earlier time periods (i.e., before 2016). The same can be said for aerial and unmanned aerial vehicle acquisitions, which are typically limited in geographic and temporal extent [129]. While hardcopy historical maps can help supplement temporal data gaps, these data sources come with their own problems, such as errors introduced during scanning and co-registration, and unknown production standards and undocumented mapping uncertainties. Spatial co-registration can be a substantial source of response design error when training with HR and VHR commercial satellite imagery. Due to their narrow swath widths, HR/VHR sensors are often tasked, resulting in substantially off-nadir image acquisitions [61]. Due to large view zenith angles and the lack of adequate digital elevation models, side overlapping imagery for stereo photogrammetry, or other relevant control points, HR/VHR imagery often does not meet the same orthorectification standards as coarser resolution, government-operated satellites [130][131][132]. When integrating HR/VHR imagery acquired at different azimuth and elevation angles, features such as Spatial co-registration can be a substantial source of response design error when training with HR and VHR commercial satellite imagery. Due to their narrow swath widths, HR/VHR sensors are often Remote Sens. 2020, 12, 1034 8 of 39 tasked, resulting in substantially off-nadir image acquisitions [61]. Due to large view zenith angles and the lack of adequate digital elevation models, side overlapping imagery for stereo photogrammetry, or other relevant control points, HR/VHR imagery often does not meet the same orthorectification standards as coarser resolution, government-operated satellites [130][131][132]. When integrating HR/VHR imagery acquired at different azimuth and elevation angles, features such as building roofs show offsets similar to those caused by topography. These offsets are particularly problematic for (a) training repeated mappings of the same features, and/or (b) when using an existing vector dataset such as OpenStreetMap (OSM) as TD [133][134][135].
TD collected by interpreting HR/VHR imagery is often co-registered with the coarser resolution imagery used as ML model data. This creates a potential spatial resolution conflict because the minimum mapping unit (MMU), i.e., the relationship between image objects and pixel size, may be different in the two imagery data sets. This potentially leads to situations in which objects delineated as spectrally homogenous areas in HR/VHR imagery are part of mixed pixels in moderate-or coarse-resolution model imagery. This mismatch is similar to the concept of H-resolution versus L-resolution scene models proposed by Strahler et al. [136]; in H-resolution models, the objects of interest are substantially larger than the pixel size, and vice versa for L-resolution models. The incorporation of mixed pixels may degrade classification model performance, or at least introduce undesired spectral variability within classes [127,137,138]. This situation may be alleviated by displaying both HR/VHR imagery and/or other ancillary datasets as well as coarser model imagery during training data creation [139,140]. However, such practices may not be possible when training data are taken from previous research projects, or when they are to be applied in the context of time series analysis, in which spatial features change over time, e.g., [141].
Similar spatial resolution and scaling issues must be dealt with when combining in situ measurements with satellite observations for continuous variables. Field-collected data often cannot practically cover the entire area of a pixel in the model data, especially for moderate or coarse-resolution imagery, and can thus induce scaling errors related to the modifiable areal unit problem [142,143]. Spatial representativeness assessments and interpolation methods are used to limit this problem for operational EO science products [144][145][146][147], but this issue is likely to be a source of error for most in situ TD samples.

Collection-Related Errors
There are several common forms of error that occur when collecting both TD and map reference data. The first of these are errors of interpretation [39], which are mistakes created in the process of manual image interpretation. Image interpretation is widely used to generate TD, but often this technique leads to inconsistent labels between interpreters for the same areas of interest [34,37,99,155]. Interpreters may lack experience in the task or be unfamiliar with the context of the study area, e.g., [156]. In an unusually thorough analysis of error in image interpretation, Powell et al. [99] showed that inter-interpreter agreement was on average 86% but ranged from 46 to 92%, depending on land cover. This research, which relied on trained image interpreters, concluded that transitional land cover classes produce substantial interpretation uncertainty, which is particularly problematic since much land cover mapping effort is directed towards change detection. Another image interpretation study that used a crowdsourcing platform found that interpreters' average accuracy in digitizing crop field boundaries in high-resolution imagery was~80%, based on comparisons against training reference data [150]. This result held true whether the interpreters mapped several hundred sites or <50 (Figure 2), indicating that increased interpreter experience does not necessarily eliminate labeling error, even when analysts are highly seasoned [99]. These findings underscore the need to assess uncertainty in TD, as well as map reference data, using predefined training reference data or inter-interpreter comparisons [46,60,99,157,158].
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 39 Figure 2. Number of sites mapped per worker versus the average score received at reference sites, where workers' maps were compared to reference maps using a built-in accuracy assessment protocol within a crowdsourcing platform for collect cropland data [150].
Labeling error may also result from inadequate or poorly communicated semantic class definitions [159,160], particularly when identifying human land use as opposed to biophysical land cover [161]. This is especially evident in urban environments, which exhibit high spatial and spectral heterogeneity (even within HR/VHR imagery [162]), and are also semantically vague (i.e., hard to define) even at the ground level. For example, Figure 3 shows a typical example of TD collection for mapping informal settlements (i.e., slums), in Nairobi, Kenya, in which several trained interpreters separately delineate the same area [163]. Because slums may be defined by sociodemographic factors in addition to spatial and spectral properties, TD creation for such areas is prone to error stemming from semantic issues [160]. Complex classes such as slums may exhibit high variability between study areas, as local idiosyncrasies link the definition of slums to different physical, remotely observable characteristics. These characteristics make it hard to develop a generalizable mapping capability for land uses such as informal settlements. These results further illustrate the importance of consensus mapping for image interpretation, particularly for spatially, spectrally, or temporally heterogeneous LCLU classes, which may have vague or regionally idiosyncratic semantic definitions.
Categorical mapping projects typically define a crisp set of non-overlapping categories, rather than a fuzzy set [164,165]. However, many human and natural land covers exhibit continuous gradation between classes, implying that crisp map legends will necessarily cause semantic ambiguity when image pixels in transitional areas are labeled [166,167]. This problem is particularly acute with moderate-and coarse-resolution imagery [26]. Local variance is highest when scene objects approximate the spatial dimension of the image resolution, leading to poor classification accuracy [168]. While substantial research has been devoted to the issue of mixed pixels [85,137,138,[169][170][171], crisp categories are still often relied on during the training and testing phases of image classification [172]. Alternative approaches based on fuzzy set theory are available, but have seen limited adoption [165,173]. Labeling errors can also arise if analysts are not properly trained regarding class definitions, or by the failure to capture comprehensive metadata while collecting TD in the field or during image interpretation. Lack of TD metadata is particularly problematic in the context of difficult-to-determine labeling cases, or when there is potential confusion between spectrally, spatially, or semantically/conceptually similar classes [161]. Such inadequacies limit the analysis of TD error and, therefore, the ability to account for error propagation. where workers' maps were compared to reference maps using a built-in accuracy assessment protocol within a crowdsourcing platform for collect cropland data [150].
Labeling error may also result from inadequate or poorly communicated semantic class definitions [159,160], particularly when identifying human land use as opposed to biophysical land cover [161]. This is especially evident in urban environments, which exhibit high spatial and spectral heterogeneity (even within HR/VHR imagery [162]), and are also semantically vague (i.e., hard to define) even at the ground level. For example, Figure 3 shows a typical example of TD collection for mapping informal settlements (i.e., slums), in Nairobi, Kenya, in which several trained interpreters separately delineate the same area [163]. Because slums may be defined by sociodemographic factors in addition to spatial and spectral properties, TD creation for such areas is prone to error stemming from semantic issues [160]. Complex classes such as slums may exhibit high variability between study areas, as local idiosyncrasies link the definition of slums to different physical, remotely observable characteristics. These characteristics make it hard to develop a generalizable mapping capability for land uses such as informal settlements. These results further illustrate the importance of consensus mapping for image interpretation, particularly for spatially, spectrally, or temporally heterogeneous LCLU classes, which may have vague or regionally idiosyncratic semantic definitions.
Categorical mapping projects typically define a crisp set of non-overlapping categories, rather than a fuzzy set [164,165]. However, many human and natural land covers exhibit continuous gradation between classes, implying that crisp map legends will necessarily cause semantic ambiguity when image pixels in transitional areas are labeled [166,167]. This problem is particularly acute with moderate-and coarse-resolution imagery [26]. Local variance is highest when scene objects approximate the spatial dimension of the image resolution, leading to poor classification accuracy [168]. While substantial research has been devoted to the issue of mixed pixels [85,137,138,[169][170][171], crisp categories are still often relied on during the training and testing phases of image classification [172]. Alternative approaches based on fuzzy set theory are available, but have seen limited adoption [165,173]. Labeling errors can also arise if analysts are not properly trained regarding class definitions, or by the failure to capture comprehensive metadata while collecting TD in the field or during image interpretation. Lack of TD metadata is particularly problematic in the context of difficult-to-determine labeling cases, or when there is potential confusion between spectrally, spatially, or semantically/conceptually similar classes [161]. Such inadequacies limit the analysis of TD error and, therefore, the ability to account for error propagation. Collection-related errors may be particularly acute in large-scale crowdsourcing campaigns or citizen science initiatives, which are increasingly valued for mapping projects due to their larger size and cheaper acquisition costs [22,66,150,151]. Such datasets are often collected rapidly and entail labeling many observations over a short period of time by participants who are not domain experts [153,174]. In such cases, label quality is a function of interpreter skill, experience, contextual knowledge, personal interest, and motivation for involvement in the data collection [22]. Errors can be exacerbated if interpreters are inadequately trained or unfamiliar with the study area, or lack experience with EO data and methods. For example, delineation of different classes of urban land use may be extremely difficult without the benefit of local knowledge [160]. Furthermore, image interpretation is complicated when participants are required to interpret HR/VHR satellite imagery collected over multiple sensors, on different acquisition dates, with varying quality (e.g., cloud cover percentage and atmospheric correction), and/or with varying view/sun angles [175]. Inadequate or confusing user interfaces may also lead to error [22,160]. Once crowdsourced/citizen science data have been post-processed for noise, they can be highly detailed and spatially extensive [66,[69][70][71]. Nevertheless, quality problems in such datasets can be particularly hard to find and clean and are thus an important source of TD error that may propagate through ML algorithms into map outputs [57,151,176]. Therefore, these data should be used more cautiously than expert-derived TD.
Errors also arise in in situ TD, caused by measurement error, geolocation inaccuracy, or incorrect identification of relevant objects (e.g., vegetation species), for example [177]. In addition to these factors, some feature types may also be difficult to discern on the ground [30]. Aside from these problems, there are many sources of technologically induced errors, such as defects in the software or hardware of measurement devices, user input error, or calibration errors (e.g., in spectroradiometers or other equipment). However, accounting for quantitative measurement error is more Collection-related errors may be particularly acute in large-scale crowdsourcing campaigns or citizen science initiatives, which are increasingly valued for mapping projects due to their larger size and cheaper acquisition costs [22,66,150,151]. Such datasets are often collected rapidly and entail labeling many observations over a short period of time by participants who are not domain experts [153,174]. In such cases, label quality is a function of interpreter skill, experience, contextual knowledge, personal interest, and motivation for involvement in the data collection [22]. Errors can be exacerbated if interpreters are inadequately trained or unfamiliar with the study area, or lack experience with EO data and methods. For example, delineation of different classes of urban land use may be extremely difficult without the benefit of local knowledge [160]. Furthermore, image interpretation is complicated when participants are required to interpret HR/VHR satellite imagery collected over multiple sensors, on different acquisition dates, with varying quality (e.g., cloud cover percentage and atmospheric correction), and/or with varying view/sun angles [175]. Inadequate or confusing user interfaces may also lead to error [22,160]. Once crowdsourced/citizen science data have been post-processed for noise, they can be highly detailed and spatially extensive [66,[69][70][71]. Nevertheless, quality problems in such datasets can be particularly hard to find and clean and are thus an important source of TD error that may propagate through ML algorithms into map outputs [57,151,176]. Therefore, these data should be used more cautiously than expert-derived TD.
Errors also arise in in situ TD, caused by measurement error, geolocation inaccuracy, or incorrect identification of relevant objects (e.g., vegetation species), for example [177]. In addition to these factors, some feature types may also be difficult to discern on the ground [30]. Aside from these problems, there are many sources of technologically induced errors, such as defects in the software or hardware of measurement devices, user input error, or calibration errors (e.g., in spectro-radiometers or other equipment). However, accounting for quantitative measurement error is more straightforward than thematic TD creation. Textbook tools to quantify measurement error are widely available, and in situ data collection procedures often include inter-analyst measurement comparison [178,179].

Impacts of Training Data Error
TD errors carry through to impact the map production process and outcomes. From a design perspective, the size and class composition of TD is particularly impactful on ML algorithms, which are susceptible to overfitting and class imbalance problems [31,73]. Additionally, the assumption of representativeness of training pixels is often overstated, and many TD may in fact not be generalizable to broader scales (discussed by Tuia et al. [154]). TD errors arising from the collection process also impact map quality. Both design-and collection-related errors may be particularly hard to discern, or quantify in absolute terms, if the error in the map reference data errors are unknown.
Several studies reviewed in Section 1.2.2 provide insight into how much TD error can impact ML-generated land-cover maps, focusing on aspects of sample size and balance (design-related errors) and labeling error (collection-related error). This work shows that the impact of each error source varies according to the algorithm used. For example, SVMs were relatively insensitive to changes in sample size, with accuracy dropping by only 3%-6% under TD size reductions of 85-94% [28,180]. Random forests (RF) also proved robust to TD sample size, showing slightly higher accuracy drops of 4-10+% when TD was reduced by 70-99% [48,51,180]. Sample size also impacts the certainty of RF classification by lowering the mean margin (a measure of certainty related to the number of class votes) by~50% for sample size reductions of 95% [48]. In contrast to SVM and RF, maps classified with single decision trees are highly affected by TD size, with 13% accuracy loss for TD reductions of 85% [28], and up to 50-85% loss with TD size reductions of 50-70% [51,59]. NNs show varying responses to sample size, depending on their algorithmic design: one NN based on adaptive resonance theory showed accuracy reductions of~30% to~65% when TD samples were halved [59], while a feed-forward NN lost just 2% accuracy when TD was reduced by 85% [28].
Classifiers are also sensitive to class balance within the training data. For example, the accuracy of RF-generated maps declined by~12% to~23% and classification confidence fell~25% to~50% when TD class balances were highly skewed [48]. Notably, the ranges in these accuracy and confidence declines were attributable to differing TD sample sizes, showing the synergistic effect of sample size and class balance sensitivities. Maxwell et al. [31] provide a more comprehensive review of class imbalance for RF, SVM, NN, and k-nearest neighbors (kNN) classifiers, finding that all models were sensitive to class imbalance, but the accuracy impact was largest for rare classes, as opposed to overall map accuracy.
The impact of TD labeling errors, also referred to as noise, varies substantially between mapping algorithms. SVMs and closely related derivatives appear least sensitive to mislabeling. SVMs lost just 0-5% in land-cover classification accuracy when 20-30% of TD samples were mislabeled either randomly or uniformly across classes [30,52,126]. Relative vector machines (RVMs) were even less sensitive under these conditions (2.5% accuracy loss for 20% mislabeling [30]), and an SVM designed specifically for handling noisy TD (context-sensitive semi-supervised SVM) was even more robust (2.4% reduction in kappa for 28% mislabeling [52]). However, the impact of TD noise was greater for all three models when mislabeling was confined to specific classes. SVMs lost 9% accuracy and 31% kappa when 20-28% of samples in spectrally similar classes were mislabeled [30,52]. The RVM showed a 6% accuracy loss [30], and the specialized SVM showed a 12% kappa reduction [52] under the same conditions. As with sample size, RF is the next least sensitive to TD noise [48,51]. Mislabeling 25% of TD samples reduced RF accuracy by 3-7% for a binary classifier and 7-10% for a multiclass model, with the ranges in accuracy loss also varying according to TD sample size [48]. Classification certainty was more heavily impacted by label error, dropping by 45-55%, as measured by the mean margin [48]. Other classification models showed larger impacts due to label noise, including 11-41% kappa declines for a kNN (28% label noise [52]), and 24% [126,181] and 40-43% accuracy loss for a kernel perceptron and NN, respectively, when each is trained with 30% of TD labeled incorrectly [59,126,181]. Single decision-tree models were most sensitive to label error, registering 39% to nearly 70% accuracy declines for 30% label noise [59,126,181].
The research described above provides substantial information on how TD error can impact the accuracy and certainty of older-generation ML classifiers. Further understanding of the consequences of these errors can be inferred from literature examining the impact of errors in map reference data. Map reference errors can substantially bias areal estimates of land-cover classes, as well as the estimation of variance in those classes, particularly when examining land-cover change [46,182,183]. While methods exist to incorporate map reference data error into map accuracy assessments and area estimates [38,46,47], and also to account for TD uncertainty in assessing classifier accuracy [48], there has been little work that shows how to address both TD and map reference error.
Less information is available regarding the ways in which TD error may propagate beyond the map it initially creates. Initial research by Estes et al. [33] examined how error propagates from a primary land-cover map into subsequent derived products. This work used a high-quality reference cropland map to quantify the errors in 1 km cropland fractions derived from existing land cover datasets and measured how these errors propagated in several map-based analyses drawing on cropland fractions for inputs. The results suggest that downstream errors were in some instances several fold larger than those in the input cropland maps (e.g., carbon stock estimates, Figure 4), whereas in other cases (e.g., evapotranspiration estimates) errors were muted. In either case, the degree to which the error magnifies or reduces in subsequent maps is difficult to anticipate, and the high likelihood that error could increase means that any conclusions based on such land cover-derived maps must be treated with caution when error propagation is not quantified. This analysis suggests how TD errors might impact the maps they generate and provides a potential method for quantifying their impacts on map accuracy.
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 39 The research described above provides substantial information on how TD error can impact the accuracy and certainty of older-generation ML classifiers. Further understanding of the consequences of these errors can be inferred from literature examining the impact of errors in map reference data. Map reference errors can substantially bias areal estimates of land-cover classes, as well as the estimation of variance in those classes, particularly when examining land-cover change [46,182,183]. While methods exist to incorporate map reference data error into map accuracy assessments and area estimates [38,46,47], and also to account for TD uncertainty in assessing classifier accuracy [48], there has been little work that shows how to address both TD and map reference error.
Less information is available regarding the ways in which TD error may propagate beyond the map it initially creates. Initial research by Estes et al. [33] examined how error propagates from a primary land-cover map into subsequent derived products. This work used a high-quality reference cropland map to quantify the errors in 1 km cropland fractions derived from existing land cover datasets and measured how these errors propagated in several map-based analyses drawing on cropland fractions for inputs. The results suggest that downstream errors were in some instances several fold larger than those in the input cropland maps (e.g., carbon stock estimates, Figure 4), whereas in other cases (e.g., evapotranspiration estimates) errors were muted. In either case, the degree to which the error magnifies or reduces in subsequent maps is difficult to anticipate, and the high likelihood that error could increase means that any conclusions based on such land coverderived maps must be treated with caution when error propagation is not quantified. This analysis suggests how TD errors might impact the maps they generate and provides a potential method for quantifying their impacts on map accuracy. The impact of map input errors can also be seen in the practice of using well-known standard datasets, such as the National Land Cover Map (NLCD, [184]), to map quantities of interest, such as urban tree canopy biomass. Urban trees play a crucial role but in regional carbon cycles [185][186][187] but are often omitted from EO studies of carbon dynamics, e.g., MODIS Net Primary Productivity [188]. As urban lands are expected to triple between 2000 and 2030 [189,190], the need to factor them into The impact of map input errors can also be seen in the practice of using well-known standard datasets, such as the National Land Cover Map (NLCD, [184]), to map quantities of interest, such as urban tree canopy biomass. Urban trees play a crucial role but in regional carbon cycles [185][186][187] but are often omitted from EO studies of carbon dynamics, e.g., MODIS Net Primary Productivity [188]. As urban lands are expected to triple between 2000 and 2030 [189,190], the need to factor them into carbon accounting is pressing, but remotely mapping urban tree cover is limited by (a) spatial resolutions that are too coarse for highly variable urban landscapes and (b) TD that are often biased to forested, agricultural, and other rural landscapes. For these reasons, the Landsat-derived NLCD Percent Tree Cover (PTC) product [191], which estimates canopy cover at 30-m resolution across the US, provides a practical input for empirical models to map tree biomass. However, previous studies have shown the that this product shows higher uncertainty in urban areas [191] and has a tendency to underestimate urban canopy cover compared to high resolution datasets. Therefore, to quantify the potential impact of NLCD PTC error on canopy biomass estimates, we compared the accuracy of the NLCD PTC dataset to canopy cover estimates derived from manually digitized VHR Imagery for a suburb of Washington, D.C., USA. We found that NLCD PTC underestimated canopy cover by 15.9%, particularly along forest edges ( Figure 5) where it underestimated canopy cover by 27%. This discrepancy is particularly important in heterogeneous urban landscapes, where forest edges comprise a high proportion of total forest area. Scaling field data from forest plots to the entire study yielded an estimate of 8164 Mg C stored in aboveground forest biomass, based on our manually digitized canopy cover map, compared to only 5960 Mg C based on the NLCD PTC. This finding indicates the significance of these map errors for carbon accounting, as temperate forest carbon storage and rates of sequestration are much larger (64% and 89%, respectively) than in forest interiors [192]. Quantifying errors in the NLCD is thus important for correcting subsequent estimates trained on these data. to forested, agricultural, and other rural landscapes. For these reasons, the Landsat-derived NLCD Percent Tree Cover (PTC) product [191], which estimates canopy cover at 30-m resolution across the US, provides a practical input for empirical models to map tree biomass. However, previous studies have shown the that this product shows higher uncertainty in urban areas [191] and has a tendency to underestimate urban canopy cover compared to high resolution datasets. Therefore, to quantify the potential impact of NLCD PTC error on canopy biomass estimates, we compared the accuracy of the NLCD PTC dataset to canopy cover estimates derived from manually digitized VHR Imagery for a suburb of Washington, D.C., USA. We found that NLCD PTC underestimated canopy cover by 15.9%, particularly along forest edges ( Figure 5) where it underestimated canopy cover by 27%. This discrepancy is particularly important in heterogeneous urban landscapes, where forest edges comprise a high proportion of total forest area. Scaling field data from forest plots to the entire study yielded an estimate of 8164 Mg C stored in aboveground forest biomass, based on our manually digitized canopy cover map, compared to only 5960 Mg C based on the NLCD PTC. This finding indicates the significance of these map errors for carbon accounting, as temperate forest carbon storage and rates of sequestration are much larger (64% and 89%, respectively) than in forest interiors [192]. Quantifying errors in the NLCD is thus important for correcting subsequent estimates trained on these data. These brief examples help illustrate the potential problems of TD error, but the range of potential impacts is as varied as the number of mapping projects underway across academic research, commercial operations, and the public sphere. To represent the growing set of remote-sensing applications in which TD error may be encountered, we present a set of case studies below. To help Figure 5. Spatial variations in canopy cover (A) and uncertainty in canopy cover estimates (B) in forested and non-forested areas of the heterogeneous suburban landscape of the National Institute of Standards and Technology campus in Gaithersburg, Maryland. Percent canopy cover at a 30-m resolution from the commonly used National Land Cover Database (NLCD) Percent Canopy Cover product (and its uncertainty) is superimposed over a high-resolution map of forested areas (hollow outlined polygons) and non-forest trees (e.g., street trees; solid polygons) that were manually mapped using <1-m resolution Wayback World Imagery. Note the lower estimates of percent canopy cover along forest edges (A) and the associated higher levels of uncertainty (B) using the NLCD product.
These brief examples help illustrate the potential problems of TD error, but the range of potential impacts is as varied as the number of mapping projects underway across academic research, commercial operations, and the public sphere. To represent the growing set of remote-sensing applications in which TD error may be encountered, we present a set of case studies below. To help lay a common framework, we show a typical methods sequence for a ML-based remote-sensing analysis in Figure 6, which also helps clarify the terminology used in this paper. The figure shows the various sources and implications of error in the modeling and mapping process, beginning with issues in the data sources and sample design, and continuing through-model training, validation, and ultimately in map accuracy assessment.
Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 39 lay a common framework, we show a typical methods sequence for a ML-based remote-sensing analysis in Figure 6, which also helps clarify the terminology used in this paper. The figure shows the various sources and implications of error in the modeling and mapping process, beginning with issues in the data sources and sample design, and continuing through-model training, validation, and ultimately in map accuracy assessment.

Case Studies
To better illustrate the potential impact of TD error, we provide several case studies across different mapping applications that represent the broad range of ML-based mapping and modeling applications that rely on TD.

Incorporating Noisy Training Label Data
Automated building footprint detection is an important but difficult mapping task, with potential benefits for a wide range of applications. The following case study illustrates the use of

Case Studies
To better illustrate the potential impact of TD error, we provide several case studies across different mapping applications that represent the broad range of ML-based mapping and modeling applications that rely on TD.

Incorporating Noisy Training Label Data
Automated building footprint detection is an important but difficult mapping task, with potential benefits for a wide range of applications. The following case study illustrates the use of Raster Vision (https://rastervision.io/), an open source deep learning framework, to train several models for automated building detection from high resolution imagery (Additional detail available at: https://www.azavea.com/blog/2019/08/05/noisy-labels-deep-learning/). These models perform best when trained on a large number of correctly labeled examples, usually generated by a paid team of expert labelers. An alternative, less costly approach was conducted in which a building segmentation model was trained using labels extracted from OSM. However, the labeled training polygons generated from OSM contain errors: some buildings are missing, and others are poorly aligned with the imagery or have missing details. This provides a good test case for experimentation on how noise in the labels affects the accuracy of the resulting model.
To measure the relationship between label noise and model accuracy, the amount of label noise was varied while holding all other variables constant. To do this, an off-the-shelf dataset (the SpaceNet Vegas buildings data set) was used in place of OSM, into which label errors were systematically introduced. Missing and imprecisely drawn building errors were systematically introduced to this relatively large training data set (~30,000 labeled buildings) (https://spacenetchallenge. github.io/datasets/spacenetBuildings-V2summary.html), and then the resulting model accuracy was measured. The experimental design consisted of two series of six datasets each, with random deletion or shift of buildings at increasing probabilities and magnitudes, respectively. For each dataset, a UNet semantic segmentation model with a ResNet18 backbone was trained using the fastai/PyTorch plugin for Raster Vision (https://github.com/azavea/raster-vision-fastai-plugin). These experiments, including data preparation and visualization, can be replicated using code at https: //github.com/azavea/raster-vision-experiments/tree/master/noisy_buildings_semseg. Figure 7 shows the ground truth and predictions for a variety of scenes and noise levels, showing that the quality of the predictions decreases with the noise level. The background and central portions of buildings tend to be predicted correctly, whereas the outer periphery of buildings presented a greater challenge. These results are quantified in Figure 8, which shows F1, precision, and recall values for each of the noise levels below (see Table S2 for terminology description). The precision falls more slowly than recall (and even increases for noisy drops), which is consistent with the pattern of errors observed in the prediction plots. Pixels that are predicted as building tend to be in the central portion of buildings, leading to high precision.  In panels (A) and (B) of Figure 8, the x-axis shows the noise from randomly dropped and randomly shifted labels, respectively. Panel (C) combines the effects of noisy deletions and noisy shifts on accuracy in a single graph, showing F1 of the labels on the x-axis and F1 of the prediction on the y-axis. The F1 score of the noisy versus ground truth labels is a function of the pixel-wise   In panels (A) and (B) of Figure 8, the x-axis shows the noise from randomly dropped and randomly shifted labels, respectively. Panel (C) combines the effects of noisy deletions and noisy shifts on accuracy in a single graph, showing F1 of the labels on the x-axis and F1 of the prediction on the y-axis. The F1 score of the noisy versus ground truth labels is a function of the pixel-wise In panels (A) and (B) of Figure 8, the x-axis shows the noise from randomly dropped and randomly shifted labels, respectively. Panel (C) combines the effects of noisy deletions and noisy shifts on accuracy in a single graph, showing F1 of the labels on the x-axis and F1 of the prediction on the y-axis. The F1 score of the noisy versus ground truth labels is a function of the pixel-wise errors; this metric has the benefit of measuring the effect of noise on error in a way that is comparable across datasets and object classes. For instance, a noisy shift of 10 in a dataset with large buildings might result in a different proportion of erroneous label pixels than in another dataset with small buildings. From this, panel (C) shows that while some of the shifted datasets have a greater level of noise, the prediction F1 scores are similar between the two series when the noise level is similar.
These results present a small step toward determining how much accuracy is sacrificed by using TD from OSM. Preliminary results indicate that accuracy decreases as noise increases, and that the model becomes more conservative as the noise level increases, only predicting central portions of buildings. Furthermore, the noisy shift experiments suggest that the relationship between noise level and accuracy is non-linear. Future work will quantify the functional form of this relationship, and how it varies with the size of the training set. Some preliminary work toward this goal has been described in Rolnick et al. [193], which focuses on image classification of Imagenet-style images.
One limitation of these results is that the magnitude of error in OSM for most areas is unknown, making it difficult to predict the effect of using OSM labels to train models in a generalized, global sense. Noisy error in OSM can be estimated by measuring the disparity between OSM labels to clean labels, such as the SpaceNet labels used here, providing a local estimate of OSM noise. A more general but less rigorous approach is to roughly estimate the noise level by visually inspecting the labels in OSM, and comparing to Figure 7, which shows examples of the labels at different noise levels.

Detecting Roads from Satellite Imagery
Road networks constitute a critical geographical data layer used to assist national decision makers in resource allocation, infrastructure planning, vaccination campaigns, and disaster response, among others. However, accurate and up-to-date road networks are not available in many developing countries. High-resolution satellite imagery, paired with deep-learning methods, provides the capacity to detect and map roads at large spatial scales. This important goal, however, is dependent on availability of local high-quality TD.
To evaluate the impact of local TD availability on predicted road network accuracy, a study was carried out in Kumasi, Ghana [194]. Two datasets were used to train ML models: (1) the SpaceNet (https://spacenetchallenge.github.io/) dataset [195] in Khartoum, Sudan, and Las Vegas, USA, and (2) OSM data in Kumasi, Ghana. The SpaceNet Dataset includes high quality road labels with human expert validation, but unfortunately was not available in Kumasi, Ghana. Therefore, the latter study site relied on OSM data, consisting of crowdsourced labels with no accuracy assessment or expert validation. A series of experiments were carried out to assess the feasibility of using transfer learning, using the Raster Vision Python library for training and evaluation. For all MobileNet V2 models introduced in the following list, the image chip size was set to 300 × 300 pixels, and the training/validation split was 80/20.
The Las Vegas Model was trained and validated on SpaceNet data in Las Vegas and produced very high accuracy predictions. However, when this model was used in Kumasi, it predicted very few roads, with only scattered road segments. The Khartoum model was also trained using SpaceNet data in Khartoum. The Kumasi model used Maxar WorldView-3 imagery and labels from OSM as input. OSM was used to test the quality of crowdsourced labels in training a road detection model. The Khartoum Model was then fine-tuned on OSM labels in Kumasi for three different steps of 100 K, 50 K and 10 K. All models used the same hyperparameters, to isolate the role of TD on model performances.
To validate the models' performance using an independent dataset, a set of expert labels was generated over a small part of Kumasi. Figure 9 shows the region with human expert data vetting, along with the three model predictions. The Las Vegas model is excluded from this figure as it does not have any meaningful prediction in Kumasi. Quantitative performance metrics were calculated using the expert-created labels, to which the models had been blind during training. The results indicate that, as shown by Figure 9, the F1 score for roads was substantially higher for the Kumasi model (0.6458) than when using the Khartoum model (0.3780). However, by retraining and fine-tuning the Khartoum model, the F1 score for roads increased to 0.6135. The full accuracy results for this experiment are presented in Table S3, and prediction maps are shown in Figure S1.
Based on these results, it is concluded that: (1) lack of diverse TD significantly limits the geographic applicability of models, as the types, surfaces, and arrangements of roads varies substantially between regions; (2) regional training datasets are essential for the model to learn the feature of roads in that region; and (3) transfer learning from a reasonably similar geography can help train models. tuning the Khartoum model, the F1 score for roads increased to 0.6135. The full accuracy results for this experiment are presented in Table S3, and prediction maps are shown in Figure S1. Based on these results, it is concluded that: (1) lack of diverse TD significantly limits the geographic applicability of models, as the types, surfaces, and arrangements of roads varies substantially between regions; (2) regional training datasets are essential for the model to learn the feature of roads in that region; and (3) transfer learning from a reasonably similar geography can help train models.

Global Surface Flux Estimates
Fluxes at the land-atmosphere boundary play a key role in regulating water, carbon and energy cycles. These fluxes include latent heat flux (LE), sensible heat flux (H), and gross primary production (GPP). While these fluxes cannot be measured directly from remote-sensing observations, other remotely sensed variables can be used to estimate these fluxes. Moreover, these three fluxes are highly coupled and, therefore, a coupled model is ideal.
A fully connected neural network model was developed for this purpose [196], named water, energy, and carbon with artificial neural networks (WECANN). Inputs to WECANN are remotely sensed estimates of precipitation, soil moisture, net radiation, snow water equivalent, air temperature and solar induced fluorescence. The target variables for training the model were derived from outputs of global models. However, this presents the difficulty that the target variables are model outputs that can have substantial error, which will propagate in the WECANN model. To mitigate this problem, three independent estimates of each of the three fluxes (LE, H and GPP) were retrieved from the global models. Then a novel statistical approach, named triple collocation (TC, Figure S2, equation S1), was used to combine those estimates to a new dataset for training the WECANN model.
Triple collocation (TC) is a technique for estimating the unknown error (measured with standard deviations or RMSEs) of three mutually independent measurement systems, without treating any one system as zero-error "truth" [197]. The three measurement systems estimate a variable collocated in space and time, hence the name triple collocation. Using these probabilities, at each pixel and at each time one of the three estimates of the target variable is randomly selected to generate the TD.

Global Surface Flux Estimates
Fluxes at the land-atmosphere boundary play a key role in regulating water, carbon and energy cycles. These fluxes include latent heat flux (LE), sensible heat flux (H), and gross primary production (GPP). While these fluxes cannot be measured directly from remote-sensing observations, other remotely sensed variables can be used to estimate these fluxes. Moreover, these three fluxes are highly coupled and, therefore, a coupled model is ideal.
A fully connected neural network model was developed for this purpose [196], named water, energy, and carbon with artificial neural networks (WECANN). Inputs to WECANN are remotely sensed estimates of precipitation, soil moisture, net radiation, snow water equivalent, air temperature and solar induced fluorescence. The target variables for training the model were derived from outputs of global models. However, this presents the difficulty that the target variables are model outputs that can have substantial error, which will propagate in the WECANN model. To mitigate this problem, three independent estimates of each of the three fluxes (LE, H and GPP) were retrieved from the global models. Then a novel statistical approach, named triple collocation (TC, Figure S2, equation S1), was used to combine those estimates to a new dataset for training the WECANN model.
Triple collocation (TC) is a technique for estimating the unknown error (measured with standard deviations or RMSEs) of three mutually independent measurement systems, without treating any one system as zero-error "truth" [197]. The three measurement systems estimate a variable collocated in space and time, hence the name triple collocation. Using these probabilities, at each pixel and at each time one of the three estimates of the target variable is randomly selected to generate the TD.
The results of WECANN model outputs were evaluated against ground measurements from global FLUXNET towers from 2007 to 2015 (Figure 10), using both the coefficient of determination and RMSE to evaluate accuracy. These show that WECANN's correlation was on average 17% higher (range 8-51%) than that of any one of the three individual inputs, while the RMSE was 21% lower (range 4-54%). These differences provide a partial quantification of the error inherent in any one of these training inputs and show that by combining them using the TC technique, we can reduce error in an ML model for predicting the fluxes at global scale. This case study illustrates a means of assessing and accounting for error in TD for cases in which these data are not created specifically for the project, but rather are pre-existing data products with potentially quite different characteristics and potentially unknown error.
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 39 calculating the averages of the mean absolute error (MAE) of the prediction, and the R 2 of regressions fit between prediction and observed values ( Figure 11B).

Agricultural Monitoring
Two agricultural cases illustrate how TD error can impact both categorical and quantitative remotely sensed measures. The first relates to cropland mapping and is drawn from an ongoing study focused on mapping smallholder agricultural fields at high spatial resolution (3-4 m) in Ghana. The mapping method is based on "active learning", in which a random forest-based [124,198,199] ML algorithm is iteratively trained and validated by a crowdsourcing platform. This platform enlists human trainers to visually interpret and digitize field boundaries in the imagery (PlanetScope visual and near-infrared surface reflectance [128]) being classified [149,150,198]. During this process, a protocol is used to assess the accuracy of training labels, in which each worker is periodically directed to a training reference site where the field boundaries are already known but are invisible to the worker. Using these training reference sites, the interpreters' maps are then scored using a multi-dimensional accuracy assessment algorithm [150], resulting in an average TD accuracy score for each worker ranging from 0 (complete disagreement with reference) to 1 (perfect agreement). Each label site is mapped by at least five workers, and the resulting worker-specific accuracy scores are used within a Bayesian merging algorithm to combine the five sets of labels into a single consensus label, which is then used to train the random forest classifier. Here we use the worker-specific training accuracy scores to assess the impact of label quality on map accuracy by assessing three variants of two random forest-generated maps, one over Central Ghana (~3400 km 2 ) and one over Northern Ghana (~3100 km 2 ). The first two maps were trained using labels generated by the worker with the least accurate TD, the second two by the most accurate worker, and the third using the consensus labels. The accuracy of each pair of maps was then assessed against the validation set (reserved consensus labels) using the true skill statistic [90] (sensitivity + specificity − 1, with scores ranging from −1 to 1). The results show a substantial difference in accuracy between the maps trained with the least and most accurate workers' labels ( Figure 11A), with the former having 7-9% more skill than the latter, while maps based on consensus labels havẽ 3% more skill than those of the most accurate workers' labels. The results show that four models, including the baseline, compressed the range of yields, as seen in the shallow slope between observed versus predicted values, but prediction error was 18-31% higher when training yields had either the high level of random or systematic error within them. The smaller amount of random noise only added ~6% error to the predictions, suggesting that RandomForest is tolerant to some training error. Note that the average R 2 of the observed-predicted regression fit was nearly the same for the systematic error case as the baseline, which shows that this metric can be an unreliable measure of performance for quantitative measures, and that it is important to assess fit against the y = x line and to use a metric such as mean absolute error.

Guidelines and Recommendations
Our review and case studies show that the impacts of TD error on EO applications can vary, as can the procedures for assessing those impacts. Nevertheless, several best practices and guidelines can be discerned from this work. Below we synthesize a set of suggested steps for minimizing and accounting for TD error, within the context of undertaking and assessing the accuracy of a typical ML-based mapping project. The second case relates to remotely sensed crop estimates of wheat yields collected in 48 smallholder fields in Bihar, India in 2016-17 [200]. Yield data were collected via eight 2 × 1 m 2 crop cuts within each field, and PlanetScope-derived green chlorophyll vegetation indices (GCVI) were calculated over each field from imagery collected over four dates during the growing season (13 January, 25 February, 12 March, and 14 April 2017). A random forest regression was trained on the yield measured for each field, using the four dates of GCVI values as predictors. To test the effect of TD error on the resulting yield predictions, three types of noise were artificially introduced into the yield data used for training: (1) a systematic 0.5 ton/ha overestimate with randomly distributed errors sampled from a normal distribution with a mean of 0 ton/ha, (2) random noise with standard deviations of 0.5 ton/ha, and (3) random noise with standard deviations of 1 ton/ha. A baseline model fit to unperturbed data was also developed. Each model was trained on three separate randomly selected subsets of 32 perturbed observations, and the predictions were made for the remaining 16 held-out (independent) yield observations, which were not perturbed. This three-fold cross validation process was repeated 50 times, with each permutation using a different random seed to construct the folds, in order to achieve stable error metrics. The model performance was assessed by calculating the averages of the mean absolute error (MAE) of the prediction, and the R 2 of regressions fit between prediction and observed values ( Figure 11B).
The results show that four models, including the baseline, compressed the range of yields, as seen in the shallow slope between observed versus predicted values, but prediction error was 18-31% higher when training yields had either the high level of random or systematic error within them. The smaller amount of random noise only added~6% error to the predictions, suggesting that RandomForest is tolerant to some training error. Note that the average R 2 of the observed-predicted regression fit was nearly the same for the systematic error case as the baseline, which shows that this metric can be an unreliable measure of performance for quantitative measures, and that it is important to assess fit against the y = x line and to use a metric such as mean absolute error.

Guidelines and Recommendations
Our review and case studies show that the impacts of TD error on EO applications can vary, as can the procedures for assessing those impacts. Nevertheless, several best practices and guidelines can be discerned from this work. Below we synthesize a set of suggested steps for minimizing and accounting for TD error, within the context of undertaking and assessing the accuracy of a typical ML-based mapping project.

Step 1: Define Acceptable Level of Accuracy and Choose Appropriate Metric
As a starting point, researchers should determine the minimum level of accuracy required for their application, using the accuracy metric(s) most appropriate for answering their questions [201]. For example, if the goal of creating a categorical map is to obtain an unbiased area estimate for a particular land cover, it is essential to account for the map's commission and omission errors by adjusting the proportional area estimate of the cover type derived from the map by the proportion of that type estimated from the map reference sample [39,40,57]. For a continuous variable in which the absolute accuracy of the mapped variable is most important, then the mean absolute deviation from the y = x line is more informative than R 2 [93,94].
Error in the map reference data should also be factored into the selected accuracy metrics and resulting map-derived measures. Several published methods exist for categorical data (see Section 1.2.1). For continuous variables, the fit between the mapped and map reference variables can be assessed using Type 2 regression, which allows for error in the dependent (map reference) variable [202], unlike the more commonly used Type 1 regression. Determining map reference data error is critical to determining overall map accuracy. The error in these data effectively determines the upper limit of achievable map accuracy, as it is difficult (but not impossible; see [47]) to know whether a model's predictions are more accurate than its map reference data; thus if the map reference data are only 90% accurate, then the map can be at most 90% accurate. Acceptable accuracy should thus be determined relative to the accuracy of the map reference data, rather than the implicit assumption of 100%, which is widely used since map reference data are usually considered perfect [38,39,47,57,67].
Although the above steps relate primarily to concerns about map accuracy assessment, they are essential to establishing best practices for map TD. This is firstly due to the fact that, without undertaking rigorous accuracy assessment as described above, it is not possible to assess fully how TD error impacts map accuracy. And secondly, the processes of map reference data and TD generation are often tightly intertwined and impacted by many of the same sources of error (see Sections 1.2.1 and 1.2.2). The procedures for minimizing and measuring errors in both datasets are thus often the same. Our subsequent recommendations, therefore, cover both training and map reference datasets, except where we indicate necessary distinctions.

Step 2: Minimize Design-Related Errors
The next logical step in a mapping project that relies on TD is to design strategies for independently collecting the training and map reference samples. Although there are numerous factors to consider, there are several general aspects of design that can help minimize potential TD errors.

Sample Design
The first consideration relates to the sampling design itself, meaning where, when, how many, and what type of samples are placed (e.g., simple random, clustered, stratified, systematic, purposive/directed). With respect to the TD, this depends to a certain extent on the requirements of the selected ML algorithm, since various ML algorithms have differing requirements with respect to geographic distribution [53] and class balance of samples, e.g., [31,48,80]. Geographic representativeness and the degree to which the TD capture the variability in the feature of interest are also important TD sample design considerations [53,61,150,203]. Continuous TD, particularly those collected in situ, are often point samples. Therefore a sampling protocol should be used to match field measurements and pixel dimensions in order to avoid scaling problems associated with the modifiable areal unit problem [142,143].
The road mapping case study above shows the type of errors that can result when maps are trained with samples that do not adequately represent the features in a particular region. TD can in practice be highly localized or relevant for a limited spatial extent or temporal period [160,194]. This problem may continue to become more relevant, given the increase in stock or benchmark training libraries and subsequent attempts to transfer pre-trained models to different regions, time periods, or scales of observation [73,204]. While such benchmark libraries can be of immense benefit as TD for large area EO research, the representativeness of the features of interest should be assessed and augmented as needed, as shown above in the Khartoum model case study ( Figure 9D). For some widely-used ML algorithms, such as random forests, the best practice appears to be to train with data collected within the mapping region (e.g., within a particular agroecoregion [55,205]), and to avoid over-generalizing or transferring models to other regions [206]. However, until more published studies are available, it is not clear whether this rule applies to deep-learning models. When using citizen science or crowdsourcing approaches to generate these data, representativeness can be ensured by directing labelers to the selected TD sites, e.g., [150], rather than having the interpreters select the regions to map.
Samples should also be temporally representative of the imagery that is being classified [61]. That is, relative to the imagery being classified, the TD (and map reference) sample should be collected within a window of time that matches the characteristic rate of change of the feature being mapped. This interval can be estimated by measuring the temporal autocorrelation in the feature of interest [207]. For rapidly changing phenomena, such as deforestation events, snow/ice melt, and vegetation coverage during phenological transition, the sample may need to be captured within a few days or weeks of the acquisition of the imagery being classified, whereas for slower-moving features a sample collected within a few years may be sufficient.
In cases where training and reference samples are generated simultaneously, it is essential that TD sample design does not undermine the standards required for an independent, probabilistic map reference sample, sensu [67]. This is particularly relevant for extremely large, geographically broad benchmark datasets, which may be used for either TD or map reference data, assuming the specific data set used conforms to the appropriate criteria. Stehman et al. [176] describe procedures for rigorous incorporation of crowdsourced data while maintaining an appropriate probability-based sampling approach specifically for map reference data, and Stehman and Foody [57] explore issues relating to independence between TD and map reference data during sample design. Beyond those considerations, it is important to note that the map reference sample's independence is compromised when it is used to iteratively refine the mapping algorithm. This problem can best be understood within the context of cross validation, which is appropriate for ML parameter tuning, e.g., [31]. However, when the number of folds exceed one (as in our yield estimation case study; Figure 11B) then the portions excluded from training lose statistical independence and can no longer serve as the map reference [77]. Map reference data independence may also be undermined when training sites are selected iteratively, in order to increase their representativeness and improve ML performance e.g., [55,149]. If the gain due to new training sites is assessed against the map reference, then it will also lose independence after the first iteration. Moreover, any error in the map reference sample will be integrated into the final map. Xiong et al. [55] avoided this problem by visually assessing whether their classifier improved map quality after having new TD points added to the initial sample. A more quantitative approach is to divide an initial sample into three splits: one for training, the second for validating algorithm improvements, including those related to the addition of new training sites, and the third as the map reference, used only for final accuracy assessment. This partitioning approach can be implemented in the mapping platform used in the cropland mapping case study, Figure 11A [199].

Training Data Sources
The requirements for temporal representativeness make the source of training imagery a critical consideration for projects that rely on image interpretation. The use of basemap imagery is not recommended for training maps of dynamic features, given their broad range and uneven distribution of image acquisition dates [61], unless the age of the imagery being classified can be matched to that of the training imagery. Otherwise, there is substantial potential for introducing error into the mapping algorithm (e.g., Figure 1), and its impact may be hard to assess, particularly if the map reference sample is collected from the basemap. The goal of temporal representativeness must be balanced with the need to have a sufficiently high spatial resolution to accurately interpret the smallest target features/classes (i.e., the MMU; see Step 3). Beyond matters of cost, this tradeoff is one reason that HR/VHR basemaps are widely used [61]. Newly available commercial imagery such as PlanetScope [128] are collected at high temporal frequency (near-daily) and have a spatial resolution sufficient for many visual interpretation tasks (3-4 m) and, therefore, may be a preferable source of training imagery for developing maps representing the post-2016 period. Finally, in designing an image-based sample, it is also important to consider additional characteristics that can influence interpreters' judgement, such as atmospheric quality (e.g., clouds, haze), sensor view angle, sun angle, spectral band selection, and image contrast stretches [74].

Legend Design
For thematic maps, legend design merits special consideration as it relates to TD, particularly for multi-temporal and/or large area projects that rely on multiple image datasets [61]. As discussed in Section 2 above, objects of interest, including land-cover patches (i.e., the MMU), should be at least twice as large as the pixel resolution of the imagery used in the classification algorithm, assuming a requirement for spectrally pure pixels [136,168,208]. When image spatial resolution is too coarse relative to the scene elements of interest, image interpretation errors are likely due to mixed pixels [127,137,138]. This implies that in designing a legend, researchers should select classes with an MMU that is large enough to be effectively captured by the coarsest resolution imagery to be incorporated in the model, which will help avoid the problem of collecting training samples with mixed pixels [55]. This consideration is particularly relevant since HR/VHR imagery is often used to create TD and map reference data, while the mapping algorithm is applied to moderate-or coarse-resolution imagery, e.g., [55,120,209,210]. Alternatively, researchers may opt to select a classification workflow which explicitly incorporates mixed pixels, e.g., [98,165,173].
Spatial representativeness should be considered as a limiting factor for legend design [53], and to the extent possible, researchers should attempt to use categories that are supported by both the spatial resolution of the model data and the field sampling protocols to be used; we recommend that researchers consult the extensive literature on legend design [25,[144][145][146][147][211][212][213].

Step 3: Minimize Collection-Related Errors
There are numerous ways to collect TD for categorical and continuous mapping projects, each with their own sources of error. There are thus many potential approaches for minimizing the associated collection errors, which may be quite specific to a particular variable (e.g., for agricultural area estimates [214]). However, there are several general approaches that can be followed to minimize TD collection errors. Our focus here is primarily on error in image-interpreted TD, which is one of the most common approaches used to training ML mapping algorithms. We also touch on the specific case of model-derived training data.
Whenever possible, we recommend using protocols that incorporate training reference data to independently assess TD accuracy, particularly for image-interpreted TD, e.g., [150]. Training reference datasets can be limited in size compared to the ultimate sample size, provided that training reference locations are randomly presented to interpreters during the data creation campaign [150]. Active feedback during training label creation can also help reduce errors on a rolling basis by providing interpreters with information regarding their performance [174].
If comparison against training reference data is not possible, then consensus methods for generating TD may be the next best alternative. Consensus between at least 3 interpreters is recommended to allow for majority voting [34,46], but more complex land covers may require up to 7 interpreters [46]. Consensus among several domain experts may also be the best and most practical measure for collecting both training reference data and map reference data [34,57]. In the case of image-interpreted samples, consensus approaches should employ multiple interpreters to label the same site. For continuous variables, several independent or repeated in situ measurements should be made and aggregated. For modeled variables where the error is unknown, as in the surface flux case study, training based on the outputs of multiple independent models is recommended. The agricultural case study shows how multiple mappings can be used to quantify label uncertainty ( Figure 12A) and minimize the amount of labeling error, yielding improved map accuracy ( Figure 11A). The surface flux case study demonstrates these same benefits across several continuous variables ( Figure 10). The number of separate measures or interpreters needed will vary depending on the application. For example, in the cropland mapping case study, 5 interpreters labeled each consensus training sample, and in the continuous surface flux example, 3 separate modeled inputs were used.
Further steps can be taken to minimize TD collection errors arising from image interpretation. Interpreters should be given thorough training regarding the task [34], which may include instruction on remote-sensing principles as well as local or regional contextual information. Local domain expertise is particularly helpful for consistent identification of idiosyncratic land covers [163]. Interpreter education is particularly important for crowdsourcing or citizen science data collection campaigns, as participants typically lack formal experience in image interpretation [151,215].
As described in Step 2 above, image interpretation is inadvisable when the available imagery does not support the legend categories in terms of spatial, spectral, temporal, or radiometric resolution [216][217][218]. Researchers must be especially cautious in the similar but potentially more hazardous case that HR/VHR imagery is used to create training samples that are then used with coarser resolution imagery when ingested into the ML model. Assuming that researchers correctly specify their data selection and legend design when using higher spatial resolution imagery to create TD, image interpretation errors due to insufficient resolution should be minimized; however, special care should be given to borderline classes, or classes exhibiting a high degree of spatial and/or spectral variability due to land-cover mixtures within the pixel [127,137,138,154,219]. In such cases, we recommend that training polygons be created near the center of scene objects, where pixel mixing is likely to be minimized, e.g., [55].
Another important error-minimizing approach relates to cases in which TD comes from a process model, as in the surface flux example outlined above. Process models are also increasingly used to train crop yield mapping models, due to the difficulty of obtaining sufficiently large and reliable field-scale yield data for training [220]. To circumvent this challenge, the scalable yield mapping (SCYM) method [221,222] uses a mechanistic crop model to simulate yields under various environmental and management conditions. The model's outputs then become inputs for training an empirical mapping model (typically ML), in which the simulated yield is the dependent variable and a subset of remotely retrievable model variables serve as predictors. TD errors in such cases can be minimized by rigorously calibrating the process model (itself a challenging task) using best practices from the relevant modeling literature, e.g., [223]. Alternatively, if modeled TD are necessary but careful calibration is not possible (e.g., because the data are pre-existing), then a merging approach such as triple collocation (Section 3.2) can help reduce training error.

Step 4. Assess Error in Training Data Error
The best way to assess both TD (and map reference data) error is to measure it directly. For continuous variables, calculating measurement error should be possible in many cases, even for model-generated data, in which the variance can be calculated from simulation treatments, e.g., [223]. For categorical mapping, label error can be measured using an internal accuracy assessment protocol that makes use of predefined training reference data (e.g., Estes et al., [150]).
However, it can be challenging to produce training reference data, and indeed in some cases the true category is not clear, whether looking at an image or standing on site. In these cases, or when a direct TD error measurement protocol is not available, we recommend that researchers calculate uncertainty estimates based on repeated measures or multiple interpreter approaches (e.g., the crowd standard deviation [151]) described in Step 3 above (and see Figure 12); this is useful for both training and map reference data. We also recommend that additional measures relating to data collection speed, precision, and consistency be collected for individual data creators, as these can generate further insight into relative TD errors. This recommendation is based on experience in crowdsourced data creation [150,151], but it is applicable to any type of data collection, and could greatly bolster the understanding and quantification of error propagation. If it is not possible to either directly quantify TD error or relative uncertainty, then researchers should at a minimum clearly document the data creation methods, and detail likely sources of error and potential uncertainties.   Similarly, 19 independent experts were asked to delineate slum settlements in image subset from Cape Town, South Africa. The polygons are converted into overall agreement and the uncertainty is modeled using random sets (C) shows the covering function, which is then used to calculate standard deviation of random set (D). Both these metrics indicate the variability as well as stability in boundaries delineated by different experts. Adapted with permission from Kohli et al. [163].

TD Treatment Tiers
Due to the wide range of remote-sensing research currently underway, a wide variety of TD and classification algorithms are in use. Therefore, it is not possible to specify a single protocol for treatment of TD error. Instead, we outline three tiers that represent different levels of accounting for the impact of TD errors on resulting map products. These three tiers presuppose that researchers follow best practices for map accuracy assessment, which includes selecting the most appropriate, literature-recommended accuracy measure(s), quantifying map reference sample error, and accounting for the impact of map reference data error on the accuracy measures (per Step 1). If these best practices are followed, TD error impacts will already be implicitly accounted for within the accuracy measures, and the selected TD accounting tier will be governed by the purposes of the mapping application.

Tier 1
The optimal TD accuracy assessment, termed Tier 1, involves quantifying TD error using gold standard training reference data (Step 4). This information is then used to quantify various characteristics of the TD sample such as class balance and sample size. It is also used to determine the impacts of collection error stemming from label or measurement errors on model uncertainty and map accuracy (see Sections 1.2.2 and 2.2). For example, the impact of TD error on the certainty of Figure 12. Two examples of consensus-based mapping approaches and their potential use for assessing training (or reference) data uncertainty. Panel (A) shows a collection of crop field boundary polygons drawn by five independent workers around crop fields visible in PlanetScope imagery collected over Ghana. These labels can be converted into a heat map (B) showing the overall agreement, the inverse of uncertainty. Similarly, 19 independent experts were asked to delineate slum settlements in image subset from Cape Town, South Africa. The polygons are converted into overall agreement and the uncertainty is modeled using random sets (C) shows the covering function, which is then used to calculate standard deviation of random set (D). Both these metrics indicate the variability as well as stability in boundaries delineated by different experts. Adapted with permission from Kohli et al. [163].

4.5.
Step 5. Evaluate and Communicate the Impact of Training Data Error

TD Treatment Tiers
Due to the wide range of remote-sensing research currently underway, a wide variety of TD and classification algorithms are in use. Therefore, it is not possible to specify a single protocol for treatment of TD error. Instead, we outline three tiers that represent different levels of accounting for the impact of TD errors on resulting map products. These three tiers presuppose that researchers follow best practices for map accuracy assessment, which includes selecting the most appropriate, literature-recommended accuracy measure(s), quantifying map reference sample error, and accounting for the impact of map reference data error on the accuracy measures (per Step 1). If these best practices are followed, TD error impacts will already be implicitly accounted for within the accuracy measures, and the selected TD accounting tier will be governed by the purposes of the mapping application.

Tier 1
The optimal TD accuracy assessment, termed Tier 1, involves quantifying TD error using gold standard training reference data (Step 4). This information is then used to quantify various characteristics of the TD sample such as class balance and sample size. It is also used to determine the impacts of collection error stemming from label or measurement errors on model uncertainty and map accuracy (see Sections 1.2.2 and 2.2). For example, the impact of TD error on the certainty of random forest classifications can be assessed using measures derived from the margin function [48]. The impact of TD error on map accuracy should also be assessed by training models with TD adjusted to reflect the range of measured TD error, as illustrated by our cropland mapping case study, and with respect to variations in TD sample size and class balance [30,48,149]. This approach can be used to inform the researcher how much map improvement can be obtained by improving TD quality. As such, these tests should be performed against the validation sample rather than the map reference data, in order to preserve the independence of the latter.
We recommend that developers of benchmark TD libraries adhere to the tier 1 guidelines, keeping in mind that these data sets are likely to be used for a variety of purposes, including as TD and map reference data. Undertaking such evaluations can provide users important information about appropriate usage of these data for different ML models and geographies, and whether the benchmark data are appropriate for use as TD, training reference data, validation data, and/or map reference data. A rigorous quantification of error in the samples themselves is particularly important, since such data are often likely to be used as training and/or map reference data. We strongly urge researchers to consider what purposes these benchmark data sets are appropriate for, and refer the reader to previously published literature regarding incorporation of non-probabilistic samples [176]. Ideally, this tier should also be followed by the makers of map products intended for widespread public use, who should also release TD and map reference data that were used during map creation [57]. This step would allow users full insight into the quality and usability of the map for their own purposes.
Published TD (and map reference data) should be documented with standard metadata, as shown in Table S4, including the relevant error metric associated with each observation. The SpatioTemporal Asset Catalog (STAC, https://stacspec.org/) provides a framework for standardization of metadata for EO data and is increasingly seen as an international standard for geospatial data.

Tier 2
If it is not possible to directly measure and quantify TD error, the next best approach to account for TD error is to introduce a plausible range of simulated error into the TD and evaluate its impact on model uncertainty and map accuracy after training separate models with the perturbed datasets [48]. If multiple workers are tasked with collecting TD for the same site, then the variance in their data can be calculated [151] to derive the uncertainty bounds (e.g., Figure 12). This approach is demonstrated in the building mapping case study (Section 3.1.1), which illustrates the sensitivity of key accuracy metrics to two different kinds of simulated labeling errors. The wheat yield case study (see Section 3.3) provides an example of this approach for a continuous variable.
This tier may also provide an acceptable standard for both benchmark datasets and publicly released map products, particularly where absolute error quantification is less important, as well as for publicly released map products. TD and map reference data should also be made openly available with standard metadata, as described above, including the uncertainty metric for each observation. If it is not possible to publish them (e.g., because of privacy concerns), then researchers should provide documentation that summarizes these data and their uncertainty.

Tier 3
If the TD error quantification in Tiers 1 or 2 are not possible, then researchers should at minimum publish their TD and map reference data, e.g., [55] with accompanying metadata that includes descriptions of potential errors and uncertainties. If data cannot be made openly available, then researchers should publish full descriptions of the potential error in the data. Adherence to this tier, at least the reporting component, should be the minimal standard practice in peer-reviewed, map-based scientific research.

Communicating Error
Finally, uncertainty in ML-generated maps associated with both TD and map reference error should be faithfully reported within the maps and accompanying documents. Incomplete error reporting serves to limit the scientific validity and usefulness of these products [57]. Given that ML-generated maps are increasingly used by the public and policy domains, we advise makers of widely used maps to communicate these uncertainties and their consequences in a manner that is clear and understandable for broad audiences, including non-specialists, so that users can understand the map and its limitations. In general, we recommend including the error on or very close to the actual map, whether by means of metrics, the error matrix, and/or by using cartographic techniques for representing uncertainty. Examples of effective cartographic techniques for conveying uncertainty include selection of appropriate, intuitive, and color-blind friendly color schemes for classes and symbols, varying color value and saturation and font/line weight to indicate levels of uncertainty, use of crisp versus blurred boundaries and symbols to indicate the range of uncertainty, or display of consensus maps or side-by-side juxtaposition in cases of multiple, mutually exclusive predictions for the same place and time (e.g., representing differently specified models) [42,43]. Maps of consensus in training labels can provide valuable uncertainty information to users, such as shown in Figure 12A,B.

Towards an Open Training Data Repository
For the scientific community, the ideal standard of openness and replicability is to provide a complete description of TD collection practices, appropriate accuracy metrics, and perhaps most importantly of all, the raw data. We recommend the creation of a centralized, open source database of all available and relevant TD, using the details collected in the proposed template (Table S4), and recorded using the STAC framework. This type of open repository, taking inspiration from similar large-scale databases for computer vision (ImageNet, SIFT10M Dataset [224,225], and remote sensing (BigEarthNet, DeepSat, UC Merced Land-Use Dataset [73,226,227], should contain full training metadata, citations to the peer-reviewed literature, as well as links to downloadable versions of TD collection protocols. Following the philosophy of free and open source software, we strongly recommend that researchers embrace open source data, which is the only way by which a study can be truly reproduced.

Conclusions
Current practices in EO research are generally inattentive to the need to evaluate and communicate the impact of TD error on ML-generated maps. This oversight undermines the goals of scientific reproducibility and may compromise the insights drawn from the resulting maps. Improving these practices is important due to the increasing use of TD-intensive ML algorithms, which have motivated our review and recommendations.
To resolve terminological differences arising from the influence of non-EO disciplines, and to help contextualize TD considerations relative to established map accuracy assessment practice, we distinguish between four types of "truth" data used in ML-based mapping projects (training, validation, training reference, and map reference data), and define the appropriate role for each (Section 1.2). We identify causes of error in TD as well as map reference data, distinguishing where these vary (Section 2.1). We then explore the impacts of TD error (Section 2.2) and provide a set of case studies to illustrate the consequences of such error across a range of ML-based mapping applications (Section 3).
We then provide a set of guidelines for minimizing error arising from the design and collection of TD samples, and present recommendations for measuring and accounting for the impact of these errors (Section 4). Many of these guidelines and procedures also relate to map reference data generation, and we ground our recommendations in the existing best practices for map accuracy assignment (Sections 1.2.1 and 4.1). We conclude by defining three tiers of TD error accounting and reporting standards, which are designed to accommodate a wide range of ML-based mapping projects. The highest tiers should be adopted when creating open training libraries and public map products, both of which are increasingly being developed to meet the growing demand for EO-derived maps. In this context, there is a pressing need to rigorously evaluate the training requirements and relative performance of deep-learning models as they become more widely used for EO [36]. While TD is more visible in the context of LCLU and other categorical mapping projects, the need for rigorous, well-documented TD is also critically important for continuous variable applications in Earth System Sciences (e.g., hydrological research [228]). If adopted within the peer-reviewed literature, the standards we propose for TD treatment may improve confidence in scientific findings drawn from map-based research, which can otherwise be confounded by poorly quantified map errors [33,57].  Figure S2: Schematic of product selection using the Triple Collocation approach. Table S1: List of peer-reviewed publications retrieved using Google Scholar search algorithm results. Table S2: Summary of commonly used error metrics. Table S3: Quantitative results of comparing each of the three models trained for the road detection case in Kumasi, Ghana to the validation labels.