Twice Is Nice: The Benefits of Two Ground Measures for Evaluating the Accuracy of Satellite-Based Sustainability Estimates

Satellite data offer great promise for improving measures related to sustainable development goals. However, assessing satellite estimates is complicated by the fact that traditional ground-based measures of these same outcomes are often very noisy, leading to underestimation of satellite performance. Here, we quantify the amount of noise in traditional measures for three commonly studied outcomes in prior work—agricultural yields, household asset ownership, and household consumption expenditures—and present a theoretical basis for properly characterizing satellite performance in the presence of noisy ground data. We find that for both yield and consumption, repeated ground measures often disagree with each other, with less than half of the variability in one ground measure captured by the other. Estimates of the performance of satellite measures, in terms of squared correlation (r2), which account for this noise in ground data are accordingly higher, and occasionally even double, the apparent performance based on a naïve comparison of satellite and ground measures. Our results caution against evaluating satellite measures without accounting for noise in ground data and emphasize the benefit of estimating that noise by collecting at least two independent ground measures.


Introduction
Researchers and policy makers working on issues of poverty and food security often face a paucity of reliable data on outcomes of interest. As a result, decisions about resource allocation to improve these outcomes are often made on the basis of very limited information. This longstanding situation has motivated efforts to develop alternate measurement approaches, including using satellite imagery, mobile phone data, crowdsourcing platforms, and social media [1][2][3].
Efforts to use satellite data are particularly attractive because of the ubiquitous and rapidly expanding availability of imagery, much of it in the public domain. Recent work has demonstrated promising results for using satellites to estimate outcomes such as agricultural crop yields, village-level measures of wealth based on asset ownership, average household consumption, income inequality, and the prevalence of informal settlements [4][5][6].
Despite research progress, the operational use of satellite-based estimates remains low. For example, while satellite data have proven useful for yield estimation in many situations [7][8][9][10][11][12], they are still not used operationally in most of the major efforts to assess farm productivity. The United States Department of Agriculture still relies on a combination of farmer phone and mail surveys and field measurements for its in-season yield forecasts and end-of-year yield estimates [13]. In smallholder systems-typically defined as those with field sizes below 2 ha-governments have primarily relied either on subjective assessments of local officials, self-reported farmer yields, or, in the case of countries (e.g., Ethiopia), extensive in-field harvests of small plots ("crop-cuts") [14,15].
One conundrum facing the research community is that traditional measures against which new approaches are compared are often themselves quite noisy. Self-reported measures from household surveys on farm production or consumption expenditures, for instance, can be fraught with problems arising from limited memories, poor recordkeeping, a tendency to round, inconsistent unit conversions, and deliberate under-or over-reporting [16][17][18]. In some cases, more objective measures are possible. For crop yields, one can conduct crop-cuts within agricultural fields. However, even then the sampled values can deviate substantially from the yield of the entire field because of within-field heterogeneity [18,19].
Ironically, the substantial and often unrecognized noise in traditional measures has arguably hampered adoption of alternative approaches, since agreement with these traditional measures is often a key measure of performance sought by potential users. To resolve this conundrum, we present here an approach to explicitly account for noise in groundbased measures when evaluating satellite estimates, and thereby improve assessments of satellite performance. The approach relies primarily on having multiple, independent ground-based measures of outcomes, although in the case of crop yields, we also present an alternative that uses the satellite measures themselves to estimate likely errors in the ground data.

Correlation between Ground Measures
We considered three outcomes related to economic activity and conditions that are commonly reported in the literature-crop yields, household asset wealth, and household consumption expenditures. For crop yields, we identified published papers that reported having multiple crop-cuts, and requested the original datasets from the authors. We also obtained many datasets from the International Maize and Wheat Center (CIMMYT) Research Data & Software Repository Network (https://data.cimmyt.org/, accessed on 12 October 2020). The largest number of observations (n = 15) was available for maize, primarily because of the Taking Maize Agronomy to Scale in Africa (TAMASA) program that conducted multiple years of crop-cuts in three countries, with multiple crop-cut locations per field. Table S1 summarizes values for each dataset and Figure S2 summarizes how correlations between crop-cuts varied across crop types, crop-cut size, and whether the fields were irrigated or rainfed. Although we focus here for yields on studies with more than one crop-cut, some studies have both a single crop-cut and a self-report yield. In those cases, correlations are typically also quite low, as shown in Table S2.
For asset wealth, we utilized data from the Demographic and Health Surveys (DHS) Program funded by USAID (available at https://dhsprogram.com/data/, accessed on 11 March 2021), which routinely collects data on household ownership of a standard set of assets (e.g., radio, refrigerator, type of flooring) in many countries. For household consumption, we used survey data from the World Bank's Living Standards Measurement Surveys (LSMS) (available at https://microdata.worldbank.org/index.php/catalog/lsms, accessed on 11 March 2021), which measures expenditures on all items over a fixed recall period (e.g., past 7 days or past month). Asset wealth was summarized based on the first principal component of a set of household assets, as described in [20], whereas total household consumption was provided in the LSMS data. To estimate noise in the survey measures, for each survey enumeration area (or "cluster", roughly equivalent to village in rural areas) we split surveyed households into two random subsets of equal size, compute the average values for each subset, and then report the correlation across clusters between the two subcluster averages.

Derivation of Correction Equation
Building from the intuition that noisy ground data will hamper evaluation of satellite measures, we seek to develop a formal way of correcting performance measures for this noise. We begin by considering the distribution of the "true" outcome (Y), which for simplicity we will assume is normally distributed with mean µ and standard deviation σ Y : We then consider a satellite measure S that is an unbiased but noisy measure of Y: with ε s representing normally distributed error with mean 0 and standard deviation σ ε_S , which is independent of Y. Then we can express the standard deviation of S as Similarly, we consider a ground-based measure G 1 that is a noisy measure of Y, with a standard deviation of noise σ ε_G1 : with σ ε_G1 independent of both S and Y, in which case Of interest is typically the linear correlation (i.e., Pearson correlation coefficient) between S and Y, which is calculated as: However, since we cannot measure the true yields, we instead are left to calculate: We can see that r(S,G 1 ) is smaller than r(S,Y), because σ G1 > σ Y . Thus, correlations reported in studies are typically a lower bound on the correlation of S with the true outcome, but without knowing the ratio of σ G1 :σ Y one cannot estimate the true value of r(S,Y).
Importantly, this situation is remedied if one also obtains a second ground-based measure G 2 , which is independent of the first measure, G 1 . with As before, we can calculate the correlation between S and G 2 as: We can also calculate the correlation between the two ground measures, which if ε G1 and ε G2 are independent is: By combining Equations (7) and (10), we can express the product of the correlations of S with the two ground measures as: Dividing by Equation (11) then gives: Finally, substituting Equation (13) into (6) and squaring both sides gives: Equation (14) says that the squared correlation coefficient between S and the true outcome Y, which cannot be directly observed since we cannot measure Y, can be calculated based on knowing the correlation of S with each of two ground measures, as well as the correlation of these ground measures with themselves. Intuitively, r 2 (S,Y) is measured not based on the absolute agreement between S and G 1 or G 2 , but by how this agreement compares to how well the ground measures agree with each other.
Equation (14) is only valid if the two measures are independent. For example, if G 1 is a crop-cut yield from a random subplot, G 2 could be self-reported yield by a farmer only if that farmer is not aware of the crop-cut estimate, or it could be a crop-cut yield from a separate location within the same field, as long as that location is randomly selected. The implications of violations of this assumption are addressed in the Section 4.

Simulations
To verify the accuracy of Equation (14) for correcting estimates of satellite performance, we conduct a series of simulations based on hypothetical variation in crop yields. For a given number of fields (N), we simulate relations of Y, S, G 1 , and G 2 on these fields under a set of specified values for µ, σ Y , σ S , σ G1 and σ G2 . We then calculate the correlations r(S,Y), r(S,G 1 ), r(S,G 2 ), r(G 1 ,G 2 ), as well as the estimatedr 2 (S, Y) from Equation (14). We then repeat these simulations 1000 times, using different combinations of parameters for the simulated noise in both satellite and ground measures to explore a range of potential conditions. The range of values used for the parameters were 1-3 t/ha for σ Y , and 0.5-3 t/ha for σ S , σ G1 and σ G2 . The value of µ was fixed at 5 t/ha, since varying this value does not affect the resulting r 2 .
A potential objection to the derivations above is that they assume that both satellite and ground observations exhibit classical measurement error, i.e., observations are the sum of the true values and random noise. A reasonable question is therefore how these equations would perform under conditions of non-classical errors. We focus here on situations where the satellite-based estimates exhibit so-called Berkson error, with values that are smoother (i.e., have less variance) than the true values. This situation is plausible given that satellite estimates rely on spectral vegetation indices (VIs) that are primarily sensitive to total canopy biomass. Although biomass is strongly correlated with grain yields, and in fact many approaches assume a constant proportion of biomass in grains (referred to as the harvest index), changes in harvest index will cause variations in true yields that are not captured in the satellite estimates, S.
We therefore repeat the simulations above, except that this time we model a situation with Berkson error, in which case the satellite yields are smoother than the actual yields:

Application to Crop Yields
We applied Equation (14) for three prior studies of satellite yield estimates for which at least two ground measures per field were available. In two cases, for sorghum in Mali [21] and wheat in Nepal [22], two independent crop-cuts were available. In the third case, a study of maize in Uganda [11], we used one crop-cut and one self-report yield. Self-report yields were only measured on a subset of fields (n = 43 out of 78) for which the farmers harvested their own fields (other fields included a full plot harvest by the research team). The 8 m × 8 m crop-cut was randomly located within the field, and then partitioned into four 4 m × 4 m quadrants, with yields measured separately for each quadrant. Although we did not have multiple independent 4 m × 4 m crop-cuts (since they were adjacent to each other), in a prior year of field work [17] we obtained two independent 2 m × 2 m crop-cuts, as well as adjacent 2 m × 2 m crop-cuts. These indicated that the correlation between independent crop-cuts was 0.29 lower than the correlation between adjacent crop-cuts (Table S3). Since the correlation between the adjacent 4 m × 4 m crop-cuts was 0.71, we used an estimate of 0.42 for the correlation between two independent 4 m × 4 m crop-cuts. This value is lower than the median, but within the range shown in literature values (see Figure 1).

Application to Crop Yields
We applied Equation (14) for three prior studies of satellite yield estimates for which at least two ground measures per field were available. In two cases, for sorghum in Mali [21] and wheat in Nepal [22], two independent crop-cuts were available. In the third case, a study of maize in Uganda [11], we used one crop-cut and one self-report yield. Selfreport yields were only measured on a subset of fields (n = 43 out of 78) for which the farmers harvested their own fields (other fields included a full plot harvest by the research team). The 8 m × 8 m crop-cut was randomly located within the field, and then partitioned into four 4 m × 4 m quadrants, with yields measured separately for each quadrant. Although we did not have multiple independent 4 m × 4 m crop-cuts (since they were adjacent to each other), in a prior year of field work [17] we obtained two independent 2 m × 2 m crop-cuts, as well as adjacent 2 m × 2 m crop-cuts. These indicated that the correlation between independent crop-cuts was 0.29 lower than the correlation between adjacent crop-cuts (Table S3). Since the correlation between the adjacent 4 m × 4 m crop-cuts was 0.71, we used an estimate of 0.42 for the correlation between two independent 4 m × 4 m crop-cuts. This value is lower than the median, but within the range shown in literature values (see Figure 1).  Table S1. For household assets, we used data from DHS for 23 countries, with 1-4 years per country. For consumption estimates, we used data from LSMS for three countries, with two years per country. The mean number of fields for cropcut studies was 145. The mean number of clusters for asset surveys was 462, with an average of 25 households per cluster. The mean number of clusters for consumption surveys was 161, with a mean of 10 households per cluster.
The satellite yield estimates in all three studies were based on Sentinel-2 data, which includes several bands with 10 m resolution and others, particularly in the red-edge region, with 20 m resolution. We utilized the best performing model for each study, which  Table S1. For household assets, we used data from DHS for 23 countries, with 1-4 years per country. For consumption estimates, we used data from LSMS for three countries, with two years per country. The mean number of fields for crop-cut studies was 145. The mean number of clusters for asset surveys was 462, with an average of 25 households per cluster. The mean number of clusters for consumption surveys was 161, with a mean of 10 households per cluster.
The satellite yield estimates in all three studies were based on Sentinel-2 data, which includes several bands with 10 m resolution and others, particularly in the red-edge region, with 20 m resolution. We utilized the best performing model for each study, which in the

Application to Household Consumption
For six country-year datasets (two years of data in each of three countries), we first randomly split the dataset with 60% of clusters in training and 40% in test. All of the train clusters were pooled to train a single model that predicted mean household consumption based on nighttime lights (NL). Specifically, the 2016 NL values from NASA's Black Marble 500 m resolution product (available at https://earthobservatory.nasa.gov/features/NightLights, accessed on 11 March 2021) were obtained for a 7 km × 7 km box surrounding the cluster location, and the number of values falling into each of 15 bins was calculated. The NL histograms were then input into a random forest model to predict consumption, similar to prior work [20], which found this approach approximated the performance of more sophisticated models using daytime imagery. For the test clusters, we randomly split the households into two equal sized subsets and calculated the average consumption for each subset. The predictions of the NL model were then combined with the two subsets to calculate satellite performance using Equation (14). The naïve r 2 (i.e., the direct comparison with ground measures without using Equation (14)) between the NL predictions and the overall average or the average of the first subset were also calculated for comparison. Figure 1 presents a quantitative summary of one straightforward measure of noise in typical ground data-the correlation between two independent ground measures of the same outcome. For the case of crop yields, we identified all studies with at least two crop-cuts per field (Table S1) and report the correlation across all sampled fields of the first and second crop-cut (or the mean correlation if there are more than two crop-cuts). For household wealth and consumption, we used public datasets (see Section 2) and for each survey enumeration area (or "cluster", roughly equivalent to village in rural areas) we split surveyed households into two random subsets of equal size, computed the average values for each subset, and then reported the correlation across clusters between the two subcluster averages.

Correlations between Ground Measures
Overall, both crop yields and household consumption exhibit considerable noise, with correlations commonly below 0.7. This indicates that in a typical setting, less than half of the variation in a measured outcome can be explained by an independent measure of that exact same outcome, at the same location (i.e., field or village), using the exact same instrument (i.e., crop-cut or household survey). Notably, measures of household assets appear more robust, perhaps reflecting the larger number of households typically surveyed in a cluster for DHS (mean of 25) compared to LSMS surveys (mean of 10), the fact that asset ownership is easier to recall and verify, or some combination of these and other factors. Given the higher correlations for asset measures, we focus below on correcting performance measures for yields and household consumption and leave aside the issue of correcting performance measures for household assets.

Simulations Illustrate the Value of Two Ground Measures
We begin by testing the validity of Equation (14) for different sample sizes using a series of simple simulations. In particular, the derivation of Equation (14) rested on the assumption that the covariance between two noisy measures of yield is exactly σ y 2 , which is true in the limit since the covariance between independent realizations of noise will be zero. However, for finite samples of noise (σ S , σ G1 , σ G2 ), these covariances will generally be slightly larger or smaller than zero, leading to a question of how accurate Equation (14) will be under typical conditions.
To address this question, we turn to a set of simple simulations as described in the Methods. Figure 2a compares the true r 2 (S,Y) with both r 2 (S,G 1 ) (e.g., the training r 2 for a satellite model trained on a noisy ground measure) and the estimatedr 2 (S, Y) based on Equation (14). This initial plot pertains to a large number of locations (N = 1000) to illustrate two key points. First, the training r 2 is always less than the true r 2 , and can be well below 50% of the true value. Second, Equation (14) results in an unbiased estimate of the true r 2 across a range of parameter combinations, with most points falling very close to the 1:1 line. Equation (14) remains valid even if one assumes that satellite estimates are subject to Berkson rather than the classical measurement error, with values that are smoother (i.e., have less variance) than the true values ( Figure S1). To address this question, we turn to a set of simple simulations as described in the Methods. Figure 2a compares the true r 2 (S,Y) with both r 2 (S,G1) (e.g., the training r 2 for a satellite model trained on a noisy ground measure) and the estimated ( , ) based on Equation (14). This initial plot pertains to a large number of locations (N = 1000) to illustrate two key points. First, the training r 2 is always less than the true r 2 , and can be well below 50% of the true value. Second, Equation (14) results in an unbiased estimate of the true r 2 across a range of parameter combinations, with most points falling very close to the 1:1 line. Equation (14) remains valid even if one assumes that satellite estimates are subject to Berkson rather than the classical measurement error, with values that are smoother (i.e., have less variance) than the true values ( Figure S1).  (14) to correct the measure of satellite performance for noise in ground data. (left) Training r 2 (gray points) and corrected r 2 (blue points, using Equation (14)) for a linear regression model that predicts ground-measured yields with satellite-measured yields for 1000 fields, plotted against the "true" r 2 between satellitemeasured and true yields. True yields and the noisy ground-and satellite-based measures of yields were simulated from Equations (4)-(8), (11) and (12), with each point representing a single set of simulation parameters, and simulations then repeated for different combinations of parameter values to span a wide range of r 2 . Lines show a locally-weighted polynomial (LOWESS) fit to the points. Parameters ranged from 1 to 3 t/ha for sY, and 0.5 to 3 t/ha for , , and . The value of m was fixed at 5 t/ha, since varying this value does not affect the resulting r 2 . Overall, training r 2 underestimates the true r 2 , whereas Equation (14) results in an unbiased estimate of the true r 2 . (right) The difference between the true and corrected r 2 (from Equation (14)) plotted against the number of fields simulated. The mean difference (i.e., bias) remains small when a small number of fields are simulated, but the median absolute error rises sharply when fewer than 100 fields are measured.
We then implement the simulations using a smaller number of locations (N ranging from 20 to 1000) to evaluate how performance of Equation (14) could vary as the number of fields with which to calculate the relevant terms declines. The results reveal that while the mean error across multiple simulations remains close to zero, the median absolute error is higher for smaller sample sizes (Figure 2b). The median absolute error reaches as high as 0.10 for a sample of 20 locations, but is below 0.05 for sample sizes above ~100. These simulations indicate that Equation (14) will still perform well on average when applied to datasets with fewer than 100 fields, but that the variance of the performance will increase as the number of fields decrease.

Correcting for Ground Noise Significantly Improves Performance Measures for Crop Yields
To illustrate the application of Equation (14) for crop yields, we reconsider some prior published remote sensing studies where more than one ground measure was available ( Figure 3). Specifically, we compare the agreement of satellite-yield estimates with each  (14) to correct the measure of satellite performance for noise in ground data. (left) Training r 2 (gray points) and corrected r 2 (blue points, using Equation (14)) for a linear regression model that predicts ground-measured yields with satellite-measured yields for 1000 fields, plotted against the "true" r 2 between satellite-measured and true yields. True yields and the noisy ground-and satellite-based measures of yields were simulated from Equations (4)-(8), (11) and (12), with each point representing a single set of simulation parameters, and simulations then repeated for different combinations of parameter values to span a wide range of r 2 . Lines show a locally-weighted polynomial (LOWESS) fit to the points. Parameters ranged from 1 to 3 t/ha for s Y , and 0.5 to 3 t/ha for σ S , σ G1 , and σ G2 . The value of m was fixed at 5 t/ha, since varying this value does not affect the resulting r 2 . Overall, training r 2 underestimates the true r 2 , whereas Equation (14) results in an unbiased estimate of the true r 2 . (right) The difference between the true and corrected r 2 (from Equation (14)) plotted against the number of fields simulated. The mean difference (i.e., bias) remains small when a small number of fields are simulated, but the median absolute error rises sharply when fewer than 100 fields are measured.
We then implement the simulations using a smaller number of locations (N ranging from 20 to 1000) to evaluate how performance of Equation (14) could vary as the number of fields with which to calculate the relevant terms declines. The results reveal that while the mean error across multiple simulations remains close to zero, the median absolute error is higher for smaller sample sizes (Figure 2b). The median absolute error reaches as high as 0.10 for a sample of 20 locations, but is below 0.05 for sample sizes above~100. These simulations indicate that Equation (14) will still perform well on average when applied to datasets with fewer than 100 fields, but that the variance of the performance will increase as the number of fields decrease.

Correcting for Ground Noise Significantly Improves Performance Measures for Crop Yields
To illustrate the application of Equation (14) for crop yields, we reconsider some prior published remote sensing studies where more than one ground measure was available ( Figure 3). Specifically, we compare the agreement of satellite-yield estimates with each individual ground measure, and then apply Equation (14) to estimate a "corrected" r 2 . We also compute a 5-95% confidence interval for each measure of r 2 by resampling the data individual ground measure, and then apply Equation (14) to estimate a "corrected" r 2 . We also compute a 5-95% confidence interval for each measure of r 2 by resampling the data with replacement 100 times, calculating the r 2 values for each sample, and taking the 5th and 95th highest values out of the 100 bootstrap samples. Figure 3. Naïve r 2 understates true performance for satellite crop yield estimation. Comparison of r 2 between satellite-based yields and two different ground-based measures of yields, as well as a corrected r 2 using Equation (14). The confidence intervals (gray bars) were calculated by resampling the fields and recalculating the different r 2 values. The bottom panel indicates the location, crop, and type of ground measures used. Uganda data are from [11], Mali data from [21], and Nepal data from [22].
In a study of pure stand maize fields in Uganda for 2016 [11], three different groundbased measures were collected: self-report yields, 8 × 8 m crop-cuts, and 4 × 4 m crop-cuts. When combining self-report yields with either of the crop-cuts, the corrected r 2 using Equation (14) is above 0.8, more than twice the uncorrected values albeit with wide confidence intervals. When using two 4 × 4 m crop-cuts as the two ground measures, the corrected r 2 is lower at 0.52. In this study, we also obtained a full plot harvest for a subset of fields [11], which if treated as the truth provides a direct estimate of r 2 (S,Y) of 0.56. Thus, the corrected r 2 for the 4 × 4 m crop-cuts was very similar to the direct estimate of the true value. The corrected r 2 values when using self-report were also consistent with the true value but exhibited wide error bars owing to the smaller number of fields with both selfreport and crop-cut data.
In a study of 557 sorghum fields in Mali [21], satellite-based yields exhibited an r 2 of 0.07 and 0.20 with self-report and crop-cut yields, respectively. In contrast, the corrected r 2 was considerably higher at 0.37, because of the low observed correlation between selfreport and crop-cuts (r = 0.32). In a study of 147 wheat fields in Nepal, two separate 5 × 5 m crop-cuts were obtained in each field, with a high correlation of 0.93 between them despite the fact that they were reportedly independent samples. This could reflect the fact that these wheat fields were irrigated, which tends to increase the homogeneity of yield outcomes. As a result of the higher correlation between crop-cuts, the corrected r 2 of 0.45 was only slightly higher than the training r 2 for the individual crop-cuts (of 0.41 and 0.43). . Naïve r 2 understates true performance for satellite crop yield estimation. Comparison of r 2 between satellite-based yields and two different ground-based measures of yields, as well as a corrected r 2 using Equation (14). The confidence intervals (gray bars) were calculated by resampling the fields and recalculating the different r 2 values. The bottom panel indicates the location, crop, and type of ground measures used. Uganda data are from [11], Mali data from [21], and Nepal data from [22].
In a study of pure stand maize fields in Uganda for 2016 [11], three different groundbased measures were collected: self-report yields, 8 × 8 m crop-cuts, and 4 × 4 m cropcuts. When combining self-report yields with either of the crop-cuts, the corrected r 2 using Equation (14) is above 0.8, more than twice the uncorrected values albeit with wide confidence intervals. When using two 4 × 4 m crop-cuts as the two ground measures, the corrected r 2 is lower at 0.52. In this study, we also obtained a full plot harvest for a subset of fields [11], which if treated as the truth provides a direct estimate of r 2 (S,Y) of 0.56. Thus, the corrected r 2 for the 4 × 4 m crop-cuts was very similar to the direct estimate of the true value. The corrected r 2 values when using self-report were also consistent with the true value but exhibited wide error bars owing to the smaller number of fields with both self-report and crop-cut data.
In a study of 557 sorghum fields in Mali [21], satellite-based yields exhibited an r 2 of 0.07 and 0.20 with self-report and crop-cut yields, respectively. In contrast, the corrected r 2 was considerably higher at 0.37, because of the low observed correlation between selfreport and crop-cuts (r = 0.32). In a study of 147 wheat fields in Nepal, two separate 5 × 5 m crop-cuts were obtained in each field, with a high correlation of 0.93 between them despite the fact that they were reportedly independent samples. This could reflect the fact that these wheat fields were irrigated, which tends to increase the homogeneity of yield outcomes. As a result of the higher correlation between crop-cuts, the corrected r 2 of 0.45 was only slightly higher than the training r 2 for the individual crop-cuts (of 0.41 and 0.43).

Correcting for Ground Noise Significantly Improves Performance Measures for Household Consumptions
As with crop yields, household consumption expenditures are difficult to measure with traditional ground-based instruments, resulting in low correlations between two independent measures (Figure 1). For six country-year combinations, we trained a single model to estimate consumption based on the distribution of satellite nighttime lights (NL) (see Section 2). Although more complicated models using high-resolution daytime imagery are possible [1,23], a model using NL distribution has been shown to perform nearly as well as the best models, and serves the purpose here of providing a credible satellite-based estimate. The model was trained using 60% of data from each country-year, with the remaining 40% used to test the model.
Performance (r 2 ) on the held-out test data was estimated using the naïve comparison, where all households in a cluster were used to estimate average consumption, as well as using Equation (14) with G 1 defined as the average of a random half of the households and G 2 as the average of the other half (Figure 4). For comparison, we also show how the naïve r 2 changes if using only half of the households (G 1 ). Consistent with the notion that sampling more households helps to reduce noise in the ground-based estimate, the naïve r 2 was higher in all six cases when using the average of all households rather than half of the households. However, the corrected r 2 , which explicitly accounts for noise in the ground-based measures, was higher still, especially in Tanzania and Nigeria. Corrected r 2 in Ethiopia remained low, mainly because nightlights appear to explain very little variation in ground-measured household consumption (the numerator in Equation (14) is close to zero).

Consumptions
As with crop yields, household consumption expenditures are difficult to measure with traditional ground-based instruments, resulting in low correlations between two independent measures (Figure 1). For six country-year combinations, we trained a single model to estimate consumption based on the distribution of satellite nighttime lights (NL) (see Section 2). Although more complicated models using high-resolution daytime imagery are possible [1,23], a model using NL distribution has been shown to perform nearly as well as the best models, and serves the purpose here of providing a credible satellitebased estimate. The model was trained using 60% of data from each country-year, with the remaining 40% used to test the model.
Performance (r 2 ) on the held-out test data was estimated using the naïve comparison, where all households in a cluster were used to estimate average consumption, as well as using Equation (14) with G1 defined as the average of a random half of the households and G2 as the average of the other half (Figure 4). For comparison, we also show how the naïve r 2 changes if using only half of the households (G1). Consistent with the notion that sampling more households helps to reduce noise in the ground-based estimate, the naïve r 2 was higher in all six cases when using the average of all households rather than half of the households. However, the corrected r 2 , which explicitly accounts for noise in the ground-based measures, was higher still, especially in Tanzania and Nigeria. Corrected r 2 in Ethiopia remained low, mainly because nightlights appear to explain very little variation in ground-measured household consumption (the numerator in Equation (14) is close to zero).  On average, the corrected r 2 was 0.2 points higher than the naïve r 2 for consumption expenditures across the six cases, with a median difference of 0.16. Interestingly, this difference is similar in magnitude to the difference in many countries between satellite performance for predicting assets vs. consumption [1,20]. Thus, although on the surface it appears that assets are "easier" than consumption to predict from satellite, it may be that much of the difference, in fact, arises from the fact that consumption is harder to measure than assets on the ground.

Correcting Satellite Performance Measures in the Absence of Two Ground-Based Estimates
One main argument of this paper is that researchers should strive to obtain two independent ground-based measures of an outcome of interest in order to properly assess the performance of satellite-based measures. In the case of household surveys, this implies splitting households within a location into two groups rather than combining them into one.
In the case of crop yields, this implies collecting two small crop-cuts rather than one large one. Yet we recognize that for various reasons this may not be feasible in some situations and, therefore, offer a few points for proceeding with only a single ground measure. First and foremost, the r 2 between satellite and ground measures should be reported with an emphasis on the downward bias that results from noise in the ground measure.
Second, one can still use Equation (14) with an estimate of how well a second measure of the same type would correlate with the first (as in the example of the 4 × 4 m crop-cuts in Uganda). The correlations in Figure 1 can provide some indication of plausible values for these correlations.
Third, in situations where errors in ground-based measures are due partly to spatial heterogeneity, as in the case of yield crop-cuts, one can use a map of satellite-based estimates themselves to assess the likely correlation between two independent ground-based measures. This approach assumes that (i) imperfect correlation between two crop-cuts arises primarily from in-field yield heterogeneity rather than measurement error for the crop-cut area itself, and (ii) satellite-based yields can adequately characterize in-field spatial heterogeneity. For the latter condition, it is likely important that the resolution of the satellite data be close to the size of the crop-cut area.
To test this approach, we return to the Uganda and Mali examples above, where we have one 8 m × 8 m crop-cut as well as yield maps based on the 10 m × 10 m Sentinel-2 imagery. We randomly sampled two pixels from within each field and calculated the correlation between the two samples across all fields. The corrected r 2 was then calculated as: where S is the satellite-based yield estimate for the entire field, G 1 is the crop-cut yield, and S 1 and S 2 are the satellite estimates for the two sampled pixels. This calculation is repeated 100 times, each time taking a new sample of fields and of pixels within the fields, to estimate a confidence interval on the corrected r 2 . The median correlation between two independently sampled pixels in each field was 0.38 for maize in Uganda and 0.70 for sorghum in Mali. The corrected r 2 that results from combining these correlations with the single crop-cut (Equation (17)) agreed well with the corrected r 2 from using the combination of self-report and crop-cut (Equation (14) ( Figure 5). Thus, at least for these two examples, it appears that using satellite measures of in-field heterogeneity can be a useful substitute for two independent ground measures of yield.
Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 14 Figure 5. Satellite measures of in-field heterogeneity can substitute for a second crop-cut. Comparison of naive r 2 for a regression of satellite-based yields on crop-cut yields (G1), the corrected r 2 using self-report yields and Equation (14) as in Figure 3 (method 1), and corrected r 2 using satellite-based estimates of the correlation between two crop-cuts and Equation (15) (method 2). The confidence intervals (gray bars) were calculated by resampling the fields and recalculating the different r 2 values.

Discussion
The notion that having two noisy ground-based measures allows one to recover the true correlation of S with an outcome is perhaps counter-intuitive. Indeed, to our knowledge, studies that have taken multiple ground measures typically combine these Figure 5. Satellite measures of in-field heterogeneity can substitute for a second crop-cut. Comparison of naive r 2 for a regression of satellite-based yields on crop-cut yields (G 1 ), the corrected r 2 using self-report yields and Equation (14) as in Figure 3 (method 1), and corrected r 2 using satellite-based estimates of the correlation between two crop-cuts and Equation (15) (method 2). The confidence intervals (gray bars) were calculated by resampling the fields and recalculating the different r 2 values.

Discussion
The notion that having two noisy ground-based measures allows one to recover the true correlation of S with an outcome is perhaps counter-intuitive. Indeed, to our knowledge, studies that have taken multiple ground measures typically combine these into an average before comparing with satellite measures. However, the notion that three noisy but independent measures of a quantity can be used to infer the true values is well established in other fields, such as the triple collocation methods used for remote sensing of wind speed or soil moisture [24,25]. Whereas those approaches are focused on estimating the true value at locations with three noisy observations, here we are concerned with the related problem of measuring the correlation of one easily scalable measurement (e.g., satellite measures) with the true unobserved values, by employing two additional independent measures at a small number of locations.
Several remaining issues deserve future attention. First, while our correction approach assumes the availability of two independent ground measures, many situations are likely to arise when the ground measures are not perfectly independent. Self-report yields, for instance, are prone to many sources of non-classical measurement error including mistaken beliefs [26] that could be correlated with errors in the crop-cut yield, or could themselves be affected by farmer awareness of crop-cut values. Similarly, errors in measuring household expenditures could be correlated across households in the same cluster, for instance if households in a certain region have common incentives to over-or under-report. We note that, to the extent these measurement errors are positively correlated, as is plausibly the case for both yields and consumption, our corrected performance estimates from Equation (14) will understate true performance. Future work should probe the likely bias from Equation (14) in the presence of non-independent ground measures, as well as potential remedies in these situations.
Second, while our focus here is on characterizing the overall performance of satellite measures, future work could attempt to estimate (and correct for) the error for individual observations. This could be achieved, for example, by using triple colocation methods for locations where two ground measures are available, or by identifying covariates that are predictive of deviations between satellite and ground measures.
Third, there are many-development-relevant variables that we did not consider in the current study, such as population density or the existence of informal settlements [4]. More work is needed to establish approaches to quantify the degree of noise in ground data for these variables, characterize the magnitude of this noise, and re-evaluate the performance of satellite measures in light of these errors.

Conclusions
With increased availability of satellite remote sensing data and growing interest in lowering the cost and improving the accuracy of economic statistics, we anticipate continued growth in studies aimed at characterizing outcomes, such as crop yields and household wealth using satellite data. Based on the results presented here, we offer a few key points for future work.
First, be wary of using unadjusted correlations between satellite and ground measures to evaluate performance, given that these can substantially underestimate true performance. Second, whenever possible, try to obtain at least two independent ground-based measures of an outcome. For outcomes measured in household surveys, this can be readily achieved by splitting households into two independent groups within each location (i.e., cluster). For outcomes measured by other means, such as in the case of crop-cuts for yield estimation, we recommend prioritizing two measures for each field, even if they are fairly small. For example, two 2 m × 2 m crop-cuts would be more valuable than one 4 m × 4 m crop-cut.
Third, if only one ground-measure is available, adjustments can still be made, either by using a range of values for noise from the literature or using satellite-based measures of correlations between pixels sampled from the same location (i.e., within fields as in Figure 4). If only one ground-measure is available, we also advise relying less on head-to-head comparisons between satellite and ground-based measures, and more on collecting measures of factors that are likely to influence the outcome, such as fertilizer use or soil quality in the case of crop yields. One can then perform regressions that examine the outcome response to these factors, with attention to whether regression coefficients have similar or better precision when using satellite-based rather than ground-based measures [10,11,17]. In these situations, the ability to uncover response coefficients is a more useful indicator of the accuracy of satellite-based outcomes than a direct comparison to a noisy ground-based measure [4].
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/rs13163160/s1, Figure S1: Simulations with Berkson error, Figure S2: Crop-cut correlations by crop, Table S1: Crop-cut correlations by study, Table S2: Self-report correlations with crop-cut, Table S3: Uganda correlations for adjacent and random crop-cut locations.  Data Availability Statement: All data used in this study are available from the authors upon request.