Sight for Sorghums: Comparisons of Satellite- and Ground-Based Sorghum Yield Estimates in Mali

: The advent of multiple satellite systems capable of resolving smallholder agricultural plots raises possibilities for signiﬁcant advances in measuring and understanding agricultural productivity in smallholder systems. However, since only imperfect yield data are typically available for model training and validation, assessing the accuracy of satellite-based estimates remains a central challenge. Leveraging a survey experiment in Mali, this study uses plot-level sorghum yield estimates, based on farmer reporting and crop cutting, to construct and evaluate estimates from three satellite-based sensors. Consistent with prior work, the analysis indicates low correlation between the ground-based yield measures (r = 0.33). Satellite greenness, as measured by the growing season peak value of the green chlorophyll vegetation index from Sentinel-2, correlates much more strongly with crop cut (r = 0.48) than with self-reported (r = 0.22) yields. Given the inevitable limitations of ground-based measures, the paper reports the results from the regressions of self-reported, crop cut, and (crop cut-calibrated) satellite sorghum yields. The regression covariates explain more than twice as much variation in calibrated satellite yields (R 2 = 0.25) compared to self-reported or crop cut yields, suggesting that a satellite-based approach anchored in crop cuts can be used to track sorghum yields as well or perhaps better than traditional measures. Finally, the paper gauges the sensitivity of yield predictions to the use of Sentinel-2 versus higher-resolution imagery from Planetscope and DigitalGlobe. All three sensors exhibit similar performance, suggesting little gains from ﬁner resolutions in this system.


Introduction
Agriculture remains a key contributor to national employment and economic output in many countries throughout the world. Improving agricultural productivity is thus a key engine for both increasing local food security as well as spurring overall economic growth [1]. Finding policy solutions that can improve performance depends on understanding the drivers of agricultural productivity, which in turn depends on the ability to accurately measure productivity.
A widely used measure of cropland productivity is crop yield, defined as the weight of edible product (e.g., kilogram of grain) produced per unit area (e.g., hectare). Although yield alone cannot capture all aspects of household income or well-being, it is a key determinant of the profitability of Lambert et al. [12] study in Mali measured yields on 27 total sorghum plots, whereas the current study covers 575 plots.
Third, we compare estimates from publicly available Sentinel-2 imagery (10 m spatial resolution) with estimates from very high resolution (VHR) Planetscope (~3 m resolution) and DigitalGlobe multispectral imagery (~1-2 m resolution) to assess the added value of VHR for this application. Given the significantly higher costs of acquisition of VHR imagery, it is informative for budget-constrained researchers and development organizations to assess the relationship, if any, between image cost and accuracy of agricultural productivity measures.
The following section describes the data used in this study. Section 3 then describes the methods used to evaluate the quality of both the ground-based and satellite-based yield estimates. Section 4 then discusses the results, while Section 5 summarizes the study's conclusions.

Survey and Crop Cut Data
Our study region was the Dioïla Cercle, an administrative subdivision in the southeastern part of the Koulikoro region of Mali ( Figure 1A). This area is within the primary sorghum-producing area in Mali and has been the locus of many activities by the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT). Key to our research is a farm survey that was implemented during the 2017 agricultural season. The survey fieldwork was conducted from August 2017 to February 2018 by ICRISAT-Mali, under the supervision of and technical assistance from the World Bank Living Standards Measurement Study (LSMS) team. Within Dioïla, four 10 × 10 km blocks were identified to ensure heterogeneity of topographic relief ( Figure 1B). In each block, the complete list of villages was constructed based on the shapefiles from the 2009 Population and Housing Census. Overall, there were 17 villages across all blocks. In each village, a complete listing of households was carried out, as part of which the sorghum-producing households with at least one purestand plot were identified. Given the low incidence of intercropped sorghum plots in Dioïla (estimated at 6% according to the 2017 LSMS-ISA survey), we elected to exclusively sample purestand plots. Subsequently, we randomly selected 150 of these households in each block. The across-village allocation of the sampled households in each block was proportional to the village-level total count of households with purestand sorghum plots.
Each sampled household was in turn visited three times. The first visit's fieldwork spanned the period of mid-September-October 2017. During the first visit, the post-planting questionnaire collected detailed information on household composition and demographics, housing conditions, ownership of consumer durables and agricultural implements, receipt of agricultural extension services, and farm organization. The questionnaire modules regarding the latter collected farmer-reported information for (i) all parcels owned and/or cultivated by the household and (ii) all plots cultivated with sorghum within these parcels, on a variety of topics, including parcel and plot areas, parcel ownership and tenure, plot management and farming practices, plot-level labor inputs for land preparation and planting, and plot-level sorghum cultivation and varietal attributes.
One purestand sorghum plot was in turn selected at random in each household. The area and boundaries for the selected sorghum plot were captured via a handheld Garmin eTrex 30 GPS unit with the Wide Area Augmentation System enabled for higher accuracy. An 8 × 8 m crop cut sub-plot was then established on each plot for the actual processing and weighing the crop harvest at the end of the season. The latitude and longitude of the edges of each sub-plot were recorded by Android tablets, and each sub-plot was sub-divided into four 4 × 4 m quadrants, for which the crop harvest was processed and weighed separately. The approach of a random placement of the crop cut sub-plots, supervision of the crop cut sub-plots throughout the season, and harvesting, processing, and tracking of sorghum cultivated in each quadrant were all identical to the approach in an earlier methodological study focused on maize in Eastern Uganda [6] (please see [6] and their Appendix D for more details). out, as part of which the sorghum-producing households with at least one purestand plot were identified. Given the low incidence of intercropped sorghum plots in Dioïla (estimated at 6% according to the 2017 LSMS-ISA survey), we elected to exclusively sample purestand plots. Subsequently, we randomly selected 150 of these households in each block. The across-village allocation of the sampled households in each block was proportional to the village-level total count of households with purestand sorghum plots.  The second visit's fieldwork was conducted during the harvest season, over the period of November-December 2017. During the second visit, the harvest from each crop cut quadrant was barcoded and processed for drying, which was done at a centralized location in each village, under the supervision of the resident enumerator. Finally, the third visit fieldwork spanned the period of late-January-February 2018. During the third visit, the dried harvest associated with each crop cut quadrant was weighed, and the households were administered a post-harvest questionnaire that elicited farmer-reported information for all plots cultivated with sorghum regarding (i) sorghum production (allowing for the use of non-standard measurement units for the quantification of production), (ii) use of non-labor inputs, including organic and inorganic fertilizers, as well as pesticides, and (iii) household and hired labor inputs for weeding, input application, and crop harvest.
The final dataset consisted of household-and plot-level information augmented with self-reported and GPS-based plot areas, self-reported and crop cut sorghum harvest weights, and plot boundary polygons. GPS plot boundaries were processed to remove any self-intersections and to erase any areas of overlap between neighboring plot polygons. Crop cut yields were calculated by dividing the total grain weight of sorghum by the 64 m 2 designated crop cut area. Grain moisture was deemed to be uniformly low, given the arid climate of the region, and was not measured or adjusted for in the crop cut yield estimates. A total of 25 fields were omitted from subsequent analysis, either because of missing crop cut yields or self-intersecting GPS boundaries, leaving a final sample size of 575 plots.
Self-report yields were calculated by first converting each survey answer to kilograms (kg), since responses were provided in varying non-standard measurement units such as a "sheaf". To obtain kg-equivalent production measures, village-level median conversion factors were computed based on the conversion factors that were provided by the respondents for each non-standard unit during the third interview that solicited self-reported information on agricultural production. (We compared our conversion factors to those that were obtained from a pilot study that was conducted by the Ministry Remote Sens. 2020, 12, 100 5 of 16 of Agriculture (MoA), with support from the World Bank LSMS-ISA. The conversion factors matched across the two studies, even though "bunches" exhibited a large variance. The scaled-up version of the MoA conversion factor survey is slotted for implementation starting in December 2019.) Further, if the reported harvest condition was unshelled, a shelling factor was applied. We computed the conversion factors at the village level in part to reduce potential biases in farmer-reported conversions, while recognizing the possibility of spatial variation in conversion factors based on prior work [17]. In the current study, only 4% of observed variability in self-report yields were explained by the conversion factors used.
The total harvest weights in kg were then divided by the self-reported plot area to obtain self-reported yields. In sensitivity tests, self-reported yields were also calculated using the polygon area, with any notable differences discussed below. The preference for calculating self-reported yields using self-reported area is based on Abay et al. [7], who show that if measurement errors in self-reported production and area are correlated, correcting one (area) can lead to bigger errors in analyzing yields than correcting neither.

Satellite Data
We used Sentinel-2 as the primary source of remote sensing imagery for this study. The Sentinel-2 Multispectral Instrument (MSI) measures 13 spectral bands that collectively span the visible/near infrared and short wave infrared spectral range, from roughly 440-2200 nm. For the current study we used the Sentinel-2 Level 1-C product, which represents top-of-atmosphere reflectance. From these raw bands we calculated several common vegetation indices that have been found to be useful for agricultural monitoring in similar settings [12,15,16].
Specifically, we computed the Green Chlorophyll Vegetation Index (GCVI) [18] at 10 m resolution, the Normalised Difference Vegetation Index (NDVI) [19] at 10 m resolution, and the MERIS Terrestrial Chlorophyll Index (MTCI) [20] at 20 m resolution: where R λ refers to reflectance centered at wavelength λ and B refers to the corresponding Sentinel-2 band. Sentinel-2 has a revisit period of 5 days in this study region. Based on the reported sowing and harvest dates from the household survey, we used the time series from 5 May 2017 through to 31 December 2017, for a total of 87 unique image dates for each plot. For higher-resolution imagery, we considered two alternatives. First, DigitalGlobe (DG) images were acquired by their GeoEye-1 (1.84 m multispectral resolution), WorldView-2 (1.84 m), and WorldView-3 (1.24 m) sensors on a monthly basis from July through December 2017 for the four 10 × 10 km (red) blocks shown in Figure 1. Given the limited swath of DG images, only a fraction of plots (typically less than one-quarter) were covered by any individual image. Removing images with excessive cloud cover resulted in successful acquisition of 18 images, with the dates for each image shown in Table 1. Table 1. Summary of images acquired by DigitalGlobe (DG) sensors in the study region. Region refers to the quadrant shown in Figure 1. Sensor source is shown in parentheses G = GeoEye-1, W2 = WorldView-2, and W3 = WorldView-3.

Region
Digital Globe Imagery Dates Second, approximately 3 m resolution Planetscope images were collected at roughly daily frequency by sensors on the Planet company's "dove" of Planetscope cubesats. The Planet's constellation consists of >100 small cubesats in low-Earth orbit. In contrast to the large size of traditional VHR sensors (e.g., WorldView-3 weighs 2800 kg), Planet doves weigh approximately 5 kg and thus have dramatically lower launch costs. Although the image quality is lower in terms of both sensor signal-to-noise ratio and spatial resolution, the large number of doves allows frequent observations at any point on the Earth's surface. For the current study, access to Planetscope images was provided via Planet's ambassador program. We downloaded all images in the study region with fewer than 5% of clouds, with an average of 52 observations per plot throughout the growing season. All images were converted to top-of-atmosphere reflectance using coefficients provided in the image metadata.

Satellite Data Processing
For each of the 575 plots, Sentinel-2-based vegetation index values were computed for both the entire plot and for the 8 × 8 m crop cut sub-plot on each plot. A common challenge with optical satellite imagery is the prevalence of clouds that obscure or completely block the satellite sensor's view of the land surface. Sentinel-2 has a predefined cloud mask band, but preliminary analysis revealed this mask was unreliable in the region-with many false positives and false negatives. Rather than explicitly flagging cloudy observations, we adopted a recursive curve fitting procedure similar to that implemented in the Timesat software package [21], which is robust to the inclusion of cloudy observations. Specifically, a discrete Fourier transform, also known as a "harmonic regression," was fit to all 87 observations in a pixel between 5 May and 31 December: where f (t) is the value of the reflectance band or vegetation index of interest, t is the date of observation, N is the number of harmonic terms included, and a k , b k , and, c are cosine, sine, and intercept coefficients estimated by the regression, respectively. Past research has shown that harmonic regressions can adequately characterize vegetation phenology, including in agricultural settings [21][22][23]. Here we used the second-order harmonic (N = 2) following [22]. When fit to the raw data, Equation (4) is heavily influenced by cloudy observations, as illustrated in Figure 2 for the time series of GCVI at a representative plot. The raw values shown in black dots display frequent drops in value to near zero, indicating cloud cover on those days. Therefore, an iterative procedure was used whereby the predicted values from Equation (4) were compared with the input value at each date. If the input value was lower, it was replaced with the predicted value, and the resulting time series was then used to refit Equation (4). This process can be repeated until the influence of cloudy observations is minimized. In this case, we found that the 10th iteration approximated the

Calibration of Satellite-Based Yield Model
To convert satellite vegetation indices to yield, we performed a simple regression between the peak value of the harmonic curve (VIpeak) on each plot i, and the associated ground-based yield measure (Y): When crop cut yields were used for Equation (5), the VI was taken from the Sentinel-2 pixel that contained the centroid of the 8 × 8 crop cut sub-plot. Taking an average of pixels around this point was also tried but had little effect on the results. When self-reported yields were used, the VI for each plot was computed as an average of values from all non-tree pixels with at least half of their area falling within the plot boundary. Following [12], we masked out pixels with trees since plots often contain enough trees to appreciably influence the apparent greenness of the plot. In our case, pixels with trees were identified as those with a GCVI harmonic fit that exceeded a value of 1.0 at any point between 1 June and 15 July, which is the beginning of the crop growing season when crop pixels were typically well below this value ( Figure 2). Overall, 34% of the pixels within plots were removed in this manner.
We alternatively tested GCVI, NDVI, and MTCI as potential predictors of yield, with GCVI generally performing best. Models with additional terms, such as the date on which the peak occurred or the fitted value of the harmonic at earlier times in the season, were also tested but did not significantly improve the model and therefore are not presented for the sake of brevity.

Comparison of Survey and Satellite Data
A major difficulty in assessing the quality of satellite yield estimates is the fact that the traditional ground-based measures (i.e., self-reported and crop cut yields) are themselves prone to errors as discussed in the introduction. The most common approach has simply been to treat ground data as "ground truth," and attribute any differences between satellite and ground-estimates to errors in the satellite data. In this case, the calibration R 2 of Equation (5) is a common metric of how well the satellite-based yield model performs. An alternative to directly comparing the two measures is to use a vector of variables (X) that are expected to influence yields (Y), and test whether the regression results in weaker, stronger, or equivalent β coefficients (with expected signs) when using satellitebased yields. This approach avoids dependence on the assumption that the ground-based measures

Calibration of Satellite-Based Yield Model
To convert satellite vegetation indices to yield, we performed a simple regression between the peak value of the harmonic curve (VI peak ) on each plot i, and the associated ground-based yield measure (Y): When crop cut yields were used for Equation (5), the VI was taken from the Sentinel-2 pixel that contained the centroid of the 8 × 8 crop cut sub-plot. Taking an average of pixels around this point was also tried but had little effect on the results. When self-reported yields were used, the VI for each plot was computed as an average of values from all non-tree pixels with at least half of their area falling within the plot boundary. Following [12], we masked out pixels with trees since plots often contain enough trees to appreciably influence the apparent greenness of the plot. In our case, pixels with trees were identified as those with a GCVI harmonic fit that exceeded a value of 1.0 at any point between 1 June and 15 July, which is the beginning of the crop growing season when crop pixels were typically well below this value ( Figure 2). Overall, 34% of the pixels within plots were removed in this manner.
We alternatively tested GCVI, NDVI, and MTCI as potential predictors of yield, with GCVI generally performing best. Models with additional terms, such as the date on which the peak occurred or the fitted value of the harmonic at earlier times in the season, were also tested but did not significantly improve the model and therefore are not presented for the sake of brevity.

Comparison of Survey and Satellite Data
A major difficulty in assessing the quality of satellite yield estimates is the fact that the traditional ground-based measures (i.e., self-reported and crop cut yields) are themselves prone to errors as discussed in the introduction. The most common approach has simply been to treat ground data as "ground truth," and attribute any differences between satellite and ground-estimates to errors in the satellite data. In this case, the calibration R 2 of Equation (5) is a common metric of how well the satellite-based yield model performs. An alternative to directly comparing the two measures is to use a vector of variables (X) that are expected to influence yields (Y), and test whether the regression Remote Sens. 2020, 12, 100 8 of 16 results in weaker, stronger, or equivalent β coefficients (with expected signs) when using satellite-based yields. This approach avoids dependence on the assumption that the ground-based measures are free of error, and was used to evaluate estimates of maize yields in recent studies in Kenya [15] and Uganda [16]. Similar to the prior work, the following (self-reported) variables are included in our vector X: Household size; head of household age, and education level; plot size; plot distance to (i) household, (ii) road, and (iii) market; use of crop rotation and fallowing; frequency of ploughing; quantity of labor, seed, fertilizer, and pesticides used; and rainfall totals for the entire growing season and for the month of August, which is especially important for determining drought stress during flowering. Rainfall was obtained for each 5-day period in the growing season from the CHIRPS dataset [24]. The coefficients and overall explanatory power of the regressions were then compared for models that alternatively used self-reported, crop cut, or satellite-based yields.

Results and Discussion
Summary statistics for survey responses and crop cut yields are presented in Table 2. Overall, farmers reported an average plot size of 1.88 hectares (ha), which is 23% larger than the 1.53 ha average size of the plot polygons measured with GPS. This tendency to overestimate plot area was particularly noteworthy on plots below 2 ha, consistent with the findings in other LSMS-ISA studies where plot sizes were overestimated by farmers on plots below 2 ha in Malawi, Uganda, Tanzania, and Niger [25]. Only 31% of farmers reported using any inorganic fertilizer, with a mean reported rate of just below 20 kg/ha. The majority of farmers reported having clay soils and a plain (flat) terrain. Although fewer than 10% of farmers reported having a steeply sloping plot, 20% reported having had erosion problems.
Both self-reported and crop cut yields exhibited similar mean values of just below 500 kg/ha. However, the median yield was roughly 25% lower for self-reports, consistent with the overestimation of areas by a similar amount. At the same time, self-reports had a much higher upper bound of yields stretching to values above 6000 kg/ha, whereas crop cuts never exceeded 2500 kg/ha. This is striking given that the crop cuts cover a much smaller area (64 m 2 is 0.4% of the mean plot size of 15,303 m 2 ), and because of sub-plot heterogeneity they would be expected to exhibit more extreme values than self-reports that correspond to the plot-level mean yield. A scatter plot between the self-report and crop cut yields on each plot reveals a striking lack of agreement between the two ground-based measures (Figure 3). The overall correlation between the two measures was just 0.33, indicating that one ground-based measure can explain only approximately 11% of variation in the other. The correlation slightly improves to 0.40 when omitting plots with self-report values above 2500 kg/ha ( Figure 3B). The low agreement between ground-based measures is similar to that reported for maize plots in Eastern Uganda, where self-report and crop cut yields had a correlation below 0.30 even after excluding very small plots where self-reports were deemed least reliable [16]. A scatter plot between the self-report and crop cut yields on each plot reveals a striking lack of agreement between the two ground-based measures (Figure 3). The overall correlation between the two measures was just 0.33, indicating that one ground-based measure can explain only approximately 11% of variation in the other. The correlation slightly improves to 0.40 when omitting plots with self-report values above 2500 kg/ha ( Figure 3B). The low agreement between ground-based measures is similar to that reported for maize plots in Eastern Uganda, where self-report and crop cut yields had a correlation below 0.30 even after excluding very small plots where self-reports were deemed least reliable [16]. Moving to the comparison with satellite measures, Figure 4 shows the correlation between the two ground-based yield measures and GCVI for each date of Sentinel-2 imagery, both for the raw GCVI values and the harmonic fit at those dates. We find the highest correlations between GCVI and ground-based yields were observed in mid-to late-September for both self-report and crop cut yields, which is generally the same time that the GCVI curve reaches its maximum (Figure 2). The GCVI correlations with crop cut yields were consistently higher than those with self-report yields, with a peak value of 0.48 for crop cut compared to 0.23 for self-reports ( Figure 4). The self-report correlation improved slightly to 0.27 when removing fields with self-report yields above 2500 kg/ha, but still remained well below the corresponding correlation for crop cuts. This finding suggests that the selfreport yields are the less reliable of the two ground measures.
The decision to use VIpeak in Equation (5) is supported by the observation that the GCVI-yield correlation peaks in late September (Figure 4). Results were similar when using NDVI, with r = 0.48 for GCVI as compared to r = 0.50 for NDVI. MTCI was found to be less suitable to the harmonic fitting because clouds did not systematically reduce MTCI values; nonetheless MTCI performed similarly to GCVI and NDVI for clear images in this setting, with a peak correlation of 0.40 compared to 0.45 for GCVI.
Predictions from Equation (5) using GCVI were significantly correlated with both ground-based measures ( Figure 5, p < 0.01 for slopes in both panels), with better agreement with crop cut than selfreport yields. Although GCVI successfully captured some of the variation in ground-based yield Moving to the comparison with satellite measures, Figure 4 shows the correlation between the two ground-based yield measures and GCVI for each date of Sentinel-2 imagery, both for the raw GCVI values and the harmonic fit at those dates. We find the highest correlations between GCVI and ground-based yields were observed in mid-to late-September for both self-report and crop cut yields, which is generally the same time that the GCVI curve reaches its maximum (Figure 2). The GCVI correlations with crop cut yields were consistently higher than those with self-report yields, with a peak value of 0.48 for crop cut compared to 0.23 for self-reports ( Figure 4). The self-report correlation improved slightly to 0.27 when removing fields with self-report yields above 2500 kg/ha, but still remained well below the corresponding correlation for crop cuts. This finding suggests that the self-report yields are the less reliable of the two ground measures.
The decision to use VI peak in Equation (5) is supported by the observation that the GCVI-yield correlation peaks in late September (Figure 4). Results were similar when using NDVI, with r = 0.48 for GCVI as compared to r = 0.50 for NDVI. MTCI was found to be less suitable to the harmonic fitting because clouds did not systematically reduce MTCI values; nonetheless MTCI performed similarly to GCVI and NDVI for clear images in this setting, with a peak correlation of 0.40 compared to 0.45 for GCVI.  Both satellite-based models used peak GCVI from the 10th harmonic regression, as illustrated in Figure 2. In general, GCVI was more strongly correlated with crop cut than with selfreported yields.
As a separate assessment, we performed the regression in Equation (6) using three alternative measures of yield: self-report, crop cut, and satellite-based estimates using the model calibrated to crop cuts. For predicting field-scale yields with GCVI, we use the mean value of GCVI for the entire plot (again excluding pixels with trees), rather than for the crop cut sub-plot used in the calibration step. Table 3 summarizes the coefficients and explanatory power of each model. All models were statistically significant, although less than one-quarter of total variation in measured yields was captured in all cases. The models' explanatory power was likely limited by a lack of objective soil measures in the current study, especially given overall low levels of input use in the region ( Table 2). Consistent with a high importance of soil, all models showed a statistically significant (p < 0.01) negative association between yields and whether the plot had been fallowed in the past decade (a likely indication of poor soil quality). Predictions from Equation (5) using GCVI were significantly correlated with both ground-based measures ( Figure 5, p < 0.01 for slopes in both panels), with better agreement with crop cut than self-report yields. Although GCVI successfully captured some of the variation in ground-based yield measures, well over half of the variation was not captured. At least in part, these discrepancies arise because of noise in the ground-based measures, namely reporting errors in the case of self-reports or the effects of spatial heterogeneity in the case of crop cuts.   Both satellite-based models used peak GCVI from the 10th harmonic regression, as illustrated in Figure 2. In general, GCVI was more strongly correlated with crop cut than with selfreported yields.
As a separate assessment, we performed the regression in Equation (6) using three alternative measures of yield: self-report, crop cut, and satellite-based estimates using the model calibrated to crop cuts. For predicting field-scale yields with GCVI, we use the mean value of GCVI for the entire plot (again excluding pixels with trees), rather than for the crop cut sub-plot used in the calibration step. Table 3 summarizes the coefficients and explanatory power of each model. All models were statistically significant, although less than one-quarter of total variation in measured yields was  Figure 2. In general, GCVI was more strongly correlated with crop cut than with self-reported yields.
As a separate assessment, we performed the regression in Equation (6) using three alternative measures of yield: Self-report, crop cut, and satellite-based estimates using the model calibrated to crop cuts. For predicting field-scale yields with GCVI, we use the mean value of GCVI for the entire plot (again excluding pixels with trees), rather than for the crop cut sub-plot used in the calibration step. Table 3 summarizes the coefficients and explanatory power of each model. All models were statistically significant, although less than one-quarter of total variation in measured yields was captured in all cases. The models' explanatory power was likely limited by a lack of objective soil measures in the current study, especially given overall low levels of input use in the region (Table 2). Consistent with a high importance of soil, all models showed a statistically significant (p < 0.01) negative association between yields and whether the plot had been fallowed in the past decade (a likely indication of poor soil quality). Table 3. Regression results using three alternative measures of sorghum yield: (i) Self-reported, (ii) crop cut, and (iii) crop cut-calibrated Sentinel-2 satellite yields. In first column, carrots (ˆ) indicate inputs that were input as a binary variable (0,1) and asterisk (*) indicate variables that were calculated on a per ha basis using self-report area. Values in parentheses show the standard error of the estimate, and significance of the coefficient is indicated as (* p < 0.1, ** p < 0.05, *** p < 0.01). The model using crop cut yields exhibited the lowest explanatory power (adjusted R 2 = 0.05), likely reflecting the fact that the crop cut sub-plot areas represent on average less than 1% of the total plot area that pertains to the survey responses on management and soil conditions. The model using self-report yields had higher explanatory power (adjusted R 2 = 0.11). Several factors likely contribute to this increase, including that (i) self-report yields (unlike crop cut yields) correspond to the same spatial scale as other self-reported plot characteristics; (ii) measurement errors for different survey questions are likely correlated and could inflate the overall model performance, for instance if a farmer who overestimates production tends to also overestimate seed or fertilizer inputs; (iii) farmer perceptions of yield are influenced by the weather during the growing season, as suggested by the fact that the association with August rainfall is larger for self-report yields than the other measures; and (iv) self-reported yields tend to be higher on smaller plots, as indicated by a significant negative coefficient for plot size. Although this last factor increases the explanatory power of the model, we interpret it as an artefact of self-report bias on small fields rather than a true inverse relationship between plot size and productivity, as discussed in [6].
The model using crop cut-calibrated satellite yields exhibited the highest explanatory power among the three models (adjusted R 2 = 0.24), more than twice as much as for crop cut or self-reported yields. One interpretation of this could be that errors in the satellite model are correlated with the some of the factors in X. For example, GCVI is known to correlate strongly with overall canopy biomass and nitrogen content [26,27], but the relationship between biomass and yield is variable in rainfed sorghum systems [12]. Therefore, it is possible that inputs such as fertilizer are more predictive of biomass than yield, while at the same time our satellite-based estimates are more sensitive to biomass than yield. Unfortunately, without explicit measures of biomass we cannot assess the extent to which this explanation holds. Another potential explanation, in our view a more likely one, is that the higher explanatory power indicates a greater ability of satellite data to detect plot-level variations in yield compared to either self-reports or crop cut estimates.
The results discussed above all pertain to satellite estimates obtained when using ground data from all 575 fields to calibrate Equation (5). A practical question is whether similar results could be achieved at lower cost by using significantly fewer plot samples. To test this, we varied the number of ground samples used in calibration from 5 to 500, each time randomly selecting the calibration samples and reserving the remaining fields as a validation set. Figure 6 shows the average root mean square error (RMSE) for the validation set, averaged over all validation points and for 200 different iterations with different randomly selected calibration samples. The out-of-sample performance improves most rapidly up to roughly 30 ground samples, after which RMSE exhibits a more gradual decrease. Also shown in Figure 6 is the RMSE when the model is calibrated to self-reported yields, which consistently do worse than crop cuts regardless of how many ground samples are obtained. biomass and nitrogen content [26,27], but the relationship between biomass and yield is variable in rainfed sorghum systems [12]. Therefore, it is possible that inputs such as fertilizer are more predictive of biomass than yield, while at the same time our satellite-based estimates are more sensitive to biomass than yield. Unfortunately, without explicit measures of biomass we cannot assess the extent to which this explanation holds. Another potential explanation, in our view a more likely one, is that the higher explanatory power indicates a greater ability of satellite data to detect plotlevel variations in yield compared to either self-reports or crop cut estimates. The results discussed above all pertain to satellite estimates obtained when using ground data from all 575 fields to calibrate Equation (5). A practical question is whether similar results could be achieved at lower cost by using significantly fewer plot samples. To test this, we varied the number of ground samples used in calibration from 5 to 500, each time randomly selecting the calibration samples and reserving the remaining fields as a validation set. Figure 6 shows the average root mean square error (RMSE) for the validation set, averaged over all validation points and for 200 different iterations with different randomly selected calibration samples. The out-of-sample performance improves most rapidly up to roughly 30 ground samples, after which RMSE exhibits a more gradual decrease. Also shown in Figure 6 is the RMSE when the model is calibrated to self-reported yields, which consistently do worse than crop cuts regardless of how many ground samples are obtained. Figure 6. The average root mean squared error (RMSE) between crop cut yields and satellite-based yield estimates for fields not used in calibration (i.e., out-of-sample), plotted for different size of training datasets used in calibration and for different sources of yield data for calibration (crop cut or farmer self-report). Values shown are the mean across 200 different random subset of calibration points for each calibration size. Performance improves rapidly with each additional sample until ~30 training points, after which improvement is more gradual.
A final consideration for this study was whether the spatial resolution of the satellite data is a major factor determining the performance of satellite-based models. The Planetscope imagery was processed in the same manner as Sentinel-2, with recursive harmonics fit to the GCVI observations, and the peak of the harmonic fit used to predict yield. We also considered whether combining Sentinel-2 and Planetscope observations, which results in a denser time series of VI observations, led to any improvements. Overall, the two sensors performed similarly (Figure 7), suggesting that the benefits of a finer resolution (for Planetscope) or better sensor spectral resolution and signal-to-noise (for Sentinel-2) were not substantial in this particular system. Figure 6. The average root mean squared error (RMSE) between crop cut yields and satellite-based yield estimates for fields not used in calibration (i.e., out-of-sample), plotted for different size of training datasets used in calibration and for different sources of yield data for calibration (crop cut or farmer self-report). Values shown are the mean across 200 different random subset of calibration points for each calibration size. Performance improves rapidly with each additional sample until~30 training points, after which improvement is more gradual.
A final consideration for this study was whether the spatial resolution of the satellite data is a major factor determining the performance of satellite-based models. The Planetscope imagery was processed in the same manner as Sentinel-2, with recursive harmonics fit to the GCVI observations, and the peak of the harmonic fit used to predict yield. We also considered whether combining Sentinel-2 and Planetscope observations, which results in a denser time series of VI observations, led to any improvements. Overall, the two sensors performed similarly (Figure 7), suggesting that the benefits of a finer resolution (for Planetscope) or better sensor spectral resolution and signal-to-noise (for Sentinel-2) were not substantial in this particular system.  Figure 7. Correlation between ground-based yields and satellite peak GCVI for Sentinel-2 (S2), Planetscope (PS), and the combination of the two. Peak GCVI values were determined from a harmonic fit (Equation (5)) to the individual observations of GCVI throughout the season. Performance was similar for the different satellite sensors.
The DG imagery could not be processed in a similar way given the sporadic coverage of fields throughout the season. That is, the differing frequency of DG images would complicate the interpretation of the resulting harmonic coefficients relative to those from Sentinel-2. Therefore, to compare with Sentinel-2 we calculated the Sentinel-2 GCVI from the harmonic regression for the date of the DG image and computed the correlation between GCVI for each sensor and crop cut yields for the sub-set of fields that were located within the image. A comparison of the correlations for the two sensors with crop cut yields revealed little systematic difference between the two sensors ( Figure 8). Sentinel-2 outperformed DG in slightly more than half of the image pairs, but differences were typically only a few percentage points.  Correlation between ground-based yields and satellite peak GCVI for Sentinel-2 (S2), Planetscope (PS), and the combination of the two. Peak GCVI values were determined from a harmonic fit (Equation (5)) to the individual observations of GCVI throughout the season. Performance was similar for the different satellite sensors.
The DG imagery could not be processed in a similar way given the sporadic coverage of fields throughout the season. That is, the differing frequency of DG images would complicate the interpretation of the resulting harmonic coefficients relative to those from Sentinel-2. Therefore, to compare with Sentinel-2 we calculated the Sentinel-2 GCVI from the harmonic regression for the date of the DG image and computed the correlation between GCVI for each sensor and crop cut yields for the sub-set of fields that were located within the image. A comparison of the correlations for the two sensors with crop cut yields revealed little systematic difference between the two sensors ( Figure 8). Sentinel-2 outperformed DG in slightly more than half of the image pairs, but differences were typically only a few percentage points. . Correlation between ground-based yields and satellite peak GCVI for Sentinel-2 (S2), Planetscope (PS), and the combination of the two. Peak GCVI values were determined from a harmonic fit (Equation (5)) to the individual observations of GCVI throughout the season. Performance was similar for the different satellite sensors.
The DG imagery could not be processed in a similar way given the sporadic coverage of fields throughout the season. That is, the differing frequency of DG images would complicate the interpretation of the resulting harmonic coefficients relative to those from Sentinel-2. Therefore, to compare with Sentinel-2 we calculated the Sentinel-2 GCVI from the harmonic regression for the date of the DG image and computed the correlation between GCVI for each sensor and crop cut yields for the sub-set of fields that were located within the image. A comparison of the correlations for the two sensors with crop cut yields revealed little systematic difference between the two sensors ( Figure 8). Sentinel-2 outperformed DG in slightly more than half of the image pairs, but differences were typically only a few percentage points.

Conclusions
The rapidly growing availability of satellite data offers great promise for monitoring smallholder agricultural plots. The results of this study indicate that the perceived performance of satellite-based measures can depend to a large extent on how the performance is measured. In our setting, when compared to ground measures of yield, satellite data agreed much more strongly with crop cuts from 8 × 8 m sub-plot than with farmer self-reported yields for the entire plot. Studies that rely solely on comparisons with self-reported yields may therefore understate the ability of satellites to measure yield variation, particularly in subsistence systems where farmers do not typically measure production.
Moreover, even comparisons with crop cuts may not give a complete view of the performance of (crop cut) calibrated satellite yields, given high yield heterogeneity in many fields and the fact that satellite pixels will inevitably contain some signal from outside the crop cut sub-plot. Therefore, we recommend (i) following a regression-based approach to estimate the relationships between yields and factors that are likely to influence yields, including inputs, soil conditions, and management practices, and (ii) investigating the sensitivity of these estimates across different yield measures, as an additional form of evaluation of calibrated satellite-based estimates. In the current study, these factors explained more than twice as much variation in satellite-based yield estimates as for either crop cuts or self-report yields. This finding provides an additional line of evidence that satellite-based yields are no worse, and possibly better, than traditional ground-based measures for assessing yield variation. Of course, obtaining accurate ground-based data on agricultural systems remains important, both for the calibration and evaluation of satellite estimates, as well as for the many other aspects of interest that cannot be measured remotely (e.g., off-farm income). However, measuring production is often one of the most time-intensive and therefore expensive parts of field surveys, and satellite-based estimates can help to reduce the number of fields for which ground-based measures are needed. Our estimates suggest that, at least in this setting, only a few dozen high-quality ground observations would be needed to train an accurate model.
As the use of satellite data becomes more mainstream for monitoring and studying agricultural systems, the relative benefits and costs of different sensors is of increasing relevance. Here we considered three sources of satellite imagery, the public and freely available 10 m data from Sentinel-2 as well as finer resolution imagery from two private sector providers (Planet and DigitalGlobe). For the study setting, where plot sizes averaged 1.5 ha and there were frequent cloud-free Sentinel-2 observations during the growing season, there appeared little benefit to using the finer resolution data. In regions with more cloud cover or smaller fields, the benefits of these other sensors would likely be larger, and more comparisons are needed to understand the conditions under which these sensors provide the most value.