Real-time Prediction of Crop Yields from MODIS Relative Vegetation Health: A Continent-wide Analysis of Africa

: Developing countries often have poor monitoring and reporting of weather and crop health, leading 1 to slow responses to droughts and food shortages. Here I develop satellite analysis methods and software tools 2 to predict crop yields two to four months before the harvest. This method measures relative vegetation health 3 based on pixel-level monthly anomalies of NDVI, EVI and NDWI. Because no crop mask, tuning, or subnational 4 ground truth data is required, this method can be applied to any location, crop, or climate, making it ideal for 5 African countries with small ﬁelds and poor ground observations. Testing began in Illinois where there is 6 reliable county-level crop data. Correlations were computed between corn, soybean, and sorghum yields and 7 monthly vegetation health anomalies for every county and year. A multivariate regression using every index and 8 month (up to 1600 values) produced a correlation of 0.86 with corn, 0.74 for soybeans, and 0.65 for sorghum, 9 all with p -values less than 10 − 6 . The high correlations in Illinois show that this model has good forecasting skill 10 for crop yields. Next, the method was applied to every country in Africa for each country’s main crops. Crop 11 production was then predicted for the 2018 harvest and compared to actual production values. Twenty percent 12 of the predictions had less than 2% error, and 40% had less than 5% error. 13

• Corn, soybeans, and sorghum were analyzed in every county. • A multivariate regression was able to produce predictions with an error of 6.3%, 5.7%, and 33% respectively. • These high correlations show that this model has good forecasting skill for crop yields Correlations: • The 2-4 crops with the highest production were analyzed in each country. • The model was trained on years 2013-2015, and then 2018 production values were predicted. • Once 2018 production values were published, predictions were compared to actual values and were found to have a median error of 8.6% Clouds and snow in images can disrupt data and distort values. In order to account for cloud contamination, 121 a cloud mask was retrieved from the Descartes Platform. Pixels with clouds or snow were not included in monthly 122 averages, and images with over 80% clouds were removed altogether ( Figure 3c).

123
To measure the health of crops throughout the growing season, three indices were computed: Normalized  For every pixel in Illinois, the NDVI, EVI, and NDWI monthly averages and climatologies were computed.

134
The climatology is defined as the average NDVI, EVI, or NDWI over years 2000 through 2016 for each month 135 and pixel. Next, the monthly climatology was subtracted from the monthly average for every pixel, resulting in 136 the monthly anomaly. The pixels in each county were then averaged together to find the monthly anomaly for 137 NDVI, EVI, and NDWI. Monthly averaging was chosen for simplicity.

138
Illinois was chosen as a test site because the land is mostly agricultural and can provide a clear signal of crop 139 health. Illinois also has very little irrigation: most counties irrigate less than 1% of their fields [35]. Similarly,

140
90% of staple food production in sub-Saharan Africa comes from rain-fed farming systems [36]. To test the predictive ability of the model, the data was split into a training group of 90% and a testing group 150 of the remaining 10%. The multivariate regression was then fit to the training data and asked to predict the testing 151 set. To ensure randomness, this process was repeated 10 times for each crop.

152
After testing in Illinois was complete, the method was applied to three countries in Africa: Ethiopia,

153
Tunisia, and Morocco. These countries were used as initial case studies because they have a recent history of 154 relative agricultural and political stability and offer a range of climates and crops. In each country, the two Table 1. Definitions of indices to measure crop health. NIR is near infrared, G is the gain factor, L is the canopy background adjustment that addresses non-linear, differential NIR and red radiant transfer through a canopy, and C 1 , C2 are the coefficients of the aerosol resistance term, which uses the blue band to correct for aerosol influences in the red band.  to four highest-producing crops were analyzed. African crop yields were downloaded from Index Mundi, a 156 comprehensive data portal with country-level statistics compiled from multiple sources, but the production data 157 was originally collected by the USDA Foreign Agricultural Service (FAS) [40].

158
In each country, a box was analyzed over a dense farming region ( Figure 14) and was then correlated 159 to national crop production data [40]. Subsections in each country were chosen based on sub-national crop 160 production estimates from the Spatial Production Allocation Model (MAPSPAM), a global spatial crop allocation 161 model [41]. Sample areas were selected rather than the entire country in order to limit the amount of data required.

162
A continent-wide analysis would require significant data transfer and computational power, which is expensive 163 and time consuming. Even with the smaller areas for analysis shown in Figure 14, the MODIS imagery over 164 Africa totaled to 10 terabytes of data. This study provides a proof of concept that dense farming areas can serve 165 as representative samples of larger regions, and shows that a single user with a personal computer can produce 166 reasonable forecasts of crop yields for the whole continent.

167
The daily MODIS imagery over the selected boxes in each country was processed in a similar way to Illinois.

168
First, the bands were retrieved from the Descartes Platform. NDVI, EVI, and NDWI were computed, and cloudy 169 pixels were masked out. The climatology for each pixel was subtracted to gain monthly anomalies as well as 170 averages of all three indices, resulting six variables for correlation analysis: NDVI average, NDVI anomaly, EVI 171 average, EVI anomaly, NDWI average, and NDWI anomaly. Next, correlations were computed between the six indices of the month at the height of the growing season and the crop production. The height of the growing 173 season is defined as the month in the growing season that the NDVI average peaks.

174
After initial successes in Ethiopia, Tunisia, and Morocco, the method was expanded to every African country 175 with the exceptions of Western Sahara due to lack of crops, and Equatorial Guinea and Gabon due to constant 176 cloud cover. Satellite data was restricted in this study to 2013-2018 based on the limited download and compute 177 time that is avaliable to a typical home user on a modern-day laptop. The satellite imagery processed in Africa 178 totaled 10 terabytes even with only five years of data. Future production was then predicted for every African 179 country with a harvest between December 2017 (e.g. Ethiopia) and June 2018 (e.g. Namibia). When the actual 180 production values were published, the error of the predictions in every country and crop was computed.

182
The method was first validated in Illinois and then applied in Africa. as examples to show corn yield and satellite anomalies at the county level ( Figure 6).

195
Next, the relationships were examined at a higher resolution. The corn, soybean, and sorghum county yield than one in a million chance of them occurring through a random process.

203
Correlations for each crop have been computed with three indices (NDVI, EVI, and NDWI) and five months, 204 for a total of fifteen independent variables. In order to create a single predictive measure of crop yields, a 205 multivariate regression was fit to every index and every month using a Python machine learning library. Figure 9 206 shows an example of the multivariate regression for two of the variables and corn yield. The multivariate 207 regression improved the individual correlations for all three crops to 0.86, 0.74, and 0.65 respectively (thick solid 208 lines in Figure 8).

209
To test the predictive power of the model, the multivariate regression was trained on a random 90% of 210 the data and then predicted the remaining 10%. This process was repeated ten times. The median errors of 211 the predicted yields are 9.8 bu/acre (6.3%), 2.7 bu/acre (5.7%), and 9.1 bu/acre (33%) for corn, soybeans, and  with corn (left), soybeans (middle), and sorghum (right). Corn and soybeans have the best correlations, and sorghum is slightly worse, likely because it is grown much less in Illinois than the other crops. All correlations are extremely significant with p-values less than 10 −6 . August was the month with the highest correlations to yields. a.  Figure 10. Accuracy of the multivariate regression predictions of yields in Illinois. The model was trained with a randomly selscted 90% of the data and then predicted the other 10%. This process was then repeated ten times to ensure randomness. The median error was lowest for soybeans with 5.7%, corn was similar at 6.3%, and sorghum had the worst error with 32.6%. a.

May
b.
c. Figure 11. NDVI monthly average for Ethiopia (a), Tunisia (b), and Morocco (c). The annual rainy season produces high NDVI values and corresponds to the crop-growing months. Ethiopia includes the corn production as green bars, which has a very high correlation to maximum NDVI at 0.98.

222
The high correlations in Illinois show that this model has good forecasting skill for crop yields. Next, this 223 method was applied to three countries in Africa: Ethiopia, Morocco, and Tunisia. For each country, a box within 224 a major crop-growing region was analyzed (Figures 13a, 14).

225
Crop estimation in developing countries is vastly different than Illinois and the developed world. The 226 greatest distinctions include the heterogeneity of the landscape, lack of agricultural technology, the spatial size 227 of crop reports, and the accuracy of reported values. In Illinois, the ground is covered with large fields which 228 grow a small number of crops: mostly corn and soybeans. In Africa, the landscape is highly diverse, with small 229 family-owned farms neighboring villages, lakes, mountains, and forests, sometimes all within a couple pixels 230 ( Figure 1). These farms, usually smaller than an acre, lack much of agricultural technology found in the US, such 231 as pesticides, herbicides, and fertilizers. This makes crops yields much more variable in Africa both seasonally 232 and spatially. One of the largest difficulties of crop prediction in Africa is the area for which production numbers 233 are reported. While the US reports crop data for every county, which range from 400 to 3000 square kilometers,

234
African data is only easily available at the country level, which is 1.1 million square kilometers for a country like with the highest production in each country were evaluated for this study. Table 2  c.

258
Satellite imagery was processed for every African country. First, a box in an agricultural region was 259 selected in every African country and a total of 10 terabytes of daily satellite imagery was processed according to 260 the method above. Correlations and linear regressions were computed in every country for their 2-4 highest 261 producing crops. Difficulties in finding accurate correlations could include:

262
• false reporting of production in some countries due to lack of resources, poor oversight, or corruption 263 (e.g. DR Congo, Eritrea, Libya). In severe cases, one could simply use the NDVI anomaly as a proxy for 264 production rather than computing a correlation with reported crop yields.

Production Predictions 2018 with Actual Yield
Production Prediction Std Dev from the Average Figure 16. The map in the center displays the predicted crop production for African countries with harvests between December 2017 and June 2018 in standard deviations from the average. Surrounding the map are bar charts of satellite indices (blue), historical crop production (dark green), predicted 2018 crop production (pink), and actual 2018 production (light green). To view the accuracy of predictions for all crops and countries predicted, see Table 2.

295
In this research, I developed a method to use three measures of crop health computed from daily MODIS 296 satellite imagery as a predictive tool for crop yields 2-4 months before the harvest. The model was first validated 297 in Illinois where there is high-resolution yield data by computing the linear fit between harvest yields in October

298
[45] and the satellite indices in July and August. When a split sample validation was applied to a multivariate 299 regression with all months of the growing season and all three indices, the model could predict the crop yields 300 within an accuracy of 6.3%, 5.7%, and 33% for corn, soybeans, and sorghum respectively. Next, the method was 301 applied to three countries in Africa (Ethiopia, Tunisia, Morocco), all with different climates and crops. High 302 correlations between maximum satellite indices and crop production were calculated in all three countries, with 303 Ethiopia the highest at 0.99 to sorghum. After this success, satellite imagery was analyzed in every African 304 country, and productions for the 2018 harvests were predicted 2-4 months before the harvest. Once 2018 harvests 305 were published, the prediction accuracy was tested against the reported values. Forty percent of the predictions 306 were found to have less than a 5% error.

307
The main objective of this study was to show how a very simple method can serve as an early warning 308 system to predict crop yields in every African country. This method relies solely on NDVI, EVI, and NDWI 309 anomalies calculated over specific subsections of the countries, without the use of crop masks, subnational yield 310 statistics, or special tuning for location or climate. Even with these many simplifications, the model was still able 311 to produce predictions with minimal error over Illinois and throughout Africa.

312
The model was found to perform best in countries with a relatively large agricultural sector and recent 313 political stability. Some countries with relatively poor correlations include Sudan, Somalia and DR Congo, which 314 are all currently in a state of conflict. Because of the political instability, it can be assumed that agricultural 315 reports in those countries could have lower accuracy, and farming as a whole could be of a lower priority. A 316 limitation of this model is that it relies on published yield data, so it will not predict as reliably in countries that 317 lack reporting accuracy. In these places, the NDVI anomaly could be used as a proxy for relative crop yields 318 compared to a mean. The model also only predicts yields at the national level and has no subnational component.

319
However, it has the ability to predict yields sub-nationally in the future when sub-national crop data is supplied.

320
The model developed here may be compared to the existing early-warning systems of GEOGLAM and 321 FEWS NET. Both systems can most likely be said to have predictions with better accuracy than the one presented 322 here, which can be traced back to a couple reasons. Primarily, both are run under large budgets by an extensive 323 team of people with partnerships around the globe. Their systems include ground observations, remotely sensed 324 data, agroclimate indicators, field reports, and communications with national and regional experts. In contrast, 325 this method can be run by a single user on a modern laptop computer. It was developed over the course of a 326 couple months, and is practically free. This model is also able to predict a numerical value of crop production, 327 while GEOGLAM and FEWS NET present their results as a qualitative measure: conditions are compacted into 328 five categories of crop conditions or food insecurity phases.

329
The power of the method developed here is that can be applied to any crop, location, or climate to produce 330 reasonable real-time forecasts of crop yields. It is unique because of its versatility and easy to apply due to its 331 simplicity.