Improving Winter Wheat Yield Forecasting Based on Multi-Source Data and Machine Learning

: To meet the challenges of climate change, population growth, and an increasing food demand, an accurate, timely and dynamic yield estimation of regional and global crop yield is critical to food trade and policy-making. In this study, a machine learning method (Random Forest, RF) was used to estimate winter wheat yield in China from 2014 to 2018 by integrating satellite data, climate data, and geographic information. The results show that the yield estimation accuracy of RF is higher than that of the multiple linear regression method. The yield estimation accuracy can be signiﬁcantly improved by using climate data and geographic information. According to the model results, the estimation accuracy of winter wheat yield increases dramatically and then ﬂattens out over months; it approached the maximum in March, with R 2 and RMSE reaching 0.87 and 488.59 kg/ha, respectively; this model can achieve a better yield forecasting at a large scale two months in advance.


Introduction
As the world's largest producer and consumer of wheat [1], China faces great challenges of food security. Winter wheat production, one of the most important summer grain production of China [2], stagnated in 56% of China from 1961 to 2008 [3]. Therefore, a timely and an accurate winter wheat yield forecasting in China is of great importance for food trade and policymakers. Recently, there has been increasing research on winter wheat yield estimation where the yield prediction models are based on the physiological and the ecological processes of crops. These have been developed constantly, such as WOFOST [4], DSSAT [5], APSIM [6], STICS [7], and MONICA [8]. Such models mostly simulate daily crop development, growth, and yield formation as well as climate variables that are used as the main inputs to describe environmental conditions during the period of crop growth. However, the growth state of crops is not only affected by abiotic factors (growth environment) but also by biological factors (such as plant diseases) [9][10][11]. Therefore, using climate data alone may not be sufficient to estimate yield. Meanwhile, due to the high spatial heterogeneity of crop varieties, farmer management policies, and environments, there is significant uncertainty in the practical application of the model on a large scale [12,13].
Satellite remote sensing can continuously monitor crop growth across various spectral bands and provide useful additional information for crop yield estimations [14][15][16]. In the past decades, remote sensing monitoring technology has been successfully applied to crop yield estimations [17,18]. Such research was mostly about the empirical relationship are mainly in the North China Plain and also include the winter wheat planting area in ten provinces-Inner Mongolia, Shandong, Hubei, Chongqing, Sichuan, Guizhou, Yunnan, Shaanxi, Gansu, and Qinghai-and two autonomous regions: Ningxia and Xinjiang. It is thus clear that the study area is widely distributed and the proportion of planting areas in other provinces are relatively small. Winter wheat in China is generally sown at the end of September or at the beginning of October, and it is harvested by the middle of June of the following year [59]. Generally, irrigation and fertilization are available in these areas. In this study, January to June is defined as the major growth period of winter wheat on which the analysis and the modeling are focused in view of the effect of cold and frost injury on crops in January [60].

Dataset
The study uses data on crop yield, planting area, satellite data, climate data, and spatial information ( Table 1). The multi-source data collected in the study have various temporal and spatial resolutions. Therefore, firstly the raster data were resampled to a spatial resolution of 1 km and the climate and satellite data are unified into monthly interval. Then, monthly climate and satellite are aggregated at prefecture-level by using the crop map generated.

Dataset
The study uses data on crop yield, planting area, satellite data, climate data, and spatial information ( Table 1). The multi-source data collected in the study have various temporal and spatial resolutions. Therefore, firstly the raster data were resampled to a spatial resolution of 1 km and the climate and satellite data are unified into monthly interval. Then, monthly climate and satellite are aggregated at prefecture-level by using the crop map generated. ) and SIF reanalysis datasets (GOSIF) of OCO-2 satellite (http://data.globalecology.unh.edu (accessed on 27 November 2019)). Compared with the normalized vegetation index (NDVI), EVI is closely related to biomass and crop yield [62], and it can better represent the leaf and the chlorophyll content of crop canopy; GOSIF is a reanalysis dataset based on SIF data derived from OCO-2, MODIS data, and climate data. Compared with SIF data with coarse resolution and calculated directly from OCO-2, GOSIF has a better spatio-temporal resolution (0.05 • , 8 days), continuous global coverage, and longer records.
Climate data: a total of 10 climate variables are collected from CRU_TS 4.04 (Climatic Research Unit Timeseries 4.04) series and CMFD (The China Meteorological Forcing Dataset) series ( Table 1). The CRU_TS series is based on the record analysis of more than 4000 independent weather stations and it is gridded at a resolution of 0.5 × 0.5 • , including monthly precipitation, daily maximum and minimum temperatures, cloud cover, and other variables covering the terrestrial region of the earth from 1901 to 2020. The CMFD dataset, with a spatial resolution of 0.1 × 0.1 • , mainly includes precipitation per 3 h, surface radiation, wind speed, air specific humidity, and other variables covering 1979 to 2018.
Geographical basic data: the crops growth status and growth environment have spatial heterogeneity. Studies have indicated that crop yields in neighboring counties are usually similar in a certain year. The spatial autocorrelation can be explained by coding geographical coordinates (lat, lon) in feature space [62,63]. In this study, all data, including EVI, SIF, climate variables, and geographical coding, are covered by the raster data of winter wheat distribution and they were collected to prefecture-level cities with an average value.

Data Preprocessing
Selecting input variables was indispensable before machine learning and linear regression, which can not only reduce the input dimension, i.e., integrating expert knowledge to select the most appropriate input but also quantify the correlation between different potential independent variables and dependent variables to help to explain the results of machine learning algorithms. Based on previous studies on the relation between climate and crop yield [19][20][21][22], ten climate variables were selected for the study. To facilitate variable selection and interpretation, the best variable combinations of yield estimation were chosen without wasting information. Firstly, the 10 climate variables were divided into four groups according to prior knowledge: (1) water-supply-related, including precipitation (pre), wet day frequency (wet), and air specific humidity (shum); (2) temperature-related, including near-surface average temperature (tmp), near-surface temperature minimum (tmn), and near-surface temperature maximum (tmx); (3) water-demand-related, including potential evapotranspiration (pet) and vapor pressure (vap); and (4) radiation-related, including surface downward shortwave radiation (srad) and surface downward longwave radiation (lrad). The correlation analysis was carried out based on the mean value of the variables of the growing season (January-June) to eliminate the influence of the seasonal cycle. This study selected appropriate dependent variable inputs from the climate variables based on the following criteria: 1) selecting the variables which have the maximum absolute correlation with the yield in each group; and 2) selecting the variables whose value of correlation with the previously selected climate variables in the same group is not greater than 0.5.

Multiple Linear Regression
Multiple linear regression (MLR) is one of the most widely used methods of crop yield estimation, and it is easy to use. Based on the principle of Ordinary Least Square (OLS) and the stepwise regression method, the independent variables were selected with significant effects and they constructed the optimal regression model for winter wheat yield estimation by using the climate, satellite, and space information data. The yield estimation model is calculated by Equation (1) where Y represents the winter wheat yield of prefecture-level cities; x 1 . . . x n represent different independent variable factors used to predict Y; a 1 . . . a n represent partial regression coefficient; β is a random variable and a constant term; and ε represents random error. The criterion for the stepwise regression model to pass the significance test is that the equation of linear relation model passes the F test and all the coefficients of the equation passes the t test.

Random Forest
Random Forest (RF) is an integrated learning technology, which classifies or regresses by combining a group of CART decision trees. Due to the introduction of randomness, RF is not prone to over-fitting, and it has good learning stability [56]. In this paper, the scikit-learn, an ML library of Python, is used to develop the RF model, which includes three steps: (1) normalizing all the selected variables and yield and randomly dividing the whole data set into training data with 70% and test data with 30% [64,65]; (2) for the training data set only, optimizing the key parameters of each model based on the highest R 2 and the lowest RMSE by ten-fold cross-verification; and (3) conducting the "leave one year out" experiment from 2014 to 2018, and R 2 and RMSE are used to evaluate the performance and generalization of the model. Considering the climate data, satellite data, and spatial information, this study counts the yield data of 187 out of 385 prefecture-level cities in China from 2014 to 2018.

Experiment Design
Two groups of experiments were designed ( Figure 2) to answer the research questions raised in this paper. The purpose of the first group of experiments was to explore the effect of different input combinations on crop yield estimation models and to compare the potential of SIF in crop yield estimation. There are 11 data input combinations for the experiment, namely: (1) only SIF; (2) only EVI; (3) only climate; (4) SIF combined with spatial information; (5) EVI combined with spatial information; (6) climate combined with spatial information; (7) SIF combined with climate; (8) EVI combined with climate; (9) SIF combined with climate and spatial information; (10) EVI combined with climate and spatial information; and (11) SIF combined with EVI, climate, and spatial information. To assess the practicality of these models, based on the most suitable selected input, we recursively performed hindcasting for each year from 2014 to 2018 to evaluate whether the models can be promoted in different years; for example, the data for 2014-2017 was collected as training data to predict winter wheat yield in 2018. Certainly, future data cannot be used to predict current data. However, more verification samples can be provided for these hypotheses to increase the understanding of the model's performance. The RMSE (root mean square error) and R 2 (determination coefficient) between the predicted yield and the actual yield of winter wheat were calculated to verify the accuracy of the model. sively performed hindcasting for each year from 2014 to 2018 to evaluate whether the models can be promoted in different years; for example, the data for 2014-2017 was collected as training data to predict winter wheat yield in 2018. Certainly, future data cannot be used to predict current data. However, more verification samples can be provided for these hypotheses to increase the understanding of the model's performance. The RMSE (root mean square error) and R 2 (determination coefficient) between the predicted yield and the actual yield of winter wheat were calculated to verify the accuracy of the model. Figure 2. Experimental flow chart (Related parameter description: Location represents spatial information, including latitude and longitude; a1…an represent partial regression coefficient; β is a random variable and a constant term; ε represents random error; Jan is short for January; Feb is short for February).
The second group of experiments explored the effect of time series data on yield estimation models and the contribution of climate data and satellite data to crop yield prediction at different growth stages. In this experiment, the location information was not added since by default the spatial information of crops remained unchanged in this study in the short term. In the experiments, the climate and the satellite data were added and compared the change of R 2 and RMSE based on two methods of modeling to evaluate the Figure 2. Experimental flow chart (Related parameter description: Location represents spatial information, including latitude and longitude; a 1 . . . a n represent partial regression coefficient; β is a random variable and a constant term; ε represents random error; Jan is short for January; Feb is short for February).
The second group of experiments explored the effect of time series data on yield estimation models and the contribution of climate data and satellite data to crop yield prediction at different growth stages. In this experiment, the location information was not added since by default the spatial information of crops remained unchanged in this study in the short term. In the experiments, the climate and the satellite data were added and compared the change of R 2 and RMSE based on two methods of modeling to evaluate the change of performance of winter wheat estimation models. During the growing season (January-June), the input data of all months were used to predict winter wheat yield. The experiments were based on the three input combinations (namely climate, satellite, as well as climate and satellite). According to the experiment results, the added value of climate or satellite data to the estimation model in any period can be determined, and through different methods and input combinations the time for the model to achieve the best performance of estimation can be tested. Figure 3 shows the correlation analysis results of 10 climate variables, which demonstrated that water-supply-related variables (shum, pre, and wet) are all positively correlated with yield, while tmx and tmp among the temperature-related variables are negatively correlated with yield. Among the water-demand-related variables, pet is negatively correlated with yield, while vap is positively correlated with yield. Among radiation-related variables, srad is negatively correlated with yield, and lrad is positively correlated with yield. To select appropriate variables from each group as the input of yield estimation, 5 out of 10 climate variables are selected according to the method in Section 2.4, namely wet, tmx, pet, srad, and vap. lated with yield, while tmx and tmp among the temperature-related variables are negatively correlated with yield. Among the water-demand-related variables, pet is negatively correlated with yield, while vap is positively correlated with yield. Among radiation-related variables, srad is negatively correlated with yield, and lrad is positively correlated with yield. To select appropriate variables from each group as the input of yield estimation, 5 out of 10 climate variables are selected according to the method in Section 2.2.4, namely wet, tmx, pet, srad, and vap.

The Influence of Different Input Data Combinations on the Simulation of the Model
The results of the first group of experiments given in (Figure 4) that two models have the following similar characteristics with different combinations of data inputs: in the single data, the yield estimation performance of climate data is better. It may be because the climate variables can better simulate the growing environment of crops. There is an obvious spatial pattern of winter wheat yield at the prefecture scale, which indicates that the addition of spatial information is helpful to improve the prediction accuracy of the model. A better simulated result of yield is obtained by combining satellite data, climate data, and spatial information (MLR: R 2~0 .68; RF: R 2~0 .95). What is noteworthy is that on a monthly scale, compared with the addition of EVI, that of SIF does not significantly improve the yield estimation accuracy. The estimation effect of RF by combining SIF with other environmental factors is even lower than that of EVI. The result indicates that at the seasonal and the prefecture scale, SIF cannot provide much additional information that is different from EVI in crop yield estimation, and it shows no advantages of yield estimation on the small scale in the field. This may be related to the low signal-to-noise ratio, coarse resolution, and complex extraction algorithm of SIF [38]. However, the resolution of the SIF dataset used in this study (0.05 • ) has been improved compared with previous datasets (0.5~1 • ) [38]. Yet, the performance of the model remains the same, indicating that the downscale SIF dataset based on statistical methods alone cannot significantly enhance the effect of seasonal-scale crop yield estimation, which is consistent with Lindsey [66]. different from EVI in crop yield estimation, and it shows no advantages of yield estimation on the small scale in the field. This may be related to the low signal-to-noise ratio, coarse resolution, and complex extraction algorithm of SIF [38]. However, the resolution of the SIF dataset used in this study (0.05°) has been improved compared with previous datasets (0.5~1°) [38]. Yet, the performance of the model remains the same, indicating that the downscale SIF dataset based on statistical methods alone cannot significantly enhance the effect of seasonal-scale crop yield estimation, which is consistent with Lindsey [66].

Comparison of Yield Estimation Performance of the Model
The results show that the yield estimation performance of RF is generally higher than that of MLR, which may be because the relationships between crop yield and variables are mostly nonlinear while nonlinear methods capture these relationships better than linear methods. Besides comparing the performance and generalization of the models, we conduct a "leave-one-year-out" experiment to verify the extrapolation potential of the models, which is establishing models based on all data and the crop yields in four out of five years and then separately verifying the estimated yield result of the year left. The result is shown in Table 2 in which each row represents the model performance of one year. The RMSE and R 2 between the estimated and the actual yield of winter wheat are compared. The results show that R 2 and RMSE are fairly stable in each year of winter wheat yield estimation at the prefecture scale, except for 2015. For example, the spatial distribution of yield prediction of 2014 based on two models ( Figure 5) shows that RF can well reflect the spatial difference of winter wheat yield, especially in the North China Plain, in addition to having a high potential of yield estimation. In 2014, the errors of RF are mainly in Henan Province, while MLR generally underestimates the crop yield in high-yield areas, and the errors are concentrated in Henan, Hebei, and Shandong provinces. distribution of yield prediction of 2014 based on two models ( Figure 5) shows that RF can well reflect the spatial difference of winter wheat yield, especially in the North China Plain, in addition to having a high potential of yield estimation. In 2014, the errors of RF are mainly in Henan Province, while MLR generally underestimates the crop yield in high-yield areas, and the errors are concentrated in Henan, Hebei, and Shandong provinces.

The Influence of Time Series Data on the Simulation Ability of the Model
Since SIF has no advantage in large-scale yield estimation, EVI is used as an example in this experiment. The results ( Figure 6) illustrate that the two models have the following similar characteristics: (1) for any particular inputs, the yield estimation accuracy of the model increases rapidly with the increase of acquired data and the growth rate slows down and gradually reaches saturation at a later stage of the growing season; (2) the combination can significantly improve the performance of the yield estimation model which can be significantly improved through combining climate data with satellite data, and climate data plays an essential role in the model. However, there are significant differences in the trajectory of model prediction performance produced by different inputs. With only satellite data as inputs, the model generally starts from very poor performance (R 2~0 .1-0.2), and then it improves relatively by much (R 2~0 .4-0.5), while with only climate data and combined data as inputs, the model starts with relatively good performance (R 2~0 .4-0.6) and has a small increase during the growing season. To understand more clearly the

The Influence of Time Series Data on the Simulation Ability of the Model
Since SIF has no advantage in large-scale yield estimation, EVI is used as an example in this experiment. The results ( Figure 6) illustrate that the two models have the following similar characteristics: (1) for any particular inputs, the yield estimation accuracy of the model increases rapidly with the increase of acquired data and the growth rate slows down and gradually reaches saturation at a later stage of the growing season; (2) the combination can significantly improve the performance of the yield estimation model which can be significantly improved through combining climate data with satellite data, and climate data plays an essential role in the model. However, there are significant differences in the trajectory of model prediction performance produced by different inputs. With only satellite data as inputs, the model generally starts from very poor performance (R 2~0 .1-0.2), and then it improves relatively by much (R 2~0 .4-0.5), while with only climate data and combined data as inputs, the model starts with relatively good performance (R 2~0 .4-0.6) and has a small increase during the growing season. To understand more clearly the effect of multi-resource data on the model, we assume that climate and satellite data have independent and overlapping contributions to the yield estimation model. We quantify the contributions of data from different sources. For example, the independent influence of climate data on the model is the difference between the combination R 2 and satellite R 2 . The results indicate that (Figure 7) climate data always play a vital role in the performance of the model. With the advance of the growing season, the proportion of overlapping information becomes higher, and the contribution of climate data to the model decreases gradually. The results depict that satellite data gradually absorbs climate information as time goes by. In addition, models with only climate data and only satellite data can generally achieve a high simulation performance in May, while the performance of those with multi-source data can generally get close to the maximum in March (RF) and April (MLR). Therefore, the combination of multi-source data can achieve a high estimation accuracy of the crop yield one or two months in advance.
gradually. The results depict that satellite data gradually absorbs climate information as time goes by. In addition, models with only climate data and only satellite data can generally achieve a high simulation performance in May, while the performance of those with multi-source data can generally get close to the maximum in March (RF) and April (MLR). Therefore, the combination of multi-source data can achieve a high estimation accuracy of the crop yield one or two months in advance.

Discussion
The results show the effects of different input combinations on yield estimation accuracy, and different yield estimation models have similar results. During the whole crop growing season, climate data always provide important information for crop yield estimation, which is consistent with the previous conclusions [20,21,22]. The performance of two yield estimation models with satellite data has been significantly improved, which is consistent with the view of Guan et al., who indicate that satellite data can provide for crop growth additional information different from climate data [43]. It is worth noting that the addition of SIF in the regional range does not significantly improve the yield estimation accuracy of the model in this case. Compared with EVI, SIF has no significant advantage in performance of yield estimation. It has not provided much additional infor-

Discussion
The results show the effects of different input combinations on yield estimation accuracy, and different yield estimation models have similar results. During the whole crop growing season, climate data always provide important information for crop yield estimation, which is consistent with the previous conclusions [20][21][22]. The performance of two yield estimation models with satellite data has been significantly improved, which is consistent with the view of Guan et al., who indicate that satellite data can provide for crop growth additional information different from climate data [43]. It is worth noting that the addition of SIF in the regional range does not significantly improve the yield estimation accuracy of the model in this case. Compared with EVI, SIF has no significant advantage in performance of yield estimation. It has not provided much additional information to the yield estimation model. It agrees with previous research results on whether SIF has an advantage in capturing crop growth state on the regional scale [38,66], which may be related to the coarse resolution and the complex extraction algorithm and thus more uncertainties of SIF. Random Forest can better predict crop yield. This is consistent with the literature [39], as both temperature and precipitation have a nonlinear response to yield [67,68]; and the nonlinear yield estimation model was more in line with the actual situation.
As time goes on in the growing season, the amount of the input satellite and the climate data increases, and the changes of yield estimation accuracy of different models show a similarity. In accordance with the previous conclusions [38], in the early stage of crop growth climate data play an important role in the model, the satellite gradually absorbs crop growth information, and the yield estimation accuracy of the model increases significantly, while the yield estimation accuracy reaches the maximum in the late stage of growth. It is worth noting that the time when the estimation accuracy reaches the maximum varies slightly between estimation models. While in the regression model the accuracy of yield estimation peaks only one month before harvest, RF achieves a high performance of the yield prediction two months in advance.
The winter wheat planting areas are widely distributed in China, with obvious spatial differences. The main planting areas of winter wheat are in the North China Plain, about 15,309.1 kha. Most of the previous studies have not considered the situation of Inner Mongolia, Ningxia, Xinjiang, and other regions, which accounts for approximately 27.4 percent of the total winter wheat planting area in China. This study establishes models of yield estimation at the national scale with consideration of the spatial heterogeneity of winter wheat yield by adding extra basic geographic data. The results show that adding spatial information data can improve the yield estimation accuracy of models [3] and it is helpful to establish a unified model on a large scale.
In this paper, RF and MLR are used to build yield estimation models of winter wheat in China, which avoids the randomness of single model analysis. However, due to the availability of data, the research is mainly on the prediction of crop yield at the prefecture scale. Furthermore, the spatial resolution of the satellite data used is relatively coarse, which leads to small training samples, and it limits the ability of machine learning methods [38,64,69]. The newly launched satellite provides a variety of data (such as EVI, SIF, and climate variables) and it has a higher spatiotemporal resolution (such as Landsat, Sentinel, and Fluorescence EXplorer) [70][71][72], which can provide the potential for future improvement. Besides, although the machine learning model performs well in the yield estimation in the prefecture, the process-based explanation is limited, which weakens the traceability and the interpretability of the model. How to better combine the process-based model with machine learning algorithm to realize more efficiently the extrapolation beyond the training conditions [54,69] and special migration can be investigated in the future. This is to improve the crop yield estimation of models in areas where there are not enough historical yield records, such as Africa [44]. At the same time, some key factors have not been considered in this study, such as biological factors other than those captured by satellite, namely soil characteristics [26,40], which will also help to explain more yield variability. In addition, due to the data limitation of the spatial distribution of winter wheat, the spatial distribution of winter wheat from 2014 to 2018 was represented by 2014 data in this paper, which may also lead to errors in the statistics of remote sensing data. For future research, it is suggested that the crop interannual spatial distribution information data should be generated from satellite data to reduce potential errors.

Conclusions
To avoid the randomness of a single model, this study conducts yield estimation of winter wheat in China at the prefecture scale based on a MLR model and a RF model combining climate data, satellite data, and spatial information. The effects of different input combinations and time-series data on the performance of the model have been discussed. The main conclusions are as follows: (1) By decomposing and quantifying the contribution of satellite data and climate data to the model's performance in different growth periods, we find that satellite data can gradually capture the changes of crop growth and with accumulation of information can absorb part of the climate data. Spatial information and climate data have made a unique contribution to the yield forecasting of winter wheat in the whole growing season. (2) By comparing the satellite data from two sources (i.e., SIF and EVI), it was found that the downsized SIF products do not perform better than EVI on the yield forecasting at the prefecture scale in China, which may be largely owing to the low signal-to-noise ratio of SIF products and the difficulty of extraction algorithm. (3) By comparing the extrapolation and the spatial generalization ability of two models, RF can generally better capture the spatiotemporal heterogeneity of crop growth and thus is expected to better understand the impact of meteorology on agricultural production.
This study demonstrated a new scalable, simple, and inexpensive framework in estimating winter wheat yields over a wide range of areas based on publicly available data, which is applicable to other crops and geographical environments.
Author Contributions: S.Z. led the project and developed the framework; F.T. and S.Z. conceptualized and designed this research strategy; Y.S. carried out the field work and was responsible for data processing and manuscript writing; S.Z., R.A. and A.A. revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are available on request from the authors.

Conflicts of Interest:
The authors declare no conflict of interest.