remote Combining Multi-Source Data and Machine Learning Approaches to Predict Winter Wheat Yield in the Conterminous United States

: Winter wheat ( Triticum aestivum L.) is one of the most important cereal crops, supplying essential food for the world population. Because the United States is a major producer and exporter of wheat to the world market, accurate and timely forecasting of wheat yield in the United States (U.S.) is fundamental to national crop management as well as global food security. Previous studies mainly have focused on developing empirical models using only satellite remote sensing images, while other yield determinants have not yet been adequately explored. In addition, these models are based on traditional statistical regression algorithms, while more advanced machine learning approaches have not been explored. This study used advanced machine learning algorithms to establish within-season yield prediction models for winter wheat using multi-source data to address these issues. Speciﬁcally, yield driving factors were extracted from four di ﬀ erent data sources, including satellite images, climate data, soil maps, and historical yield records. Subsequently, two linear regression methods, including ordinary least square (OLS) and least absolute shrinkage and selection operator (LASSO), and four well-known machine learning methods, including support vector machine (SVM), random forest (RF), Adaptive Boosting (AdaBoost), and deep neural network (DNN), were applied and compared for estimating the county-level winter wheat yield in the Conterminous United States (CONUS) within the growing season. Our models were trained on data from 2008 to 2016 and evaluated on data from 2017 and 2018, with the results demonstrating that the machine learning approaches performed better than the linear regression models, with the best performance being achieved using the AdaBoost model (R 2 = 0.86, RMSE = 0.51 t / ha, MAE = 0.39 t / ha). Additionally, the results showed that combining data from multiple sources outperformed single source satellite data, with the highest accuracy being obtained when the four data sources were all considered in the model development. Finally, the prediction accuracy was also evaluated against timeliness within the growing season, with reliable predictions (R 2 > 0.84) being able to be achieved 2.5 months before the harvest when the multi-source data were combined.


Introduction
As one of the top cereals that supply world food, wheat is a rich source of calories and it is an essential staple in most regions of the world [1]. For example, in 2017, the global cereal crop product was 2.98 billion tons, of which 780 million tons were wheat [2]. Despite the tremendous wheat product grown annually, the total demand is still difficult to meet, leading to a recent increase in food prices [3,4]. The United States is one of the leading wheat producers in the world, with an annual production of over 51 million tons in 2018 [5]. Winter wheat (Triticum aestivum L.), which is planted in the preceding fall, is a primary variety, representing more than 70% of the total U.S. production [6]. Due to its large scale, timely and accurate forecasts of the winter wheat yield in the United States (U.S.) are of great significance for regional and global food security.
Satellite remote sensing provides a non-destructive and efficient way to monitor crop growth and, thus, has great potential for wheat yield analysis. In the last few decades, research has mainly focused on establishing the relationship between vegetation indices (VIs) extracted from satellite images with the reported yields [7][8][9]. The VIs data have been mainly obtained from Landsat and Moderate Resolution Imaging Spectroradiometer (MODIS) [10]. Landsat data can provide finer-scale spatial information, while MODIS data are more appropriate for studying large scale crop production due to its moderate spatial and high temporal resolution. The most commonly used VIs include Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI). For example, Ren et al. applied time series NDVI from MODIS to estimate winter wheat yield in Shandong, China [11]. Similarly, Becker-Reshef et al. used daily NDVI from MODIS to estimate the winter wheat yield in Kansas and Ukraine [12]. Kouadio et al. extracted multiple VIs to forecast wheat yield at the ecodistrict scale in Canada, and found that the prediction errors could be significantly reduced using MODIS-EVI and NDVI [13].
Although encouraging results have been achieved by using VIs, intrinsic yield driving factors have not been yet been adequately explored in the model development. Very recently, researchers have started to assess wheat yield using data from more than one source [14][15][16][17]. For example, Saeed et al. combined weather data and NDVI to predict the wheat yield in Pakistan [18], and confirmed that the best results were achieved when combining the two data sources. Cai et al. integrated satellite and weather data to predict wheat yield in Australia, and demonstrated the unique contribution of each data source [19]. These studies provide insight into the potential impact of weather data on wheat production, while other yield determinants, such as soil properties, and prior information, such as historical yields, have not yet been considered. In addition, the most effective combination of multi-source data still needs to be explored.
Two types of approaches have been used for wheat yield prediction, including the crop simulation models and statistical models [20]. The crop models forecast yield by simulating the entire crop growth, when considering the physiological characteristics of plants and a number of environmental factors, and representative wheat simulation models include CERES [21], ARCWHEAT1 [22], DAISY [23], SIRIUS [24], and AFRCWHEAT2 model [25]. Although these models can have high predictive power, they typically require extensive input parameters that are related to crop variety and management practices, which are often difficult to obtain over large areas [26,27]. Moreover, the complexity of physiological processes brings further challenges for the model calibration [27].
An alternative approach of simulating the physiological mechanism of the plant growth with crop models, statistical algorithms may be used to develop an empirical relationship between a large number of current season yield determinants along with historical yield records to develop future forecasts. Several linear regression approaches have been developed recently for wheat yield prediction [28][29][30][31]. However, the main drawback of these linear models is that they often fail to reveal the complex relationship between the input predictors and the yield, thus limiting their prediction performance and generalization to other areas [26,32]. New machine learning approaches that are capable of capturing the nonlinear relationships between the yield and a large set of predictors might offer an improved method [33]. For example, Zhang et al. used the hyperspectral data and support vector machine (SVM) method to forecast the winter wheat yield in Hebei Province, China [34]. Safa et al. applied artificial neural networks (ANN) to predict the wheat production in Canterbury province, New Zealand [35]. Besides using a single approach, researchers have recently started to explore different models in their studies. For instance, Wang et al. developed three models, including random forest (RF), SVM, and ANN for estimating the biomass of winter wheat in China, and RF was found to be the best [36]. Cai et al. used the same three models for predicting the winter wheat yield in Australia, and SVM achieved the best performance [19]. Since the model performance varies spatially, a comprehensive comparison between different machine learning methods, including the deep learning model, needs to be investigated for modeling the winter wheat yield in the Conterminous United States (CONUS).
In this study, multi-source data, including satellite imagery, climate data, soil maps, and historical yield records, were synergistically used as the predictors to forecast the winter wheat yield in the CONUS from 2008 to 2018 at the county level. Two linear regression methods (ordinary least square (OLS), least absolute shrinkage and selection operator (LASSO)) and four machine learning models (SVM, RF, and Adaptive Boosting (AdaBoost) and deep neural network (DNN)) were built and compared. The goal of the study was to answer the following three questions: (1) Which is the best model for predicting CONUS winter wheat yield? (2) How much improvement can be obtained by combing the multi-source data? (3) What is the near real-time prediction performance within the growing season?

Data Acquisition and Preprocessing
In this study, four types of data were collected, including yield records, remote sensing images, climate data, and soil maps. Table 1 shows a complete summary, and the detailed descriptions for each data source and the extracted variables are discussed in the following sections, respectively. The goal of this study was to build machine learning models to forecast the county-level winter wheat yield in the CONUS. A critical component is historical yield records required for the model training, which we obtained from the United States Department of Agriculture (USDA) NASS Quick Stats database [37]. For each county, the average yield over years 2008-2018 was calculated, as mapped in Figure 1. The primary production regions include the Great Plains, Pacific Northwest, States along the Mississippi River and the eastern States. The figure also shows that the yields vary substantially with geographic location, with higher yields in the west (e.g. Idaho, Oregon, and California) and east (e.g. Indiana and Illinois) while lower yields in the central regions (e.g. Kansa, Oklahoma and Texas).
The goal of this study was to build machine learning models to forecast the county-level winter wheat yield in the CONUS. A critical component is historical yield records required for the model training, which we obtained from the United States Department of Agriculture (USDA) NASS Quick Stats database [37]. For each county, the average yield over years 2008-2018 was calculated, as mapped in Figure 1. The primary production regions include the Great Plains, Pacific Northwest, States along the Mississippi River and the eastern States. The figure also shows that the yields vary substantially with geographic location, with higher yields in the west (e.g. Idaho, Oregon, and California) and east (e.g. Indiana and Illinois) while lower yields in the central regions (e.g. Kansa, Oklahoma and Texas).
For most counties in the CONUS, winter wheat is generally planted in the fall [38] and it becomes established in the period of late autumn and winter. It then goes into dormancy and requires a process of vernalization until the following spring [39]. The plants recover in the spring and grow rapidly until the summer harvest. For modeling the winter wheat in the CONUS, a uniform growing season is needed and, therefore, we defined it as from the preceding October to the end of July of the current year. We used the previous two years' yields as one type of the model inputs since the historical yields were found to be useful for crop yield prediction [20].

VI Data
Since the effectiveness of VIs had been previously validated for yield prediction [40][41][42], they were also considered in this study and derived from MODIS data. Specifically, the cloud free 16-day averaged daily MODIS MCD43A3 product, which provides a wealth of band information with 500m spatial resolution, including the visible, near-infrared (NIR), and shortwave, was used for the VI extraction [43]. Four VIs that have been complementary and extensively used in crop yield analysis were calculated, including the NDVI, NDWI, EVI, and Green Chlorophyll Index (GCI). In particular, NDVI is widely used to assess vegetation vigor. NDWI is capable of extracting the moisture content of vegetation canopy. EVI is an optimized NDVI designed to improve the sensitivity of vegetation signals in areas with high biomass. GCI can indicate the canopy chlorophyll content and light use efficiency related to crop growth [44,45]. The four VIs were calculated while using equations (1-4). For most counties in the CONUS, winter wheat is generally planted in the fall [38] and it becomes established in the period of late autumn and winter. It then goes into dormancy and requires a process of vernalization until the following spring [39]. The plants recover in the spring and grow rapidly until the summer harvest. For modeling the winter wheat in the CONUS, a uniform growing season is needed and, therefore, we defined it as from the preceding October to the end of July of the current year. We used the previous two years' yields as one type of the model inputs since the historical yields were found to be useful for crop yield prediction [20].

VI Data
Since the effectiveness of VIs had been previously validated for yield prediction [40][41][42], they were also considered in this study and derived from MODIS data. Specifically, the cloud free 16-day averaged daily MODIS MCD43A3 product, which provides a wealth of band information with 500-m spatial resolution, including the visible, near-infrared (NIR), and shortwave, was used for the VI extraction [43]. Four VIs that have been complementary and extensively used in crop yield analysis were calculated, including the NDVI, NDWI, EVI, and Green Chlorophyll Index (GCI). In particular, NDVI is widely used to assess vegetation vigor. NDWI is capable of extracting the moisture content of vegetation canopy. EVI is an optimized NDVI designed to improve the sensitivity of vegetation signals in areas with high biomass. GCI can indicate the canopy chlorophyll content and light use efficiency related to crop growth [44,45]. The four VIs were calculated while using Equations (1)-(4).
Remote Sens. 2020, 12, 1232 where NIR, Red, Green, and Blue represent the reflectance of the near-infrared, red, green, and blue bands, respectively; L is the canopy background adjustment; C1, C2 are the coefficients of the aerosol resistance term; and, G is the gain factor. The coefficients of EVI in Equation (3) were chosen according to the MODIS EVI algorithm, with L = 1, C1 = 6, C2 = 7.5, and G = 2.5 [46].

Climate Data
In addition to the widely adopted VIs, we also extracted the climate variables within the growing season from two gridded data sources, including the Parameter-elevation Regressions on Independent Slopes Model (PRISM) dataset [47] and MODIS Land Surface Temperature (LST) MOD11A2 product [48]. The primary meteorological variables were derived based on the PRISM dataset with 4 km spatial resolution. These variables include daily total precipitation (PPT), daily minimum and maximum vapor pressure deficit (VPDmn, VPDmx), daily minimum, maximum, and mean temperature (Tmn, Tmx, Tmean), and daily mean dew point temperature (TDmean). Two additional variables, including daytime and nighttime land surface temperature (LST_D, LST_N), were extracted from the MOD11A2 product, which is an eight-day composite data product with a 1 km spatial resolution. In sum, the extracted factors can be divided into two categories: (1) water-related factors including PPT, VPDmn, and VPDmx, (2) temperature-related factors, including Tmean, Tmn, Tmx, TDmean, LST_D, and LST_N.

Soil Data
Soil properties, such as soil moisture and nutrient, also significantly influence plant growth. In this study, six types of soil properties, including organic carbon content (SOC), total nitrogen (TN), bulk density (BD), pH, sand content (SC), and clay content (CC), were derived from the US_SoilGrids100m soil maps [49]. The maps were generated to provide an estimation of soil properties for the CONUS at seven different depths (0, 5, 15, 30, 60, 100, and 200 cm) that are spatially continuous and internally consistent at a 100-m spatial resolution. Each map describes one soil property at a certain depth. All of the maps are available and can be obtained from the Penn State University repository at https://scholarsphere.psu.edu [49]. We considered the six soil properties at all the seven depths since the winter wheat roots have shown to reach 200 cm depth [50].

Data Preprocessing
All of the extracted yield affecting factors, including the VIs, climate variables, and soil properties, were spatially aggregated to the county level, and the Cropland Data Layer (CDL) at 30-m resolution provided by NASS [51,52] was used to identify the wheat area and extract wheat fields. The daily sequential variables, including the climate data and VIs, were collected and aggregated to a 16-day interval during the growing season, resulting in a total of twenty temporal data records. The Google Earth Engine (GEE) platform [53] was used for these preprocessing steps. A total of 304 input variables were obtained for model development, including 260 for the sequential features (four VIs and nine climate variables with twenty temporal records for each), 42 for the six soil properties from seven depths, and two years of historical yields. We also showed the sample size for each year in Appendix A Figure A1, and "processed" samples in Figure A1 were used for the model development. The size difference among years is mainly due to two reasons. One is the long-term downward trend in winter wheat planting ( Figure A1 "raw" samples), which is likely due to the growing foreign competition in international wheat markets [6]. Additionally, in each year, the samples with missing values in their features were removed from the model development.

Model Development and Performance Evaluation
The extracted variables were firstly explored, with the significant ones then applied for the model development, to build accurate and timely yield prediction models. Different machine learning models were then compared and evaluated. Next, the best model was selected for analyzing the multi-source data contribution and the near real-time prediction performance within the growing season. Specifically, the entire workflow consists of the following five main steps: (1) select the most informative input variables using correlation analysis; (2) construct prediction models using two linear regression (OLS and LASSO) and four machine learning methods (SVM, RF, AdaBoost, and DNN); (3) compare different models in terms of the prediction accuracy and the spatial adaptability; (4) investigate the contribution of multiple data sources; and, (5) evaluate the forecasting performance against timelines.

Strategy for Input Variable Selection
We first applied the Pearson correlation coefficient (PCC) to select the variables that mostly affect the yield to reduce the input data dimensionality. The PCC is a common method to measure the correlation degree between two variables, and an absolute value greater than 0.5 indicates good correlation [54].
Specifically, the PCCs were calculated between (1) each input variable and the yield and (2) any two variables within each data source. For the sequential VIs and climate data, the mean value of each time series variable was used for the calculation. Subsequently, two variables were selected from each data source based on the following criteria: (1) the variable that had the maximum absolute correlation with the yield was selected; and, (2) among the remaining factors, the one had a correlation with the previously selected variable below a threshold (0.5 in this study), while most correlated with the yield was also included. This strategy can ensure selecting the informative and diverse factors, in order to improve the efficiency while reducing the redundancy in modeling.

Machine Learning Algorithms for Yield Prediction
We explored two linear regression methods and four machine learning methods in this work. The dependent variable was the county-level winter wheat yield for the current year. The independent variables were the selected variables and they were normalized before the model development.
The models were trained on the data from 2008 to 2016 and then the trained models were evaluated in 2017 and 2018. The scikit-learn library implemented six mainstream machine learning algorithms in Python, and a five-fold cross-validated grid search optimized their hyper-parameters using the "GridSearchCV" module in the library [55]. A brief description of each algorithm is presented below.
OLS is a commonly used linear algorithm that aims to minimize the sum of the square errors between the ground truth and the predicted value. However, OLS lacks penalties for the regression coefficients and, thus, might cause overfitting when there is a large number of input variables. LASSO, which was proposed by Tibshirani [56], is a shrinkage and selection algorithm for linear regression to reduce overfitting. It assumes that some independent variables are more important than the others. Based on the OLS, a L1 norm regularization was added to its loss function, which can not only reduce overfitting, but also removes redundant information in the inputs. The main hyper-parameter of the LASSO is the weighting parameter of the regularization term, and it was set to 0.2 in this study.
SVM is a class of algorithm characterized by the usage of kernels and proposed by Gunn [57]. There are two main steps in an SVM regression model. First, the variables are mapped from the original space to a high-dimensional feature space by a kernel function. The kernel function can be either linear or non-linear, and it is determined by the relationship between the independent and dependent variables. Second, a linear model is constructed in the derived feature space to minimize the errors [58]. The main hyper-parameters include kernel function, kernel coefficient, and regularizer weighting Remote Sens. 2020, 12, 1232 7 of 21 parameter, and they were assigned to "radial basis function (RBF)", "0.1", and "2", respectively, in this work.
RF is an ensemble learning method that has been widely used in different applications [59][60][61]. It is constructed by a large set of decision trees, with each tree being built using a random set of features and samples. RF is a bagging algorithm in which the decision trees are learned in parallel and independent of each other, with the final prediction combining the results from each individual tree. Two critical hyper-parameters include the total number of trees and the maximum depth of each tree, which were set to 400 and 8, respectively.
AdaBoost is another ensemble learning method that was proposed by Freund and Schapire [62]. When compared to RF, it is a boosting algorithm in which the base learners are generated in a sequential way, and the final prediction is achieved by taking a weighted average of base learners. In each iteration, the training samples are assigned different weighting factors with more weight on misclassified instances, which allows for the AdaBoost to control the bias and variance. In this work, the decision tree was used as the base learner. The parameters include the maximum number of estimators, learning rate, and loss function were set to 400, 1, and "linear", respectively.
Deep learning is a breakthrough technology in machine learning and data mining, which has a structure that provides sufficient ability to learn feature representations from the data [63]. There are different types of architectures, and our study applied the fully-connected DNN [64], which increases the depth of the conventional ANNs [65]. DNN contains one input layer, several hidden layers, and one output layer. Each layer has a collection of conceptualized neurons, which can move between layers through weighted connections and activation functions. In this study, we designed a seven-layered network, in which the number of neurons was set to 256, 128, 128, 64, and 24, respectively, in the five hidden layers. We used the "Relu" as the activation function and applied the "Dropout" strategy to improve the model generalization ability. Finally, we configured the model with the "adam" optimizer, the "mean_absolute_error" loss function, and set the learning rate as 0.001.

Metrics for Model Evaluation
We used the root mean squared error (RMSE), coefficient of determination (R 2 ), and mean absolute error (MAE) to quantify the model prediction performance. Additionally, the spatial autocorrelation of prediction errors, which can be measured by the Global Moran's I metric, reflects the model generalizability over the spatial domain, as demonstrated by previous studies [66][67][68][69]. The value of Global Moran's I ranges from -1 to 1 [70], where positive values indicate a trend of aggregation and negative values indicate a trend of dispersion. Values that are near zero indicate spatial randomness patterns, with higher randomness indicating a better model [71]. In this study, we used ArcGIS 10.6 to calculate the Global Moran's I.

Important Factor Selection.
The PCC was calculated within each data source, except for the historical yields as only two previous years' yields were considered in this study, as demonstrated by previous studies. Additionally, it worth noting that the climate variables were divided into two categories, with the water-and temperature-related variables in each. The correlation coefficients for each group are shown in Figure 2, and the corresponding p-values are shown in Appendix A Figure A2.  Generally, the VIs all demonstrated positive correlations with the yield, while the temperaturerelated factors were negatively correlated with the yield. For the other two groups, both positive and negative correlations were observed. The method described in Section 2.2.1 was used to select important factors from each source. For example, within the VI group, NDWI and EVI were selected, because NDWI showed the strongest correlation with the yield; and, EVI had a correlation less than 0.5 with the selected NDWI while showing the highest correlation with the yield among the remaining three factors. Similarly, we selected LST_D and Tmean among the Temperature-related factors; VPDmx and PPT from the Water-related factors; and, SOC and CC within the soil properties.

Model Comparison
We trained the six models described in Section 2.2.2 using both the full and selected factors separately to compare the performances of different models and evaluate the effectiveness of the factor selection, and the accuracies on the testing data are reported in Table 2. First, the results showed that an RMSE of less than 0.7 t/ha, a MAE of less than 0.6 t/ha, and an R 2 of more than 0.75 were found for all of the approaches, demonstrating the efficacy of using multi-source data for yield prediction. Subsequently, when comparing the different methods, we found that the non-linear machine learning approaches, in general, performed better than the two linear approaches, and the AdaBoost model outperformed all other approaches, achieving an R 2 of 0.85 while using the full factors, and an R 2 of 0.86 with the reduced factors. Regarding the factor selection, improvements were observed in most cases, and the R 2 was significantly increased for the DNN model when the reduced factors were considered. This was primarily because reducing the input factors could significantly reduce the number of weights that need to be estimated in the neural network and, thus, helped to prevent overfitting and enhanced the model generalization ability.  Generally, the VIs all demonstrated positive correlations with the yield, while the temperature-related factors were negatively correlated with the yield. For the other two groups, both positive and negative correlations were observed. The method described in Section 2.2.1 was used to select important factors from each source. For example, within the VI group, NDWI and EVI were selected, because NDWI showed the strongest correlation with the yield; and, EVI had a correlation less than 0.5 with the selected NDWI while showing the highest correlation with the yield among the remaining three factors. Similarly, we selected LST_D and Tmean among the Temperature-related factors; VPDmx and PPT from the Water-related factors; and, SOC and CC within the soil properties.

Model Comparison
We trained the six models described in Section 2.2.2 using both the full and selected factors separately to compare the performances of different models and evaluate the effectiveness of the factor selection, and the accuracies on the testing data are reported in Table 2. First, the results showed that an RMSE of less than 0.7 t/ha, a MAE of less than 0.6 t/ha, and an R 2 of more than 0.75 were found for all of the approaches, demonstrating the efficacy of using multi-source data for yield prediction. Subsequently, when comparing the different methods, we found that the non-linear machine learning approaches, in general, performed better than the two linear approaches, and the AdaBoost model outperformed all other approaches, achieving an R 2 of 0.85 while using the full factors, and an R 2 of 0.86 with the reduced factors. Regarding the factor selection, improvements were observed in most cases, and the R 2 was significantly increased for the DNN model when the reduced factors were considered. This was primarily because reducing the input factors could significantly reduce the number of weights that need to be estimated in the neural network and, thus, helped to prevent overfitting and enhanced the model generalization ability. In addition, Figure 3 further showed the agreement between the prediction and the reported yield of different models that were developed by the selected factors. The results were presented for both the training (red) and testing datasets (black). Again, among the six statistical models, the best agreement was found in the AdaBoost (Figure 3e), and the two linear methods showed less agreement than the other approaches (Figure 3 a,b). Besides, we also noticed that the yields were underestimated in some high-yield counties (green ellipses), notably in the OLS, LASSO, and SVM models. This is mainly because the quantity of high-yield samples was relatively small, which results in difficulty for modeling the yield variation. For the following analysis, we excluded the OLS model, as it consistently showed the worst performance among all of the approaches. In addition, Figure 3 further showed the agreement between the prediction and the reported yield of different models that were developed by the selected factors. The results were presented for both the training (red) and testing datasets (black). Again, among the six statistical models, the best agreement was found in the AdaBoost (Figure 3e), and the two linear methods showed less agreement than the other approaches (Figure 3 a,b). Besides, we also noticed that the yields were underestimated in some high-yield counties (green ellipses), notably in the OLS, LASSO, and SVM models. This is mainly because the quantity of high-yield samples was relatively small, which results in difficulty for modeling the yield variation. For the following analysis, we excluded the OLS model, as it consistently showed the worst performance among all of the approaches.

The Spatial patterns of predicted yield
We further reported the predicted yields on the two testing years (2017 and 2018) in Figure 4, and the predictions were obtained by the five different models that were trained with the selected factors. In general, the spatial patterns of the predicted yield were consistent with the reality for all methods (Figure 4(c)-(j)), showing that high-yield counties were distributed in the west and east, while the yield in the central region was relatively low. Subsequently, when comparing the different approaches, we noticed that yields in the northeast region in the year of 2018 were underestimated in most models (black ellipses in Figure 4(d), (h), (j), and (l)), which were in agreement with the results

The Spatial Patterns of Predicted Yield
We further reported the predicted yields on the two testing years (2017 and 2018) in Figure 4, and the predictions were obtained by the five different models that were trained with the selected factors. In general, the spatial patterns of the predicted yield were consistent with the reality for all methods (Figure 4c-j), showing that high-yield counties were distributed in the west and east, while the yield in the central region was relatively low. Subsequently, when comparing the different approaches, we noticed that yields in the northeast region in the year of 2018 were underestimated in most models (black ellipses in Figure 4d,h,j,l), which were in agreement with the results shown in Figure 3, which was likely due to the reason of the small sample size of the high-yield counties. Additionally, it was seen that the predicted yield generated by the AdaBoost model (Figure 4e,f) had the most similar spatial patterns with the ground truth, while the linear LASSO model showed the worst, presenting both the underestimation in the northeast in 2018 (black ellipse in Figure 4l) and the significant overestimation in 2017 (red ellipse in Figure 4k).  We also calculated the relative prediction error of each model to evaluate the adaptability of the models over the spatial domain, and the error maps are shown in Figure 5.

Multi-Source Data Contribution
We further compared the developed multi-source based full model with several other reduced models that were built by using fewer data sources to demonstrate the effectiveness of using the multi-source data. In this analysis, the AdaBoost algorithm was considered due to its superior performance and only the selected factors from each data source were used for building the models. The prediction accuracies on the testing dataset, with different data source combinations, are shown in Figure 6. From the results, it was evident that the best performance was obtained when all of the data sources were combined, with an R 2 of 0.86. Subsequently, within the single data source group, the best performance was achieved by using the soil data (R 2 = 0.77), and the historical yields and sequential VIs showed slightly better performance than the climate. This was probably because the soil data had the finest spatial resolution, providing precise input for modeling, while the others are associated with more coarse resolutions, with 4 km for several climate variables (Tmean, PPT, VPDmx). In addition to the single source scenario, we also constructed two other reduced models while using the combination of sequential (VI and Climate) and non-sequential (soil and historical yields) factors. In both cases, the results outperformed the predictions that were obtained from the corresponding single data source, and an R 2 of 0.8 was achieved when the soil and historical yields were combined. However, the time series factors were not able to explain more yield variation than the other case, which might also be due to the coarse spatial resolution of these variables.

Multi-Source Data Contribution
We further compared the developed multi-source based full model with several other reduced models that were built by using fewer data sources to demonstrate the effectiveness of using the multi-source data. In this analysis, the AdaBoost algorithm was considered due to its superior performance and only the selected factors from each data source were used for building the models. The prediction accuracies on the testing dataset, with different data source combinations, are shown in Figure 6. From the results, it was evident that the best performance was obtained when all of the data sources were combined, with an R 2 of 0.86. Subsequently, within the single data source group, the best performance was achieved by using the soil data (R 2 = 0.77), and the historical yields and sequential VIs showed slightly better performance than the climate. This was probably because the soil data had the finest spatial resolution, providing precise input for modeling, while the others are associated with more coarse resolutions, with 4 km for several climate variables (Tmean, PPT, VPDmx). In addition to the single source scenario, we also constructed two other reduced models while using the combination of sequential (VI and Climate) and non-sequential (soil and historical yields) factors. In both cases, the results outperformed the predictions that were obtained from the corresponding single data source, and an R 2 of 0.8 was achieved when the soil and historical yields were combined. However, the time series factors were not able to explain more yield variation than the other case, which might also be due to the coarse spatial resolution of these variables.

Time-Series Prediction Performance
We then further investigate the model forecasting performance against timelines to achieve realtime yield prediction. Within the growing season, we compared the multi-source based full model with the reduced model only trained on the sequential data over time, and again, the AdaBoost algorithm was applied in all cases. The results were separately reported on the two testing years in Figure 7.
When only considering the sequential data, the model accuracy significantly increased as time proceeds, and the best performance was achieved until harvest (July). Within the entire growing season, a noticeable accuracy improvement was observed between December and the following

Time-Series Prediction Performance
We then further investigate the model forecasting performance against timelines to achieve real-time yield prediction. Within the growing season, we compared the multi-source based full model with the reduced model only trained on the sequential data over time, and again, the AdaBoost algorithm was applied in all cases. The results were separately reported on the two testing years in Figure 7.
with the yield during the flowering and grain-filling stages, as previously demonstrated [74,75]. Moreover, winter wheat is highly sensitive to the environmental stress during this stage [76,77], and climatic variables would significantly contribute to yield modeling within this period. These comparison results showed that the full model developed based on the multi-source data is more robust over time, and a high-performance prediction could be achieved in the middle of May, which is two and a half months before harvesting.

Performance of Machine Learning Models
This study aimed to combine machine learning methods and multi-source data to predict the winter wheat yield in the CONUS. We used both the full and selected factors from the multi-source data to develop two linear and four machine learning models for yield prediction. The results indicated that factor selection played an important role in improving model performance, which was in agreement with previous studies [19,60]. We noticed that the OLS method had lower accuracy than all of the other methods and its performance cannot be improved through factor selection. This is because OLS is a basic linear method, and it is unable to process high dimensional data. Although When only considering the sequential data, the model accuracy significantly increased as time proceeds, and the best performance was achieved until harvest (July). Within the entire growing season, a noticeable accuracy improvement was observed between December and the following January, during which period the wheat starts the vernalization process, and then transits from vegetative growth to reproductive growth. During this period, the meterological conditions are essential for developing the wheat head size and, thus, play an important role in modeling the yield [72,73]. In addition, we also divided the entire growth period into three sub-seasons, including (1) Early season: from preceding October to December; (2) Middle season: from January to April; and, (3) Late season: from May to July, and the prediction results (Appendix A Table A1) also demonstrated that each phenological stage has a certain impact on the end-of-season yield and, thus, lacking the information from any stage would negatively affect the model performance.
In contrast, the full model showed different patterns. It outperformed the sequential model and showed robust performance within the entire growing season. This was largely because complementary information provided by the multi-source data can contribute to a more stable model. For example, the historical yields and soil properties, which were included in the full model, can provide prior knowledge for the yield potential, leading to a better performance in early-season prediction. In addition, although stable, noticeable improvements were still observed from mid-April to mid-May, during which the plants go through the heading, flowering, and early grain-filling stages. The model prediction capacity might be strengthened by the stronger correlation between the VIs with the yield during the flowering and grain-filling stages, as previously demonstrated [74,75]. Moreover, winter wheat is highly sensitive to the environmental stress during this stage [76,77], and climatic variables would significantly contribute to yield modeling within this period. These comparison results showed that the full model developed based on the multi-source data is more robust over time, and a high-performance prediction could be achieved in the middle of May, which is two and a half months before harvesting.

Performance of Machine Learning Models
This study aimed to combine machine learning methods and multi-source data to predict the winter wheat yield in the CONUS. We used both the full and selected factors from the multi-source data to develop two linear and four machine learning models for yield prediction. The results indicated that factor selection played an important role in improving model performance, which was in agreement with previous studies [19,60]. We noticed that the OLS method had lower accuracy than all of the other methods and its performance cannot be improved through factor selection. This is because OLS is a basic linear method, and it is unable to process high dimensional data. Although LASSO is also a linear approach, it is constructed with a weight constraint process, thus resulting in a better performance when dealing with a large number of input variables.
The results also demonstrated that the machine learning methods were superior to the two linear methods, which was consistent with the previous studies demonstrating that machine learning methods are good at handling nonlinear and complex datasets [78,79]. Among the four machine learning models, the AdaBoost method achieved the highest accuracy in this study. DNN was inferior to the two ensemble learning methods (RF and AdaBoost) in this work, although it often outperforms traditional machine learning methods under a larger sample size [80,81]. This is mainly because the sample set, including the winter wheat growing-counties over 11 years (2008-2018) in the CONUS, was still relatively small for training the DNN model. RF outperformed the other models, except AdaBoost. This is probably because combining multiple base learners into one predictive model can help to decrease the variance [82]. In this study, the AdaBoost method performed better than RF, which was probably attributable to the boosting strategy, which pays more attention to the prediction error and generates a new tree to minimize that error.

Spatial Adaptability of Model Performance
The study found that, in some high-yield regions, the yields were underestimated in several models (Figure 3), which was in agreement with the previous studies [12,67,83]. This may be due to the small quantity of samples with high-yield or due to the fact that higher yield was typically associated with higher density, which might cause a saturation issue of optical remote sensing and thus affect the model performance. We also compared the spatial adaptability of different models using Global Morans'I, which showed that the prediction errors from all the models had some spatial aggregation, indicating that their performances vary over space. Our study predicted yield for the entire CONUS, and understandably such a large spatial scale typically introduces more complex environmental conditions and greater variability in planting conditions, which can also lead to unstable modeling performance in the spatial domain [84,85]. It is noteworthy that the spatial aggregation of the LASSO model was higher than the other machine learning methods, which indicated that this linear method was unable to model the complex spatial data [86,87]. Meanwhile, AdaBoost model exhibited the best performance in addressing the spatial variation with the lowest Moran's I value, indicating that the boosting strategy not only improves overall accuracy, but also improves spatial adaptability.

Impact of Multi-Source Data on Yield Prediction
We first used the PCC method to analyze the correlation between the yield and the variables from multi-source data. The results showed that all of the VI factors were positively correlated with the yield, while the temperature-related factors showed negative correlations with the yield, which were consistent with the previous study [19]. This method allowed for us to find the important yield affecting factors, including the NDWI and EVI from the VIs; LSD_T, Tmean, VPDmx, and PPT from the climate variables; SOC and CC from the soil properties, and the previous two years of yields. With the selected factors from multi-source data, the AdaBoost model achieved the highest accuracy in this study.
We evaluated the contribution of different data sources in yield modeling. For the single data source, the soil factors could achieve a better performance than other data sources. Soil properties represent the fertility of soil, which directly affects the crop yield. Majchrzak et al. established a multiple regression to estimate the wheat yield while using 16 soil properties in Illinois, and the final model explained 78% variation in the wheat yield (R 2 = 0.78) [88]. Our soil variables that were extracted at seven different depths provided more detailed information, leading to enhanced prediction performance. [89]. We also noticed that the combinations of multiple sources, in general, outperformed the single data source, with the full data sources achieving the best performance. However, when combining the VI and climate data, it underperformed the models using only the soil factors. One possible reason is that the resolutions of the VIs (500 m) and climate variables (4 km) were much lower than the soil maps (100 m), and such coarse resolution might fail to capture the variation within the target unit [19]. Finally, we found that multi-source data also played an important role in time-series prediction. As compared only using the sequential data (VIs and climate variables), the full model with multi-source data maintained stable accuracy over the growing season, but it also achieved good performance two and a half months before the harvest.

Uncertainties and Future Work
Like all models, there are some limitations to the current study and modeling methods. First, the spatial resolution of the remote sensing data used in this work was relatively coarse and may affect the accuracy of the yield prediction. For further studies, we would consider data with higher resolution, such as Landsat and Sentinel. Second, spatial autocorrelation was not considered in this modeling, and it was necessary to incorporate spatial analysis into the machine learning approaches in order to reduce the spatial clustering issue. Third, although this approach worked well in the CONUS, it would be necessary to test its adaptability for other crops and regions.

Conclusions
Because of the significant role of winter wheat in the U.S., it is essential to estimate its yield precisely and timely. For this purpose, we developed machine learning models to predict county-level winter wheat yield for the CONUS, by using multi-source data. The results indicated that machine learning methods outperformed statistical methods, and the AdaBoost method performed the best. In addition, we examined the spatial autocorrelation of prediction errors for different models and, again, the Adaboost model showed the least clustering pattern, indicating its stronger spatial adaptability in comparison with other approaches. Furthermore, we explored the contribution of different data sources, and demonstrated the efficacy of using the multi-source data. We also found that soil properties are more critical than other data sources. Finally, we evaluated the model performance in a simulated real-time manner, with the results indicating that a reliable prediction could be achieved two and a half months before harvest.