remote

: Subseasonal-to-seasonal (S2S) prediction of winter wheat yields is crucial for farmers and decision-makers to reduce yield losses and ensure food security. Recently, numerous researchers have utilized machine learning (ML) methods to predict crop yield, using observational climate variables and satellite data. Meanwhile, some studies also illustrated the potential of state-of-the-art dynamical atmospheric prediction in crop yield forecasting. However, the potential of coupling both methods has not been fully explored. Herein, we aimed to establish a skilled ML–dynamical hybrid model for crop yield forecasting (MHCF v1.0), which hybridizes ML and a global dynamical atmospheric prediction system, and applied it to northern China at the S2S time scale. In this study, we adopted three mainstream machining learning algorithms (XGBoost, RF, and SVR) and the multiple linear regression (MLR) model, and three major datasets, including satellite data from MOD13C1, observational climate data from CRU, and S2S atmospheric prediction data from IAP CAS, used to predict winter wheat yield from 2005 to 2014, at the grid level. We found that, among the four models examined in this work, XGBoost reached the highest skill with the S2S prediction as inputs, scoring R 2 of 0.85 and RMSE of 0.78 t/ha 3–4 months, leading the winter wheat harvest. Moreover, the results demonstrated that crop yield forecasting with S2S dynamical predictions generally outperforms that with observational climate data. Our ﬁndings highlighted that the coupling of ML and S2S dynamical atmospheric prediction provided a useful tool for yield forecasting, which could guide agricultural practices, policy-making and agricultural insurance.


Introduction
Growing populations and an increasing frequency of extreme weather events are threatening wheat production and bringing great challenges to global food security [1][2][3][4][5]. Besides, timely crop yield forecasting on subseasonal-to-seasonal (S2S) time scales (2 weeks to 3 months) is of great interest to agricultural production, decision-making, market futures and the insurance industry [6][7][8]. Wheat is the world's most widely distributed cereal, model accuracy and comparison of machine learning models, and reports findings on the spatial distribution of yield forecasting and the optimum lead time. Sections 4 and 5 provide some further discussion and conclusions, respectively.

Study Area
The study area is the main winter-wheat-producing area in northern China with a temperate continental monsoon climate, including most areas of Hebei, Henan, Shandong province, and southwestern Shanxi province, central Shaanxi province (Figure 1). The main prevailing rotation mode is winter wheat and summer corn in this region. In general, the growing season of winter wheat starts in October and ends in June the following year [30].

Data and Preprocessing
Four types of data were used in this study: crop yield data, satellite data, observational climate data, and S2S atmospheric prediction data. The detailed description and sources of the datasets are shown in Tables 1 and 2. Firstly, all variables were aggregated into 0.5° spatial resolution and monthly temporal resolution. Then all variables were further masked based on the winter-wheat-planting areas. The sample size used in the study comprised 219 grids × 10 years × 9 months (19,710) in total.

Data and Preprocessing
Four types of data were used in this study: crop yield data, satellite data, observational climate data, and S2S atmospheric prediction data. The detailed description and sources of the datasets are shown in Tables 1 and 2. Firstly, all variables were aggregated into 0.5 • spatial resolution and monthly temporal resolution. Then all variables were further masked based on the winter-wheat-planting areas. The sample size used in the study comprised 219 grids × 10 years × 9 months (19,710) in total.  We collected county-level winter wheat yield data (t/ha) in the winter-wheat-producing regions of northern China from 2005 to 2014, which were gathered by the Agricultural Statistical Yearbook of the Ministry of Agriculture of China. The winter-wheat-planting area of 2014 was used to mask the winter wheat yields with a resolution of 1 km [31]. Overall, we selected 393 counties across the winter-wheat-planting areas in China. To match the spatial scale of other variables, the county-level winter wheat yields were assigned to a 0.5 • grid through weighted averaging.

Satellite Data
Enhanced Vegetation Index (EVI) has been proven to be superior in crop yield prediction to Normalized Difference Vegetation Index (NDVI), which is more sensitive to higher canopy Leaf Area Index [15,[32][33][34][35]. Thus, we chose EVI as the satellite indicator, which was derived from the MOD13C1 (Collection 6) product with a 16-day repeat and 0.05 • spatial resolution [22]. Finally, the EVI was resampled by the MVC (maximum synthesis method) to have 0.5 • spatial resolution and monthly time steps.

Observational Climate Data
Observational climate data were obtained from the Climatic Research Unit (CRU), including monthly maximum temperature, minimum temperature, mean temperature, precipitation, vapor pressure deficit (VPD), and growing degree day (GDD). One other variable was the Standardized Precipitation-Evapotranspiration Index (SPEI) [36]. VPD and GDD were calculated from the CRU variables using the following formulas [22].
Vap is the vapor pressure in CRU, e sat is the saturated vapor pressure (in hPa) at mean air temperature (T mp ). T mx , T mn are the monthly maximum temperature and minimum temperatures respectively, and T base is the lower temperature limits ( • C). The lower temperature for winter wheat is 0 • C.

S2S Climate Prediction Data
The S2S atmospheric prediction outputs were obtained from the IAP-CAS FGOALS-f2 dynamical forecasting system, which has been applied at China National Climate Center for real-time S2S prediction [37][38][39]. The prediction model is a Climate System Model representing the interaction between the atmosphere, oceans, land, and sea ice, which uses the time-lag perturbation method to generate 35 ensemble samples, and rolls on the 20th of each month to predict the climate conditions in the next 6 months [40]. The prediction products include the monthly and daily average and 6-h average forecast data of the next 6 months of the 4 modules of the atmosphere, ocean, land, and sea ice [37,38]. Studies have shown that FGOALS-f2 is skilled in predicting extreme events, such as summer drought and tropical cyclones genesis, which inevitably affect crop growth and factual yield [41,42]. In this study, we used the monthly atmospheric prediction outputs of FGOALS-f2 with a 0.5 • spatial resolution for the next whole month to forecast the winter wheat yield, including 925-hPa air temperature in K, 925-hPa eastward wind in m/s, 925-hPa northward wind in m/s, 925-hPa specific humidity in kg/kg, ground temperature in K, surface (2-m) air temperature in K, total precipitation rate in mm/h, and surface net shortwave radiation in W/m 2 .

Model Development
Four ML methods (MLR, SVR, RF, XGBoost) were adopted to establish prediction models between input variables and winter wheat yield. Before applying the models, the data preprocessing was carried out. Firstly, we randomly divided the whole dataset into 70% training data and 30% testing data [33,43,44], and then preprocessed the training data and testing data separately. This is because the testing data are often unavailable in real life, otherwise, it is equivalent to telling the model the answer to a part of the prediction in advance. Secondly, to have a mean of 0 and a standard deviation of 1, the training data and testing data were normalized respectively by the Z-score [22]. Next, the best hyper-parameters for each model were determined by the five-fold cross-validation using the GridSearchCV package [45,46]. Finally, the optimal models were determined based on the tuned hyperparameters, and the predicted R 2 was calculated on the testing dataset.

Multiple Linear Regression (MLR)
Multiple linear regression (MLR) is the most commonly used linear regression method that utilizes multiple explanatory variables to predict the response variable [44,47]. The goal of MLR is to establish a linear relationship between the independent variable and the dependent variable. Essentially, MLR is an extension of ordinary least squares, which involves more than one explanatory variable [48]. As mentioned above, the factors affecting crop yield are complex and interrelated. Therefore, it is reasonable for our study to choose MLR instead of simple linear regression.

Support Vector Machine Regression (SVR)
Support vector machine regression (SVR) is a model derived from Support Vector Machine (SVM). Similar to SVM, SVR finds a regression plane to minimize the distance of all inputs to this hyperplane. Generally speaking, SVR requires a kernel function to map all inputs from the original space to a high-dimensional space. Then, a linear function is constructed in this feature space to balance error minimization and overfitting [49,50]. The most commonly used kernel functions include linear kernels, polynomial kernels, and Gaussian radial basis kernels. In addition, the hyperparameters that need tuning are the penalty coefficient C and the kernel coefficient gamma.

Random Forest (RF)
The RF model is an ensemble learning method for both regression and classification, which was introduced by Breiman [51]. RF is composed of many decision trees. In the training phase, the input training dataset is divided into multiple different sub-training datasets using bootstrap sampling, and then each sub-training set is used to generate a single decision tree. After training each single decision tree, it gives a prediction. Finally, predictions of each decision tree are averaged to provide the final prediction [47,52]. Moreover, RF is more robust and insensitive to noise. The number of decision trees (n_estimators), the maximum depth (max_depth) are the need-to-be-tuned hyper-parameters in our study.

EXtreme Gradient Boost (XGBoost)
XGBoost is a scalable machine learning algorithm for tree boosting proposed by Chen and Guestrin [53]. The basic principle of the XGBoost algorithm is to build multiple weak learners on the full data and aggregate the modeling results of all weak learners to obtain better regression or classification performance. XGBoost incorporates a regularized model to prevent overfitting and the weak learner can be a regression tree or a linear model [53][54][55]. The XGBoost model in this research is established based on the decision tree. The predicted values in XGBoost are obtained by directly summing the leaf weights on all decision trees. The optimal parameters of the XGBoost algorithm are obtained using the GridSearchCV package [45,46].

Model Evaluation
We conducted leave-one-year-out cross-validation to assess the practicality of the models, i.e., using all years' data from 2005 to 2014 except the target year to train the model and then make a prediction for the target year [28]. This approach is an extensively used cross-validation method because of its simplicity, universality, and superiority in avoiding the issue of over-fitting [28,33,56]. To evaluate the model performance, the coefficient of determination (R 2 ), root-mean-square error (RMSE), and percent error (PE) were selected as the evaluation metrics in this paper. The RMSE and PE are calculated by Formulas (3) and (4).
y i is the predicted yield,ŷ i is the observed yield, and n is the number of samples.

Experimental Design
Three experiments were designed to address the aims of our study as outlined in the introduction. Firstly, we trained crop yield models with observational climate data and S2S atmospheric prediction outputs separately to identify which data source is superior for crop yield forecasting. Secondly, we compared several yield prediction models with different algorithms to find the model with the highest accuracy. Thirdly, we performed in-season prediction to assess the lead time that can reasonably predict the winter wheat yield. The in-season prediction started at the beginning of the growing season and ended one month before harvest. The flow diagram of the yield modeling steps is presented in Figure 2.
S2S atmospheric prediction outputs separately to identify which data source is superior for crop yield forecasting. Secondly, we compared several yield prediction models with different algorithms to find the model with the highest accuracy. Thirdly, we performed in-season prediction to assess the lead time that can reasonably predict the winter wheat yield. The in-season prediction started at the beginning of the growing season and ended one month before harvest. The flow diagram of the yield modeling steps is presented in Figure 2.

Comparison of Different Data Sources on Model Accuracy
To compare the performance of observational climate data with S2S dynamical atmospheric prediction outputs in yield forecasting, three groups of data were integrated into the selected models-namely, S2S dynamical atmospheric prediction outputs (S2Sbased), observational climate data (Observation-based), and the combination of the two (S2S+Observation-based). As mentioned earlier, the hybrid of S2S atmospheric prediction outputs and ML have been called MHCF v1.0, while the combination of observational climate data and ML are called the basic model, i.e., benchmark. We found that the performance of the S2S-based model was better than the observation-based model in all selected ML models, but not for MLR ( Figure 3). S2S-based alone outperformed the

Comparison of Different Data Sources on Model Accuracy
To compare the performance of observational climate data with S2S dynamical atmospheric prediction outputs in yield forecasting, three groups of data were integrated into the selected models-namely, S2S dynamical atmospheric prediction outputs (S2Sbased), observational climate data (Observation-based), and the combination of the two (S2S+Observation-based). As mentioned earlier, the hybrid of S2S atmospheric prediction outputs and ML have been called MHCF v1.0, while the combination of observational climate data and ML are called the basic model, i.e., benchmark. We found that the performance of the S2S-based model was better than the observation-based model in all selected ML models, but not for MLR ( Figure 3). S2S-based alone outperformed the observation-based, as well as the combination of both, among the ML-based models, with an R 2 of 0.85, 0.84, and 076 in XGBoost, RF, and SVR (RMSEs of 0.78 t/ha, 0.82 t/ha, and 0.99 t/ha), respectively. Meanwhile, the prediction R 2 values of the observation-based models were 0.81, 0.8 and 0.65 in XGBoost, RF and SVR (RMSEs were 0.89 t/ha, 0.92 t/ha and 1.21 t/ha), respectively.
Additionally, we noticed that the S2S+observation-based model performed almost equivalently to the S2S-based model alone for XGBoost and RF, and worse for SVR, indicating that S2S atmospheric prediction outputs plus observational climate variables as model inputs do not add extra contributions. This phenomenon may largely be because the S2S atmospheric prediction outputs are generated based on a coupled prediction system of the atmosphere, ocean, land, and sea ice, which is different from the observational climate data that are collected based on the conditions of the historical atmosphere. Therefore, the overlapping of variables from different systems will result in them restricting each other and will ultimately worsen the model performance for the ML methods. observation-based, as well as the combination of both, among the ML-based models, with an R 2 of 0.85, 0.84, and 076 in XGBoost, RF, and SVR (RMSEs of 0.78 t/ha, 0.82 t/ha, and 0.99 t/ha), respectively. Meanwhile, the prediction R 2 values of the observation-based models were 0.81, 0.8 and 0.65 in XGBoost, RF and SVR (RMSEs were 0.89 t/ha, 0.92 t/ha and 1.21 t/ha), respectively. Additionally, we noticed that the S2S+observation-based model performed almost equivalently to the S2S-based model alone for XGBoost and RF, and worse for SVR, indicating that S2S atmospheric prediction outputs plus observational climate variables as model inputs do not add extra contributions. This phenomenon may largely be because the S2S atmospheric prediction outputs are generated based on a coupled prediction system of the atmosphere, ocean, land, and sea ice, which is different from the observational climate data that are collected based on the conditions of the historical atmosphere. Therefore, the overlapping of variables from different systems will result in them restricting each other and will ultimately worsen the model performance for the ML methods.
We also evaluated the contribution of EVI to the candidate models ( Figure 4). As shown in Figures 4 and 5, whether for R 2 or RMSE, the inclusion or not of EVI did not significantly change the difference between the two models of MHCF v1.0 and Benchmark among the four selected algorithms. The result showed that EVI contributed little to the predictive capability. Therefore, in the following analysis, both the MHCF v1.0 and Benchmark models do not include EVI.

Comparison of Machine Learning Models
As stated in Section 3.1, the S2S-based model alone achieved the highest accuracy among the ML-based models (XGB>RF>SVR) (Figures 3 and 4). For MLR, the performance of the S2S-based model was the worst among the three groups of input variables, followed by observation-based, and then S2S+observation-based, which had the most input data, and achieved the highest R 2 , which was completely different from the results obtained by the ML-based models. One possible explanation for this is that the MLR method essentially captures the correlation between input variables and yield [47], and the correlation between observational climate data and yield is much greater than that between S2S atmospheric prediction and yield ( Figure 6). Thus, the ML models (i.e., XGBoost, RF, and SVR) outperformed the linear method (i.e., MLR), since most relationships between yield and different variables are nonlinear, and ensemble learning methods are better able to capture these relationships than the linear method [22]. We also evaluated the contribution of EVI to the candidate models ( Figure 4). As shown in Figures 4 and 5, whether for R 2 or RMSE, the inclusion or not of EVI did not significantly change the difference between the two models of MHCF v1.0 and Benchmark among the four selected algorithms. The result showed that EVI contributed little to the predictive capability. Therefore, in the following analysis, both the MHCF v1.0 and Benchmark models do not include EVI.

Comparison of Machine Learning Models
As stated in Section 3.1, the S2S-based model alone achieved the highest accuracy among the ML-based models (XGB > RF > SVR) (Figures 3 and 4). For MLR, the performance of the S2S-based model was the worst among the three groups of input variables, followed by observation-based, and then S2S+observation-based, which had the most input data, and achieved the highest R 2 , which was completely different from the results obtained by the ML-based models. One possible explanation for this is that the MLR method essentially captures the correlation between input variables and yield [47], and the correlation between observational climate data and yield is much greater than that between S2S atmospheric prediction and yield ( Figure 6). Thus, the ML models (i.e., XGBoost, RF, and SVR) outperformed the linear method (i.e., MLR), since most relationships between yield and different variables are nonlinear, and ensemble learning methods are better able to capture these relationships than the linear method [22].

Spatial Patterns of Yield Predictions at the Grid Level
The spatial distributions of predicted yields are also provided (Figure 7). We only evaluated the predictions made by XGBoost and RF for 2005, 2010, and 2014, using the S2S atmospheric prediction and observational climate data as training inputs, respectively, as they showed the best prediction skill among the four candidate models. Figure 7 shows the spatial patterns of predicted yield by XGBoost, and Figure 8 shows the results predicted by RF. In general, the spatial patterns of the predicted yield were consistent with the recorded yield. The high-yield grids were mainly concentrated in the east of the planting area, while the grids with low yield were mainly distributed in the west. As shown in Figure 7, some high-yield grids in the east were slightly underestimated. The prediction yields obtained by RF in Figure 8 are roughly similar to those of XGBoost.

Spatial Patterns of Yield Predictions at the Grid Level
The spatial distributions of predicted yields are also provided (Figure 7). We onl evaluated the predictions made by XGBoost and RF for 2005, 2010, and 2014, using th S2S atmospheric prediction and observational climate data as training inputs, respec tively, as they showed the best prediction skill among the four candidate models. Figur  7 shows the spatial patterns of predicted yield by XGBoost, and Figure 8 shows the result predicted by RF. In general, the spatial patterns of the predicted yield were consisten with the recorded yield. The high-yield grids were mainly concentrated in the east of th planting area, while the grids with low yield were mainly distributed in the west. A shown in Figure 7, some high-yield grids in the east were slightly underestimated. Th prediction yields obtained by RF in Figure 8 are roughly similar to those of XGBoost. To further compare the performances of different sources of input data, we present the mean prediction PE of winter wheat yield predicted by XGBoost. It was found that grids with low yield tended to have larger PE (≥25%) (Figure 9a,b). The PE differences between Benchmark and MHCF v1.0 were obtained by the PE of Benchmark to minus that of MHCF v1.0 (Figure 9c). The model performance from integrating different sources of meteorological data also corresponded with crop yield. In other words, the high-yield grids using the MHCF v1.0 model achieved a satisfactory R 2 compared to the low-yield grids, whereas, in contrast, the combination of low-yield grids and Benchmark model achieved better predictive precision.     between Benchmark and MHCF v1.0 were obtained by the PE of Benchmark to minus that of MHCF v1.0 (Figure 9c). The model performance from integrating different sources of meteorological data also corresponded with crop yield. In other words, the high-yield grids using the MHCF v1.0 model achieved a satisfactory R 2 compared to the low-yield grids, whereas, in contrast, the combination of low-yield grids and Benchmark model achieved better predictive precision.

Optimum Lead Time of Yield Prediction
To investigate the limit of making an optimum early yield prediction, we conducted an in-season prediction experiment on winter wheat yield using XGBoost and RF. The inseason predictions were made in monthly intervals, which means we added the climate information for a more recent month at each time step. In general, better performance from the model was achieved with more input data, i.e., as the prediction period approached the end of the growing season, the R 2 increased and the RMSE decreased [57]. As shown in Figure 10, XGBoost and RF performed comparably in terms of predictive capability for yield forecasting, sharing the same trend. Although RF achieved a stable R 2 and RMSE earlier than XGBoost, XGBoost had a higher R 2 and lower RMSE than RF. Overall, the performance of both XGBoost and RF increased with the accumulation of data. In terms of XGBoost, the R 2 ranged from 0.77 to 0.85 and 0.75 to 0.81 (RMSE from 0.97 to 0.78 t/ha and 1.01 to 0.89 t/ha) for MHCF v1.0 and Benchmark model, respectively (Figure 10), while for RF, the R 2 ranged from 0.78 to 0.84 and 0.75 to 0.8 (RMSE from 0.94 to 0.82 t/ha and 1.03 to 0.9 t/ha) for the two models, respectively ( Figure 10). The results confirmed that the S2S atmospheric predictions were superior to the observational climate variables as model inputs, and XGBoost was slightly superior to RF in predicting winter wheat yield.

Optimum Lead Time of Yield Prediction
To investigate the limit of making an optimum early yield prediction, we conducted an in-season prediction experiment on winter wheat yield using XGBoost and RF. The in-season predictions were made in monthly intervals, which means we added the climate information for a more recent month at each time step. In general, better performance from the model was achieved with more input data, i.e., as the prediction period approached the end of the growing season, the R 2 increased and the RMSE decreased [57]. As shown in Figure 10, XGBoost and RF performed comparably in terms of predictive capability for yield forecasting, sharing the same trend. Although RF achieved a stable R 2 and RMSE earlier than XGBoost, XGBoost had a higher R 2 and lower RMSE than RF. Overall, the performance of both XGBoost and RF increased with the accumulation of data. In terms of XGBoost, the R 2 ranged from 0.77 to 0.85 and 0.75 to 0.81 (RMSE from 0.97 to 0.78 t/ha and 1.01 to 0.89 t/ha) for MHCF v1.0 and Benchmark model, respectively (Figure 10), while for RF, the R 2 ranged from 0.78 to 0.84 and 0.75 to 0.8 (RMSE from 0.94 to 0.82 t/ha and 1.03 to 0.9 t/ha) for the two models, respectively ( Figure 10). The results confirmed that the S2S atmospheric predictions were superior to the observational climate variables as model inputs, and XGBoost was slightly superior to RF in predicting winter wheat yield.
Regarding the optimum forecasting time, the MHCF v1.0 models (XGBoost and RF) reached a stable R 2 of 0.85 and 0.84 in February and January, respectively, with a lead time of about four months before the harvesting of winter wheat (Figure 10a). By contrast, the Benchmark models (XGBoost and RF) resulted in the highest R 2 of 0.81 and 0.8 in March and January for XGBoost and RF, respectively, with a lead time of three months (Figure 10b). Our work achieves a more satisfactory lead time before harvest compared with previous studies [22,44,57]. The findings demonstrate that S2S atmospheric predictions have the potential to achieve an earlier optimum yield forecast than observational climate data as model inputs. Regarding the optimum forecasting time, the MHCF v1.0 models (XGBoost and RF) reached a stable R 2 of 0.85 and 0.84 in February and January, respectively, with a lead time of about four months before the harvesting of winter wheat (Figure 10a). By contrast, the Benchmark models (XGBoost and RF) resulted in the highest R 2 of 0.81 and 0.8 in March and January for XGBoost and RF, respectively, with a lead time of three months ( Figure  10b). Our work achieves a more satisfactory lead time before harvest compared with previous studies [22,44,57]. The findings demonstrate that S2S atmospheric predictions have the potential to achieve an earlier optimum yield forecast than observational climate data as model inputs.

Model Performance
In this work, we have further developed the four most commonly used models (XGBoost, RF, SVR, MLR) for winter wheat yield forecasting. Linear regression generally explains the linear relationship between input and output variables, but from Figure 6, we could speculate that yield is not the result of a linear relationship with the input variables. In this study, the MLR method performed the worst for simulating yield among the developed models, which simply confirmed this point. ML algorithms have shown strong predictive capability, especially for XGBoost and RF. Thus, ML methods can achieve superior performance in capturing the complex relationships between S2S atmospheric predictions and yield, and they provide a window of opportunity to predict crop yield at regional, or even global, scales. Here, we built yield prediction models for winter wheat within the framework of ML because of its simplicity and efficiency, but also its limitations. By contrast, deep learning, with its higher accuracy at the cost of computational intensity and model complexity, may have great potential for improving global grain yield forecasting, and it is essential to find a balance between complexity and efficiency for future research [58]. Moreover, the hybridization of process-based crop growth models

Model Performance
In this work, we have further developed the four most commonly used models (XGBoost, RF, SVR, MLR) for winter wheat yield forecasting. Linear regression generally explains the linear relationship between input and output variables, but from Figure 6, we could speculate that yield is not the result of a linear relationship with the input variables. In this study, the MLR method performed the worst for simulating yield among the developed models, which simply confirmed this point. ML algorithms have shown strong predictive capability, especially for XGBoost and RF. Thus, ML methods can achieve superior performance in capturing the complex relationships between S2S atmospheric predictions and yield, and they provide a window of opportunity to predict crop yield at regional, or even global, scales. Here, we built yield prediction models for winter wheat within the framework of ML because of its simplicity and efficiency, but also its limitations. By contrast, deep learning, with its higher accuracy at the cost of computational intensity and model complexity, may have great potential for improving global grain yield forecasting, and it is essential to find a balance between complexity and efficiency for future research [58]. Moreover, the hybridization of process-based crop growth models and/or ML with deep learning, as well as multi-source data, also has the potential to offer an improved and optimized technique for yield forecasting [17,59].

Comparison of Different Data Sources
Our findings demonstrated that the S2S atmospheric predictions outperformed those based on observational climate data. This may be largely due to S2S dynamical atmospheric predictions being able to provide more information over the simple use of observational climate data to simulate possible future scenarios, confronted with climate change, together with climate teleconnections between a region of interest and other parts of the globe [27,60]. Compared to previous works, which usually used observed climate data and satellite data, the inclusion of S2S atmospheric prediction data in our study greatly improved the accuracy of winter wheat yield prediction. More S2S dynamic prediction models from various institutions (e.g., IAP-CAS, ECMWF, NECP-CFSv2, etc.) could be used for comparison to improve yield forecasting for future study [41,42]. In further research, deep learning methods should be considered to improve the accuracy of S2S atmospheric prediction [34,61]. Besides, the horizontal resolution of S2S atmospheric prediction systems should be increased to a convection-permitting resolution (<10 km), to provide crop yield predictions at finer spatial scales.
For the satellite data, only EVI was included in our study, and the results showed that EVI contributed little to the yield prediction model. The existing study showed that EVI provides most information in the peak season because that peak-season EVI contains biotic or abiotic stress information and may not be captured by the accumulated climate information [22]. However, climate variables over the whole growing period play critical roles in determining the wheat growth and final wheat yield. Thus, the performance of EVI for yield prediction was insignificant for predictions based on the whole growing season ( Figure 5). Moreover, the newly emerging satellite Sun-Induced Chlorophyll Fluorescence (SIF) contains information about the physiological, biochemical, and the fraction of absorbed photosynthetically active radiation (fPAR) [62,63], and some studies indicated that SIF provided feasible satellite data for crop yield estimation at a larger scale, and it was not inferior to EVI [33,43]. For future research, SIF will undoubtedly play an important role in crop yield prediction.

Optimum Lead Time of Winter Wheat Yield Prediction
Our results also indicated the capability of MHCF v1.0 in winter wheat yield prediction, which achieved the highest R 2 , ranging from 0.76-0.85, with a 3-4-month lead time ( Figure 10). The model performance and lead time are consistent with, or better than, existing previous works; for example, a study about wheat yield prediction in Australia showed that their model achieved R 2 ranging from 0.73-0.75, with a 2-month lead time before harvest [22], and another study also predicted winter wheat yield in China and achieved higher accuracy, with R 2 ranging from 0.79-0.81, 1-2 months in advance of the harvest dates [56]. These findings show that the inclusion of S2S atmospheric predictions, as in our work, improves yield prediction performance, and that has increased as the S2S atmospheric prediction outputs can normally be available. In our study, we used the S2S atmospheric prediction data with a forecast length of 1 month in advance. Further research can compare the accuracy of S2S prediction outputs with different forecast lengths (1-6 months in advance) for the yield prediction model.

Conclusions
This study aimed to establish a forecasting model (MHCF v1.0) for the major production areas of winter wheat yield in northern China, using ML driven by an S2S atmospheric prediction system. To this end, we incorporated various datasets, including satellite data, observational climate data, and S2S atmospheric prediction data, for the four models (MLR, SVR, RF, XGBoost). We designed several experiments to test the model performance and compare the predictive performance of S2S atmospheric predictions, observational climate data, and the combination of meteorological and satellite data. Firstly, our results indicated that S2S atmospheric predictions are superior in their forecasting performance, in terms of crop yield, as compared with observational climate data, and the inclusion of S2S atmospheric predictions largely improves yield prediction performance. Secondly, we demonstrated that ML methods, especially ensemble learning models, perform significantly better than linear-regression-based methods (XGB > RF > SVR > MLR). Thirdly, a skilled prediction, 3-4 months before the harvest, was achieved by XGBoost, with R 2 ranging from 0.81-0.85 and RMSE ranging from 0.78-0.89 t//ha. This research proved that MHCF v1.0 is a novel and promising technique for regional yield prediction, and we hope that it can be extended to crop yield forecasts in other regions, and even on a global scale.