Improving the Prediction of Grain Protein Content in Winter Wheat at the County Level with Multisource Data: A Case Study in Jiangsu Province of China

: Wheat is an important food crop in China. The quality of wheat affects the development of the agricultural economy. However, the high-quality wheat produced in China cannot meet the demand, so it would be an important direction for research to develop high-quality wheat. Grain protein content (GPC) is an important criterion for the quality of winter wheat and its content directly affects the quality of wheat. Studying the spatial heterogeneity of wheat grain proteins is beneficial to the prediction of wheat quality, and it plays a guiding role in the identification, grading, and processing of wheat quality. Due to the complexity and variability of wheat quality, conventional evaluation methods have shortcomings such as low accuracy and poor applicability. To better predict the GPC, geographically weighted regression (GWR) models, multiple linear regression, random forest (RF), BP neural networks, support vector machine, and long-and-short-term memory algorithms were used to analyze the meteorological data and soil data of Jiangsu Province from March to May in 2019–2022. It was found that the winter wheat GPC rises by 0.17% with every 0.1° increase in north latitude at the county level in Jiangsu. Comparison of the prediction accuracy of the coefficient of determination, mean deviation error, root mean square error, and mean absolute error by analyzing multiple algorithms showed that the GWR model was the most accurate, followed by the RF model. The regression coefficient of precipitation in April showed the smallest range of variation among all factors, indicating that precipitation in April had a more stable effect on GPC in the study area than the other meteorological factors. Therefore, consideration of spatial information might be beneficial in predicting county-level winter wheat GPC. GWR models based on meteorological and soil factors enrich the studies regarding the prediction of wheat GPC based on environmental data. It might be applied to predict winter wheat GPC and improve wheat quality to better guide large-scale production and processing.


Introduction
Wheat is the most important major food crop in the world, feeding nearly 40% of the global population [1] and providing approximately 20% of energy and 22% of protein for daily human diets.In recent years, the demand for agricultural products, including wheat, has changed from availability to quality.Although China is a large agricultural country with stable wheat production, the current domestic production of high-quality wheat cannot meet the demand for high-quality strong gluten wheat, and imports tend to be a common solution.
The quality of wheat determines its purchase price, processing purposes, and value in use and in turn affects the profitability of wheat production [2].As a key indicator of wheat quality evaluation, wheat grain protein content (GPC) is mainly subject to factors such as wheat varieties, climate during growth, and soil fertility, among which meteorological factors have the greatest influence [3].Many studies have included analyses on the effect of climate information on winter wheat GPC through daily or monthly data [4].They have demonstrated that meteorological conditions, including temperature, radiation, and precipitation, have the potential to influence the protein content and composition of wheat seeds during grain filling [5].However, some studies have shown that the effect of precipitation on GPC cannot be ascertained, and it is likely to be positively [5] or negatively [6] correlated with GPC.In addition to the effect of variety and environment on winter wheat GPC, wheat GPC changed accordingly with geographic location.Some studies found that the differences in wheat GPC based on spatial location were greater than the differences between varieties at different latitudes and ecological environments.Therefore, the study of spatial heterogeneity in wheat is beneficial to improving the quality of processed wheat products and can help to guide large-scale production and processing [7].
For a long time, to advance the study of spatial heterogeneity in wheat, researchers at home and abroad have conducted numerous experiments to obtain physiological and biochemical parameters.However, the existing research methods are still insufficient in terms of practical application, although they are able to initially assess the quality of wheat.This insufficiency can be attributed to the fact that the role of multi-indicator weights is not obvious, and the accuracy of prediction requires improvement [8].
Statistical models and simulation models are two main approaches to study spatial heterogeneity in wheat based on environmental information [9].Statistical models can complement simulation models, which record environmental elements and analyze their effects on crop GPC [10] and quality [4].Previous studies have developed various statistical models, including multiple linear regression, stepwise regression analysis, spatial lagged effects models, and hierarchical linear regression models [11], to study spatial heterogeneity in wheat based on meteorological parameters [12].The crop simulation model is designed to explore the effect of climate on winter wheat GPC in the absence of agricultural management [13], but its disadvantage lies in the sole focus on the direct effects of climatic factors on crop growth, ignoring indirect effects such as crop losses due to adverse meteorological conditions or pests and diseases [14].
In contrast, machine learning has more advantages.It is highly data-driven and capable of building powerful nonlinear regression models that can effectively mine and exploit detailed information such as spatial structure.It is possible to guarantee smarter, more efficient, and more accurate decisions and services.Combined with multisource data, machine learning has great potential for investigating spatial heterogeneity [15].
Due to the variable geographic factors and diverse climate, wheat GPC and wheat characteristics vary greatly from region to region and year to year under normal growth conditions, which is a serious obstacle to the study of spatial heterogeneity [16].To address this issue, a geographically weighted regression (GWR) model based on partial regression is proposed for assessing whether the spatial relativity of all regional data is stable [17].This model integrates the geographic location parameters of selected sampling points, examines the variation in their spatial parameters [18], and can explore the spatial variability of a study object at a certain scale [19].It is also widely used in sociological work, agroecology, and soil surveys.It is worthwhile to study the GWR model to assess the effect on the spatial heterogeneity of winter wheat GPC at the county level.Currently, although the effects of meteorological parameters in various locations on GPC have been widely investigated, studies on the effects of soil elements on the spatial heterogeneity of winter wheat GPC are still lacking.
In summary, few studies have assessed spatial heterogeneity and spatio-temporal variability in county-level wheat GPC.The innovation point of this study lies in the use of multiple machine learning algorithms fusing GPC, soil, and meteorological data of 4-year multi-point trials of winter wheat in Jiangsu Province, which constructs a prediction model of the winter wheat GPC at the county level and studies the spatial heterogeneity of the GPC through correlation analysis, feature combinations, feature selections, and feature importance assessments.The objectives of this study were (1) to investigate the correlation of county-level winter wheat GPC with latitude, (2) to evaluate the accuracy of models using multiple machine learning algorithms, and (3) to assess the influences of various meteorological and soil factors on the heterogeneity of county-level winter wheat GPC in Jiangsu.

Study Region
The wheat GPC data and environmental parameters in this study were obtained from Jiangsu Province (30°45′ N~35°20′ N, 116°18′ E~121°57′ E), China, from 2019 to 2022.Jiangsu has a variety of climate types, including subtropical humid monsoon, subtropical monsoon, and warm temperate humid and semi-humid monsoon.Approximately 25% of the entire winter wheat growing region in China (approximately 607.5 million ha) is in Jiangsu Province, which leads to a high representativeness of our results for winter wheat yield prediction in China (National Bureau of Statistics of China, 2019 to 2022).

Winter Wheat GPC at the County Level
The winter wheat GPC was collected from the Jiangsu Wheat Quality Report of the General Agricultural Technology Extension Station of Jiangsu Province.The dataset is based completely on experimental measurements of varying grades of the samples collected from various counties and districts in Jiangsu Province, with a measurement accuracy of two decimal places.The samples for winter wheat GPC data for the years 2019 to 2022 include 1253 samples from 73 counties in the region.

Environmental Data
In the specified four years, the winter wheat harvesting period in Jiangsu Province was between late May and late June.This work is focused on the key growth stages of winter wheat that affect GPC formation, such as stem elongation, germination, and partial filling, so the meteorological data analysis in this paper is mainly from March to May each year.The mean temperature in March (MT03), April (MT04), and May (MT05), mean maximum temperature in March (TMAX03), April (TMAX04), and May (TMAX05), mean sunshine duration in March (MSD03), April (MSD04), and May (MSD05), and precipitation in March (PRE03), April (PRE04), and May (PRE05) from 2019 to 2022 were acquired from the ECMWF climate data shared service system (https://cds.climate.copernicus.eu/,accessed on June 2022).Through the field experiments, basic soil attribute data were collected from different locations and years in the experimental spots to examine the effects of soil factors on GPC, which included nitrogen alkali digestion (N), available phosphorus (P), rapidly available potassium (K), and soil organic matter (SOM).More experimental details are in Ruan et al. [20].The 12 meteorological factors and 4 soil factors used in this paper are listed in Table 1.

Estimation Method
This study included an exploration of the linear interaction between latitude and county-level winter wheat GPC using SPSS (The SPSSAU project (2021), SPSSAU (Version 21.0) Online Application Software; retrieved from https://www.spssau.com,accessed on August 2022).Two statistical models containing GWR and multiple linear regression (MLR), and four machine learning algorithms, namely, random forest (RF), BP neural network (BPNN), support vector machine (SVM), and long-and-short-term memory (LSTM), were used to construct and evaluate county-level winter wheat forecasting models.Shapley additive planning (SHAP) and sensitivity analysis were used to study the spatial heterogeneity of winter wheat GPC in Jiangsu Province.The modeling of out-of-sample prediction in this study used the Python Spatial Analysis library.

Geographically Weighted Regression
GWR is a spatial analysis technique that explores the spatial variability of a study object at a given scale and the associated drivers by creating a local regression equation (n at each point of the spatial scale).Considering the partial influence of the spatial object, GWR excels in accuracy.The specific expression is as follows: where i is the i-th county-level GPC in Jiangsu Province; n is the number of independent variables in the model; and (, ) represents the geographic and spatial location of the ith county. ( , ) ,  ( . ) , and  ( , ) are the intercept coefficients, explanatory functions for the kth variable, and local regression coefficients for the kth variable, respectively.The positive and negative values of the regression parameter  ( . ) are meaningful.This study uses the GWR model to abstract the location of counties as central points to easily determine the county-to-county distances.
Although the value of the regression coefficient  does not fluctuate with sample point i, in this case, GWR is equivalent to linear regression.The core of the GWR model principle is the spatial weight matrix, and the regression coefficient of GWR is calculated by the least squares method through the import of the weights Wij, which is expressed as follows: where  is the distance between observation point i and observation point . is a nonnegative attenuation parameter describing the function relationship between weight and distance, called bandwidth, and the best bandwidth is generally obtained by the cross-validation method.
The GWR model analyzes the weight matrix derived from the location relationship of spatial objects, but it cannot fully reveal the autocorrelation of spatial variables.The spatial autocorrelation coefficient can represent the spatial autocorrelation characteristics of winter wheat GPC in each county.Spatial statistical analysis can describe the spatial autocorrelation of the data, but it is difficult to quantitatively generalize the causal relationship between spatial things.The GWR model, on the other hand, can quantitatively describe the association between winter wheat GPC in each county and influential factors, taking into account the individual spatial object locations.Therefore, combining the spatial autocorrelation model and GWR, in this study, the advantages of each of them are summarized, and the spatial relationship between GPCs in each county and latitude, as well as the causality with meteorological, soil, and other parameters, are profoundly diagnosed.However, the GWR model can still be improved.For example, multi-scale geographically weighted regression (MGWR) improves GWR by taking into account spatial scale differences in the effects of factors, which makes the modeling results more reliable.

Machine Learning
Machine learning algorithms are also known as estimators.This study evaluates four estimators, namely, RF, BPNN, SVM, and LSTM, for wheat quality prediction.The RF algorithm is a multiclassification tree consisting of a series of different regression trees, which is more adaptable to database sets and can run large datasets efficiently.The process of constructing the BPNN in this study includes constructing input, hidden, and output layers with interconnected neurons between the layers and no connection between the same layers.The actual process is divided into the following three parts: prediction model building, mathematical model training and optimization, and model testing and simulation.The SVM is an optimization algorithm based on supervised learning that has satisfactory results in dealing with small samples and nonlinear and high-dimensional patterns.The SVM uses a two-dimensional sample dataset; for example, a two-dimensional planar dataset can be classified with multiple straight lines of different slopes, and the SVM searches for the best generalization capability among the many classified straight lines.When the dataset is three-dimensional and above, dataset D requires a maximum separation hyperplane to classify the high-dimensional data.The LSTM is a recurrent neural network with LSTM units as its hidden layer building blocks, which is very effective in solving time series.
Due to the complexity of machine learning, the prediction models sometimes match the known datasets too closely or precisely and thus lack the generalization ability to predict future observations well.In this case, overfitting occurs.Therefore, since 60% of the dataset is used to generate the predictive models, 20% of the data is used as a test set for the subsequent evaluation of the predictive model's ability to predict unknown data, and the remaining data are used as a validation set to adjust the hyperparameters of the model and preliminarily evaluate the model's ability.The overall workflow quality prediction program is shown in Figure 1.

Accuracy Evaluation
Four variables are used to evaluate the performance of the models: coefficient of determination (R 2 ), mean deviation error (MBE), root mean square error (RMSE), and mean absolute error (MAE), whose formulas are shown below.The higher the R 2 is, the larger the absolute value of MBE is, which indicates that the model makes better predictions, which are closer to the true value.RMSE and MAE express the difference between the model and the measured value, and the smaller the value is, the higher the accuracy of the model is.
MBE= ∑ ( −  ) where n denotes the total number of samples, xi denotes the i-th measured value, and yi denotes the i-th predicted value.

E-Fast Method
The E-Fast method is a new global sensitivity analysis method based on the Fourier amplitude sensitivity test, which combines the advantages of the Sobol method with the robustness, low sample size, and efficient calculation of the markers.It is a quantitative global sensitivity analysis method based on the variance method.By decomposing the variance of the model output to obtain the quantitative sensitivity of the parameters (each subsensitivity and the total sensitivity), the quantitative analysis of the influence of the model parameters on the model output is performed, and the quantitative value of the direct or indirect contribution of each parameter to the model output is obtained.

SHAP Method
SHAP is a method for interpreting models, mainly for explaining the prediction results of machine learning models.SHAP makes use of the concept of the Shapley value in game theory to calculate the feature importance, giving the contribution of each feature an output impact.It can interpret feature importance for different combinations of variables and for individual observations, so it can be used not only for the analysis of overall features but also for the interpretation of output results for individual features.SHAP values consider the importance of each feature itself and the interrelationships between features, enabling more accurate prediction results.SHAP can visualize the features for an entire dataset or a single data point for easy visualization.

Relationship between GPC and Latitude
The statistical results of winter wheat GPC at the county level in Jiangsu Province from 2019 to 2022 are shown in Figure 2. The mean value of winter wheat GPC for the whole dataset was 14.00%, ranging from 9.07% to 20.98%.By analyzing the correlation between winter wheat GPC and latitude in Jiangsu province, it was found that GPC was positively and dramatically linked to latitude, with R 2 = 0.78 (Figure 3).There was a 0.17% increase in winter wheat GPC with each 0.1° rise in latitude to the north.

Accuracy Comparison of Winter Wheat GPC Models Based on Different Methods
In this study, six methods, namely, RF, BPNN, SVM, LSTM, GWR, and MLR, were used to combine meteorological and soil information to construct a winter wheat GPC prediction model.As shown in Figure 4, the GWR model fits the predicted and actual values significantly better than the other models.Among the four methods of machine learning algorithms, the best simulation effect resulted from the RF algorithm, followed by SVM, and finally BPNN and LSMT.In addition, six feature subsets (Table 2) were created to evaluate the prediction accuracy of each method for winter wheat GPC (Figure 5, Table 3) through four indices, namely, R 2 , RMSE, MAE, and MBE.It could be seen that among all the models constructed by different methods, the GWR model had the largest R 2 and the smallest RMSE and MAE, and the absolute value of MBE was slightly smaller than that of the RF model, so the GWR model had the highest accuracy.It could also be seen that MLR had the lowest accuracy.Therefore, this study used GWR to further analyze the spatial heterogeneity of winter wheat GPC.

Subset
Feature Name Feature Number

Analysis of the Spatial Heterogeneity of Winter Wheat GPC Based on the GWR Model
Sixteen factors with GWR coefficients are listed in Table 4 to better explore the spatial influences of various parameters on GPC.Comparison of the interquartile range of the estimation coefficients of the GWR and MLR models yielded errors within 1 standard error, indicating that the effects of environmental variables on the winter wheat GPC at the county level are spatially heterogeneous.The sensitivity indices of environmental parameters for county winter wheat GPC are illustrated in Figure 6, demonstrating that 12 meteorological factors and 4 soil factors exhibited spatial heterogeneity in the county-level winter wheat GPC sensitivity, among which, the sensitivity indices of P, PRE03, PRE04, and PRE05 were poorly correlated with latitude.

Sensitivity of Factors Affecting Winter Wheat GPC at the County Level
As shown in the sensitivity indices of the factors of the GWR model (Figure 7), there were significant spatial differences in the effects of each environmental factor on GPC, meaning that there is spatial non-smoothness in the relationship between GPC and these environmental factors.The degree of influence of each factor on GPC can be explained by the corresponding regression coefficients.The absolute value of the regression coefficient reflects the intensity of the effect on GPC (Figure 7), with positive response factors exerting a positive influence or having a positive correlation with GPC, and vice versa.For example, the regression coefficients of MT04 had a smaller range of variation than those of the other factors, indicating that the effect of MT04 on GPC in the study area was more stable than that of the other factors.TMAX03, N, and K had much greater effects on wheat quality than other factors; that is, the changes in TMAX03, N, and K had the greatest impact on the accuracy of the model.This was followed by MSD04, TMAX04, and TMAX05, while P, PRE03, PRE04, and PRE05 had the least effect.Furthermore, SHAP values were used to explain the four machine learning models.As shown in Figure 8, MT04 had a much greater effect on wheat quality than the other factors, meaning that the changes in MT04 had the greatest effect on the accuracy of the model; this was followed by MSD03 and TMAX03.Finally, TMAX04, P, and SOM had the least effect.

Spatial Heterogeneity Analysis of the County-Level Winter Wheat GPC
Winter wheat with low GPC is readily available at low subtropical latitudes, whereas the converse is the case at high latitudes, where GPC is also susceptible to climatic factors (Figure 3).Our study revealed that in Jiangsu Province, the county-level winter wheat GPC increased by 0.17% for every 0.1° increase in north latitude.This result is consistent with previous findings that wheat GPC decreases with decreasing latitude in the northeast spring wheat region, the Yellow-Huai winter wheat region, and the middle and lower Yangtze River winter wheat region [21].This result also shows that the overall trend of wheat GPC is high in the northeast and low in the southwest, decreasing from year to year, mostly in a zonal distribution, and overall higher in the north than in the south, and latitude is the main influencing factor [22].Many factors contribute to the spatial heterogeneity of wheat GPC.Although it is warmer during wheat growth stages at low latitudes, high temperatures in winter wheat early stages can dilute GPC and positively affect yield [23].
At low latitudes, precipitation affects the formation of GPC.During the early stages of grain development, water stress diminishes the uptake potential of the grain by reducing the number of formed endosperm cells and amyloplasts [24], which leads to a decrease in grain weight and an increase in protein content [25].In addition, sunlight also influences the accumulation of seed protein to some extent due to variations in sunlight duration and light intensity.Photosynthesis provides energy for the accumulation of protein content in wheat seeds.The longer the duration of sunlight, the longer the photosynthesis time and the more organic matter accumulates [26].Meanwhile, when investigating the spatial differences in the winter wheat GPC, differences in cropping patterns, for example, early sowing and late harvesting of winter wheat and the length of the growing season, cannot be ignored.

Effectiveness of Different Methods to Predict the GPC of Winter Wheat
The MLR model had a limited ability to express the spatial heterogeneity characteristics of winter wheat GPC.In contrast, the GWR model is a partial model that considers the spatial heterogeneity of the variables and decomposes the global parameter estimations into partial parameters for evaluation.The GWR model obviously performs better than the MLR model in terms of estimation accuracy and storage of sample spatial features and can effectively weaken the spatial autocorrelation of model residuals [27].The RF model does not require prior assumptions about the relationship between the independent and dependent variables, and it can effectively overcome the multi-collinearity among independent variables and give the importance ranking of each variable [28], which has been applied in agricultural factor analysis and predicting biomass [29].As shown in Figure 5, the R 2 of the RF model was 0.70, which is similar to that of GWR, and both can achieve a good fit.BPNN is widely used for model building because of its powerful nonlinear mapping ability and flexible network structure.But from the results in this study, it can be seen that it is not effective in predicting wheat winter GPC.The SVM model can improve the intelligence and automation of spectral data [30].In this study, it was also used to predict winter wheat GPC with better results, but the prediction accuracy was not as high as that of the GWR and RF models.LSTM can be used to process temporal data and is widely used in natural language processing and speech recognition [31].However, the fitting of winter wheat GPC by LSTM in this paper was not satisfactory.
In summary, GWR is the best choice for constructing a winter wheat GPC model at the county level.In this study, to build a better general model applicable to the whole region, meteorological and soil factors were added to the GWR model to accommodate geospatial variations.In previous studies, it has been shown that the phenological period of wheat in Jiangsu is earlier than that in some northern regions [32].Therefore, it may be challenging to rely on the phenological period for a more detailed geographical analysis of winter wheat GPC.GWR is a linear regression of specific sample points, while the effect of meteorological factors on GPC may be nonlinear [33].In a subsequent study, the effect of nonlinear regression of exponential, logarithmic, and quadratic functions on the quality of wheat in large geographical areas was investigated [34].In addition, remote sensing data can reveal differences between winter wheat growth and GPC in the same geographic setting [12,35].It is hoped that remote sensing data will be combined with the GWR model in the future to establish a multilevel GPC estimation model to further diagnose the spatial and temporal differences in winter wheat GPC.

Sensitivity Analysis of GPC Predictor Variables for Winter Wheat at the County Level
Meteorological factors are utilized in studies of spatial heterogeneity.In this study, 12 meteorological factors and 4 soil factors were identified, and MLR and GWR models were established to explore the spatial heterogeneity of the county-level winter wheat GPC in different years in Jiangsu Province.The use of monthly meteorological and soil data as independent variables for the study of spatial heterogeneity supports the results of previous studies [12].
From the results of this study, the regression coefficients of N and K as soil factors had a wide range of variability, indicating that the effects of N and K on winter wheat GPC vary more significantly in different places.The effect of N on winter wheat GPC has been mentioned in previous studies on wheat quality [23].N uptake, assimilation, and utilization by wheat plants directly affects seed yield and GPC, and an increase in N application significantly improves wheat GPC [36].K application can increase leaf K content at anthesis and significantly increase N accumulation at anthesis and maturity, which in turn, facilitates the transport of N stored before flowering to seeds, thereby enhancing winter wheat GPC.In recent years, it has been shown that increased N fertilization and higher crop yield levels accelerate soil K export [37], and K has become a key limiting factor in improving wheat yield and achieving high-quality wheat.
In China, winter wheat returns in early March and is harvested in early May, which matches the period of seed filling and ripening [3].In this study, the sensitivity indices of TMAX03 and N were negatively correlated with latitude, while other meteorological factors and soil were reversed.When latitude and radiation drive photosynthesis, it is mediated by daytime temperature, while the respiration rate responds to both daytime and night temperatures.The effects of photosynthesis and respiration on crop development differ during different periods.At the county level, the effects of temperature, radiation, precipitation, and N content on winter wheat GPC are complex and challenging to interpret.Although the interannual variation in March maximum temperature in Jiangsu Province is not significant, timely and reasonable planting management and fertilizer management will minimize the effect of temperature on winter wheat GPC.

Limitations and Future Applications
The accuracy of the GWR model is superior to that of other models, consistent with the results of numerous previous studies [38][39][40].However, the GWR model can still be improved.It is challenging to fully understand how temperature, radiation, precipitation, and nitrogen content affect GPC.Moreover, the effects are not always consistent with the assumptions of the GWR model and the accuracy of the simulations may be affected.For example, in practice, the effects of independent variables on dependent variables do not always have spatial differences as assumed by the GWR, a fact that gives rise to errors in model simulations.The analysis using machine learning methods for wheat GPC spatial heterogeneity studies was valid.In future work, it is necessary to continue to study the following areas in depth.
First, the sample size of the training data can be expanded.The present article included 4 years of data from 73 counties in 13 cities in Jiangsu Province due to the limited collected data.If possible, future studies could expand the data volume at a later stage.Second, the model algorithm can be further optimized.Based on the model for predicting wheat quality established in this study, more influential factors in addition to meteorological, soil, and latitude factors should be selected for training to optimize the model algorithm.Third, other machine algorithms or autonomous learning algorithms, such as gradient boosted decision trees and other nonlinear modeling methods, can be considered for further exploratory analysis.
In addition to model improvement, various factors can continue to be selected or added for modeling in subsequent studies.Some scholars constructed a GPC prediction model based on hyperspectral data and agronomic parameters, involving the relationship between agronomic parameters and seed protein content at maturity, which could accurately predict seed quality and provide necessary information for agricultural management and production [41].Although 16 factors were selected for GWR modeling in this study, the regulation of winter wheat GPC is complex.Regional spatial spans and local microclimates vary greatly, resulting in large differences in wheat GPC between regions and years under conventional cultivation conditions [28].Therefore, corrections and refinements need to be made based on the actual situation in practice.

Conclusions
GPC is one of the important indicators of winter wheat quality.Studying the spatial heterogeneity of GPC is favorable for the prediction of wheat quality.In this study, the relationship of county-level environmental variants with winter wheat GPC was investigated using the GWR model.The results showed that the effect of April mean temperature on GPC was more stable than that of the other factors, and the county-level winter wheat GPC increased by 0.17% for every 0.1° increase in north latitude in Jiangsu Province.Through the linear fitting of the predicted and true values of each model and the comparison of the four methods, namely, R 2 , RMSE, MAE, and MBE, it was concluded that the GWR model was the most accurate, followed by RF, and MLR was the least accurate, which illustrated the superiority of the GWR model in the study of spatial heterogeneity.The GWR analysis showed significant spatial heterogeneity in its effect on GPC: the correlation coefficients of latitude varied with the sensitivity indices of soil and meteorological factors.It was evident from the GWR sensitivity analyses that MT04 had a stable effect on GPC.The GWR model based on meteorological and soil factors could be used for county-level winter wheat GPC prediction to study the spatial heterogeneity of winter wheat, and is beneficial for assessing the status of wheat quality in a timely and accurate manner.In conclusion, the GWR model is valuable for improving wheat prediction research systems.It is also of some reference value in guiding wheat production and processing.

Figure 1 .
Figure 1.Workflow of county-level winter wheat quality prediction based on machine learning.

Figure 3 .
Figure 3. Correlation between latitude and winter wheat grain protein content (GPC) in Jiangsu Province.Note: ** denotes statistically significant differences at p < 0.01.

Figure 5 .
Figure 5.Comparison of the values of the six methods on the metrics (a) R 2 , (b) RMSE, (c) MAE, and (d) MBE based on 6 features.

Table 1 .
Summary of the collected soil and weather datasets.

Table 2 .
Selected features in different subsets.

Table 3 .
Accuracy evaluation of four machine learning methods.

Table 4 .
GWR and MLR coefficients of the 16 independent variables.