3.4.1. The Development of Predictive Models Utilizing Linear Regression
The predictive analysis for the data set “Yield, Yield component and Chemical grain parameters” was performed in relation to three predictors (G, H, S). For two parameters (moisture and oil), models of high accuracy and reliability (80%) were obtained, while the parameter protein achieved an accuracy limit of 50% (
Table 7). The linear regression model appeared to be the most accurate approach and revealed that the PCA analysis successfully predicted the quantities most associated with the predictors.
Furthermore, there is currently no reliable and predictively applicable model for PRI utilizing these input predictors (G, H, S).
Among the modelled target variables, Vlaga (grain moisture content) and Ulja (grain oil content) demonstrated the most promising predictive performance across all applied ML models. In particular, LR achieved the highest independent test set R2 values (0.824 for Vlaga and 0.704 for Ulja) and consistent cross-validation stability, outperforming more complex approaches such as random forest and XGBoost. By contrast, the prediction of protein content showed considerably weaker performance, with R2 values not exceeding 0.45, indicating limited suitability of the selected descriptors for this property. Therefore, in the following sections we focus on Vlaga and Ulja as representative cases, providing a more detailed evaluation of their predictive models. For these two best-performing targets, additional error-based evaluation parameters (MAE, MBE, RMSE, NRMSE, and MAPE) are reported in a separate table to ensure a comprehensive assessment of accuracy and reliability.
According to the results presented in
Table 8, Vlaga and Ulja are indeed two targets with reliable predictive performance. For Vlaga, the LR model achieved a high independent test set R
2 of 0.8242 with stable cross-validation (0.8920 ± 0.0382). The error parameters indicate excellent predictive quality, with low MAE (0.7072), minimal bias reflected in MBE (0.0002), and moderate absolute error as RMSE (1.1162). Both the relative error measures, NRMSE (0.0863) and MAPE (5.0431%), confirm robustness and accuracy. For Ulja, predictive performance was slightly lower but still satisfactory, with an independent R
2 of 0.7044 and a cross-validation mean of 0.6854 (±0.1507). Error analysis shows low MAE (0.2233), small negative bias (MBE = −0.0412), and low overall error (RMSE = 0.2811), with relative measures NRMSE (0.0673) and MAPE (5.2367%) confirming strong reliability.
To further illustrate the predictive relationship between the selected variables, the final regression equation for Vlaga is presented below, providing an interpretable representation of the contribution of each predictor (Equation (6)).
Examination of the key coefficients within the model shows that the year of production significantly affects the variation in grain moisture content. In this particular model, the difference between G1 and G2 is associated with an average decrease in moisture content of about 4.96 units. In addition, when considering the genotype-hybrid as a predictor in the prediction model, for hybrids H5 and H6, the estimate for grain moisture content shows an average increase of about 1.13 units, or 0.041, whereas hybrid H3 is linked to an average decrease in moisture by approximately 0.67 units. These findings suggest that the predicted moisture value is statistically more influenced by the experimental year and the selected hybrid than by the other predictors included in this model or crop density (
Figure 4).
According to the results presented in
Table 7, Ulja also demonstrated reliable predictive performance, although at a slightly lower level compared to Vlaga. The LR model achieved an independent test set R
2 of 0.7044, supported by a cross-validation mean of 0.6854 (±0.1507). The error parameters indicate satisfactory predictive quality, with low MAE (0.2233), a small negative bias reflected in MBE (−0.0412), and low absolute error as RMSE (0.2811). Both relative error measures, NRMSE (0.0673) and MAPE (5.2367%), further confirm the reliability and robustness of the model for predicting Ulja.
Same as for Vlaga, we are referring to the final regression equation for Ulja, providing an interpretable representation of the contribution of each predictor (Equation (7)).
Examining the most prominent coefficients within the model reveals the significance of environmental conditions. Furthermore, the alteration in meteorological conditions during the vegetation season, particularly between the transition from G1 to G2, correlates with an average reduction in oil value of roughly 0.75 units. The findings from the projection of oil quantity in the grain, utilising the genotype (hybrid) as a predictor, clearly underscore the importance of this factor. Specifically, hybrid H5 shows an average increase in oil value of about 0.42 units, while hybrid H4 is associated with an average decrease in oil value of about 0.34 units. These results indicate that the year of the experiment and the choice of hybrid have a more pronounced statistical influence on the predicted oil value in relation to planting density (
Figure 5).
The linear regression model created for analysing protein levels with the predictors G, S, and H yielded an R
2 value of 0.44. This indicates that roughly 44% of the variability in protein levels can be accounted for by the selected predictors (
Table 7). Additionally, cross-validation resulted in an average R
2 of 0.50, which signifies a moderate degree of generalization performance. While the R
2 values do not suggest a fully dependable model, the cross-validation outcome and the slight difference in R
2 values between the test and cross-validation datasets imply that this model may possess some practical applicability, though it is limited.
The analysis of the most important coefficients in the third prediction model indicates that, similar to previous models, the year of production significantly affected the variability of the protein trait. Furthermore, the production conditions during the transition from G1 to G2 were associated with an average reduction in protein values of about 1.11 units. The impact of different hybrids on the fluctuation of protein values varied according to their genetic foundation. Hybrid H3 exhibits an average decrease in protein of about 0.77 units, while hybrid H2 also corresponds to a decrease of roughly 0.77 units. Conversely, hybrids H4 and H5 demonstrate an increase in protein, with H5 showing a notably greater enhancement. These results imply that both the experimental year and the chosen hybrid have a statistically greater influence on the predicted protein value than does planting density, which appears to play a minor role according to this model (
Figure 6).
3.4.2. Predictive Analysis of the Influence of G, H, and S on Moisture Release Parameters
PCA analysis identified grain moisture content as the primary factor influencing variability among all studied parameters. Based on this, moisture release properties (SMZ, VSMZ, 105 mz) were in the focus of predictive analysis. The predictive analysis of the “Moisture Release” dataset was conducted with respect to four predictors (G, H, S, and V), while all other numerical variables were treated as target variables. As in the case of the “Yield, Yield Component, and Chemical Grain Parameters” dataset, four ML models were tested against the target variables, and these results are summarized in
Table S4.1.
Several target variables yielded models with exceptionally high accuracy and reliability. For the mentioned target quantities, in all cases the XGBoost machine learning model emerged as the most precise method, and PCA analysis effectively identified the maize characteristics most closely associated with the predictors. Notably, the model for the SMZ parameter exhibited remarkable performance, with R
2 (0.90) and CV Mean R
2 (0.93) values reflecting outstanding accuracy and consistency. Parameters—VSMZ and 105 mz—also achieved models with an accuracy of more than 80%. These models were subsequently analysed to elucidate the impact of the predictors on the measured variables (
Table S4.1). Error-based parameters for these selected models are presented in
Table 9.
For SMZ, the model showed the most reliable results, with a high independent test set R2 of 0.8975 and excellent cross-validation stability (CV mean R2 = 0.9346, CV std = 0.0124). Error-based metrics further confirm accuracy, with low MAE (4.3251) and RMSE (5.4387), minimal bias (MBE = 0.3572), and acceptable relative error values (NRMSE = 0.1514, MAPE = 13.7417%). These results indicate that SMZ can be predicted with high confidence. For VSMZ, the model also performed well, with a test set R2 of 0.8625 and a stable cross-validation mean of 0.8952 (±0.0144). Absolute error values were low (MAE = 2.2292, RMSE = 2.8655), while relative error measures (NRMSE = 0.2583, MAPE = 29.6875%) suggest that although the model successfully captured the underlying trend, small variations in the data may have contributed to slightly higher relative deviations. For 105 mz, predictive performance was comparable, with a test set R2 of 0.8522 and cross-validation mean of 0.8930 (±0.0153). Absolute errors again remained low (MAE = 2.2307, RMSE = 2.8498), and relative errors (NRMSE = 0.2689, MAPE = 31.7272%) were somewhat higher but still within an acceptable range for complex biological datasets.
In the following section, we provide a more detailed analysis of these models, including an interpretation of feature importance and SHAP visualizations to better understand the contribution of individual predictors. We begin with a model for the SMZ parameter, whose results are summarized in
Figure 7.
According to the results presented in
Figure 7a, the main predictors identified are G and V. Notably, G2 and V4 each play a significant role in enhancing the model’s accuracy, exceeding 0.45. This indicates that the year of the experiment (G2) and the timing of sampling and measurement (V4) are crucial for predicting the SMZ value, with a joint significance level greater than 90%. According to the SHAP analysis presented in
Figure 7b, V4 contributed to the increase of SMZ values for the mean value of 20 units, positioning it as the second most critical temporal factor. V2 and V3 exhibited a weak positive influence on SMZ. The predictor H provided a smaller yet noticeable contribution to the variability of SMZ. The other predictors demonstrate minimal relevance, and their effect on SMZ prediction is negligible when compared to G2 and V4. The mean values of the factor S are mostly at zero, indicating that it did not influence the change in the value of the SMZ. Other predictors also showed negligible influence on SMZ.
In
Figure 8, we have summarised the feature importance and SHAP analysis in the case of the model for the VSMZ parameter.
According to the results presented in
Figure 8a, the primary determinant for the VSMZ parameter was G2, which played a crucial role in enhancing the model’s accuracy, exhibiting an importance level exceeding 0.5, thereby underscoring the substantial impact of the second year of the experiment. The V4 ranked as the second most influential Time Treatments predictor, offering a moderate contribution, whereas V3 and V2 also add value to the model, although to a lesser degree. Conversely, all other predictors, including S and H of maize, held minimal relevance in this model with no significant contribution in forecasting the VSMZ value (
Figure 8a).
According to the SHAP analysis presented in
Figure 8b, predictor G2 has an extremely positive influence on the value of VSMZ. Of the Time Treatments predictors, V4 has the strongest positive influence on the VSMZ parameter, while both V3 and V2 show a moderate contribution to the increase. This trend is consistent with previous observations and points to a temporal component as an important factor. Hybrid H6 has a relatively pronounced negative impact, reducing the value of VSMZ. In certain other instances involving the application of the hybrid H5, a beneficial contribution was noted, suggesting a potentially favourable influence of the factor H. The predictor S exhibited a neutral effect in the majority of cases, with minor fluctuations in both directions (
Figure 8b).
Finally, in
Figure 9 we are presenting feature importance and SHAP analysis in the case of the model for predicting the 105 mz parameter.
The feature importance in the case of the model for the 105 mz parameter (
Figure 9a) indicates that the primary predictors were G and V. Variation in the levels of these factors leads to measurable differences in the variable. Factor G2 emerged as the most critical predictor influencing the model’s accuracy, followed by V4 as the second most significant predictor. Additionally, V3 and V2 provided some contribution, whereas the predictors related to hybrids and planting density were of minimal relevance in comparison to those previously identified (
Figure 9a).
According to the SHAP analysis for the model related to 105 mz parameter (
Figure 9b), the presence of predictor G2 has evidently resulted in an increase of the 105 mz parameter. The SHAP values associated with G2 were consistently positive, ranging from approximately +5 to +7.5, which signified a reliable positive influence of the second year on the variable 105 mz. Predictors V4, V3, and to a lesser extent V2, exhibited a trend towards increasing the value at 105 mz, with V4 making the most significant contribution, characterized by its notably positive SHAP values. V4 expressed a pre-dominantly positive effect, while V3 offered a moderate positive contribution. On the other hand, V2 had a weak impact. In contrast, the factors H and S showed a variable influence, with some leading to an increase in a on the one hand, and a decrease in the value to 105 mz on the other hand. Additionally, predictor S played a dual role, contributing to both the increase and decrease of the 105 mz parameter.