1. Introduction
The expectations placed on agriculture to provide enough food will continue to be a major concern for the foreseeable future due to the world’s population growth [
1]. Growing agricultural yields have frequently been cited as a possible way to address the problem of feeding our expanding population [
2]. For many crops, field-scale traits such as lodging resistance and plant density strongly influence grain yield. Both factors are directly shaped by shoot architecture: plants with sturdier, well-balanced shoot structures exhibit reduced susceptibility to lodging. In contrast, architectures that optimize branch and stem arrangement allow higher planting densities without compromising light interception or resource allocation [
3]. Therefore, achieving optimal shoot architecture has become a central breeding goal [
4].
The United States Department of Agriculture (USDA) estimates that approximately 399.50 million tons of soybeans (
Glycine max (L.) Merr.) are produced annually, making it one of the major crops in the world. Brazil and the United States are the two largest producers, accounting for approximately 41% and 28% of global production, respectively [
5]. Soybeans have several uses, including the production of biofuel and serving as a source of nourishment for humans and animals. Meal and oil are made from the majority of soybeans cultivated worldwide [
6]. This high-protein bran is primarily utilized in the feed sector for cattle, pigs, and poultry [
7]. The industry uses the oil as a raw material to make a variety of products, including hydrogenated fats, refined oil, margarine, and mayonnaise [
8]. Additionally, the primary raw material used to produce biodiesel is now soybean oil [
9].
To achieve the highest yields, farmers must consider environmental conditions (such as soil, rainfall, and temperature) and effective crop genotype management. They may now survey their producing fields using remote sensing technology and systems to help monitor such conditions [
10]. Predicting agronomic factors such as days to maturity (DM), plant height (PH), and grain yield (GY) is still difficult, though, particularly when indirect approaches, e.g., multispectral data gathered using an unmanned aerial vehicle (UAV), based remote sensing systems are taken into account [
11]. The introduction of new UAV platforms has recently enabled the rapid and thorough mapping of several farmlands, producing a large volume of data that needs to be assessed and integrated to support agricultural management [
12].
Machine learning (ML) techniques have been used to interpret remote sensing data over the past ten years, which is a promising approach to help variety evaluation and selection as well as other agricultural applications. A fair number of works addressing crop yield prediction based on machine learning models for various crops can be found in the associated literature [
13,
14,
15,
16,
17,
18,
19]. Although the employment of ML in agriculture has grown, the majority of research to date has focused on leveraging remote sensing data to estimate crop characteristics, such as height or yield [
20,
21]. Despite their strength, such methods rely on indirect measures and often yield high accuracy only for a small number of cultivars or under specific constraints [
13]. In contrast, the direct prediction of soybean plant height from field-based agronomic variables remains relatively unexplored. In particular, the few studies currently available are often limited to small datasets, do not follow consistent evaluation procedures, and offer limited information about the traits that have the most significant effects on plant height [
18,
19]. Furthermore, many current approaches fail to measure prediction uncertainty, which is crucial for crop management and decision-making in plant breeding [
22].
In the aforementioned context, it should be highlighted that the inference of agronomic variables for soybean (
Glycine max (L.) Merr.) cultivars based solely on remote sensing data remains an unresolved and difficult task, particularly when taking into account multiple genotype varieties in various geographical locations and seasons [
23]. Given that agriculture plays a crucial role in the economies of many nations, determining or forecasting factors such as DM, PH, and GY can aid in decision-making and advance crop management techniques in the future [
22,
24].
This study addresses the gap in the existing literature regarding a robust, data-driven framework for predicting soybean PH from a diverse set of agronomic traits. In contrast to previous methods, our technique combines multiple algorithms, methodical hyperparameter optimization, and a comprehensive assessment pipeline that includes uncertainty analysis, accuracy, error metrics, and feature importance. This improves agronomic interpretability and predictive performance by identifying the most important traits. The suggested framework provides a consistent and repeatable method that can guide cultivar selection and phenotypic forecasting in soybean breeding efforts, offering both accuracy and explainability.
The primary objective of this research work was to develop a temporal or stage-aware ML framework that explicitly addresses the dynamic changes in trait–PH relationships across harvest stages using a wide range of measurable agronomic traits. Additionally, the study aimed to evaluate and compare the performance of various ML models, assess the uncertainty of their predictions, and identify the key features that most significantly influence soybean plant height.
2. Materials and Methods
The data were collected from an experiment conducted in the field at the Pequizeiro farm, located at the Accert Pesquisa e Consultoria Agronomica Experimental Station, approximately 10 km from the municipality of Balsas, MA, Brazil. It is situated at an elevation of about 283 m and has a latitude of 07º31′57′′ S and a longitude of 46º02′08′′ W (
Figure 1). It was implemented during the 2022–2023 harvest, and additional agronomic information can be consulted in the reference [
25].
The dataset is composed of 320 samples, 40 cultivars and 4 replications for each season, and the features are: Season, Cultivar, Repetition, plant height (cm), insertion of the first pod (cm) (IFP), number of legumes per plant (unit) (NLP), number of grains per plant (unit) (NGP), number of grains per pod (unit) (NGL), Number of stems (unit) (NS), mass of 100 grains (g) (MHG), and Grain yield (kg ha −1) (GY).
The employed dataset was checked for consistency prior to analysis. No missing values were present in any of the recorded agronomic traits. Furthermore, all measurements fell within biologically plausible ranges for soybean cultivars, and no outliers were removed, ensuring that the dataset accurately reflects natural phenotypic variation under field conditions. A comprehensive and consistent dataset was used directly for model training and evaluation.
The dataset consisted of observations from two consecutive soybean harvests. To ensure that the training and testing splits included samples from both sowings, the data from the two harvests were combined into a single dataset for model construction. Instead of being limited to a single season, this methodology enabled the models to learn from the variability introduced by various harvests. Crucially, the models were trained on the pooled dataset rather than individually for every harvest, ensuring that prediction performance reflects the capacity to generalize across harvest stages. By employing a holdout method, 80% of the dataset was set for training and 20% for final testing. Within the training subset, a 3-fold cross-validation (k = 3) was then employed to optimize model parameters and prevent overfitting. The training data were divided into three folds in this process, and models were repeatedly trained on two of the folds and validated on the third. The optimal hyperparameters were chosen based on the average performance across folds and then assessed on the independent test subset. This method combines the objective evaluation offered by an external test set with the stability of cross-validation.
Table 1 presents a statistical summary of the dataset used in the models, where CV e Var. are the coefficient of variation and variance, respectively.
Regarding the collected samples, it is worth noticing that PH across the 40 soybean cultivars ranged from 47.6 cm to 94.8 cm (refer to
Table 1), indicating substantial phenotypic variation from shorter, more determinate types to taller, indeterminate varieties relevant to breeding programs. Such a range confirms that the dataset includes both relatively dwarf and tall phenotypes, thus improving the generalizability of the predictive models. Moreover, no global standardization or normalization was applied to the dataset since tree-based models (i.e., Extra Trees, XGBoost) are invariant to feature scaling. Specifically for the
k-Nearest Neighbors (KNN) algorithm, which is sensitive to feature magnitudes, features were standardized to zero mean and unit variance to ensure fair distance calculations.
2.1. Machine Learning Models
The models were selected based on their ability to handle agronomic data effectively and their complementary nature. As a linear baseline that can handle multicollinearity among attributes by combining L1 and L2 penalties, Elastic Net (EN) was employed. Due to their resilience and ability to identify essential variables while capturing nonlinear interactions, Extremely Randomized Trees (ERTs) were selected. A probabilistic benchmark that provides predictions and uncertainty estimates useful for agricultural decision-making is the Gaussian Process Regressor (GPR). To illustrate the benefits of more sophisticated techniques, KNN offered a straightforward, instance-based baseline. Lastly, ET was enhanced by Extreme Gradient Boosting (XGB), which provided regularized, scalable ensembles with gradient-based optimization to simulate intricate trait interactions.
2.1.1. Elastic Net
An extension of linear regression, Elastic Net (EN) adds L
1 (Lasso) and L
2 (Ridge) regularization penalties to the loss function [
26]. EN is useful for building models when there are many features and some of them are highly correlated. The EN loss function is written as
where
: Mean squared error (MSE), the regression loss term that measures the difference between predicted and actual values.
: Regularization parameter, controls the intensity of regularization.
: Mixing parameter, determines the balance between L1 (Lasso) and L2 (Ridge) penalties.
: L1 penalty, inducing sparsity.
: L2 penalty, encouraging small coefficients.
represents the coefficients of the regression model.
determines the strength of the regularization term.
2.1.2. Extremely Randomized Trees
Extremely Randomized Trees (ET) [
27] is similar to Random Forest; this regressor introduces unpredictability during training in a different way. Each tree is trained using all the training data to train an ET. The best split at a node is determined by examining a subset of all available features, as in the Random Forest model. A single threshold is chosen at random for each characteristic rather than looking for the optimal threshold for each feature. The one that increases the utilized score the most out of these random splits is chosen. A higher degree of randomness during training produces more independent trees, which further reduces variance.
2.1.3. Gaussian Process Regressor
Gaussian Process Regression (GPR) [
28] is a nonparametric Bayesian approach for regression that defines a prior distribution over functions and updates it with observed data to obtain a posterior distribution. Formally, a Gaussian process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. For a set of training inputs
with targets
, and a test point
, the model assumes:
where
is a Gaussian process with mean function
and covariance (kernel) function
.
The joint distribution of training outputs
y and the prediction
at
is given by:
where
is the covariance matrix computed using the kernel function. The predictive distribution for
is then Gaussian with closed-form mean and variance:
The choice of kernel (e.g., squared exponential, Matérn, rational quadratic) determines the smoothness and flexibility of the model. Hyperparameters, such as the length scale, signal variance, and noise variance, are typically optimized by maximizing the log marginal likelihood of the model.
GPR provides not only point predictions but also an uncertainty estimate for each prediction, making it particularly valuable in applications where confidence intervals are as important as mean predictions.
2.1.4. K-Nearest Neighbors
KNN is a nonparametric [
29] is a nonparametric, instance-based learning method that predicts the response of a query sample
as the (optionally distance-weighted) mean of the target values of its
k nearest training neighbors. Given a training set
, the predicted value
is given by
where
denotes the set of
k nearest neighbors of
according to the distance metric
, and
are optional distance weights used to reduce the influence of more distant samples.
The main hyperparameters and practical choices are as follows:
Number of neighbors k: Small k yields irregular, noise-sensitive decision boundaries, whereas large k produces smoother boundaries with higher bias.
Distance metric : Euclidean (default), Manhattan, or Minkowski; Mahalanobis when features are correlated; Hamming for categorical variables.
Weighting: Uniform or distance-based.
Preprocessing: Normalize or standardize features, handle outliers, and optionally apply dimensionality reduction to mitigate the “curse of dimensionality”.
KNN offers advantages such as simplicity, strong baseline performance, the ability to model nonlinear boundaries, and native multiclass support. Its main limitations are the high computational cost at prediction time, sensitivity to feature scaling and irrelevant attributes, and degraded performance in high-dimensional spaces.
2.1.5. Extreme Gradient Boosting
Extreme Gradient Boosting (XGBoost) [
30] is an optimized implementation of gradient boosting that builds an ensemble of decision trees sequentially. Each tree attempts to correct the residual errors of the previous ones, minimizing a differentiable loss function through gradient descent. The model at iteration
t is defined as
where
is the space of regression trees. XGBoost introduces a regularized objective that balances predictive accuracy and model complexity:
with
being the loss function,
T the number of leaves in a tree, and
w the leaf weights. This formulation penalizes overly complex trees, improving generalization.
The main hyperparameters include: (i) the learning rate , which controls the contribution of each tree; (ii) the number of boosting rounds; (iii) the maximum depth of trees, influencing the bias-variance trade-off; (iv) regularization terms and (L2 and L1 penalties, respectively); and (v) subsampling of rows and features to prevent overfitting.
XGBoost is computationally efficient due to optimizations such as parallel tree construction, cache-aware data structures, and support for sparse inputs. Its advantages include high predictive performance, scalability to large datasets, and flexibility to handle regression, classification, and ranking tasks. Limitations include the need for careful hyperparameter tuning and the risk of overfitting in small datasets if regularization and subsampling are not correctly configured.
2.2. Bayesian Optimization
Bayesian Optimization (BO) is a technique for optimizing functions, particularly those that are computationally expensive to analyze, by employing a sequential approach. It is especially helpful for ML method hyperparameter optimization (
Appendix A.1) [
31].
BO iteratively builds and refines a probabilistic method of the objective function to determine optimal hyperparameters. This strategy works particularly well for problems where function evaluations are expensive, since it aims to find the optimal answer with the fewest possible evaluations [
32].
The following steps are part of the optimization process:
- 1.
Define the Problem: optimize an objective function
represents the model’s validation loss. In this work, the objective function is the Root Mean Square Error (RMSE).
- 2.
Choose Surrogate Model: Assume follows a probabilistic model. Common choice: Gaussian Process (GP), but other models, such as Random Forest or Bayesian Neural Nets, can be used. This prior captures beliefs about before seeing any data.
- 3.
Collect Initial Data: Evaluate at a small number of initial points. Fit the GP to these observations.
- 4.
Update the Surrogate Model: After observing data , update the posterior distribution of the surrogate. The model gives both a mean prediction and uncertainty .
- 5.
Define an Acquisition Function: The acquisition function balances:
Common acquisition functions: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).
- 6.
Optimize the Acquisition Function: Find the point that maximizes the acquisition function. This step is usually much cheaper than evaluating .
- 7.
Evaluate the Objective Function: Evaluate . Add the new pair (,) to the dataset.
- 8.
Update and Repeat: update the surrogate model with the expanded dataset. Iterate steps 4–7 until convergence or reach the number of iterations.
- 9.
Return the Best Solution: Output the best observed .
2.3. Performance Metrics
To assess predictive performance, we employ several complementary metrics. The Coefficient of Determination (
) [
33] measures the proportion of variance in the dependent variable explained by the model, ranging from
to 1, with 1 indicating perfect prediction:
where
are the true values,
are the predicted values,
is the mean of the true values, and
n is the number of samples.
The Root Mean Squared Error (RMSE) [
34] quantifies the square root of the average squared prediction errors:
where large deviations are penalized more strongly due to the squared term.
The Mean Absolute Percentage Error (MAPE) [
35] expresses the error as a percentage of the true values:
where the error of each prediction is normalized by the corresponding true value.
The Mean Absolute Error (MAE) [
34] is the average of the absolute deviations:
being less sensitive to outliers than RMSE.
The Accuracy within 10% (A10) [
36] reports the proportion of predictions that fall within
of the true values:
where
is the indicator function that equals 1 if the condition is satisfied and 0 otherwise.
Finally, the Pearson Correlation Coefficient (
R) [
37] evaluates the linear correlation between predictions and true values:
where
and
denote the means of the true and predicted values, respectively. Its value ranges from
(perfect negative correlation) to 1 (perfect positive correlation).
2.4. Performance Index (PI)
To facilitate a comprehensive comparison of machine learning models, we employed a Performance Index (PI) that consolidates multiple evaluation metrics into a single score. The PI provides a balanced view of model accuracy and error minimization, avoiding biases toward any individual metric.
Let
be the set of evaluation metrics. In this study, the set was considered
Each metric is first normalized to the range
:
where
is the value of metric
for model
i.
The models were evaluated over 50 independent runs. To ensure fair and stable comparison, the normalization is applied iteratively: In each of the 50 independent runs, the minimum and maximum bounds for metric j are calculated strictly across the five evaluated models (EN, ET, GPR, KNN, XGB) within that specific run. This prevents instability by limiting the normalization scope to the models evaluated in each iteration. Consequently, the PI scores reported in the results represent the mean and standard deviation of the 50 individual PI calculations per model.
For error-based metrics, where lower values indicate better performance (RMSE, MAE, MAPE), the normalized scores are inverted so that higher values consistently represent better performance:
For accuracy-based metrics, where higher values are better (
R,
, A10), was retain the normalized form:
The PI of model
i is then computed as a weighted sum of the adjusted metrics:
where the weights satisfy
. In this work, we used equal weights (
) to ensure that all metrics contribute equally. Finally, models are ranked according to their PI scores in descending order:
2.5. Uncertainty Analysis
The stability and reliability of model predictions were assessed through an uncertainty analysis. This approach quantifies the dispersion of predictions, thereby complementing traditional measures [
38].
Given the test set predictions
and observed values
y, the residuals are defined as:
From these residuals, we compute the mean error:
and the standard deviation of the errors:
To further explore prediction variability, we generate
N synthetic input samples
drawn uniformly from the observed feature space:
The model produces predictions
for these synthetic inputs. First, compute the median prediction:
Then, the mean absolute deviation (MAD) from the median is given by:
Finally, the relative uncertainty score is defined as:
This formulation provides a normalized measure of prediction dispersion, enabling a comparative assessment of model stability. Models with lower Uncertainty values are considered more reliable, as they combine both accuracy and robustness.
3. Results
Table 2 presents the performance metrics of five machine learning models across 50 independent runs, summarizing the models’ results using key evaluation metrics, including the coefficient of determination R
2, RMSE, MAE, MAPE, and A10.
The ET model performs the best overall, achieving the highest value of 0.426, indicating a moderate positive correlation between observed and predicted values. Additionally, ET achieves the lowest RMSE of 6.859 cm, demonstrating its superior ability to minimize prediction errors. Nevertheless, this number means that the models only account for half of the variation in soybean plant height. This moderate explanatory power can be attributed to several reasons. First, parameters such as plant height vary significantly among cultivars and field conditions, indicating that genotype–environment interactions naturally impact soybean growth. Furthermore, the model’s capacity to identify underlying patterns may be further compromised by the variability and errors that manual field observations may introduce. These factors suggest that although the proposed approach provides insightful information, model generalizability and predictive performance could be enhanced by incorporating a broader range of environmental and genetic indicators and expanding the dataset across multiple locations and seasons.
The XGB model also performs well, with an of 0.379 and an RMSE of 7.114 cm. While it is not as accurate as ET, XGB still offers a strong predictive performance, as evidenced by its relatively low RMSE and high A10 score of 0.677. The Elastic EN model produces a relatively low of 0.171, reflecting its weaker predictive power compared to ET and XGB. However, it demonstrates a moderate RMSE of 8.082 cm, which is higher than that of both ET and XGB.
Both the GPR and KNN models exhibit low performance across all metrics. GPR has an of 0.011 and an RMSE of 9.102 cm, while KNN shows a similarly low of 0.009 and RMSE of 9.109 cm. The ET model stands out for its superior predictive performance, followed by XGB. Both EN and KNN perform below expectations, and GPR shows the least effectiveness in terms of both predictive accuracy and error minimization. The results show the importance of selecting the most suitable modes that balance prediction, accuracy, and computational efficiency.
Figure 2 presents the relationship between observed and predicted values for five machine learning models (EN, ET, GPR, KNN, and XGB). The ET and XGB models show the closest alignment with the ideal prediction line, indicating that they perform well in predicting plant height. Both models exhibit higher
values, around 0.61, which suggests a moderate relationship between the predicted and observed data. These findings support the importance of using robust machine learning models in agriculture, as outlined in the introduction, where the accurate prediction of plant traits is crucial for optimizing cultivar selection and improving yield estimations.
In contrast, the GPR and KNN models perform less effectively, with low fits to the observed data. This indicates that while these models are simpler and may be faster to compute, they are less reliable for predicting soybean plant height in this study. This highlights the challenge mentioned in the introduction regarding the complexity of predicting agronomic traits, such as plant height, using less advanced models.
3.1. Integrated Performance and Model Robustness
Table 3 presents the mean PI scores across model runs, along with their standard deviations. The ET model ranks first with the highest mean PI score of 0.776, accompanied by a standard deviation of 0.124.
The XGB model follows closely in second place with a mean PI score of 0.738 and a higher standard deviation of 0.161. While XGB demonstrates strong performance, its higher variability suggests that it may be more sensitive to changes in the data or model parameters, which can affect its consistency.
EN ranks third, with a mean PI score of 0.526 and a standard deviation of 0.090. Though its PI score is lower than that of ET and XGB, it still provides a reasonably stable performance. The KNN and GPR models show comparatively weaker performance, with KNN scoring 0.362 (0.070) and GPR scoring 0.347 (0.081). These models have lower mean PI scores and higher variability, suggesting that they are less reliable in this context.
Table 4 presents performance metrics showing that ET provides the most reliable predictions, while XGB offers competitive performance. EN, KNN, and GPR, however, are less suited for this task due to their higher uncertainty and error rates. The ET model demonstrates the best performance with the lowest MAD (3.492 cm), uncertainty (5.073 cm), and RMSE (6.859 cm), indicating both high accuracy and low variability.
XGB follows with a slightly higher MAD (4.141 cm), uncertainty (6.267 cm), and RMSE (7.114 cm), suggesting it performs well but with more variability than ET. Elastic Net (EN) shows weaker performance with a MAD of 6.913 cm, higher uncertainty (10.312 cm), and RMSE (8.082 cm), indicating it is less effective in capturing the data’s complexities.
KNN and GPR exhibit higher MAD and RMSE, with substantial uncertainty, reflecting their limited capability in this context. These results are consistent with the study’s methodology, which evaluates the ability of machine learning models to predict soybean plant height based on agronomic traits.
Figure 3 presents a comparative uncertainty analysis for the five machine learning models evaluated in this study, illustrating the balance between prediction accuracy and uncertainty. ET is positioned near the lower left of the chart, with an RMSE of approximately 7 cm and moderate uncertainty around 5 cm; this indicates that ET delivers the most accurate predictions among the models considered. XGB lies slightly to the right and slightly higher, with RMSE near 7.2 cm and uncertainty around 6 cm, demonstrating a modest increase in both error and uncertainty relative to ET. EN appears at the far right, with uncertainty of approximately 10 cm and an RMSE of around 8.5 cm, suggesting that the linear model is unstable and less precise. KNN and GPR occupy the upper left quadrant, showing RMSE values of approximately 9 cm, but with lower uncertainties of roughly 3 cm and 2 cm, respectively. This indicates that these models consistently underperform in terms of prediction accuracy, despite relatively narrow uncertainty intervals.
The variation in feature importance rankings among models reflects their distinct learning mechanisms and the underlying agronomic structure of the data. While the number of stems (NS) was unequivocally the dominant predictor across all models, the insertion of the first pod (IFP) consistently emerged as a key secondary feature, particularly in the top-performing ensemble methods. This aligns with agronomic intuition, as IFP represents a fundamental component of vertical plant architecture that is structurally linked to overall plant stature. The strong performance of these structural traits (NS and IFP) underscores that plant architecture characteristics are primary drivers of height prediction in soybean. The prominence of grain yield (GY) in some ensemble models can be explained by its nature as a composite trait that integrates the cumulative effects of multiple yield determinants [
39,
40]. This analysis clarifies that the most robust and biologically meaningful features for predicting plant height are NS and IFP, providing reliable targets for future phenotyping efforts in soybean breeding programs.
3.2. Identification of Key Agronomic Predictors (Feature Importance)
Figure 4 presents the feature importance rankings derived from the five machine learning models. This analysis offers crucial insights into the relative impact of each trait on PH prediction, providing a foundation for understanding the agronomic factors that influence plant architecture. The figure, which illustrates importance scores for features such as the number of stems (NS), grain yield (GY) and others, highlights consistent patterns across models, alongside some variations that merit further examination.
Across all models, NS stands out as the most influential feature, consistently ranking highest or near the top. This observation aligns with botanical principles, where the number of stems reflects a plant’s structural complexity and resource allocation strategy, directly affecting height through competitive growth within the canopy [
4]. The stability of NS as a predictor, evident in its prominence across both linear (EN) and nonlinear (ET, XGB) models, suggests a strong, potentially causal relationship with PH. This consistency enhances the reliability of NS as a key metric for future field assessments, particularly for breeding programs aiming to improve shoot architecture.
Following NS, GY emerges as a significant contributor, though its importance varies by model. For instance, XGB and ET assign higher weights to GY, reflecting its role as an integrated measure of plant productivity that may correlate with taller structures due to increased biomass allocation. The variability in rankings, such as lower importance in GPR, may result from model-specific sensitivities to feature correlations or the limited dataset size, which could restrict GPR’s ability to capture these relationships effectively.
Notably, traits such as the number of legumes per plant (NLP) and number of grains per plant (NGP) exhibit lower importance scores across most models, suggesting they contribute less directly to PH variation. This observation contrasts with the expectation that reproductive traits might influence height through resource competition, indicating that structural traits, such as NS, dominate in this context. The discrepancy could reflect the specific conditions of the dataset, such as the 2022–2023 harvest at a single Brazilian site, where environmental factors may have prioritized stem development over reproductive output.
The differences in feature importance rankings among models reflect their distinct learning mechanisms. ET and XGB, which rely on ensemble techniques and feature subsampling, emphasize NS and GY, likely due to their ability to detect nonlinear interactions [
27,
30]. In contrast, EN’s linear framework provides more balanced weights, while GPR’s low importance scores across all features may indicate overfitting or an unsuitable kernel choice for this dataset [
28]. These model-specific interpretations indicate that ensemble methods are more suitable for capturing the complex trait interactions in soybean PH prediction.
From a practical perspective,
Figure 4 offers valuable guidance for agricultural applications. Prioritizing NS measurement in field trials could simplify data collection efforts, reducing costs while maintaining predictive accuracy. Overall,
Figure 4 provides a foundation for linking agronomic traits to PH, supporting data-driven decisions in cultivar selection and breeding. However, further research is necessary to address its limitations.
4. Discussion
The developed study demonstrates the effectiveness of ensemble ML models, particularly ET, in predicting soybean PH from ground-based agronomic traits. Indeed, the superior performance of ET, as evidenced by its highest mean R2 and lowest RMSE, highlights the non-linear and complex nature of the relationships between agronomic traits and plant architecture. Moreover, the model’s ability to capture approximately 43% of the variance in PH, despite the inherent field variability, highlights the potential of using easily measurable agronomic traits for phenotypic forecasting.
The ensemble structure and nonparametric topology of the ET model are responsible for its better performance. In order to reduce variance and capture intricate nonlinear dependencies between agronomic traits, ET builds multiple fully randomized decision trees and aggregates their outputs, in contrast to linear approaches such as EN, which rely on fixed coefficients and assume additive relationships among predictors. To increase model variety and reduce overfitting, each tree is trained using the complete dataset, but with random feature and threshold selection.
On the other hand, when the data shows discontinuities or varied feature effects, as in agronomic datasets, GPR, which models smooth functional connections using a kernel-based probabilistic structure, may perform poorly. The KNN approach is susceptible to feature scaling and data sparsity since it relies on local distance calculations and lacks an explicit internal structure. In the meantime, XGB utilizes additive learning in conjunction with sequential tree boosting, which enables high predictive capacity but requires precise hyperparameter adjustment to prevent overfitting in small datasets. ET consistently outperforms other evaluation metrics because it offers a better bias–variance trade-off and more stability by averaging uncorrelated randomized trees.
These architectural variations support the use of ensemble tree models, such as ET, for soybean plant height prediction by showing that they are especially well-suited for agronomic applications where nonlinear interactions and feature heterogeneity predominate.
The capacity of the ET model to capture the intricate and nonlinear interactions between structural and yield-related factors that collectively influence soybean plant height explains its superior performance from an agronomic standpoint. The physiological principles of competition for light, nutrient distribution, and biomass allocation within the canopy are reflected in traits such as the number of stems (NS) and grain yield (GY), which exhibit nonlinear responses to genetic and environmental variation. Without making any assumptions about the smoothness or linearity of the data, the ET model successfully captures these nonlinear dependencies and interactions by building a diversified ensemble of fully randomized decision trees. Due to its adaptability, it can effectively manage the heterogeneity present in multi-cultivar agronomic datasets, where phenotypic expression is significantly influenced by genotype–environment interactions.
The GPR, on the other hand, is predicated on a probabilistic framework based on kernels and presumes smooth and continuous functional relationships. These presumptions may not apply to field-based agronomic data, which sometimes exhibit discontinuities caused by local soil variability, environmental stress, or cultivar-specific growth patterns. Additionally, the GPR’s kernel hyperparameters may have been over-smoothed and underfitted due to the very small dataset size and heterogeneous feature scales. As a result, GPR’s predicted accuracy was worse than that of the ET model since it was unable to detect sudden and nonlinear changes in plant height.
Due to its capacity to capture the biological complexity and diversity of soybean development under field settings, these results show that the ET model is superior not just statistically but also agronomically. Thus, the model’s ability to generalize across genotypes and harvests accounts for its resilience, providing a dependable and understandable method for incorporating data-driven insights into breeding and crop management plans.
When compared to prior studies on crop trait prediction, the current findings align with trends in ML applications for agriculture, but highlight specific challenges with ground-based agronomic data. Incidentally, the effectiveness of our BO framework is consistent with its proven utility in other domains, such as structural health monitoring, where a BO-LSTM network achieved superior accuracy in identifying defects in concrete-filled steel tubes [
41].
The findings of recent studies that used machine learning to predict soybean and crop traits are supported by and expanded upon by the results of this investigation. For instance, Sarkar et al. [
19] estimated the composition of soybean seeds from crop photos using machine learning models, and their
values ranged from 0.40 to 0.65, which is similar to the ET model’s moderate explanatory power (
= 0.43). Reiterating that performance metrics in this range are typical for agronomic predictions under field variability, Fu et al. [
14] used deep learning with multi-source remote sensing data to forecast soybean production and reported RMSE values between 6 and 9 cm.
Teodoro et al. [
42] used deep learning to estimate plant height directly in a related investigation, and they obtained an MAE of 8.32 cm and an RMSE of 10.51 cm. Even with the use of high-dimensional multispectral inputs, these errors are noticeably larger than those produced by our ET model (MAE = 5.36 cm, RMSE = 6.86 cm), indicating that the suggested ground-based framework may produce competitive accuracy with significantly less complicated data collecting.
The current study supports the findings of Rahaman et al. [
13] and Yuan et al. [
11], who highlighted the superiority of ensemble models such as Random Forests and Gradient Boosting over linear regressors in managing heterogeneous agronomic data. However, our results show that equal accuracy may be attained using simply field-measured agronomic features, while the majority of this research relied heavily on UAV or remote sensing data.
As a result, our findings support the use of ensemble tree models on ground-based data as an affordable and comprehensible substitute for phenotypic prediction in soybeans, while also being consistent with evidence from previous agricultural ML studies. Indeed, when compared to prior studies on crop trait prediction, the current findings align with trends in ML applications for agriculture but highlight specific challenges with ground-based agronomic data. For instance, recent work by Sarkar et al. [
19] achieved higher
values (0.65–0.89) for seed composition prediction using image-based features, but was limited to controlled laboratory conditions. In contrast, our field-based approach addresses real-world variability across multiple harvests, albeit with moderate explanatory power. Similarly, de Souza et al. [
18] reported superior accuracy (
) using UAV multispectral data; however, their methodology requires specialized equipment and focuses on indirect spectral indices rather than direct relationships with agronomic traits. Our work demonstrates that competitive accuracy can be achieved using readily measurable field traits, providing a cost-effective alternative particularly valuable for resource-constrained agricultural systems. Furthermore, unlike Delfani et al. [
22], who noted limited interpretability in complex ML models, our framework maintains explainability through feature importance analysis while achieving robust performance.
A critical finding of this study is the consistent identification of the number of stems (NS) and the insertion of the first pod (IFP) as the most influential features for predicting plant height. This provides a novel mechanistic insight into the determinants of soybean plant architecture, suggesting that structural complexity (NS) and insertion of the first pod (IFP) are key drivers of height. Agronomically, this implies that breeding programs aiming to optimize plant height can focus on these traits, potentially simplifying selection criteria.
It is crucial to recognize that potential multicollinearity among agricultural traits may affect how feature importance is interpreted, especially in linear models. In this research, the use of both regularized and ensemble learning methods aimed to lessen these effects. The EN model integrates L1 and L2 regularization terms, which penalize correlated predictors and stabilize coefficient estimates, ultimately reducing bias linked to multicollinearity. Additionally, ensemble tree-based approaches, such as ET and XGB, are naturally less affected by feature correlation, as their hierarchical and randomized feature selection processes distribute importance across various decision paths. Consequently, the feature importance analysis highlighted findings from ET and XGB, which provide robust impurity-based importance measures even under correlated inputs. The consistent recognition of the number of stems (NS) and insertion of the first pod (IFP) as the most significant features across different modeling approaches underscores their agronomic significance and biological validity. However, future research could enhance this analysis with formal collinearity diagnostics, such as variance inflation factor evaluation or feature clustering, to further confirm the independence of essential predictors.
The uncertainty analysis further reinforces the reliability of the ET model, with the lowest mean absolute deviation and uncertainty. This low uncertainty is crucial for practical applications, such as cultivar selection, where confidence in predictions is necessary for decision-making. The PI rankings confirm the overall superiority of ET, integrating multiple metrics to provide a comprehensive assessment.
Despite the described strengths, several limitations must be considered. In particular, the dataset, comprising 320 samples from 40 cultivars over two harvests at a single location in Brazil, may limit generalizability to diverse climates or soil types [
22]. In addition, the moderate R
2 suggests unmodeled variability, possibly due to unmeasured factors such as soil nutrients or pests. Such performance level, while moderate, is comparable to or exceeds several recent studies using more complex data sources, such as Rahaman et al. [
13] who reported
values of 0.35–0.45 for yield prediction using satellite imagery, and Yuan et al. [
11] who achieved
for maturity date prediction using multispectral UAV data. Last, but not least, the Bayesian optimization process, although efficient, explored a predefined hyperparameter space, and bounding broader ranges or employing alternative acquisition functions could yield further performance improvements.
In conclusion, this study establishes a robust, explainable ML framework for predicting soybean plant height using ground-based agronomic traits. The novel contribution lies in demonstrating that ensemble models, such as ET, can effectively capture the nonlinear relationships in agronomic data, providing a cost-effective and interpretable alternative to remote sensing. The identification of NS and IFP as key traits offers new insights for breeding programs. Future research should focus on expanding the dataset across diverse environments and integrating these models into real-time farm management systems to enhance crop productivity and food security.
5. Conclusions
This research established a robust, explainable machine learning framework for predicting soybean PH using solely ground-based agronomic traits, offering a low-cost and competitive alternative to methods relying on complex remote sensing data.
The Extra Trees (ET) ensemble model emerged as the superior predictor, demonstrating high accuracy and stability (R2 of 0.426, RMSE of 6.859 cm, and uncertainty of 5.073%). The model’s efficacy is attributed to its nonparametric topology and ensemble structure, which effectively manage the non-linear dependencies and feature heterogeneity common in complex field data, surpassing the performance of linear (EN) and probabilistic (GPR) models.
A critical finding is the consistent identification of the number of stems (NS) and insertion of the first pod (IFP) as the most influential features for PH prediction. This provides actionable insight for soybean breeding programs, suggesting that prioritizing the measurement of NS can simplify data collection and maintain high predictive utility for optimizing shoot architecture and yield. Furthermore, the achieved error rates (MAE = 5.36 cm, RMSE = 6.86 cm) are competitive with deep learning methods utilizing high-dimensional multispectral data (MAE = 8.32 cm, RMSE = 10.51 cm), validating the precision of this ground-based approach.
The developed framework, which integrates BO, PI, and uncertainty analysis, presents a reliable and transparent decision-support tool for cultivar selection and phenotypic forecasting, that addresses key limitations of previous approaches, including limited interpretability, dataset constraints, and lack of uncertainty quantification. Future work should focus on expanding the dataset to encompass diverse geographical locations and seasons, thereby enhancing generalizability. Additionally, integrating these ground-based predictions with aerial data or implementing the models in real-time farm management systems will be essential steps toward maximizing food security and optimizing crop management strategies in soybean production.