Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests

Oliveira, Bruno Rodrigues de; Sobrinho, Renato Lustosa; Ferreira, Fernando Rodrigues Trindade; Putti, Fernando Ferrari; Bodini, Matteo; Saporetti, Camila Martins; Goliatt, Leonardo

doi:10.3390/agriengineering7120408

Open AccessArticle

Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests

by

Bruno Rodrigues de Oliveira

¹

,

Renato Lustosa Sobrinho

^2,3

,

Fernando Rodrigues Trindade Ferreira

⁴

,

Fernando Ferrari Putti

⁵

,

Matteo Bodini

⁶

,

Camila Martins Saporetti

^4,*

and

Leonardo Goliatt

⁷

¹

Pantanal Editora, Rua Abaete, 83, Sala B, Centro, Nova Xavantina 78690-000, MG, Brazil

²

Integrated Molecular Plant Physiology Research, Department of Biology, University of Antwerp, 2020 Antwerp, Belgium

³

Department of Plant Health, Rural Engineering and Soils, School of Engineering, São Paulo State University (UNESP-FEIS), Ilha Solteira 20550-900, SP, Brazil

⁴

Department of Computational Modeling, Polytechnic Institute, Rio de Janeiro State University, Rua Bonfim, 25, Vila Amelia, Nova Friburgo 22000-900, RJ, Brazil

⁵

Biosystems Engineering Department, São Paulo State University (UNESP), Tupã 17602-496, SP, Brazil

⁶

Dipartimento di Economia, Management e Metodi Quantitativi, Università degli Studi di Milano, Via Conservatorio 7, 20122 Milano, Italy

⁷

Department of Applied and Computational Mechanics, Federal University of Juiz de Fora, Juiz de Fora 36036-900, MG, Brazil

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(12), 408; https://doi.org/10.3390/agriengineering7120408 (registering DOI)

Submission received: 30 September 2025 / Revised: 21 November 2025 / Accepted: 25 November 2025 / Published: 2 December 2025

(This article belongs to the Special Issue The Future of Artificial Intelligence in Agriculture, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The accurate prediction of plant height is crucial for optimizing soybean cultivar selection and improving yield estimations. In this study, we investigate the potential of machine learning (ML) algorithms to predict soybean plant height (PH) based on a diverse set of agronomic parameters analyzed from forty soybean cultivars evaluated across sequential harvests. Using a comprehensive dataset, the models Elastic Net (EN), Extra Trees (ET), Gaussian Process Regressor (GPR), K-Nearest Neighbors, and XGBoost (XGB) were compared in terms of predictive accuracy, uncertainty, and robustness. Our results demonstrate that ET outperformed other models with an average correlation coefficient of 0.674,

R^{2}

of 0.426 and the lowest RMSE of 6.859 cm and MAE of 5.361 cm, while also showing the lowest uncertainty (5.07%). The proposed ML framework includes an extensive model evaluation pipeline that incorporates the Performance Index (PI), ANOVA, and feature importance analysis, providing a multidimensional perspective on model behavior. The most influential features for PH prediction were the number of stems (NS) and insertion of the first pod (IFP). This research highlights the viability of integrating explainable ML techniques into agricultural decision support systems, enabling data-driven strategies for cultivar evaluation and phenotypic trait forecasting.

Keywords:

soybean; plant height prediction; agronomic traits; machine learning; ensemble models; feature importance; model evaluation; uncertainty analysis

1. Introduction

The expectations placed on agriculture to provide enough food will continue to be a major concern for the foreseeable future due to the world’s population growth [1]. Growing agricultural yields have frequently been cited as a possible way to address the problem of feeding our expanding population [2]. For many crops, field-scale traits such as lodging resistance and plant density strongly influence grain yield. Both factors are directly shaped by shoot architecture: plants with sturdier, well-balanced shoot structures exhibit reduced susceptibility to lodging. In contrast, architectures that optimize branch and stem arrangement allow higher planting densities without compromising light interception or resource allocation [3]. Therefore, achieving optimal shoot architecture has become a central breeding goal [4].

The United States Department of Agriculture (USDA) estimates that approximately 399.50 million tons of soybeans (Glycine max (L.) Merr.) are produced annually, making it one of the major crops in the world. Brazil and the United States are the two largest producers, accounting for approximately 41% and 28% of global production, respectively [5]. Soybeans have several uses, including the production of biofuel and serving as a source of nourishment for humans and animals. Meal and oil are made from the majority of soybeans cultivated worldwide [6]. This high-protein bran is primarily utilized in the feed sector for cattle, pigs, and poultry [7]. The industry uses the oil as a raw material to make a variety of products, including hydrogenated fats, refined oil, margarine, and mayonnaise [8]. Additionally, the primary raw material used to produce biodiesel is now soybean oil [9].

To achieve the highest yields, farmers must consider environmental conditions (such as soil, rainfall, and temperature) and effective crop genotype management. They may now survey their producing fields using remote sensing technology and systems to help monitor such conditions [10]. Predicting agronomic factors such as days to maturity (DM), plant height (PH), and grain yield (GY) is still difficult, though, particularly when indirect approaches, e.g., multispectral data gathered using an unmanned aerial vehicle (UAV), based remote sensing systems are taken into account [11]. The introduction of new UAV platforms has recently enabled the rapid and thorough mapping of several farmlands, producing a large volume of data that needs to be assessed and integrated to support agricultural management [12].

Machine learning (ML) techniques have been used to interpret remote sensing data over the past ten years, which is a promising approach to help variety evaluation and selection as well as other agricultural applications. A fair number of works addressing crop yield prediction based on machine learning models for various crops can be found in the associated literature [13,14,15,16,17,18,19]. Although the employment of ML in agriculture has grown, the majority of research to date has focused on leveraging remote sensing data to estimate crop characteristics, such as height or yield [20,21]. Despite their strength, such methods rely on indirect measures and often yield high accuracy only for a small number of cultivars or under specific constraints [13]. In contrast, the direct prediction of soybean plant height from field-based agronomic variables remains relatively unexplored. In particular, the few studies currently available are often limited to small datasets, do not follow consistent evaluation procedures, and offer limited information about the traits that have the most significant effects on plant height [18,19]. Furthermore, many current approaches fail to measure prediction uncertainty, which is crucial for crop management and decision-making in plant breeding [22].

In the aforementioned context, it should be highlighted that the inference of agronomic variables for soybean (Glycine max (L.) Merr.) cultivars based solely on remote sensing data remains an unresolved and difficult task, particularly when taking into account multiple genotype varieties in various geographical locations and seasons [23]. Given that agriculture plays a crucial role in the economies of many nations, determining or forecasting factors such as DM, PH, and GY can aid in decision-making and advance crop management techniques in the future [22,24].

This study addresses the gap in the existing literature regarding a robust, data-driven framework for predicting soybean PH from a diverse set of agronomic traits. In contrast to previous methods, our technique combines multiple algorithms, methodical hyperparameter optimization, and a comprehensive assessment pipeline that includes uncertainty analysis, accuracy, error metrics, and feature importance. This improves agronomic interpretability and predictive performance by identifying the most important traits. The suggested framework provides a consistent and repeatable method that can guide cultivar selection and phenotypic forecasting in soybean breeding efforts, offering both accuracy and explainability.

The primary objective of this research work was to develop a temporal or stage-aware ML framework that explicitly addresses the dynamic changes in trait–PH relationships across harvest stages using a wide range of measurable agronomic traits. Additionally, the study aimed to evaluate and compare the performance of various ML models, assess the uncertainty of their predictions, and identify the key features that most significantly influence soybean plant height.

2. Materials and Methods

The data were collected from an experiment conducted in the field at the Pequizeiro farm, located at the Accert Pesquisa e Consultoria Agronomica Experimental Station, approximately 10 km from the municipality of Balsas, MA, Brazil. It is situated at an elevation of about 283 m and has a latitude of 07º31′57′′ S and a longitude of 46º02′08′′ W (Figure 1). It was implemented during the 2022–2023 harvest, and additional agronomic information can be consulted in the reference [25].

The dataset is composed of 320 samples, 40 cultivars and 4 replications for each season, and the features are: Season, Cultivar, Repetition, plant height (cm), insertion of the first pod (cm) (IFP), number of legumes per plant (unit) (NLP), number of grains per plant (unit) (NGP), number of grains per pod (unit) (NGL), Number of stems (unit) (NS), mass of 100 grains (g) (MHG), and Grain yield (kg ha ⁻¹) (GY).

The employed dataset was checked for consistency prior to analysis. No missing values were present in any of the recorded agronomic traits. Furthermore, all measurements fell within biologically plausible ranges for soybean cultivars, and no outliers were removed, ensuring that the dataset accurately reflects natural phenotypic variation under field conditions. A comprehensive and consistent dataset was used directly for model training and evaluation.

The dataset consisted of observations from two consecutive soybean harvests. To ensure that the training and testing splits included samples from both sowings, the data from the two harvests were combined into a single dataset for model construction. Instead of being limited to a single season, this methodology enabled the models to learn from the variability introduced by various harvests. Crucially, the models were trained on the pooled dataset rather than individually for every harvest, ensuring that prediction performance reflects the capacity to generalize across harvest stages. By employing a holdout method, 80% of the dataset was set for training and 20% for final testing. Within the training subset, a 3-fold cross-validation (k = 3) was then employed to optimize model parameters and prevent overfitting. The training data were divided into three folds in this process, and models were repeatedly trained on two of the folds and validated on the third. The optimal hyperparameters were chosen based on the average performance across folds and then assessed on the independent test subset. This method combines the objective evaluation offered by an external test set with the stability of cross-validation. Table 1 presents a statistical summary of the dataset used in the models, where CV e Var. are the coefficient of variation and variance, respectively.

Regarding the collected samples, it is worth noticing that PH across the 40 soybean cultivars ranged from 47.6 cm to 94.8 cm (refer to Table 1), indicating substantial phenotypic variation from shorter, more determinate types to taller, indeterminate varieties relevant to breeding programs. Such a range confirms that the dataset includes both relatively dwarf and tall phenotypes, thus improving the generalizability of the predictive models. Moreover, no global standardization or normalization was applied to the dataset since tree-based models (i.e., Extra Trees, XGBoost) are invariant to feature scaling. Specifically for the k-Nearest Neighbors (KNN) algorithm, which is sensitive to feature magnitudes, features were standardized to zero mean and unit variance to ensure fair distance calculations.

2.1. Machine Learning Models

The models were selected based on their ability to handle agronomic data effectively and their complementary nature. As a linear baseline that can handle multicollinearity among attributes by combining L1 and L2 penalties, Elastic Net (EN) was employed. Due to their resilience and ability to identify essential variables while capturing nonlinear interactions, Extremely Randomized Trees (ERTs) were selected. A probabilistic benchmark that provides predictions and uncertainty estimates useful for agricultural decision-making is the Gaussian Process Regressor (GPR). To illustrate the benefits of more sophisticated techniques, KNN offered a straightforward, instance-based baseline. Lastly, ET was enhanced by Extreme Gradient Boosting (XGB), which provided regularized, scalable ensembles with gradient-based optimization to simulate intricate trait interactions.

2.1.1. Elastic Net

An extension of linear regression, Elastic Net (EN) adds L₁ (Lasso) and L₂ (Ridge) regularization penalties to the loss function [26]. EN is useful for building models when there are many features and some of them are highly correlated. The EN loss function is written as

\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} + λ (α \sum_{j = 1}^{p} | β_{j} | + (1 - α) \sum_{j = 1}^{p} β_{j}^{2})

(1)

where

$\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}$ : Mean squared error (MSE), the regression loss term that measures the difference between predicted and actual values.
$λ$ : Regularization parameter, controls the intensity of regularization.
$α$ : Mixing parameter, determines the balance between L1 (Lasso) and L2 (Ridge) penalties.
$\sum_{j = 1}^{p} | β_{j} |$ : L1 penalty, inducing sparsity.
$\sum_{j = 1}^{p} β_{j}^{2}$ : L2 penalty, encouraging small coefficients.
$β_{j}$ represents the coefficients of the regression model.
$λ$ determines the strength of the regularization term.

2.1.2. Extremely Randomized Trees

Extremely Randomized Trees (ET) [27] is similar to Random Forest; this regressor introduces unpredictability during training in a different way. Each tree is trained using all the training data to train an ET. The best split at a node is determined by examining a subset of all available features, as in the Random Forest model. A single threshold is chosen at random for each characteristic rather than looking for the optimal threshold for each feature. The one that increases the utilized score the most out of these random splits is chosen. A higher degree of randomness during training produces more independent trees, which further reduces variance.

2.1.3. Gaussian Process Regressor

Gaussian Process Regression (GPR) [28] is a nonparametric Bayesian approach for regression that defines a prior distribution over functions and updates it with observed data to obtain a posterior distribution. Formally, a Gaussian process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. For a set of training inputs

X = {x_{i}}_{i = 1}^{n}

with targets

y = {y_{i}}_{i = 1}^{n}

, and a test point

x_{*}

, the model assumes:

y_{i} = f (x_{i}) + ϵ_{i}, ϵ_{i} \sim N (0, σ_{n}^{2}),

(2)

where

f (\cdot) \sim GP (m (\cdot), k (\cdot, \cdot))

is a Gaussian process with mean function

m (\cdot)

and covariance (kernel) function

k (\cdot, \cdot)

.

The joint distribution of training outputs y and the prediction

f_{*}

at

x_{*}

is given by:

[\begin{matrix} y \\ f_{*} \end{matrix}] \sim N (0, [\begin{matrix} K (X, X) + σ_{n}^{2} I & K (X, x_{*}) \\ K (x_{*}, X) & K (x_{*}, x_{*}) \end{matrix}]),

(3)

where

K (X, X)

is the covariance matrix computed using the kernel function. The predictive distribution for

f_{*}

is then Gaussian with closed-form mean and variance:

μ (x_{*}) = K (x_{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} y,

(4)

σ^{2} (x_{*}) = K (x_{*}, x_{*}) - K (x_{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} K (X, x_{*}) .

(5)

The choice of kernel

k (\cdot, \cdot)

(e.g., squared exponential, Matérn, rational quadratic) determines the smoothness and flexibility of the model. Hyperparameters, such as the length scale, signal variance, and noise variance, are typically optimized by maximizing the log marginal likelihood of the model.

GPR provides not only point predictions but also an uncertainty estimate for each prediction, making it particularly valuable in applications where confidence intervals are as important as mean predictions.

2.1.4. K-Nearest Neighbors

KNN is a nonparametric [29] is a nonparametric, instance-based learning method that predicts the response of a query sample

x

as the (optionally distance-weighted) mean of the target values of its k nearest training neighbors. Given a training set

{(x_{i}, y_{i})}_{i = 1}^{n}

, the predicted value

\hat{y} (x)

is given by

\hat{y} (x) = \frac{\sum_{i \in N_{k} (x)} w_{i} y_{i}}{\sum_{i \in N_{k} (x)} w_{i}}, w_{i} \in \{1, \frac{1}{d (x, x_{i}) + ε}\},

(6)

where

N_{k} (x)

denotes the set of k nearest neighbors of

x

according to the distance metric

d (\cdot, \cdot)

, and

w_{i}

are optional distance weights used to reduce the influence of more distant samples.

The main hyperparameters and practical choices are as follows:

Number of neighbors k: Small k yields irregular, noise-sensitive decision boundaries, whereas large k produces smoother boundaries with higher bias.
Distance metric $d (\cdot, \cdot)$ : Euclidean (default), Manhattan, or Minkowski; Mahalanobis when features are correlated; Hamming for categorical variables.
Weighting: Uniform or distance-based.
Preprocessing: Normalize or standardize features, handle outliers, and optionally apply dimensionality reduction to mitigate the “curse of dimensionality”.

KNN offers advantages such as simplicity, strong baseline performance, the ability to model nonlinear boundaries, and native multiclass support. Its main limitations are the high computational cost at prediction time, sensitivity to feature scaling and irrelevant attributes, and degraded performance in high-dimensional spaces.

2.1.5. Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) [30] is an optimized implementation of gradient boosting that builds an ensemble of decision trees sequentially. Each tree attempts to correct the residual errors of the previous ones, minimizing a differentiable loss function through gradient descent. The model at iteration t is defined as

{\hat{y}}^{(t)} (x) = \sum_{k = 1}^{t} f_{k} (x), f_{k} \in F,

(7)

where

F

is the space of regression trees. XGBoost introduces a regularized objective that balances predictive accuracy and model complexity:

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}), Ω (f) = γ T + \frac{1}{2} λ {∥ w ∥}^{2},

(8)

with

l (\cdot, \cdot)

being the loss function, T the number of leaves in a tree, and w the leaf weights. This formulation penalizes overly complex trees, improving generalization.

The main hyperparameters include: (i) the learning rate

η

, which controls the contribution of each tree; (ii) the number of boosting rounds; (iii) the maximum depth of trees, influencing the bias-variance trade-off; (iv) regularization terms

λ

and

α

(L2 and L1 penalties, respectively); and (v) subsampling of rows and features to prevent overfitting.

XGBoost is computationally efficient due to optimizations such as parallel tree construction, cache-aware data structures, and support for sparse inputs. Its advantages include high predictive performance, scalability to large datasets, and flexibility to handle regression, classification, and ranking tasks. Limitations include the need for careful hyperparameter tuning and the risk of overfitting in small datasets if regularization and subsampling are not correctly configured.

2.2. Bayesian Optimization

Bayesian Optimization (BO) is a technique for optimizing functions, particularly those that are computationally expensive to analyze, by employing a sequential approach. It is especially helpful for ML method hyperparameter optimization (Appendix A.1) [31].

BO iteratively builds and refines a probabilistic method of the objective function to determine optimal hyperparameters. This strategy works particularly well for problems where function evaluations are expensive, since it aims to find the optimal answer with the fewest possible evaluations [32].

The following steps are part of the optimization process:

1.

Define the Problem: optimize an objective function

x^{*} = arg min_{x} f (x)

f (x)

represents the model’s validation loss. In this work, the objective function is the Root Mean Square Error (RMSE).

2.

Choose Surrogate Model: Assume

f (x)

follows a probabilistic model. Common choice: Gaussian Process (GP), but other models, such as Random Forest or Bayesian Neural Nets, can be used. This prior captures beliefs about

f (x)

before seeing any data.

3.

Collect Initial Data: Evaluate

f (x)

at a small number of initial points. Fit the GP to these observations.

4.

Update the Surrogate Model: After observing data

(x_{i}, f (x_{i}))

, update the posterior distribution of the surrogate. The model gives both a mean prediction

μ (x)

and uncertainty

σ (x)

.

5.

Define an Acquisition Function: The acquisition function balances:

Exploitation: Sampling where $μ (x)$ is high.
Exploration: Sampling where $σ (x)$ is high.

Common acquisition functions: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).

6.

Optimize the Acquisition Function: Find the point

x^{*}

that maximizes the acquisition function. This step is usually much cheaper than evaluating

f (x)

.

7.

Evaluate the Objective Function: Evaluate

f (x^{*})

. Add the new pair (

x^{*}

,

f (x^{*})

) to the dataset.

8.

Update and Repeat: update the surrogate model with the expanded dataset. Iterate steps 4–7 until convergence or reach the number of iterations.

9.

Return the Best Solution: Output the best observed

x^{*}

.

2.3. Performance Metrics

To assess predictive performance, we employ several complementary metrics. The Coefficient of Determination (

R^{2}

) [33] measures the proportion of variance in the dependent variable explained by the model, ranging from

- \infty

to 1, with 1 indicating perfect prediction:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(9)

where

y_{i}

are the true values,

{\hat{y}}_{i}

are the predicted values,

\bar{y}

is the mean of the true values, and n is the number of samples.

The Root Mean Squared Error (RMSE) [34] quantifies the square root of the average squared prediction errors:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(10)

where large deviations are penalized more strongly due to the squared term.

The Mean Absolute Percentage Error (MAPE) [35] expresses the error as a percentage of the true values:

M A P E = \frac{100}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(11)

where the error of each prediction is normalized by the corresponding true value.

The Mean Absolute Error (MAE) [34] is the average of the absolute deviations:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(12)

being less sensitive to outliers than RMSE.

The Accuracy within 10% (A10) [36] reports the proportion of predictions that fall within

\pm 10 %

of the true values:

A 10 = \frac{1}{n} \sum_{i = 1}^{n} 1 (\frac{| y_{i} - {\hat{y}}_{i} |}{y_{i}} \leq 0.10)

(13)

where

1 (\cdot)

is the indicator function that equals 1 if the condition is satisfied and 0 otherwise.

Finally, the Pearson Correlation Coefficient (R) [37] evaluates the linear correlation between predictions and true values:

R = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}}

(14)

where

\bar{y}

and

\bar{\hat{y}}

denote the means of the true and predicted values, respectively. Its value ranges from

- 1

(perfect negative correlation) to 1 (perfect positive correlation).

2.4. Performance Index (PI)

To facilitate a comprehensive comparison of machine learning models, we employed a Performance Index (PI) that consolidates multiple evaluation metrics into a single score. The PI provides a balanced view of model accuracy and error minimization, avoiding biases toward any individual metric.

Let

M = {m_{1}, m_{2}, \dots, m_{k}}

be the set of evaluation metrics. In this study, the set was considered

M = {R, R^{2}, RMSE, MAE, MAPE, A 10} .

Each metric is first normalized to the range

[0, 1]

:

m_{j}^{norm} (i) = \frac{m_{j} (i) - min (m_{j})}{max (m_{j}) - min (m_{j})}, \forall j \in M,

where

m_{j} (i)

is the value of metric

m_{j}

for model i.

The models were evaluated over 50 independent runs. To ensure fair and stable comparison, the normalization is applied iteratively: In each of the 50 independent runs, the minimum

min (m_{j})

and maximum

max (m_{j})

bounds for metric j are calculated strictly across the five evaluated models (EN, ET, GPR, KNN, XGB) within that specific run. This prevents instability by limiting the normalization scope to the models evaluated in each iteration. Consequently, the PI scores reported in the results represent the mean and standard deviation of the 50 individual PI calculations per model.

For error-based metrics, where lower values indicate better performance (RMSE, MAE, MAPE), the normalized scores are inverted so that higher values consistently represent better performance:

m_{j}^{adj} (i) = 1 - m_{j}^{norm} (i) .

For accuracy-based metrics, where higher values are better (R,

R^{2}

, A10), was retain the normalized form:

m_{j}^{adj} (i) = m_{j}^{norm} (i) .

The PI of model i is then computed as a weighted sum of the adjusted metrics:

P I (i) = \sum_{j = 1}^{k} w_{j} m_{j}^{adj} (i),

where the weights satisfy

\sum_{j = 1}^{k} w_{j} = 1

. In this work, we used equal weights (

w_{j} = 1 / k

) to ensure that all metrics contribute equally. Finally, models are ranked according to their PI scores in descending order:

Rank (i) = order of P I (i), highest PI = best model .

2.5. Uncertainty Analysis

The stability and reliability of model predictions were assessed through an uncertainty analysis. This approach quantifies the dispersion of predictions, thereby complementing traditional measures [38].

Given the test set predictions

\hat{y}

and observed values y, the residuals are defined as:

e_{i} = y_{i} - {\hat{y}}_{i}, i = 1, \dots, n .

From these residuals, we compute the mean error:

\bar{e} = \frac{1}{n} \sum_{i = 1}^{n} e_{i},

and the standard deviation of the errors:

σ_{e} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(e_{i} - \bar{e})}^{2}} .

To further explore prediction variability, we generate N synthetic input samples

x_{1}, x_{2}, \dots, x_{N}

drawn uniformly from the observed feature space:

x_{i} \sim U (min (X), max (X)) .

The model produces predictions

{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{N}

for these synthetic inputs. First, compute the median prediction:

\tilde{y} = median ({\hat{y}}_{1}, \dots, {\hat{y}}_{N}) .

Then, the mean absolute deviation (MAD) from the median is given by:

M A D = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - \tilde{y}| .

Finally, the relative uncertainty score is defined as:

Uncertainty = 100 \times \frac{M A D}{\tilde{y}} (%) .

This formulation provides a normalized measure of prediction dispersion, enabling a comparative assessment of model stability. Models with lower Uncertainty values are considered more reliable, as they combine both accuracy and robustness.

3. Results

Table 2 presents the performance metrics of five machine learning models across 50 independent runs, summarizing the models’ results using key evaluation metrics, including the coefficient of determination R², RMSE, MAE, MAPE, and A10.

The ET model performs the best overall, achieving the highest

R^{2}

value of 0.426, indicating a moderate positive correlation between observed and predicted values. Additionally, ET achieves the lowest RMSE of 6.859 cm, demonstrating its superior ability to minimize prediction errors. Nevertheless, this number means that the models only account for half of the variation in soybean plant height. This moderate explanatory power can be attributed to several reasons. First, parameters such as plant height vary significantly among cultivars and field conditions, indicating that genotype–environment interactions naturally impact soybean growth. Furthermore, the model’s capacity to identify underlying patterns may be further compromised by the variability and errors that manual field observations may introduce. These factors suggest that although the proposed approach provides insightful information, model generalizability and predictive performance could be enhanced by incorporating a broader range of environmental and genetic indicators and expanding the dataset across multiple locations and seasons.

The XGB model also performs well, with an

R^{2}

of 0.379 and an RMSE of 7.114 cm. While it is not as accurate as ET, XGB still offers a strong predictive performance, as evidenced by its relatively low RMSE and high A10 score of 0.677. The Elastic EN model produces a relatively low

R^{2}

of 0.171, reflecting its weaker predictive power compared to ET and XGB. However, it demonstrates a moderate RMSE of 8.082 cm, which is higher than that of both ET and XGB.

Both the GPR and KNN models exhibit low performance across all metrics. GPR has an

R^{2}

of 0.011 and an RMSE of 9.102 cm, while KNN shows a similarly low

R^{2}

of 0.009 and RMSE of 9.109 cm. The ET model stands out for its superior predictive performance, followed by XGB. Both EN and KNN perform below expectations, and GPR shows the least effectiveness in terms of both predictive accuracy and error minimization. The results show the importance of selecting the most suitable modes that balance prediction, accuracy, and computational efficiency.

Figure 2 presents the relationship between observed and predicted values for five machine learning models (EN, ET, GPR, KNN, and XGB). The ET and XGB models show the closest alignment with the ideal prediction line, indicating that they perform well in predicting plant height. Both models exhibit higher

R^{2}

values, around 0.61, which suggests a moderate relationship between the predicted and observed data. These findings support the importance of using robust machine learning models in agriculture, as outlined in the introduction, where the accurate prediction of plant traits is crucial for optimizing cultivar selection and improving yield estimations.

In contrast, the GPR and KNN models perform less effectively, with low fits to the observed data. This indicates that while these models are simpler and may be faster to compute, they are less reliable for predicting soybean plant height in this study. This highlights the challenge mentioned in the introduction regarding the complexity of predicting agronomic traits, such as plant height, using less advanced models.

3.1. Integrated Performance and Model Robustness

Table 3 presents the mean PI scores across model runs, along with their standard deviations. The ET model ranks first with the highest mean PI score of 0.776, accompanied by a standard deviation of 0.124.

The XGB model follows closely in second place with a mean PI score of 0.738 and a higher standard deviation of 0.161. While XGB demonstrates strong performance, its higher variability suggests that it may be more sensitive to changes in the data or model parameters, which can affect its consistency.

EN ranks third, with a mean PI score of 0.526 and a standard deviation of 0.090. Though its PI score is lower than that of ET and XGB, it still provides a reasonably stable performance. The KNN and GPR models show comparatively weaker performance, with KNN scoring 0.362 (0.070) and GPR scoring 0.347 (0.081). These models have lower mean PI scores and higher variability, suggesting that they are less reliable in this context.

Table 4 presents performance metrics showing that ET provides the most reliable predictions, while XGB offers competitive performance. EN, KNN, and GPR, however, are less suited for this task due to their higher uncertainty and error rates. The ET model demonstrates the best performance with the lowest MAD (3.492 cm), uncertainty (5.073 cm), and RMSE (6.859 cm), indicating both high accuracy and low variability.

XGB follows with a slightly higher MAD (4.141 cm), uncertainty (6.267 cm), and RMSE (7.114 cm), suggesting it performs well but with more variability than ET. Elastic Net (EN) shows weaker performance with a MAD of 6.913 cm, higher uncertainty (10.312 cm), and RMSE (8.082 cm), indicating it is less effective in capturing the data’s complexities.

KNN and GPR exhibit higher MAD and RMSE, with substantial uncertainty, reflecting their limited capability in this context. These results are consistent with the study’s methodology, which evaluates the ability of machine learning models to predict soybean plant height based on agronomic traits.

Figure 3 presents a comparative uncertainty analysis for the five machine learning models evaluated in this study, illustrating the balance between prediction accuracy and uncertainty. ET is positioned near the lower left of the chart, with an RMSE of approximately 7 cm and moderate uncertainty around 5 cm; this indicates that ET delivers the most accurate predictions among the models considered. XGB lies slightly to the right and slightly higher, with RMSE near 7.2 cm and uncertainty around 6 cm, demonstrating a modest increase in both error and uncertainty relative to ET. EN appears at the far right, with uncertainty of approximately 10 cm and an RMSE of around 8.5 cm, suggesting that the linear model is unstable and less precise. KNN and GPR occupy the upper left quadrant, showing RMSE values of approximately 9 cm, but with lower uncertainties of roughly 3 cm and 2 cm, respectively. This indicates that these models consistently underperform in terms of prediction accuracy, despite relatively narrow uncertainty intervals.

The variation in feature importance rankings among models reflects their distinct learning mechanisms and the underlying agronomic structure of the data. While the number of stems (NS) was unequivocally the dominant predictor across all models, the insertion of the first pod (IFP) consistently emerged as a key secondary feature, particularly in the top-performing ensemble methods. This aligns with agronomic intuition, as IFP represents a fundamental component of vertical plant architecture that is structurally linked to overall plant stature. The strong performance of these structural traits (NS and IFP) underscores that plant architecture characteristics are primary drivers of height prediction in soybean. The prominence of grain yield (GY) in some ensemble models can be explained by its nature as a composite trait that integrates the cumulative effects of multiple yield determinants [39,40]. This analysis clarifies that the most robust and biologically meaningful features for predicting plant height are NS and IFP, providing reliable targets for future phenotyping efforts in soybean breeding programs.

3.2. Identification of Key Agronomic Predictors (Feature Importance)

Figure 4 presents the feature importance rankings derived from the five machine learning models. This analysis offers crucial insights into the relative impact of each trait on PH prediction, providing a foundation for understanding the agronomic factors that influence plant architecture. The figure, which illustrates importance scores for features such as the number of stems (NS), grain yield (GY) and others, highlights consistent patterns across models, alongside some variations that merit further examination.

Across all models, NS stands out as the most influential feature, consistently ranking highest or near the top. This observation aligns with botanical principles, where the number of stems reflects a plant’s structural complexity and resource allocation strategy, directly affecting height through competitive growth within the canopy [4]. The stability of NS as a predictor, evident in its prominence across both linear (EN) and nonlinear (ET, XGB) models, suggests a strong, potentially causal relationship with PH. This consistency enhances the reliability of NS as a key metric for future field assessments, particularly for breeding programs aiming to improve shoot architecture.

Following NS, GY emerges as a significant contributor, though its importance varies by model. For instance, XGB and ET assign higher weights to GY, reflecting its role as an integrated measure of plant productivity that may correlate with taller structures due to increased biomass allocation. The variability in rankings, such as lower importance in GPR, may result from model-specific sensitivities to feature correlations or the limited dataset size, which could restrict GPR’s ability to capture these relationships effectively.

Notably, traits such as the number of legumes per plant (NLP) and number of grains per plant (NGP) exhibit lower importance scores across most models, suggesting they contribute less directly to PH variation. This observation contrasts with the expectation that reproductive traits might influence height through resource competition, indicating that structural traits, such as NS, dominate in this context. The discrepancy could reflect the specific conditions of the dataset, such as the 2022–2023 harvest at a single Brazilian site, where environmental factors may have prioritized stem development over reproductive output.

The differences in feature importance rankings among models reflect their distinct learning mechanisms. ET and XGB, which rely on ensemble techniques and feature subsampling, emphasize NS and GY, likely due to their ability to detect nonlinear interactions [27,30]. In contrast, EN’s linear framework provides more balanced weights, while GPR’s low importance scores across all features may indicate overfitting or an unsuitable kernel choice for this dataset [28]. These model-specific interpretations indicate that ensemble methods are more suitable for capturing the complex trait interactions in soybean PH prediction.

From a practical perspective, Figure 4 offers valuable guidance for agricultural applications. Prioritizing NS measurement in field trials could simplify data collection efforts, reducing costs while maintaining predictive accuracy. Overall, Figure 4 provides a foundation for linking agronomic traits to PH, supporting data-driven decisions in cultivar selection and breeding. However, further research is necessary to address its limitations.

4. Discussion

The developed study demonstrates the effectiveness of ensemble ML models, particularly ET, in predicting soybean PH from ground-based agronomic traits. Indeed, the superior performance of ET, as evidenced by its highest mean R² and lowest RMSE, highlights the non-linear and complex nature of the relationships between agronomic traits and plant architecture. Moreover, the model’s ability to capture approximately 43% of the variance in PH, despite the inherent field variability, highlights the potential of using easily measurable agronomic traits for phenotypic forecasting.

The ensemble structure and nonparametric topology of the ET model are responsible for its better performance. In order to reduce variance and capture intricate nonlinear dependencies between agronomic traits, ET builds multiple fully randomized decision trees and aggregates their outputs, in contrast to linear approaches such as EN, which rely on fixed coefficients and assume additive relationships among predictors. To increase model variety and reduce overfitting, each tree is trained using the complete dataset, but with random feature and threshold selection.

On the other hand, when the data shows discontinuities or varied feature effects, as in agronomic datasets, GPR, which models smooth functional connections using a kernel-based probabilistic structure, may perform poorly. The KNN approach is susceptible to feature scaling and data sparsity since it relies on local distance calculations and lacks an explicit internal structure. In the meantime, XGB utilizes additive learning in conjunction with sequential tree boosting, which enables high predictive capacity but requires precise hyperparameter adjustment to prevent overfitting in small datasets. ET consistently outperforms other evaluation metrics because it offers a better bias–variance trade-off and more stability by averaging uncorrelated randomized trees.

These architectural variations support the use of ensemble tree models, such as ET, for soybean plant height prediction by showing that they are especially well-suited for agronomic applications where nonlinear interactions and feature heterogeneity predominate.

The capacity of the ET model to capture the intricate and nonlinear interactions between structural and yield-related factors that collectively influence soybean plant height explains its superior performance from an agronomic standpoint. The physiological principles of competition for light, nutrient distribution, and biomass allocation within the canopy are reflected in traits such as the number of stems (NS) and grain yield (GY), which exhibit nonlinear responses to genetic and environmental variation. Without making any assumptions about the smoothness or linearity of the data, the ET model successfully captures these nonlinear dependencies and interactions by building a diversified ensemble of fully randomized decision trees. Due to its adaptability, it can effectively manage the heterogeneity present in multi-cultivar agronomic datasets, where phenotypic expression is significantly influenced by genotype–environment interactions.

The GPR, on the other hand, is predicated on a probabilistic framework based on kernels and presumes smooth and continuous functional relationships. These presumptions may not apply to field-based agronomic data, which sometimes exhibit discontinuities caused by local soil variability, environmental stress, or cultivar-specific growth patterns. Additionally, the GPR’s kernel hyperparameters may have been over-smoothed and underfitted due to the very small dataset size and heterogeneous feature scales. As a result, GPR’s predicted accuracy was worse than that of the ET model since it was unable to detect sudden and nonlinear changes in plant height.

Due to its capacity to capture the biological complexity and diversity of soybean development under field settings, these results show that the ET model is superior not just statistically but also agronomically. Thus, the model’s ability to generalize across genotypes and harvests accounts for its resilience, providing a dependable and understandable method for incorporating data-driven insights into breeding and crop management plans.

When compared to prior studies on crop trait prediction, the current findings align with trends in ML applications for agriculture, but highlight specific challenges with ground-based agronomic data. Incidentally, the effectiveness of our BO framework is consistent with its proven utility in other domains, such as structural health monitoring, where a BO-LSTM network achieved superior accuracy in identifying defects in concrete-filled steel tubes [41].

The findings of recent studies that used machine learning to predict soybean and crop traits are supported by and expanded upon by the results of this investigation. For instance, Sarkar et al. [19] estimated the composition of soybean seeds from crop photos using machine learning models, and their

R^{2}

values ranged from 0.40 to 0.65, which is similar to the ET model’s moderate explanatory power (

R^{2}

= 0.43). Reiterating that performance metrics in this range are typical for agronomic predictions under field variability, Fu et al. [14] used deep learning with multi-source remote sensing data to forecast soybean production and reported RMSE values between 6 and 9 cm.

Teodoro et al. [42] used deep learning to estimate plant height directly in a related investigation, and they obtained an MAE of 8.32 cm and an RMSE of 10.51 cm. Even with the use of high-dimensional multispectral inputs, these errors are noticeably larger than those produced by our ET model (MAE = 5.36 cm, RMSE = 6.86 cm), indicating that the suggested ground-based framework may produce competitive accuracy with significantly less complicated data collecting.

The current study supports the findings of Rahaman et al. [13] and Yuan et al. [11], who highlighted the superiority of ensemble models such as Random Forests and Gradient Boosting over linear regressors in managing heterogeneous agronomic data. However, our results show that equal accuracy may be attained using simply field-measured agronomic features, while the majority of this research relied heavily on UAV or remote sensing data.

As a result, our findings support the use of ensemble tree models on ground-based data as an affordable and comprehensible substitute for phenotypic prediction in soybeans, while also being consistent with evidence from previous agricultural ML studies. Indeed, when compared to prior studies on crop trait prediction, the current findings align with trends in ML applications for agriculture but highlight specific challenges with ground-based agronomic data. For instance, recent work by Sarkar et al. [19] achieved higher

R^{2}

values (0.65–0.89) for seed composition prediction using image-based features, but was limited to controlled laboratory conditions. In contrast, our field-based approach addresses real-world variability across multiple harvests, albeit with moderate explanatory power. Similarly, de Souza et al. [18] reported superior accuracy (

R^{2} = 0.78

) using UAV multispectral data; however, their methodology requires specialized equipment and focuses on indirect spectral indices rather than direct relationships with agronomic traits. Our work demonstrates that competitive accuracy can be achieved using readily measurable field traits, providing a cost-effective alternative particularly valuable for resource-constrained agricultural systems. Furthermore, unlike Delfani et al. [22], who noted limited interpretability in complex ML models, our framework maintains explainability through feature importance analysis while achieving robust performance.

A critical finding of this study is the consistent identification of the number of stems (NS) and the insertion of the first pod (IFP) as the most influential features for predicting plant height. This provides a novel mechanistic insight into the determinants of soybean plant architecture, suggesting that structural complexity (NS) and insertion of the first pod (IFP) are key drivers of height. Agronomically, this implies that breeding programs aiming to optimize plant height can focus on these traits, potentially simplifying selection criteria.

It is crucial to recognize that potential multicollinearity among agricultural traits may affect how feature importance is interpreted, especially in linear models. In this research, the use of both regularized and ensemble learning methods aimed to lessen these effects. The EN model integrates L₁ and L₂ regularization terms, which penalize correlated predictors and stabilize coefficient estimates, ultimately reducing bias linked to multicollinearity. Additionally, ensemble tree-based approaches, such as ET and XGB, are naturally less affected by feature correlation, as their hierarchical and randomized feature selection processes distribute importance across various decision paths. Consequently, the feature importance analysis highlighted findings from ET and XGB, which provide robust impurity-based importance measures even under correlated inputs. The consistent recognition of the number of stems (NS) and insertion of the first pod (IFP) as the most significant features across different modeling approaches underscores their agronomic significance and biological validity. However, future research could enhance this analysis with formal collinearity diagnostics, such as variance inflation factor evaluation or feature clustering, to further confirm the independence of essential predictors.

The uncertainty analysis further reinforces the reliability of the ET model, with the lowest mean absolute deviation and uncertainty. This low uncertainty is crucial for practical applications, such as cultivar selection, where confidence in predictions is necessary for decision-making. The PI rankings confirm the overall superiority of ET, integrating multiple metrics to provide a comprehensive assessment.

Despite the described strengths, several limitations must be considered. In particular, the dataset, comprising 320 samples from 40 cultivars over two harvests at a single location in Brazil, may limit generalizability to diverse climates or soil types [22]. In addition, the moderate R² suggests unmodeled variability, possibly due to unmeasured factors such as soil nutrients or pests. Such performance level, while moderate, is comparable to or exceeds several recent studies using more complex data sources, such as Rahaman et al. [13] who reported

R^{2}

values of 0.35–0.45 for yield prediction using satellite imagery, and Yuan et al. [11] who achieved

R^{2} = 0.38

for maturity date prediction using multispectral UAV data. Last, but not least, the Bayesian optimization process, although efficient, explored a predefined hyperparameter space, and bounding broader ranges or employing alternative acquisition functions could yield further performance improvements.

In conclusion, this study establishes a robust, explainable ML framework for predicting soybean plant height using ground-based agronomic traits. The novel contribution lies in demonstrating that ensemble models, such as ET, can effectively capture the nonlinear relationships in agronomic data, providing a cost-effective and interpretable alternative to remote sensing. The identification of NS and IFP as key traits offers new insights for breeding programs. Future research should focus on expanding the dataset across diverse environments and integrating these models into real-time farm management systems to enhance crop productivity and food security.

5. Conclusions

This research established a robust, explainable machine learning framework for predicting soybean PH using solely ground-based agronomic traits, offering a low-cost and competitive alternative to methods relying on complex remote sensing data.

The Extra Trees (ET) ensemble model emerged as the superior predictor, demonstrating high accuracy and stability (R² of 0.426, RMSE of 6.859 cm, and uncertainty of 5.073%). The model’s efficacy is attributed to its nonparametric topology and ensemble structure, which effectively manage the non-linear dependencies and feature heterogeneity common in complex field data, surpassing the performance of linear (EN) and probabilistic (GPR) models.

A critical finding is the consistent identification of the number of stems (NS) and insertion of the first pod (IFP) as the most influential features for PH prediction. This provides actionable insight for soybean breeding programs, suggesting that prioritizing the measurement of NS can simplify data collection and maintain high predictive utility for optimizing shoot architecture and yield. Furthermore, the achieved error rates (MAE = 5.36 cm, RMSE = 6.86 cm) are competitive with deep learning methods utilizing high-dimensional multispectral data (MAE = 8.32 cm, RMSE = 10.51 cm), validating the precision of this ground-based approach.

The developed framework, which integrates BO, PI, and uncertainty analysis, presents a reliable and transparent decision-support tool for cultivar selection and phenotypic forecasting, that addresses key limitations of previous approaches, including limited interpretability, dataset constraints, and lack of uncertainty quantification. Future work should focus on expanding the dataset to encompass diverse geographical locations and seasons, thereby enhancing generalizability. Additionally, integrating these ground-based predictions with aerial data or implementing the models in real-time farm management systems will be essential steps toward maximizing food security and optimizing crop management strategies in soybean production.

Author Contributions

Conceptualization: C.M.S. and F.R.T.F.; Methodology: L.G., M.B. and C.M.S.; Software: L.G. and F.F.P.; Validation: B.R.d.O. and R.L.S.; Formal Analysis: C.M.S., L.G., F.R.T.F. and M.B.; Investigation: B.R.d.O. and R.L.S.; Resources: L.G. and F.F.P.; Data Curation: B.R.d.O. and R.L.S.; Writing—original draft: B.R.d.O., R.L.S., F.R.T.F., F.F.P., M.B., C.M.S. and L.G.; Supervision: L.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the support of the funding agencies CNPq (grants 307688/2022-4, 409433/2022-5 and 304646/2025-3), Fapemig (grants APQ-02513-22, APQ-04458-23 and BPD-00083-22), Finep (grant SOS Equipamentos 2021 AV02 0062/22), Faperj (grant 10.432/2024-APQ1) and Capes (Finance Code 001). This work has been supported by UFJF’s High-Speed Integrated Research Network (RePesq).

Data Availability Statement

All the data and source code are available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Optimization Strategy and Model Configuration

The internal parameters and their default values employed in this study are summarized in Table A1. The optimization process aimed to minimize the Root Mean Square Error (RMSE) between predicted and observed plant heights.

Table A1. Internal parameters of the Bayesian Optimization and objective function used in model optimization.

Parameter	Value	Description and Role in this Study
Objective	RMSE	Objective function.
No. trials	100	Total number of optimization iterations (trials).
No. startup trials	10	Number of initial random trials
		before the algorithm begins modeling the search space.
Prior weight	1.0	Relative weight assigned to the prior
		distribution, balancing prior knowledge and observed data.
No. EI candidates	24	No. of candidate parameter sets sampled per trial; the one
		with the highest expected improvement (EI) is selected.
Gamma	Top 10% of best trials	Quantile threshold defining the division between
		“good” and “bad” trials in the algorithm.
Seed	42	Random seed for reproducibility when specified.

The range of hyperparameters that the BO optimizer investigated for each model is compiled in Table A2, while Table A3 presents the best hyperparameters for each method.

Table A2. Range of hyperparameters for EN, ET, GPR, KNN and XGB models and their most selected value in 50 iterations.

Estimator	Parameter	Search Space
EN	L1-ratio	[ $0, 1$ ]
	Alpha	[ $10^{- 5}, 10^{2}$ ]
ET	No. estimators	[10, 300]
	Max. depth	[3, 20]
	Min. samples split	[2, 10]
GPR	Alpha	[ $10^{- 10}$ , $10^{- 0}$ ]
	Length scale	[ 0.1, 1000]
	Const	[0.1,100]
	Nu	[0.1, 3.5]
KNN	No. neighbors	[2, 20]
XGB	No. estimators	[50, 300]
	Learning rate	[0.01, 0.3]
	Max. depth	[3, 10]
	Subsample	[0.6, 1.0]
	Colsample by tree	[0.6, 1.0]

Table A3. Best Parameters.

Model	Best Parameters
EN	Alpha = 0.263, L1-ratio = 0.508
ET	No. estimators = 128, Max. depth = 15, Min. samples split = 2
GPR	Alpha = 0.296, Length scale = 823.996,
	Const = 71.224, Nu = 3.5
KNN	No. neighbors = 20
XGB	No. estimators = 225, Learning rate = 0.055, Max. depth = 7,
	Subsample = 0.769, Colsample = 0.771

Appendix A.2. ANOVA Test

Table A4 shows ANOVA p-values for R, R², RMSE, MAE, MAPE, and A10. It can be observed that in all cases the difference between the methods is statistically significant, with p-value < 0.05.

Table A4. ANOVA p-values for each performance metric across models.

R	R²	RMSE	MAE	MAPE	A10
3.246 $\times 10^{- 195}$	1.682 $\times 10^{- 177}$	1.077 $\times 10^{- 167}$	1.115 $\times 10^{- 165}$	1.259 $\times 10^{- 156}$	3.998 $\times 10^{- 99}$

Appendix A.3. Hyperparameter Distributions for Optimized Models

Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5 present the hyperparameters distribution by EN, ET, GPR, KNN and XGB, respectively, in 50 independent runs. For EN the alpha varied between 0.0 and 0.6 and L1 ratio between 0.0 and 1.0. For ET, the choice of the max. depth varied between 9 and 20, min. samples split 2 and 4 and no. estimators 30 and 300, approximately. In the case of GPR, alpha varied between 0.0 and 1.0, const 0 and 100, length scale 0 and 1000 and nu approximately 0.1 and 3.5. For KNN, the no. Neighbors varied between 15 and 20 in most cases. Finally, for XGB the colsample varied between 0.6 and 1.0, learning rate (lr) 0.01 and 0.20, approximately, max. depth 3 and 10, no. estimators 50 and 300 and subsample 0.6 and 1.0.

Figure A1. EN—Alpha and L1 Ratio.

Figure A2. ET—Max. Depth, Min. Samples Split, No. estimators.

Figure A3. GPR—Alpha, Const, Length Scale, Nu.

Figure A4. KNN—number os neightors.

Figure A5. XGB—Colsample, Max. depth, No. estimators and Subsample.

References

Woods, R.G. Future Dimensions of World Food and Population; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
Edgerton, M.D. Increasing crop productivity to meet global needs for feed, food, and fuel. Plant Physiol. 2009, 149, 7–13. [Google Scholar] [CrossRef]
Xiao, Y.; Liu, J.; Li, H.; Cao, X.; Xia, X.; He, Z. Lodging resistance and yield potential of winter wheat: Effect of planting density and genotype. Front. Agric. Sci. Eng. 2015, 2, 168–178. [Google Scholar] [CrossRef]
Wang, B.; Smith, S.M.; Li, J. Genetic regulation of shoot architecture. Annu. Rev. Plant Biol. 2018, 69, 437–468. [Google Scholar] [CrossRef]
Klein, H.S.; Luna, F.V. Soybeans. In Brazilian Crops in the Global Market: The Emergence of Brazil as a World Agribusiness Exporter Since 1950; Springer: Berlin/Heidelberg, Germany, 2023; pp. 79–106. [Google Scholar]
Dilawari, R.; Kaur, N.; Priyadarshi, N.; Prakash, I.; Patra, A.; Mehta, S.; Singh, B.; Jain, P.; Islam, M.A. Soybean: A key player for global food security. In Soybean Improvement: Physiological, Molecular and Genetic Perspectives; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–46. [Google Scholar]
Ajila, C.; Brar, S.; Verma, M.; Tyagi, R.; Godbout, S.; Valéro, J. Bio-processing of agro-byproducts to animal feed. Crit. Rev. Biotechnol. 2012, 32, 382–400. [Google Scholar] [CrossRef]
Gunstone, F. Vegetable Oils in Food Technology: Composition, Properties and Uses; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Barreiros, T.; Young, A.; Cavalcante, R.; Queiroz, E. Impact of biodiesel production on a soybean biorefinery. Renew. Energy 2020, 159, 1066–1083. [Google Scholar] [CrossRef]
Cooper, M.; Messina, C.D.; Tang, T.; Gho, C.; Powell, O.M.; Podlich, D.W.; Technow, F.; Hammer, G.L. Predicting Genotype× Environment× Management (G× E× M) interactions for the design of crop improvement strategies: Integrating breeder, agronomist, and farmer perspectives. Plant Breed. Rev. 2022, 46, 467–585. [Google Scholar]
Yuan, J.; Zhang, Y.; Zheng, Z.; Yao, W.; Wang, W.; Guo, L. Grain crop yield prediction using machine learning based on UAV remote sensing: A systematic literature review. Drones 2024, 8, 559. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Rahaman, M.; Southworth, J.; Wen, Y.; Keellings, D. Assessing Model Trade-Offs in Agricultural Remote Sensing: A Review of Machine Learning and Deep Learning Approaches Using Almond Crop Mapping. Remote Sens. 2025, 17, 2670. [Google Scholar] [CrossRef]
Fu, H.; Li, J.; Lu, J.; Lin, X.; Kang, J.; Zou, W. Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models. Agriculture 2025, 15, 1337. [Google Scholar] [CrossRef]
De Caires, S.A.; Martin, C.S.; Atwell, M.A.; Kaya, F.; Wuddivira, G.A.; Wuddivira, M.N. Advancing soil mapping and management using geostatistics and integrated machine learning and remote sensing techniques: A synoptic review. Discov. Soil 2025, 2, 53. [Google Scholar] [CrossRef]
Saha, S.; Kucher, O.D.; Utkina, A.O.; Rebouh, N.Y. Precision agriculture for improving crop yield predictions: A literature review. Front. Agron. 2025, 7, 1566201. [Google Scholar] [CrossRef]
Waqas, M.; Naseem, A.; Wannasingha, U.H.; Hlaing, P.T. Applications of machine learning and deep learning in agriculture: A comprehensive review. Green Technol. Sustain. 2025, 1, 100033. [Google Scholar] [CrossRef]
Souza, F.L.P.; Dias, M.A.; Setiyono, T.D.; Campos, S.; Shiratsuchi, L.S.; Tao, H. Identification of soybean planting gaps using machine learning. Smart Agric. Technol. 2025, 10, 100779. [Google Scholar] [CrossRef]
Sarkar, S.; Sagan, V.; Bhadra, S.; Rhodes, K.; Pokharel, M.; Fritschi, F.B. Soybean seed composition prediction from standing crops using PlanetScope satellite imagery and machine learning. ISPRS J. Photogramm. Remote Sens. 2023, 204, 257–274. [Google Scholar] [CrossRef]
Shawon, S.M.; Ema, F.B.; Mahi, A.K.; Niha, F.L.; Zubair, H. Crop yield prediction using machine learning: An extensive and systematic literature review. Smart Agric. Technol. 2025, 10, 100718. [Google Scholar] [CrossRef]
Trentin, C.; Ampatzidis, Y.; Lacerda, C.; Shiratsuchi, L. Tree crop yield estimation and prediction using remote sensing and machine learning: A systematic review. Smart Agric. Technol. 2024, 9, 100556. [Google Scholar] [CrossRef]
Delfani, P.; Thuraga, V.; Banerjee, B.; Chawade, A. Integrative approaches in modern agriculture: IoT, ML and AI for disease forecasting amidst climate change. Precis. Agric. 2024, 25, 2589–2613. [Google Scholar] [CrossRef]
Gcayi, S.R. Predicting Soybean (Glycine max (L.) Merr) Grain Yield Using Remote Sensing. Master’s Thesis, Faculty of Natural and Agricultural Sciences, University of the Free State, Bloemfontein, South Africa, 2016. [Google Scholar]
Sivakumar, M.V. Climate prediction and agriculture: Current status and future challenges. Clim. Res. 2006, 33, 3–17. [Google Scholar] [CrossRef]
de Oliveira, B.R.; Zuffo, A.M.; dos Santos Silva, F.C.; Mezzomo, R.; Barrozo, L.M.; da Costa Zanatta, T.S.; dos Santos, J.C.; Sousa, C.H.C.; Coelho, Y.P. Dataset: Forty soybean cultivars from subsequent harvests. Trends Agric. Environ. Sci. 2023, 1, e230005. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Wang, J. An intuitive tutorial to Gaussian process regression. Comput. Sci. Eng. 2023, 25, 4–11. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
Nguyen, V.; Schulze, S.; Osborne, M. Bayesian optimization for iterative learning. Adv. Neural Inf. Process. Syst. 2020, 33, 9361–9371. [Google Scholar]
Cheng, C.L.; Garg, G. Coefficient of determination for multiple measurement error models. J. Multivar. Anal. 2014, 126, 137–152. [Google Scholar] [CrossRef]
Nguyen, T.; Nguyen, B.M.; Nguyen, G. Building resource auto-scaler with functional-link neural network and adaptive bacterial foraging optimization. In Proceedings of the International Conference on Theory and Applications of Models of Computation, Kitakyushu, Japan, 13–16 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 501–517. [Google Scholar]
De Myttenaere, A.; Golden, B.; Le Grand, B.; Rossi, F. Mean absolute percentage error for regression models. Neurocomputing 2016, 192, 38–48. [Google Scholar] [CrossRef]
Xu, H.; Zhou, J.; Asteris, P.G.; Jahed Armaghani, D.; Tahir, M.M. Supervised machine learning techniques to the prediction of tunnel boring machine penetration rate. Appl. Sci. 2019, 9, 3715. [Google Scholar] [CrossRef]
Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Sattar, A.M.; Gharabaghi, B. Gene expression models for prediction of longitudinal dispersion coefficient in streams. J. Hydrol. 2015, 524, 587–596. [Google Scholar] [CrossRef]
Board, J.E.; Kahlon, C.S. Soybean yield formation: What controls it and how it can be improved. In Soybean Physiology and Biochemistry; BoD–Books on Demand: Norderstedt, Germany, 2011; pp. 1–36. [Google Scholar]
Board, J.; Tan, Q. Assimilatory capacity effects on soybean yield components and pod number. Crop Sci. 1995, 35, 846–851. [Google Scholar] [CrossRef]
Yao, M.; Chen, Z.; Li, J.; Guan, S.; Tang, Y. Ultrasonic identification of CFST debonding via A novel Bayesian Optimized-LSTM network. Mech. Syst. Signal Process. 2025, 238, 113175. [Google Scholar] [CrossRef]
Teodoro, P.E.; Teodoro, L.P.R.; Baio, F.H.R.; da Silva Junior, C.A.; dos Santos, R.G.; Ramos, A.P.M.; Pinheiro, M.M.F.; Osco, L.P.; Gonçalves, W.N.; Carneiro, A.M.; et al. Predicting days to maturity, plant height, and grain yield in soybean: A machine and deep learning approach using multispectral data. Remote Sens. 2021, 13, 4632. [Google Scholar] [CrossRef]

Figure 1. Location of Pequizeiro Farm in Balsas municipality, Maranhão State, Brazil.

Figure 2. Relationship between observed and predicted values for five algorithms (EN, ET, GPR, KNN, and XGB) in the test set. The dashed line indicates the ideal prediction (

y = x

). Among the tested models, ET and XGB achieved the best fits (

R^{2} \approx 0.61

, RMSE

\approx 5.9

).

Figure 2. Relationship between observed and predicted values for five algorithms (EN, ET, GPR, KNN, and XGB) in the test set. The dashed line indicates the ideal prediction (

y = x

). Among the tested models, ET and XGB achieved the best fits (

R^{2} \approx 0.61

, RMSE

\approx 5.9

).

Figure 3. Uncertainty and precision comparison.

Figure 4. Features Importance—EN, ET, GPR, KNN, XGB, respectively.

Table 1. Dataset summary.

	PH (cm)	IFP (cm)	NLP (unit)	NGP (unit)	NGL (unit)	NS (unit)	MHG (g)	GY (kg ha⁻¹)
Mean	68.38	15.46	59.08	135.08	2.29	4.07	168.32	3418.55
Std.	8.95	3.02	20.06	60.49	0.84	1.47	19.62	503.00
Min	47.60	7.20	20.20	47.80	0.94	0.40	127.06	1538.23
25%	62.95	13.60	44.35	95.05	2.00	3.00	153.84	3126.61
50%	67.20	15.60	54.50	123.00	2.28	3.80	166.15	3397.27
75%	74.34	17.33	71.22	161.35	2.48	5.00	183.18	3708.26
Max	94.80	26.40	123.00	683.40	14.86	9.00	216.00	4930.00
CV (%)	13.09	19.55	33.96	44.78	36.67	36.21	11.65	14.71
Var.	80.24	9.14	402.73	3659.58	0.70	2.17	385.16	253,012.62

Table 2. Performance metrics by model in the test set. Standard deviations appear in parentheses.

Model	R	R ²	RMSE	MAE	MAPE	A10
EN	0.442 (0.084)	0.171 (0.080)	8.082 (0.535)	6.449 (0.503)	0.096 (0.008)	0.595 (0.048)
ET	0.674 (0.124)	0.426 (0.141)	6.859 (0.714)	5.361 (0.556)	0.080 (0.009)	0.679 (0.042)
GPR	0.106 (0.227)	0.011 (0.056)	9.102 (0.532)	7.235 (0.438)	0.108 (0.007)	0.570 (0.034)
KNN	0.173 (0.065)	0.009 (0.052)	9.109 (0.483)	7.209 (0.410)	0.108 (0.006)	0.588 (0.040)
XGB	0.628 (0.147)	0.379 (0.179)	7.114 (0.881)	5.560 (0.676)	0.082 (0.011)	0.677 (0.059)

The bold indicates the best result.

Table 3. Mean PI scores across model runs with standard deviation. Standard deviations appear in parentheses.

Rank	Model	PI
1	ET	0.776 (0.124)
2	XGB	0.738 (0.161)
3	EN	0.526 (0.090)
4	KNN	0.362 (0.070)
5	GPR	0.347 (0.081)

Table 4. Uncertainty metrics by model. Standard deviations appear in parentheses.

Model	MAD	Uncertainty (%)	RMSE
EN	6.913 (1.067)	10.312 (2.007)	8.082 (0.535)
ET	3.492 (0.601)	5.073 (0.848)	6.859 (0.714)
GPR	1.644 (3.452)	2.443 (5.227)	9.102 (0.532)
KNN	2.599 (0.295)	3.824 (0.446)	9.109 (0.483)
XGB	4.141 (0.752)	6.267 (1.129)	7.114 (0.881)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oliveira, B.R.d.; Sobrinho, R.L.; Ferreira, F.R.T.; Putti, F.F.; Bodini, M.; Saporetti, C.M.; Goliatt, L. Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests. AgriEngineering 2025, 7, 408. https://doi.org/10.3390/agriengineering7120408

AMA Style

Oliveira BRd, Sobrinho RL, Ferreira FRT, Putti FF, Bodini M, Saporetti CM, Goliatt L. Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests. AgriEngineering. 2025; 7(12):408. https://doi.org/10.3390/agriengineering7120408

Chicago/Turabian Style

Oliveira, Bruno Rodrigues de, Renato Lustosa Sobrinho, Fernando Rodrigues Trindade Ferreira, Fernando Ferrari Putti, Matteo Bodini, Camila Martins Saporetti, and Leonardo Goliatt. 2025. "Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests" AgriEngineering 7, no. 12: 408. https://doi.org/10.3390/agriengineering7120408

APA Style

Oliveira, B. R. d., Sobrinho, R. L., Ferreira, F. R. T., Putti, F. F., Bodini, M., Saporetti, C. M., & Goliatt, L. (2025). Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests. AgriEngineering, 7(12), 408. https://doi.org/10.3390/agriengineering7120408

Article Menu

Machine Learning-Based Prediction of Soybean Plant Height from Agronomic Traits Across Sequential Harvests

Abstract

1. Introduction

2. Materials and Methods

2.1. Machine Learning Models

2.1.1. Elastic Net

2.1.2. Extremely Randomized Trees

2.1.3. Gaussian Process Regressor

2.1.4. K-Nearest Neighbors

2.1.5. Extreme Gradient Boosting

2.2. Bayesian Optimization

2.3. Performance Metrics

2.4. Performance Index (PI)

2.5. Uncertainty Analysis

3. Results

3.1. Integrated Performance and Model Robustness

3.2. Identification of Key Agronomic Predictors (Feature Importance)

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Optimization Strategy and Model Configuration

Appendix A.2. ANOVA Test

Appendix A.3. Hyperparameter Distributions for Optimized Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI