Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis

Mirahki, Isaac; Bond, Richard; Heiniger, Ryan; Moseley, David; Sykes, Virginia R.

doi:10.3390/agronomy16030376

Open AccessArticle

Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis

by

Isaac Mirahki

¹

,

Richard Bond

²,

Ryan Heiniger

³,

David Moseley

⁴ and

Virginia R. Sykes

^1,*

¹

Department of Plant Sciences, University of Tennessee, 308 Agriculture and Natural Resources Building, Knoxville, TN 37996, USA

²

Department of Crop, Soil and Environmental Sciences, University of Arkansas, 115 Plant Sciences Building, 495 N Campus Walk, Fayetteville, AR 72701, USA

³

Department of Crop and Soil Sciences, North Carolina State University, Williams Hall, 101 Derieux Pl, Raleigh, NC 27695, USA

⁴

Dean Lee Ag Center, Louisiana State University, 8105 Tom Bowman Dr, Alexandria, LA 71302, USA

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(3), 376; https://doi.org/10.3390/agronomy16030376

Submission received: 16 December 2025 / Revised: 30 January 2026 / Accepted: 31 January 2026 / Published: 4 February 2026

(This article belongs to the Special Issue Advanced Machine Learning in Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The current state-centric analysis of Official Variety Trials (OVTs) restricts the identification of stable performance zones across political boundaries. This study employed multivariate statistical learning techniques to delineate soybean (Glycine max L.) “mega-environments” using yield data from 2269 varieties collected across seven U.S. states (2019–2022). Utilizing Quadratic Discriminant Analysis (QDA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC), we examined the edaphoclimatic factors influencing yield stability. QDA classified over 79% of environments into distinct temporal categories, highlighting significant inter-annual climatic variability driven by Growing Degree Days (GDD) and latitude. PCA distinguished broad climatic drivers (PC1) from localized soil texture constraints (PC2). AHC identified optimal production clusters that frequently diverged from geographic proximity, indicating that distant sites often share more critical yield-determining factors than neighboring counties. By operationalizing these latent environmental patterns, this study provides a data-driven framework for cross-state environmental zoning that can support more precise variety placement once genotype performance has been evaluated within these zones.

Keywords:

soybean yield; statistical learning algorithms; Quadratic Discriminant Analysis; Principal Component Analysis; Agglomerative Hierarchical Clustering; Official Variety Trials; sustainable agriculture

1. Introduction

Soybean (Glycine max L.), a descendant of G. soja Sieb. and Zucc., has a cultivation history of over 5000 years [1] and has been a staple of U.S. agriculture since Official Variety Testing (OVT) became standard practice in the 1940s [2,3,4]. Characteristics like high protein content and benefits for soil health have made soybean a cornerstone of sustainable crop rotation [5], a role underscored by projections of an 8.7% increase in global production to 425.4 million tons [6,7]. While biotic stressors like weeds and diseases can cause significant yield losses of up to 50% and 30%, respectively [8,9,10], environmental factors remain the primary drivers of productivity. Temperature and precipitation alone control 20–40% of yield variability [11], with heat stress potentially causing 17% yield loss [2] and water scarcity during the reproductive stage causing up to 30% loss [12]. Furthermore, soil texture [13] and geographic factors like longitude, latitude, and altitude critically influence growth characteristics [14,15].

Optimizing soybean productivity therefore requires selecting varieties tailored to these regional edaphoclimatic conditions [16], a strategy that enhances land use efficiency and economic sustainability [17,18]. State variety testing data remain vital resources for this decision-making [19]. However, extracting actionable insights from these data requires novel analytical techniques to uncover hidden relationships between influential variables. Machine learning algorithms (MLA), which learn patterns to explain results and predict yield without explicit programming [20,21,22], have shown notable success in classifying crop performance across environments [22,23]. The adoption of statistical learning (UL-MLA)—including Discriminant Analysis (DA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC)—promises to uncover latent patterns in OVT datasets to promote sustainable farming practices [20,24,25].

Specific ML techniques offer distinct advantages for analyzing multi-environment trials. DA differentiates classes based on predictors [26,27], predicting group membership for observations [28] and facilitating the classification of varieties into performance groups [29,30]. Quadratic DA (QDA) is particularly useful as it allows for unequal covariance matrices between classes [31]. PCA reduces dataset dimensionality while preserving variance [32,33,34], identifying the most effective variables in large, highly correlated datasets [35,36,37,38]. Finally, AHC reveals hierarchical relationships [39,40] and has been successfully used to identify heterotic groups and patterns in complex datasets [41].

While methods such as Quadratic Discriminant Analysis (QDA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC) have deep roots in classical statistics, they are widely recognized as foundational algorithms within the field of machine learning. Following standard classifications in the field [42,43], we employ these techniques as supervised (QDA) and unsupervised (PCA, AHC) learning algorithms to extract latent patterns from high-dimensional ecological data. These methods allow for the unbiased discovery of environmental structures without the constraints of rigid, pre-defined political boundaries. The objective of this study was to apply these advanced analytical techniques to define data-driven recommendation domains for soybean, moving beyond simple geographic proximity. By providing verified, location-specific variety selection information [44,45], this study aims to integrate ML insights into dynamic models that enhance the adaptability and sustainability of soybean production [46]. Because the goal of this study is to delineate structural, cross-state mega-environments, we focus on seasonal thermal and moisture indices rather than event-scale extremes, and we do not explicitly model short-term heat or drought episodes. Consequently, the resulting zones are intended to describe long-term edaphoclimatic potential, not to quantify intra-seasonal production risk. In addition, due to incomplete metadata across the OVT network, management practices (e.g., planting date, irrigation, and input intensity) are not included, and the mega-environments should be interpreted as environmental domains rather than full G × E × M production systems.

2. Materials and Methods

Seeds with a relative maturity from 3 to 3.9, from 4 to 4.5, from 4.6 to 4.9, from 5 to 5.5, and from 5.6 to 5.9, respectively, are considered maturity group (MG) 3, 4 early (4E), 4 late (4L), 5 early (5E), and 5 late (5L). Varieties were stratified and analyzed within their specific maturity groups rather than pooled across the experiment. This stratification was necessary because soybean is a photoperiod-sensitive short-day plant; grouping disparate MGs (e.g., MG 3 vs. MG 5) would introduce significant confounding genotypic variance related to flowering time and maturity dates that would obscure the edaphoclimatic signals targeted by this study [47].

In the unsupervised statistical learning analyses (PCA and AHC), specific ‘Location-Years’ served as the observational units. ‘Location’ was utilized as the primary identifier because it functions as an aggregate proxy for the specific combination of edaphoclimatic and geographic variables (e.g., GDD, precipitation, soil texture) experienced by the crop at that site. Therefore, clustering or discriminating by ‘Location’ effectively groups sites based on their integrated environmental profiles rather than political boundaries.

2.1. Data Description

The time series yields data obtained from the official variety test (OVT) conducted in 2019 (Y₁), 2020 (Y₂), 2021 (Y₃), and 2022 (Y₄) [Y = 4] in 60 different locations across the seven mid-southeastern states of the U.S., including Arkansas (AR) [48,49,50,51], Kentucky (KY) [52,53,54,55], Louisiana (LA) [56,57,58,59], Missouri (MO) [60,61,62,63], North Carolina (NC) [64,65,66,67], Tennessee (TN) [68,69,70,71], and Virginia (VA) [72,73,74,75]. The current results are from locations between 30.12 and 40.37° N and between −75.73 and −95.43° W, with an average temperature from 10.56 to 20.85° C, minimum temperature from 4.3 to 15.89° C, maximum temperature from 16.02 to 26.41° C, precipitation from 689.36 to 2173.2 mm, and altitude from 2 to 450 masl; the soil of the experimental sites had a texture with from 5 to 70% sand, from 20 to 65% silt, and from 10 to 50% clay. The spatial distribution of the 60 OVT sites across Arkansas, Kentucky, Louisiana, Missouri, North Carolina, Tennessee, and Virginia is shown in Figure 1.

2.2. Data Entry Criteria

The criteria for data analysis consisted of two steps:

The overall experiment: All the reported data from all locations were analyzed.

Locations were included only if they reported data for all four years (2019–2022).

2.3. Climate Data

The long-term environmental data, including minimum, maximum, average temperature (°C), and accumulated precipitation (mm), were obtained from the National Centers for Environmental Information (NCEI) [76]. As specific planting dates varied by site-year, the analysis utilized the standard frost-free growing season for each location to calculate accumulated climatic indices. Consequently, the effect of specific planting timing is captured within the ‘Location-Year’ variance. While soybean development is driven by photothermal accumulation, Growing Degree Days (GDD) were utilized as the primary thermal metric rather than Photothermal Units (PTU) due to the lack of variety-specific genetic coefficients (e.g., critical photoperiod thresholds) for the 2269 commercial genotypes evaluated. To account for the photoperiodic influence on yield and development, the analysis was: (1) stratified by maturity group, effectively grouping varieties with similar photoperiod sensitivities; and (2) incorporated latitude as a direct independent variable in the machine learning models to serve as a continuous proxy for the photoperiodic environment. The calculation of environmental data, as well as the accumulated degree days (GDD), was as follows:

{\bar{T}}_{M a x .} = \frac{Σ_{i = 1}^{n} T_{M a x, i}}{n}

(1)

{\bar{T}}_{M i n .} = \frac{Σ_{i = 1}^{n} T_{M i n, i}}{n}

(2)

{\bar{T}}_{A v g .} = \frac{Σ_{i = 1}^{n} T_{A v g, i}}{n}

(3)

P r e c i p . = \sum_{i = 1}^{n} P_{i}

(4)

G D D = (\frac{T_{M a x .} + T_{M i n .}}{2}) - T_{B}

(5)

where T¯_Max. is the maximum temperature during the growth season, T_Max. is the maximum temperature on the i-th day, n is the total number of days of the growth season, T¯_Min. is the minimum temperature during the growth season, T_Min. is the minimum temperature on the i-th day, T¯_Avg. is the mean temperature during the growth season, T_Avg. is the mean temperature on the i-th day, Precip. is the accumulated precipitation during the growth season, P_i is the amount of precipitation recorded on the i-th day, GDD is the accumulated growth degree days, and T_B is the base temperature of soybean (10 °C).

2.4. Soil and Geographical Data

Soil texture composition, including the percentage of sand, silt, and clay, was obtained from the United States Department of Agriculture (USDA) Web Soil Survey (WSS) [77,78]. Geographic coordinates (latitude and longitude) and altitude for each trial site were derived from the United States Geological Survey (USGS) National Map data, corresponding to the specific physical location of the agricultural experiment stations or the centroid of the reported county [79].

2.5. Data Analysis

2.5.1. Discriminant Analysis (DA)

Discriminant Analysis (DA) was utilized to classify and predict the allocation of observations during the experiment. Based on the nature of the experiment, the Quadratic Discriminant Analysis (QDA), which assumes inequality in class covariance matrices [26,80,81], was chosen for DA of the yield in different years as affected by climatic, geographic, and soil characteristics. The two-box test (Chi-square (X²) and Fisher’s F (F) test [81]) were used to assess the null hypothesis. Also, Wilk’s Lambda test evaluated the equality of the vectors of the means for various groups.

The general formula for QDA is as follows:

\begin{matrix} (Q D A) : g_{k} (x) = - \frac{1}{2} \log |Σ_{k}| - \frac{1}{2} {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k}) \\ + l o g P_{k} \end{matrix}

(6)

where g_k(x) is the discriminant score for class k, Σ_k is the covariance matrix (separately calculated for each class) for class k (Equation (7)) and |Σ_k| is its determinant, x is the feature vector of the observation, μ_k is the mean vector of class k (centroid) (Equation (8)), T is the transpose operator, Σ_k⁻¹ is the inverse of the covariance matrix for class k (calculates Mahalanobis distances and accounts for correlations between features) (Equation (9)), log P_k is the prior probability of class k (Equation (10)).

Σ_{k} = \frac{1}{N_{k}} \sum_{i \in c l a s s k} (x_{i} - μ_{k}) {(x_{i} - μ_{k})}^{T}

(7)

μ_{k} = \frac{1}{N_{k}} \sum_{i \in c l a s s k} x_{i}

(8)

where N_k is the number of samples in class k, and x_i is the i-th data point in class k.

|Σ_{k}^{- 1}| = {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k})

(9)

P_{k} = \frac{N_{k}}{N}

(10)

where N is the total number of samples.

To classify the sample x, g_k(x) for all classes was computed, and observation x was assigned to the class with the highest discriminant score. The formula is expressed as:

C l a s s (x) = {a r g m a x}_{k}, g_{k} (x)

(11)

Statistical test

Fisher’s F Test

Fisher’s F test was used at the preprocessing stage to evaluate the equality of the means (H0) between different classes. It is expressed as:

F = \frac{\frac{S_{B}}{(k - 1)}}{\frac{S_{W}}{(N - k)}}

(12)

where S_B is the between-class sum of squares (Equation (13)), S_W is the within-class sum of squares, k is the number of classes (Equation (14)), and N is the total number of observations across all groups.

S_{B} = \sum_{i - 1}^{K} N_{i} (μ_{i} - μ) {(μ_{i} - μ)}^{T}

(13)

where N_i is the number of observations in class i, μ_i is the mean of group i, μ is the overall mean of all classes, and T is the transpose operator.

S_{W} = \sum_{i = 1}^{k} \sum_{j = 1}^{N_{i}} {(x_{i j} - μ_{i}) (x_{i j} - μ_{i})}^{T}

(14)

where x_ij is the individual observation in class i, and μ_i is the mean of group i.

The DF of SB is expressed as:

{D F}_{S B} = k - 1

(15)

The DF of SW is expressed as:

{D F}_{S W} = N - K

(16)

The Chi-Square

The Chi-square test evaluated the relationship between the categorical variables (e.g., climatic, geographic, etc.) to determine the model fit. This test is expressed as:

X^{2} = Σ \frac{{(O_{i j} - E_{i j})}^{2}}{E_{i j}}

(17)

where O_ij is the observed frequency representing the actual count of occurrences in category ij, and E_ij is the expected frequency for category ij calculated based on the null hypothesis, and calculated as:

E_{i j} = \frac{R o w T o t a l \times C o l u m n T o t a l}{G r a n d T o t a l}

(18)

The degree of freedom (DF) in Chi-square is calculated as:

{D F}_{x^{2}} = (r - 1) (c - 1)

(19)

where DF_x² is the degrees of freedom of chi-square (for tests involving categorical variables, DF is commonly used), r is the number of rows, and c is the number of columns in the contingency table.

The Wilk’s Lambda (Λ) Test

The Wilk’s Lambda (Λ) test assessed the discriminative ability of the variables by comparing the variance within groups to the variance between groups and is calculated through:

Λ = \frac{|S_{W}|}{|S_{W} + S_{B}|}

(20)

where |S_W| is the determinant of the within-class scatter matrix, and |S_W + S_B| (S_B & S_W Equations (13) and (14)) is the determinant of the total scatter matrix.

The DFs of Λ are calculated through Equations (15) and (16).

In this study, QDA is used as a diagnostic tool to quantify how effectively the edaphoclimatic covariates separate distinct temporal regimes (years) within each maturity group, rather than as the primary procedure for defining mega-environments.

2.5.2. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) was employed to reduce dimensionality while preserving most variances and simultaneously identifying the most effective variable in constructing each principal component: 14 edaphoclimatic variables (yield, GDD, temperatures, precipitation, latitude/longitude/altitude, soil texture %) used Location-Years as observations; standardized via z-score; Pearson correlation matrix; PC selection: scree-plot elbow retaining ≥95% cumulative variance (Section 2.6 for software). The PCA based on the Pearson correlation matrix was selected due to the multiscale nature of the dataset. The steps of PCA are as follows [33,34,35,36,38,82].

Data Standardization

Data standardization was performed to ensure that all the components contributed equally and to prevent the features with a larger range from dominating the analysis. It is expressed as:

Z = \frac{X - μ}{σ}

(21)

where Z is the standardized data matrix, X is the original data matrix (n × p, where n is the number of samples, p is the number of features), μ is the mean of each feature, and σ is the standard deviation of each feature.

The Pearson Correlation Matrix

The Pearson correlation matrix was computed to measure the linear relationship between variables while ensuring scale invariance. It was calculated through:

R = \frac{1}{n - 1} Z^{T} Z

(22)

where R is the correlation matrix, and Z^T is the transpose of the standardized data matrix. Each element R_ij in the matrix represents the Pearson correlation between the i and j features.

Eigenvalue and Eigenvector

The eigenvalue and eigenvector were computed to identify the principal components (PC) and to quantify the variance explained by each PC. Its formula is:

{R v}_{i} = λ_{i} v_{i}

(23)

where v_i is the corresponding eigenvector, and λ_i is the eigenvalue.

The Eigenvalues Were Sorted

The eigenvalues were sorted in descending order to identify the eigenvalues corresponding to the most significant PCs. The top k PCs were selected at the elbow point of the changes in the corresponding inertia of PCs (Graph 3).

The PC Score

The PC score was calculated by projecting the original data into the selected PCs (new space). The expression is as follows:

P C = Z V_{k}

(24)

where PC is the transformed dataset in the new k-dimensional space, and V_k is the matrix of the top k eigenvectors.

The Explained Variance

The explained variance was measured to quantify the variability explained by each PC to conclude the final number of PCs to retain most of the dataset’s variance. It is calculated through:

V E = \frac{λ_{j}}{\sum_{i = 1}^{p} λ_{i}}

(25)

where VE is the variance explained, λ_j is the eigenvalue of the j-th component, and

\sum_{i = 1}^{p} λ_{i}

is the total variance in the dataset.

Statistical tests

Bartlett’s Test of Sphericity

Bartlett’s test of sphericity was performed to evaluate the difference between the correlation matrix and the original matrix. This procedure aimed to explore the variables’ interrelation and suitability for PCA and find a significant p-value supporting conducting PCA. Its expression is as follows:

X^{2} = - (n - 1 - \frac{2 p + 5}{6}) \log |R|

(26)

where X² is the Chi-square, n is the sample size, p is the number of variables, and |R| is the determinant of the correlation matrix.

The Kaiser–Meyer–Olkin (KMO)

The Kaiser–Meyer–Olkin (KMO) was conducted to assess the sampling adequacy for PCA by comparing the magnitude of the observed correlation coefficient to the partial correlation coefficient. Results with overall values closer to 1 indicated the suitability of the dataset for PCA. The KMO formula is as follows:

K M O = \frac{\sum \sum_{i \neq j} r_{i j}^{2}}{\sum \sum_{i \neq j} r_{i j}^{2} + \sum \sum_{i \neq j} u_{i j}^{2}}

(27)

where r_ij is the correlation coefficient, and u_ij is the partial correlation coefficient.

2.5.3. Agglomerative Hierarchical Clustering (AHC)

Agglomerative Hierarchical Clustering (AHC) was conducted to explore the underlying structure of the data, uncover latent correlations between variables, and quantify their effect on the classification of the participating locations in OVT. AHC builds nested clusters by iteratively merging the smaller clusters into larger ones. The steps of AHC are as follows:

Initialization

Initialization started by considering each data point (Location) as a cluster. (If there is n data points, there will be n clusters.)

Distance Measurements

Distance measurements calculated the distance (dissimilarity) between the initial cluster. The principal metric used for this purpose was Euclidean distance, which is expressed as:

d (i, j) = \sqrt{\sum_{k = 1}^{p} (x_{i k} - {x_{j k})}^{2}}

(28)

where d is the distance, (i, j) are indices for two different data points, p is the number of features (dimensions), x_ik is the value of the k-th feature for data point i, and x_jk is the value of the k-th feature for data point j.

Ward’s Linkage

Ward’s linkage was used for merging clusters because of its ability to create well-balanced (even) and compacted clusters, moderate sensitivity to outliers, and minimize the within-cluster variance. Its mathematical formula is:

d (A, B) = \sqrt{\frac{|A| |B|}{|A| + |B|}} . d (A ¯, B ¯)

(29)

where |A| and |B| are the number of observations of cluster A and B, A¯ and B¯ are the centroids of cluster A and B, and d(A¯, B¯) is the Euclidean distance between centroid A and B.

Statistical tests

Silhouette Index

The Silhouette index assessed the clustering quality by measuring how well each data point fitted within its cluster compared to the data points in the neighboring cluster by combining intra-cluster and inter-cluster cohesion [83]. Its mathematical expression is as follows:

S (i) = \frac{b (i) - a (i)}{m a x (a (i), b (i))}

(30)

where a(i) is the average intra-cluster distance for point i, b(i) is the average distance between point i and points in the nearest cluster. A higher S(i) value (closer to 1) indicates a better cluster cohesion and separation (cluster validity), not classification accuracy.

Hartigan Index (H)

The Hartigan index (H) was utilized to quantify the improvement in clustering by measuring the reduction in within-cluster dispersion as the number of clusters increases from k − 1 to k [84]. The expression of this test is:

H (k) = \frac{Δ W}{W_{k}}

(31)

where ΔW is the change in the within-cluster dispersion (Equation (32)), k is the number of clusters, and W_k is the within-cluster sum of squared deviations for the k-th cluster (Equation (33)).

Δ W = W_{k - 1} - W_{k}

(32)

W_{k} = \sum_{i = 1}^{k} \sum_{x \in C_{i}} | x - {μ_{i} |}^{2}

(33)

where k is the number of clusters, x is the data point in cluster C_i, C_i is the i-th cluster, and μ_i is the centroid of cluster C_i.

Calinski–Harabasz Index (CH)

The Calinski–Harabasz Index (CH), also known as the variance ratio criterion, was used to evaluate the clustering quality by comparing the dispersion within clusters to the dispersion between clusters [85]. It is defined as:

C H (k) = \frac{T r (B_{k})}{T r (W_{k})} . \frac{n - k}{k - 1}

(34)

where k is the number of clusters, n is the number of data points, Tr(B_k) is the trace of the between-cluster scatter matrix (Equation (35)), and Tr(W_k) is the trace of the within-cluster scatter matrix (Equation (36)).

B_{k} = \sum_{i = 1}^{k} n_{i} (μ_{i} - μ) ({μ_{i} - μ)}^{T}

(35)

where n_i is the number of points in cluster i, μ_i is the centroid of cluster i, μ is the global mean of all data points, and T is the transpose operator.

W_{k} = \sum_{i = 1}^{k} \sum_{x \in C_{i}} (x - μ_{i}) ({x - μ_{i})}^{T}

(36)

where C_i is the i-th cluster, x is the datapoint in cluster C_i, and μ_i is the centroid of cluster C_i.

The H (k − 1) − H(k) Criterion

The H (k − 1) − H(k) criterion was used to identify diminishing returns in clustering quality as the number of clusters increases, to select the optimal cluster number [84]. This method uses the difference in the Hartigan index between consecutive cluster numbers (k and k − 1) to evaluate the improvement in clustering. A significant positive value of ΔH(k) indicates a significant improvement in clustering quality by increasing the number of clusters from k − 1 to k. It is defined as:

Δ H (k) = H (k - 1) - H (k)

(37)

where H(k) is the Hartigan index for k clusters, and H(k − 1) is the Hartigan index at the k − 1 clusters.

2.6. Software and Computational Tools

Statistical analysis and data visualization were performed using SAS 9.4 [86] Microsoft Excel (Microsoft Corp., Redmond, WA, USA), and the XLSTAT add-in [87], which were used for all calculations and figure generation.

3. Results

3.1. Quadratic Discriminant Analysis

Table DA’s two box tests (Chi-square (X2) and Fisher’s (F) distribution test) (Table 1) showed that the covariance matrices are unequal between different groups (years) in all MGs and the Exp., which rejects the null hypothesis (H0) and proves that the covariance matrices of the years are not equal. The Wilk’s Lambda test (Λ) also showed that the vector of the means of all the MGs and the Exp. significantly differed.

The statistical results of the Discriminant Analysis (Table 2) of the MGs and the Exp. showed that the prior probability (PP) in each class (year), which was directly affected by the sum of weights (SW) (frequency) of the cumulative measures, was relatively consistent. However, the logarithms of determinants (Log.D), in other words, the dispersion of the data in each class varied more noticeably. In MG3, MG4E, MG4L, MG5E, MG5L, and Exp., respectively, classes 2019, 2022, 2020, 2021, 2020, and 2020 had the lowest, and 2022, 2021, 2019, 2022, 2022, and 2022 had the highest Log.D. The high variability in 2022 (the highest Log.D in most MGs) and lower classification accuracy (MG5E/5L) likely stem from significant thermal shifts; 2022 was characterized by a drastic shift toward the negative side of F1 and F2, driven by extreme GDD and temperature fluctuations that diverged from the 2019–2021 patterns.

Table 3 shows that F1 was the dominant factor in separating the classes for all the MGs and the Exp. The F1 of the Exp. and MG3, respectively, had the highest and the lowest discriminative effect on classifying the observations. However, the condition for F2 was reversed, which shows that the MG3 varieties are most likely more sensitive to variables that affect the proportion of F2 discrimination. Also, Bartlett’s test for eigenvalue significance showed that both F1 and F2 effectively distinguished the observations in different classes in all MGs and the total experiment.

Based on the Mahalanobis distance test (Table 4) between all classes, in MG3, MG4E, and MG4L, the longest distance was between classes 2022 and 2020; in MG5E and Exp., the longest distance was between classes 2019 and 2020, and in MG5L, the class of 2020 had the longest distance from the class of 2022. The shortest distance between classes in MG3 was between 2019 and 2021; in MG4E and MG4L, it was between 2022 and 2019; in MG5E and MG5L, it was between 2021 and 2022; and in the Exp., it was between 2021 and 2022.

Figure 2 visualizes the correlation between variables with each function. In MG3, minimum temperature and precipitation had the highest positive correlation, respectively, with F1 and F2. In MG4E, MG4L, MG5E, MG5L, and the Exp., precipitation had the highest positive correlation with F1. The highest negative correlation with F1 in MG3, MG4E, MG4L, MG5E, MG5L, and the Exp. resulted from GDD; in the MG3 yield and percentage of sand in the soil texture, the MG4E and MG4L variety, the MG5E and MG5L yield, and Exp., longitude had the highest negative correlation with F2.

The top five variables by absolute standardized canonical coefficient (|X| ranking, Table 5) for F1 classification were MG3 (min temp, GDD, max temp, latitude, longitude); MG4E/MG4L/MG5E/MG5L/Exp (avg temp, GDD, latitude, longitude). For F2, the variables were MG3 (max temp, min temp, GDD, latitude, precipitation); others emphasized latitude and altitude. Note: Coefficient magnitudes depend on variable scaling and collinearity and should be interpreted relative to standardization within each MG (see Table 5 caption).

Influential variables were identified by ranking absolute standardized canonical discriminant coefficients (Table 5); note that coefficient magnitudes depend on variable scaling and collinearity and should be interpreted as relative discriminatory power within each model. The visualization of the classification score of the observations (Figure 3), their relative bootstrap ellipses, and the centroid of each class on the factor axis clarifies how effectively this method was able to distinguish between different classes for different maturity groups and Exp. It also illustrates that in all of the MGs and Exp., (a) the 2019, 2021, and 2022 were located relatively close together; (b) the class of 2019 was usually located at the negative side of the F1 and positive side of the F2; (c) however, in the next consecutive (following) class of 2020, the classification moved toward the positive side of F1 and relatively less movement toward the negative side of F2.;(d) the following classification (2021) moved to an in-between position and (e) was followed by a drastic shift toward the negative side of both F1 and F2.

Validation protocol: Confusion matrices (Table 6) report resubstitution (training, Tr) accuracy using all data and leave-one-out cross-validation (LOOCV, CV) accuracy. LOOCV folds were defined by individual Location-Year observations (internal validation, no independent holdout). Respectively, 95.77%, 83.34%, 97.1%, 82.99%, 91.54%, and 94.39% of observations in MG3, MG4E, MG4L, MG5E, MG5L, and Exp. achieved Tr accuracy, with CV accuracies of 92.67%, 79.16%, 95.98%, 79.22%, 90.89%, and 93.13%. The observations of class 2021 in MG4E, and 2022 in MG5E and MG5L were the only classes with <75% cross-validation accuracy below our internal threshold, which we adopted based on standard performance benchmarks for Discriminant Analysis in ecological data where year-to-year variability is the primary grouping factor. However, the ratio of well-classified observations to the total observations was above the acceptable range. These results confirmed both the eligibility of data for use in allocating similar observations to the function of the corresponding group to determine the specific classification of observation of interest, and the ability of QDA in identifying the quality of the data for classification (Supplementary Materials, Tables S1–S6). Note on coefficients: Large values/intercepts result from z-score standardization and Mahalanobis distance scaling in QDA; zeros indicate negligible contribution after covariance adjustment (not regularization/selection). Full formulas are available in the Supplementary Materials.

3.2. Principal Component Analysis (PCA)

The results of the KMO test (Table 7) of the sampling adequacy of variables to be included in the PCA showed that the overall KMO of all the MGs was in the ‘moderate’ range (0.5 < KMO < 0.7), and for Exp. was within the ‘good’ range (0.7 < KMO < 0.8) for the Principal Component Analysis.

The scree plot visualization (Figure 4) suggested that two principal components (PCs) could optimally capture the majority of the variance in the data in all cases.

The circle plot (Figure 5) based on the correlation (factor loading) between variables and PCs provides the coordinates of variables in the new space. The results of the PCA indicated that for MG3, MG4E, MG4L, MG5E, MG5L, and the Exp., PC1 explained 51.57%, 49.97%, 48.1%, 45.43%, 54.72%, and 47.62% of the variance, respectively. PC2 accounted for 23.39%, 23.74%, 25.22%, 30.06%, 19.01%, and 23.05% of the variance in these groups. It also shows that in all cases, climatic factors (average, maximum, and minimum temperature, GDD, and precipitation) had the highest positive, and latitude and altitude had the highest negative correlations with PC1. Soil type, sand percentage, and yield had the highest positive correlation, and clay percentage, silt percentage, and longitude (and altitude in MG5L) had the highest negative correlation with PC2.

The pie chart of the variables’ contribution percentage and their significance level in building PC1 and PC2 is visualized in Figure 6. It shows that climatic characteristics along with latitude and altitude (except in MG5L, it was longitude) were the most significant contributors in building PC1. Longitude, the percentage of sand, silt, and clay in all cases (in MG5E, precipitation was added to the mix, and in MG5L, it was only soil characteristics) had the highest contribution in building PC2.

3.3. Agglomerative Hierarchical Clustering (AHC)

Table 8 (S(i), H, Δ, and CH) and Figure 7 (PIDOC of BSS and WSS) provide verified guidance for selecting the optimal number of clusters and the explanatory indices. The optimal number of clusters for MG3 was four, where S(i), H, Δ, and CH were 0.37, 4.26, 5.8, and 16.78, respectively. Also, the PIDOC of BSS and WSS at four clusters explained 70.56% and 29.44% of the total inertia. MG4E was optimal at four clusters where S(i) = 0.31, H = 6.37, Δ = 4.01, CH = 19.82, and the PIDOC of BSS and WSS, respectively, explained 65.58% and 34.42% of the total inertia. MG4L was optimal at three clusters with S(i) = 0.33, H = 7.02, Δ = 8.07, CH = 17.52, and PIDOC of BSS and WSS, respectively, were responsible for 53.88% and 46.12% of the total inertia. MG5E was optimal at three clusters with S(i) = 0.33, H = 5.5, Δ = 10.18, CH = 16.6, and PIDOCs of BSS and WSS were 56.08% and 43.92%. MG5L was optimal at two clusters with S(i) = 0.34, H = 5.45, Δ = 7.53, CH = 12.07, and PIDOCs of BSS and WSS were 57.01% and 42.99%. The optimal number of clusters for the Exp. was determined to be four, with metrics S(i) = 0.29, H = 6.86, Δ = 9.41, CH = 28.17, and PIDOC of BSS and WSS were 61.03% and 38.97%, respectively.

The dendrogram (Figure 8) shows that in MG3, cluster 1, unlike the other three clusters, was the most diverse cluster (it had members from KY, MO, and TN). Clusters 2, 3, and 4 were dominantly comprised of MO, LA, and VA counties, respectively. The subcluster of Tensas Parish, LA, with West Carrol Parish, LA (from cluster 3), had the least, and Dunklin Co., MO, with Robertson Co., TN (from cluster 1), had the highest dissimilarities. Cluster 4 was the shortest node height, which means the data points had the least within-cluster dissimilarities, and cluster 3 was the least compacted (tallest, highest within-cluster dissimilarities). Clusters 1 and 2 were the first linked clusters (least dissimilarity), which were later linked with cluster 4 and eventually with cluster 3.

In MG4E, clusters 1 (KY, MO, TN) and 2 (AR, LA, TN) were the most diverse clusters, and cluster 3 (MO) was the least diverse cluster. The subcluster of Lee Co., AR, and St. Francis Co., AR, had the least, and East Baton Rouge Parish, LA, with Red River Parish, LA, had the most dissimilarities at the first level of linkage. Cluster 3 and 2, respectively, were the most and least compacted clusters. Clusters 3 and 1 had a lower level of dissimilarity linked together than clusters 2 and 4.

Among the three clusters of MG4L, clusters 1 (KY, MO, TN) and 2 (AR, LA, TN) were the most diverse clusters. Once again, the subcluster of Lee Co., AR, and St. Francis Co., AR, had the least, and East Baton Rouge Parish, LA, with Red River Parish, LA, had the most dissimilarities. Clusters 2 and 1 were the most and least compacted clusters. Clusters 1 and 2 at a lower node height are linked together (lower distance between clusters.

In MG5E, clusters 1 and 3 were the most and least diverse clusters. The least and most dissimilar subclusters were observed in the cluster of St. Francis Co., AR, with Shelby Co., TN (from cluster 2), and Suffolk, VA, with Richmond Co., VA. Clusters 1 and 3 were the most and least compacted clusters. Clusters 1 and 3 had the least between-cluster distance, which was later linked with cluster 2.

In MG5L, the subclusters of Tensas Parish, LA, with Rapides Parish, LA, had the least dissimilarities, and Orange Co., VA, with Nottoway Co., VA, had the most. Components of cluster 1 had a lower level of dissimilarity linked together (lower difference) compared to components of cluster 2.

In the Exp., clusters 1 (KY, NC, TN, VA) and 2 (AR, KY, MO, TN) were the most diverse, and cluster 4 (MO) was the least diverse cluster. In the first subcluster, the grouping of Lee Co., AR, with St. Francis Co., AR, from cluster 3 had the least, and Yadkin Co., NC, with Pulaski Co., KY, from cluster 1 had the highest dissimilarities. Cluster 4 and 1, respectively, were the most and least compacted clusters. Clusters 2 and 4 were linked at a shorter node height compared to the linkage between clusters 1 and 3.

The profile plot (Figure 9) showed that the variables followed a pattern in the contribution level in clustering. The effect of yield on clustering (in a scale from −2 to 2) was usually between −0.87 (in cluster 3 of the MG3) and 0.62 (in cluster 4 of the MG3). The highest positive contribution of GDD, average temperature, minimum temperature, maximum temperature, and precipitation (climatic variables), respectively, were 1.89, 1.83, 1.84, 1.8, and 1.29 on cluster 3 of MG3; the highest negative contribution of these variables were −1.21, −1.33, −1.29, −1.35, and −1.34 on cluster 3 of MG4E. The highest positive contribution of (geographical characteristics) latitude and longitude, respectively, was 1.23 and 0.97 in cluster 4 of the Exp., and the highest positive contribution of altitude was in cluster 3 in MG4E. The highest negative contributions of latitude (−1.93), longitude (−1.96), and altitude (−1.3), respectively, were on clusters 3, 4, and 3 of MG3. The highest positive contributions of (soil characteristics) soil type (0.6) and percentage of sand (1.65), silt (0.72), and clay (0.73), respectively, were on clusters 3 of MG5E, 4 of MG4E, 1, and 1 of MG5E. The highest negative contributions of soil type and percentage of sand, respectively, were −0.59 and −0.84 in cluster 1 of MG5E; silt was −1.55 in cluster 4 of MG3 and 4 of MG4E, and clay was −1.07 in cluster 3 of MG4L.

4. Discussion

The application of machine learning algorithms in this study successfully moved the analysis of soybean performance beyond traditional political boundaries, establishing a robust framework for delineating environmental stability zones. The main contribution of this work is not the introduction of new algorithms, but the integration of QDA, PCA, and AHC into a unified statistical-learning framework applied to a large, seven-state soybean OVT network to delineate cross-state mega-environments at the maturity-group level. This provides an explicit, data-driven alternative to the traditional state-based zoning used in current recommendation systems.

4.1. Discriminant Analysis and Yield Stability

The QDA effectively highlighted year-to-year yield variability, classifying over 79% of observations into distinct temporal classes. While year-to-year climatic variation is expected, the high classification accuracy confirms that specific variables—namely GDD and latitude—are the primary discriminators of yield potential. Unlike retrospective analyses, these discriminant functions provide a predictive model: by inputting forecasted or early-season GDD and latitudinal data, breeders can predict which “performance class” a growing season is likely to resemble. This aligns with findings by [88] regarding rice yield classification and suggests that breeding programs must prioritize genotypic stability against thermal accumulation variances (GDD) rather than precipitation alone. Furthermore, the variation in discriminant functions across different years suggests that not all factors influence yield equally in every season, supporting the need for tailored management approaches for specific maturity groups or environmental conditions [89]. Although internal classification and cross-validation accuracies are high, the current analysis does not yet include external validation with independent years, withheld locations, or prospective trials. At this stage, the mega-environments should therefore be viewed as descriptive, data-driven environmental strata that require additional validation before being used operationally for predicting future yield stability or variety rankings. The high year-wise QDA accuracies therefore demonstrate that the selected edaphoclimatic variables capture strong inter-annual differences in background climate, which complements—but does not replace—the PCA–AHC-based delineation of spatial mega-environments.

4.2. Latent Environmental Drivers (PCA)

The PCA results provided critical insight into the hierarchy of yield-limiting factors, verifying that climatic and geographic variables—specifically latitude and altitude—are the determining factors in constructing PC1, while soil characteristics and longitude drive PC2 [90]. In all maturity groups, PC1 was dominated by broad climatic drivers, explaining nearly 50% of the variance. This orthogonal separation—climate on PC1 and soil on PC2—validates the agronomic reality that while photothermal conditions set the potential yield ceiling [91], local edaphic factors determine the realized yield. This methodology allows for the adjustment and verification of correlation weights to better understand land-climate interactions [92], effectively simplifying the analysis to focus on biological interpretation rather than data processing [32].

4.3. Delineating Mega-Environments (AHC)

Most notably, the AHC analysis revealed that optimal production clusters frequently defied geographic proximity, indicating that “neighboring” counties often belong to different mega-environments.

Cross-State Clustering: For example, in MG4E, cluster 2 grouped locations from Arkansas, Louisiana, and Tennessee together, separating them from cluster 3, which was comprised almost entirely of Missouri locations. This suggests that a grower in Western Tennessee may share more critical yield-determining factors with a producer in the Arkansas Delta than with a producer in Eastern Tennessee, despite the latter being within the same state political boundary.

Operational Implications: Similarly, in MG3, cluster 1 formed a distinct “transitional” mega-environment comprising counties from Kentucky, Missouri, and Tennessee. This grouping operationalizes the concept that these specific sites share a microclimate and soil profile that justifies sharing variety recommendations.

The profile plots (Figure 9) further elucidate why these distant sites clustered together; for instance, the high negative contribution of latitude in MG3 cluster 3 indicates a “Southern-adaptability” zone that spans state lines. This confirms that while geographic proximity plays a role, local microclimate and soil characteristics are often equally influential in determining cluster membership.

4.4. Conclusion on Methodology

These findings support the move toward data-driven recommendation domains. As noted by Dawson and Belkhir (2009) [39], the node height in our dendrograms serves as a proxy for environmental similarity. Just as Das et al. (2021) [16] utilized AHC to differentiate wheat varieties based on physiological parameters, our study successfully differentiates production environments to optimize variety placement. Utilizing these clusters can assist with reducing the cost and improving the accuracy of strategizing for production. This approach mirrors the work of Ibrar et al. (2024) [41], who used hierarchical clustering to identify distinct heterotic groups; similarly, we identified distinct environmental groups to streamline testing locations and reduce redundancy in trial networks.

4.5. Limitations and Future Directions

First, the climatic covariates are based on seasonal means and cumulative indices summarized over the frost-free growing season. While these metrics capture broad suitability, they do not explicitly capture short-duration extremes, such as brief heat waves or intra-seasonal droughts, which often drive year-specific yield anomalies. As a result, the framework is better suited for long-term zoning and strategic trial placement than for assessing short-term production risk under extreme events. While this study establishes a statistical learning framework for mega-environment delineation, several limitations must be noted. First, the Silhouette scores for the identified clusters (0.29–0.37) indicate a “moderate-to-weak” structure. This suggests that the study area represents a continuum of environmental gradients rather than sharply distinct ecological islands. Consequently, these clusters should be viewed as probabilistic transition zones rather than absolute boundaries. Second, this analysis relied on seasonal means for climatic variables. As noted by recent studies, soybean yield is often limited by short-term acute weather events (e.g., heat stress during flowering), which may be smoothed over in seasonal averages. Finally, this study focused on Genotype × Environment (G × E) interactions; future work should integrate Management (M) variables—such as planting date and irrigation—to provide a full G × E × M analysis.

Second, both climate data from NCEI and soil texture information from the Web Soil Survey are derived from interpolated products, which inevitably smooth local variability in weather and soil properties. Consequently, the identified mega-environments should be interpreted as regional edaphoclimatic groupings, not as prescriptive units for fine-scale or within-field precision management. Future work could pair this framework with higher-resolution datasets (e.g., Mesonet observations, gridded reanalysis, or proximal soil sensing) to refine spatial detail where needed.

Third, key management variables such as planting date, irrigation infrastructure, plant population, and overall management intensity were not consistently available across states and therefore were not included in the models. The mega-environments derived here thus represent environmental potential rather than fully specified G × E × M systems, and cultivar recommendations will ultimately need to overlay management metadata on top of these environmental strata.

5. Conclusions

This study’s results show that QDA effectively classified 79+% of the yield of all maturity groups into distinguished classes. Also, QDA, while highlighting GDD and latitude as the main discriminatory variables in the two main discrimination factors (F1 and F2) in all maturity groups, illustrated that the climatic and geographic variables are the main variables in controlling yield classification in different years. Furthermore, PCA showed that a 2-D principal component could successfully explain the majority of the variance, in which the climatic and geographic (except longitude) variables were the main influential variables on PC1, and soil characteristics, along with longitude, were the most influential variables on PC2. Moreover, AHC showed that the optimal clusters varied between two and four for different maturity groups and the overall experiment. This result also showed that geographic approximation did not necessarily result in grouping locations in the same cluster. The profile plot indicated that environmental and geographical variables (except longitude) usually play the leading role in AHC.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy16030376/s1, Table S1: Classification Function Formula of MG3; Table S2: Classification Function Formula of MG4E; Table S3: Classification Function Formula of MG4L; Table S4: Classification Function Formula of MG5E; Table S5: Classification Function Formula of MG5L; Table S6: Classification Function Formula of the overall experiment (Exp.).

Author Contributions

Conceptualization, V.R.S.; methodology, I.M. and V.R.S.; software, I.M.; formal analysis, I.M.; investigation, I.M., V.R.S., R.B., R.H., and D.M.; resources, V.R.S., R.B., R.H., and D.M.; data curation, I.M., V.R.S., R.B., R.H., and D.M.; writing—original draft preparation, I.M. and V.R.S.; writing—review and editing, I.M., V.R.S., R.B., R.H., and D.M.; visualization, I.M.; supervision, V.R.S.; project administration, V.R.S.; funding acquisition, V.R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the United Soybean Board, project number 2323-206-0301.

Data Availability Statement

The original data presented in the study are openly available online or by request from each state OVT program, which publishes trial results annually.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AHC	Agglomerative Hierarchical Clustering
BSS	Between-Cluster Sum of Squares
CH	Calinski–Harabasz Index
CV	Cross-Validation
DA	Discriminant Analysis
DF	Degree of Freedom
Exp.	Experiment (Overall)
GDD	Growing Degree Days
H	Hartigan Index
IDOC	Inertia Decomposition at Optimal Classification
KMO	Kaiser–Meyer–Olkin
masl	Meters Above Sea Level
MG	Maturity Group
MLA	Machine Learning Algorithms
NCEI	National Centers for Environmental Information
OVT	Official Variety Trials (or Testing)
PC	Principal Component
PCA	Principal Component Analysis
PIDOC	Percentage of Inertia Decomposition at Optimal Clustering
PP	Prior Probability
PTU	Photothermal Units
QDA	Quadratic Discriminant Analysis
SB	Between-Class Sum of Squares
SW	Within-Class Sum of Squares (also Sum of Weights)
S(i)	Silhouette Index
Tr	Training Sample
UL-MLA	Unsupervised Learning Machine Learning Algorithms
USDA	United States Department of Agriculture
USGS	United States Geological Survey
WSS	Web Soil Survey (also Within-Cluster Sum of Squares)

References

Hymowitz, T. The history of the soybean. In Soybeans; Johnson, L.A., White, P.J., Galloway, R., Eds.; AOCS Press: Amsterdam, The Netherlands, 2008; pp. 1–31. [Google Scholar] [CrossRef]
USDA. Uniform Soybean Tests, Northern States. U.S. Department of Agriculture. 1951. Available online: https://www.ars.usda.gov/ARSUserFiles/60661000/UniformSoybeanTests/51soybook.pdf (accessed on 15 August 2025).
Yang, L.; Song, W.; Xu, C.; Sapey, E.; Jiang, D.; Wu, C. Effects of high night temperature on soybean yield and compositions. Front. Plant Sci. 2023, 14, 1065604. [Google Scholar] [CrossRef]
Shurtleff, W.; Aoyagi, A. History of Soybean Variety Development, Breeding and Genetic Engineering 1902–2020. Soyinfo Center. 2020. Available online: https://www.soyinfocenter.com/pdf/229/PrVd.pdf (accessed on 15 August 2025).
Jemo, M.; Devkota, K.P.; Epule, T.E.; Chfadi, T.; Motiq, R.; Hafidi, M.; Silatsa, F.B.T.; Jibrin, J.M. Exploring the potential of mapped soil properties, rhizobium inoculation, and phosphorus supplementation for predicting soybean yield in the savanna areas of Nigeria. Front. Plant Sci. Sec. Plant Symbiotic Interact. 2023, 14. [Google Scholar] [CrossRef]
USDA National Agricultural Statistics Service. Crop Production 2020 Summary. 2020. Available online: https://www.nass.usda.gov/Publications/Todays_Reports/reports/cropan20.pdf (accessed on 15 August 2025).
USDA National Agricultural Statistics Service. 2020 State Variety Testing Report: Soybean. United States Department of Agriculture. 2020. Available online: https://www.nass.usda.gov/Publications/ (accessed on 15 August 2025).
Everman, W. Weed Management. NC State Extension Publications. ncsu.edu. 2024. Available online: https://content.ces.ncsu.edu/north-carolina-soybean-production-guide/soybean-weed-management (accessed on 15 August 2025).
Lin, F.; Chhapekar, S.S.; Vieira, C.C.; Da Silva, M.P.; Rojas, A.; Lee, D.; Liu, N.; Pardo, E.M.; Lee, Y.-C.; Dong, Z.; et al. Breeding for disease resistance in soybean: A global perspective. Theor. Appl. Genet. 2022, 135, 3773–3872. [Google Scholar] [CrossRef]
Allen, T.W.; Bradley, C.A.; Sisson, A.J.; Byamukama, E.; Chilvers, M.I.; Coker, C.M.; Collins, A.A.; Damicone, J.P.; Dorrance, A.E.; Dufault, N.S.; et al. Soybean yield loss estimates due to diseases in the United States and Ontario, Canada, from 2010 to 2014. Plant Health Prog. 2017, 18, 19–27. [Google Scholar] [CrossRef]
Fowler, A.; Basso, B.; Maureira, F.; Millar, N.; Ulbrich, R.; Brinton, W.F. Spatial patterns of historical crop yields reveal soil health attributes in US Midwest fields. Sci. Rep. 2024, 14, 465. [Google Scholar] [CrossRef]
Zeleke, K.; Nendel, C. Yield response and water productivity of soybean (Glycine max L.) to deficit irrigation and sowing time in south-eastern Australia. Agric. Water Manag. 2024, 296, 108815. [Google Scholar] [CrossRef]
Bashir, M.; Adam, A.M.; Shehu, B.M.; Abubakar, M.S. Effects of Soil Texture and Nutrients Application on Soybean Nutrient Uptake, Growth and Yield Response. J. Agric. Food Sci. 2022, 20, 227–241. [Google Scholar] [CrossRef]
Lin, T.S.; Song, Y.; Lawrence, P.; Kheshgi, H.S.; Jain, A.K. Worldwide maize and soybean yield response to environmental and management factors over the 20th and 21st centuries. J. Geophys. Res. Biogeosci. 2021, 126, e2021JG006304. [Google Scholar] [CrossRef]
Dong, A.; Lai, X.; Han, T.; Nsigayehe, J.M.V.; Li, G.; Shen, Y. Crossing latitude introduction delayed flowering and facilitated dry matter accumulation of soybean as a forage crop. J. Integr. Agric. 2024, 24, 0033. [Google Scholar] [CrossRef]
Das, S.; Christopher, J.; Apan, A.; Choudhury, M.R.; Chapman, S.; Menzies, N.W.; Dang, Y.P. UAV-Thermal imaging and agglomerative hierarchical clustering techniques to evaluate and rank physiological performance of wheat genotypes on sodic soil. ISPRS J. Photogramm. Remote Sens. 2021, 173, 221–237. [Google Scholar] [CrossRef]
Smith, J.R.; Jones, M.A. The role of variety testing in sustainable agricultural practices: Implications for soybean producers. J. Agric. Sci. 2021, 159, 567–579. [Google Scholar]
Chen, H.; Pan, X.; Wang, F.; Liu, C.; Wang, X.; Li, Y.; Zhang, Q. Novel QTL and Meta-QTL Mapping for Major Quality Traits in Soybean. Front. Plant Sci. 2021, 12, 774270. [Google Scholar] [CrossRef]
USDA National Agricultural Statistics Service. 2021 State Variety Testing Report: Soybean. United States Department of Agriculture. 2021. Available online: https://www.nass.usda.gov/Publications/ (accessed on 15 August 2025).
Wakefield, K. A Guide to Machine Learning Algorithms and Their Applications: Understanding the Types of Machine Learning Algorithms and When to Use Them. SAS UK. Available online: https://www.sas.com/en_us/insights/articles/analytics/machine-learning-algorithms-guide.html (accessed on 19 December 2024).
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Elbasi, E.; Zaki, C.; Topcu, A.E.; Abdelbaki, W.; Zreikat, A.I.; Cina, E.; Shdefat, A.; Saker, L. Crop Prediction Model Using Machine Learning Algorithms. Appl. Sci. 2023, 13, 9288. [Google Scholar] [CrossRef]
Araújo, S.O.; Peres, R.S.; Ramalho, J.C.; Lidon, F.; Barata, J. Machine Learning Applications in Agriculture: Current Trends, Challenges, and Future Perspectives. Agronomy 2023, 13, 2976. [Google Scholar] [CrossRef]
Van Klompenburg, T.; Kassahun, A.; Catal, C. Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
Gao, Z.; Ma, W.; Huang, S.; Hua, P.; Lan, C. Deep Learning for Super-Resolution in a Field Emission Scanning Electron Microscope. AI 2020, 1, 1–10. [Google Scholar] [CrossRef]
McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition; Wiley: Hoboken, NJ, USA, 2004; Available online: https://onlinelibrary.wiley.com/doi/book/10.1002/0471725293 (accessed on 15 August 2025).
Dong, S.; Gao, Y.; Xin, L.; Ding, W. Insights into the effects of transgenic glyphosate-resistant semiwild soybean on soil microbial diversity. Sci. Rep. 2024, 14, 32017. [Google Scholar] [CrossRef] [PubMed]
Bianchini, A.; Moraes, P.V.D.; Longhi, S.J.; Adami, P.F.; Rossi, P.; Batista, V.V. Multivariate analysis using a discriminant method for evaluating the techniques of weed management in soybean crop. Agric. Sci. 2020, 12, 48–61. [Google Scholar] [CrossRef]
Kim, S.-Y.; Kim, S.Y.; Lee, S.M.; Lee, D.Y.; Shin, B.K.; Kang, D.J.; Choi, H.-K.; Kim, Y.-S. Discrimination of Cultivated Regions of Soybeans (Glycine max) Based on Multivariate Data Analysis of Volatile Metabolite Profiles. Molecules 2020, 25, 763. [Google Scholar] [CrossRef] [PubMed]
Li, X.; He, Z.; Liu, F.; Chen, R. Fast Identification of Soybean Seed Varieties Using Laser-Induced Breakdown Spectroscopy Combined With Convolutional Neural Network. Front. Plant Sci. 2021, 12, 21. [Google Scholar] [CrossRef]
Wu, W.; Mallet, Y.; Walczak, B.; Penninckx, W.; Massart, D.L.; Heuerding, S.; Erni, F. Comparison of regularized discriminant analysis, linear discriminant analysis, and quadratic discriminant analysis applied to NIR data. Anal. Chim. Acta 1996, 329, 257–265. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent development. Phil. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
Ringnér, M. What is principal component analysis? Nat. Biotechnol. 2008, 26, 303–304. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Jackson, J.O. A User’s Guide to Principal Components; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
Maronna, R. Principal Components and Orthogonal Regression Based on Robust Scales. Technometrics 2005, 47, 264–273. [Google Scholar] [CrossRef]
Mirahki, I.; Ardakani, M.R.; Golzardi, F.; Paknejad, F.; Mahrokh, A.; Faraji, S. Yield, Water Use Efficiency and Silage Feeding Value of Sorghum Cultivars as Affected by Planting Date and Planting Method. Gesunde Pflanz. 2023, 75, 1963–1973. [Google Scholar] [CrossRef]
Hair, J.F.; Anderson, R.E.; Tatham, R.L.; Black, W.C. Multivariate Data Analysis; Pearson: Abingdon, UK, 2010. [Google Scholar]
Dawson, K.; Belkhir, K. An agglomerative hierarchical approach to visualization in Bayesian clustering problems. Heredity 2009, 103, 32–45. [Google Scholar] [CrossRef] [PubMed]
Shen, B.; José, J.; Feng, Q.; Li, D.; Ye, Y.; Ahmadi, G. Semi-supervised hierarchical ensemble clustering based on an innovative distance metric and constraint information. Eng. Appl. Artif. Intell. 2023, 124, 106571. [Google Scholar] [CrossRef]
Ibrar, D.; Khan, S.; Raza, M.; Nawaz, M.; Hasnain, Z.; Kashif, M.; Rais, A.; Gul, S.; Ahmad, R.; Gaafar, A.-R.Z. Application of machine learning for identification of heterotic groups in sunflower through combined approach of phenotyping, genotyping and protein profiling. Sci. Rep. 2024, 14, 7333. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Pan, W.-J.; Wang, X.; Deng, Y.-R.; Li, J.-H.; Chen, W.; Chiang, J.Y.; Yang, J.-B.; Zheng, L. Nondestructive and intuitive determination of circadian chlorophyll rhythms in soybean leaves using multispectral imaging. Sci. Rep. 2015, 5, 11108. [Google Scholar] [CrossRef]
Ratke, R.F.; de Sousa, A.; Chaves, D.V.; Zanatta, F.L.; Edvan, R.L.; Sousa, H.R.; Silva-Filho, E.C.; Osajima, J.A.; Nascimento, A.M.S.S.; Aguilera, J.G.; et al. Cashew gum hydrogel as an alternative to minimize the effect of drought stress on soybean. Sci. Rep. 2024, 14, 2159. [Google Scholar] [CrossRef] [PubMed]
Cao, P.; Zhao, Y.; Wu, F.; Xin, D.; Liu, C.; Wu, X.; Lv, J.; Chen, Q.; Qi, Z. Multi-Omics Techniques for Soybean Molecular Breeding. Int. J. Mol. Sci. 2022, 23, 4994. [Google Scholar] [CrossRef] [PubMed]
Salmerón, M.; Purcell, L.C. Simplifying the prediction of phenology with the DSSAT-CROPGRO-soybean model based on relative maturity group and determinacy. Agric. Syst. 2016, 148, 178–187. [Google Scholar] [CrossRef]
Carlin, J.F.; Bond, R.D.; Still, J.A. Arkansas Soybean Performance Tests 2019. Arkansas Agricultural Experiment Station Research Series. 2019. Available online: https://scholarworks.uark.edu/aaesser/158 (accessed on 15 August 2025).
Carlin, J.F.; Bond, R.D.; Morgan, R.B. Arkansas Soybean Performance Tests 2020. Arkansas Agricultural Experiment Station Research Series. 2021. Available online: https://scholarworks.uark.edu/aaesser/196 (accessed on 15 August 2025).
Carlin, J.F.; Morgan, R.B.; Bond, R.D. Arkansas Soybean Performance Tests 2021. Arkansas Agricultural Experiment Station Research Series. 2022. Available online: https://scholarworks.uark.edu/aaesser/206 (accessed on 15 August 2025).
Carlin, J.F.; Mulloy, R.B.; Bond, R.D. Arkansas Soybean Performance Tests 2022. Arkansas Agricultural Experiment Station Research Series. 2023. Available online: https://scholarworks.uark.edu/aaesser/216 (accessed on 15 August 2025).
Venard, C.M.-P.; Duckworth, J. 2019 Kentucky Soybean Performance Tests (PR-775). University of Kentucky, College of Agriculture, Food and Environment. 2019. Available online: https://publications.ca.uky.edu/sites/publications.ca.uky.edu/files/PR775.pdf (accessed on 15 August 2025).
Venard, C.M.-P.; Mertz, D.R. 2020 Kentucky Soybean Variety Performance Tests (PR-794). University of Kentucky, College of Agriculture, Food and Environment. 2020. Available online: https://soybeanresearchinfo.com/wp-content/uploads/2020/12/2020-Kentucky-Soybean-Variety-Performance-Trials.pdf (accessed on 15 August 2025).
Venard, C.M.-P.; Mertz, D.R. 2021 Kentucky Soybean Variety Performance Tests (PR-811). University of Kentucky, College of Agriculture, Food and Environment. 2021. Available online: http://www2.ca.uky.edu/agc/pubs/PR/PR811/PR811.pdf (accessed on 15 August 2025).
Mertz, D.; Bruening, B.; Kenimer, C.; Shine, P. 2022 Kentucky Soybean Variety Performance Trials (PR-828). University of Kentucky, College of Agriculture, Food and Environment. 2022. Available online: https://publications.ca.uky.edu/files/PR828.pdf (accessed on 15 August 2025).
Spivey, T.; Woodard, C.; Stephenson, D.O.; Bollich, P.K.; Buckley, B.; Webster, E.P.; Padgett, G.B.; Harrell, D.L.; Copes, J. 2019 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2019. Available online: https://www.lsuagcenter.com/articles/page1544459344263 (accessed on 15 August 2025).
Padgett, G.; Webster, E.P.; Collins, F.L.; Davis, J.A.; May, D.; Woodard, C.; Stephenson, D.O.; Bollich, P.K.; Buckley, B.; Harrell, D.L.; et al. 2020 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2020. Available online: https://www.lsuagcenter.com/articles/page1576271753656 (accessed on 15 August 2025).
Moseley, D.; Stephenson, D.O.; Collins, F.L.; Buckley, B.; Brown, S.; Price, P.P., III; Padgett, G.B.; Gentry, G.T.; Harrell, D.L.; Kongchum, M.; et al. 2021 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2021. Available online: https://www.lsuagcenter.com/articles/page1606947975606 (accessed on 15 August 2025).
Moseley, D.; Brown, S.; Price, P.P., III; Padgett, G.B.; Gentry, G.T.; Collins, F.L.; Watson, T.; Towles, T.; Davis, J.A.; Burns, D.; et al. 2022 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2022. Available online: https://www.lsuagcenter.com/articles/page1639689730578 (accessed on 15 August 2025).
Wiebold, W.J.; Nichols, J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2019 Missouri soybean crop performance tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2019. Available online: https://varietytesting.missouri.edu/archive/2019-Soybean-Complete.pdf (accessed on 15 August 2025).
Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2020 Missouri Soybean Crop Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2020. Available online: https://varietytesting.missouri.edu/archive/2020-soybean-complete.pdf (accessed on 15 August 2025).
Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2021 Missouri Soybean Crop Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2021. Available online: https://varietytesting.missouri.edu/archive/2021-soybean-complete.pdf (accessed on 15 August 2025).
Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2022 Missouri Corn Soybean Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2022. Available online: https://varietytesting.missouri.edu/archive/2022-soybean-complete.pdf (accessed on 15 August 2025).
Heiniger, R.W. 2019 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2019. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2019/ (accessed on 15 August 2025).
Heiniger, R.W. 2020 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2020. Available online: https://officialvarietytesting.ces.ncsu.edu/soybean-2020/ (accessed on 15 August 2025).
Heiniger, R.W. 2021 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2021. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2021/ (accessed on 15 August 2025).
Heiniger, R.W. 2022 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2022. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2022/ (accessed on 15 August 2025).
Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2019. University of Tennessee Institute of Agriculture. 2019. Available online: https://search.utcrops.com/wp-content/uploads/2019/12/2019-Soybean-Variety-Test-PB-FINAL.pdf (accessed on 15 August 2025).
Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2020. University of Tennessee Institute of Agriculture. 2020. Available online: https://search.utcrops.com/wp-content/uploads/2020/12/2020-Soybean-Report-WEB.pdf (accessed on 15 August 2025).
Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2021. University of Tennessee Institute of Agriculture. 2021. Available online: https://search.utcrops.com/wp-content/uploads/2021/12/Soybean_2021.pdf (accessed on 15 August 2025).
Sykes, V.; Blair, R.; Kelly, H.; Schumacher, L.; Palacios, F.; Keadle, B.; Thelin, A.; Pantalone, V. Soybean Variety Tests in Tennessee 2022. University of Tennessee Institute of Agriculture. 2022. Available online: https://search.utcrops.com/wp-content/uploads/2022/12/2022-Soybean-Publication-Full-FINAL.pdf (accessed on 15 August 2025).
Holshouser, D.; Pawlick, A.; Taylor, B.; Seymore, E. Virginia Soybean Performance Tests 2019. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2019. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2019.pdf (accessed on 15 August 2025).
Holshouser, D.; Pawlick, A.; Taylor, B. Virginia Soybean Performance Tests 2020. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2020. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2020_soybean_performance.pdf (accessed on 15 August 2025).
Holshouser, D.; Taylor, B. Virginia Soybean Performance Tests 2021. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2021. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2021_soybean.pdf (accessed on 15 August 2025).
Holshouser, D.; Taylor, B.; Daughtrey, R. Virginia Soybean Performance Tests 2022. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2022. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2022_soybean.pdf (accessed on 15 August 2025).
NOAA National Centers for Environmental Information. Climate Data Online. National Oceanic and Atmospheric Administration. 2025. Available online: https://www.ncdc.noaa.gov/cdo-web/ (accessed on 15 August 2025).
Soil Survey Staff; Natural Resources Conservation Service; United States Department of Agriculture. (n.d.). Web Soil Survey. Available online: https://websoilsurvey.nrcs.usda.gov/ (accessed on 15 August 2025).
USDA. USDA Corn and Soybean Projections for 2024/2025—October 2024. 2024. Available online: https://www.pig333.com/latest_swine_news/usda-corn-and-soybean-projections-for-2024-2025-october-2024_20856/ (accessed on 15 August 2025).
U.S. Geological Survey. 3D Elevation Program (3DEP). 2019. Available online: https://www.usgs.gov/3d-elevation-program (accessed on 15 August 2025).
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Mika, S.; Ratsch, G.; Weston, J.; Schölkopf, B.; Muller, K.R. Fisher discriminant analysis with kernels. In Proceedings of the Neural Networks for Signal Processing IX, 1999. NNSP ’99, 1999 IEEE Signal Processing Society Workshop, Madison, WI, USA, 25 August 1999; Volume 1, pp. 41–48. [Google Scholar] [CrossRef]
Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]
SAS Institute Inc. SAS 9.4 [Computer Software]. In SAS/STAT User’s Guide, Version 9.4 ed; SAS Institute Inc.: Cary, NC, USA, 2024. [Google Scholar]
Addinsoft. XLSTAT [Computer Software], Addinsoft: New York, NY, USA, 2024.
Sharma, A.; Kumar, J.; Redhu, M.; Kumar, P.; Godara, M.; Ghiyal, P.; Fu, P.; Rahimi, M. Estimation of rice yield using multivariate analysis techniques based on meteorological parameters. Sci. Rep. 2024, 14, 12626. [Google Scholar] [CrossRef]
Amaral, L.O.; Miranda, G.V.; Val, B.H.P.; Silva, A.P.; Moitinho, A.C.R.; Unêda-Trevisoli, S.H. Artificial Neural Network for Discrimination and Classification of Tropical Soybean Genotypes of Different Relative Maturity Groups. Front. Plant Sci. 2022, 13, 814046. [Google Scholar] [CrossRef]
Omondi, J.O.; Chiduwa, M.S.; Kyei-Boahen, S.; Masikati, P.; Nyagumbo, I. Yield gap decomposition: Quantifying factors limiting soybean yield in Southern Africa. npj Sustain. Agric. 2024, 2, 32. [Google Scholar] [CrossRef]
Corbellini, M.; Bobek, D.V.; de Toledo, J.F.F.; Ferreira, L.U.; Santana, D.C.; Gilio, T.A.S.; Teodoro, L.P.R.; Teodoro, P.E.; Tardin, F.D. Geographical adaptability for optimizing the recommendation of soybean cultivars in the Brazilian Cerrado. Sci. Rep. 2024, 14, 13076. [Google Scholar] [CrossRef] [PubMed]
Joswig, J.S.; Wirth, C.; Schuman, M.C.; Kattge, J.; Reu, B.; Wright, I.J.; Sippel, S.D.; Rüger, N.; Richter, R.; Schaepman, M.E.; et al. Climatic and soil factors explain the two-dimensional spectrum of global plant trait variation. Nat. Ecol. Evol. 2021, 6, 36–50. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Geographic distribution of the 60 soybean OVT sites across seven U.S. states, with state boundaries and site locations indicated.

Figure 2. Circle plot of correlations of variables with discriminant functions (F1, F2), along with the percentage of discrimination and eigenvalue (λ) for each function across different maturity groups and the overall experiment.

Figure 3. Clusters of yield from various locations were categorized into different classes based on year, along with the discrimination percentage and eigenvalue (λ) of each function (F1 and F2) for different maturity groups and the overall experiment.

Figure 4. Scree plot of ratio explanations of eigenvalues (λ), the corresponding inertia over each principal component, and the cumulative variability percentage of explained variability of principal components of different maturity groups and the overall experiment.

Figure 5. The circle plot of the coordinates of the variable’s projection onto the new space based on their correlation with principal components (PC1 and PC2), the eigenvalue (λ), and the percentage of variance in different maturity groups and the overall experiment explained by each PC.

Figure 6. Pie chart showing the contribution percentages of variables and their levels of significance (cosine squared in bold) in constructing PC1 and PC2 across different maturity groups and the overall experiment.

Figure 7. Silhouette score (SS), and the inertia of within-cluster (WSS) and between-cluster (BSS) in each step of clustering, and inertia decomposition percentage at the optimal classification (IDOC) of WSS and BSS across different maturity groups and the overall experiment.

Figure 8. Dendrogram of the location classifications of different maturity groups and the overall experiment. C is the number of clusters.

Figure 9. Profile plot of the clusters based on the behavior of the variables across different clusters of different maturity groups and the overall experiment.

Table 1. Results of H0 and Ha hypothesis tests in the Discriminant Analysis, including the box tests of Chi-square (X2) and Fisher’s (F) asymptotic approximation and Wilk’s Lambda test (Rao’s approximation) (Λ), of the Discriminant Analysis of soybean maturity groups and the overall experiment.

Components	MG3			MG4E			MG4L
Components	X²	F	Λ	X²	F	Λ	X²	F	Λ
Lambda	N/A	N/A	0.03	N/A	N/A	0.052	N/A	N/A	0.044
−2 Log(M)	47,343	47,343	N/A	118,101	118,101	N/A	191,862	191,862	N/A
Observations	46,757	200	309	117,585	431	669	191,459	701	1532
Critical Value	271	25,656	1.42	313	201,127	1.4	313	834,670	1.4
DF_X²	234	N/A	N/A	273	N/A	N/A	273	N/A	N/A
DF_SB	N/A	234	36	N/A	273	39	N/A	273	39
DF_SW	N/A	5,997,786	4917	N/A	54,890,488	15,186	N/A	227,829,934	31,917
p-Value (two-tailed)	***	***	***	***	***	***	***	***	***
Components	MG5E			MG5L			Experiment
Components	X²	F	Λ	X²	F	Λ	X²	F	Λ
Lambda	N/A	N/A	0.033	N/A	N/A	0.014	N/A	N/A	0.069
−2 Log(M)	164,380	164,380	N/A	48,686	48,686	N/A	534,966	534,966	N/A
Observations	163,581	599	760	47,725	174	294	534,564	1697	3335
Critical Value	313	161,425	1.4	313	7567	1.4	357	6,557,852	1.38
DF_X²	272	N/A	N/A	273	N/A	N/A	315	N/A	N/A
DF_SB	N/A	273	39	N/A	273	39	N/A	315	42
DF_SW	N/A	44,053,554	13,702	N/A	2,062,405	3596	N/A	2,065,617,514	96,029
p-Value (two-tailed)	***	***	***	***	***	***	***	***	***

α = 0.05; ***, p-Value < 0.001; X2, goodness of fit; F, comparison between the classes means; Λ, proportion of variance explained by between-class versus within-class variance; Lambda, ratio of within-class variance to the total variance; −2 Log(M), homogeneity of covariance matrices; DF, degree of freedom; SB, between-classes sum of squares; SW, within classes sum of squares; two-tailed, test concerned with deviations in both directions (i.e., increase and decrease) from the null hypothesis. N/A indicates that the statistical test was not performed due to insufficient variance in the treatment subset.

Table 2. The sum of weights (SW), prior probabilities (PP), and logarithms of the determinants (Log.D) of the Discriminant Analysis for each class of soybean maturity group and the overall experiment.

Components	MG3			MG 4E			MG 4L
Class	SW (Freq.)	PP	Log.D	SW (Freq.)	PP	Log.D	SW (Freq.)	PP	Log.D
2019	421	0.251	15.1	1393	0.271	49.1	2823	0.262	80.19
2020	417	0.248	36.88	1313	0.255	47.22	2937	0.272	47.82
2021	427	0.254	15.28	1262	0.245	49.87	2811	0.26	50.63
2022	414	0.247	38.07	1176	0.229	27.5	2223	0.206	50.2
Total	1679	1		5144	1		10,794	1
Components	MG 5E			MG 5L			Experiment
Class	SW (Freq.)	PP	Log.D	SW (Freq.)	PP	Log.D	SW (Freq.)	PP	Log.D
2019	1226	0.264	25.44	394	0.32	18.1	8168	0.252	56.84
2020	1274	0.272	25.08	244	0.198	14.52	8710	0.269	54.31
2021	1064	0.229	23.86	381	0.31	15.23	8740	0.27	58.23
2022	1079	0.232	47.49	211	0.172	37.61	6770	0.209	58.65
Total	4643	1		1230	1		32,388	1

Freq., Frequency in the Model; Color code: Agronomy 16 00376 i001

.

Table 3. The eigenvalue (λ), discrimination percentage (D), cumulative discriminant percentage (ΣD), value of Bartlett’s statistic (X), and the significance level of the X² for the two main functions of the Discriminant Analysis of different maturity groups and the overall experiment.

Components	MG3		MG4L
Components	F1	F2	F1	F2
λ	6.96	2.27	8.25	1.07
D	73.20	23.84	86.77	11.24
ΣD	73.20	97.04	86.77	98.01
X²	5855.00	2391.00	33,692.00	9704.00
p-Value	***	***	***	***
Components	MG5E		Experiment
Components	F1	F2	F1	F2
λ	9.37	1.56	6.46	0.74
D	84.65	14.06	88.37	10.06
ΣD	84.65	98.71	88.37	98.43
X²	15,804.00	4967.00	86,398.00	21,351.00
p-Value	***	***	***	***

α = 0.05; ***, p-Value < 0.001.

Table 4. Mahalanobis distances (δ) between different classes of different maturity groups and the overall experiment.

MG3					MG5E
	2019 δ	2020 δ	2021 δ	2022 δ		2019 δ	2020 δ	2021 δ	2022 δ
2019 δ		435	65	420	2019 δ		367	97	62
2020 δ	137		122	340	2020 δ	142		66	202
2021 δ	19	234		191	2021 δ	45	178		19
2022 δ	53	660	55		2022 δ	36	302	8
MG4E					MG5L
	2019 δ	2020 δ	2021 δ	2022 δ		2019 δ	2020 δ	2021 δ	2022 δ
2019 δ		178	50	21	2019 δ		873	387	773
2020 δ	85		34	108	2020 δ	322		320	1413
2021 δ	20	98		16	2021 δ	80	406		9
2022 δ	11	242	20		2022 δ	98	523	5
MG4L					Experiment
	2019 δ	2020 δ	2021 δ	2022 δ		2019 δ	2020 δ	2021 δ	2022 δ
2019 δ		222	72	31	2019 δ		136	44	14
2020 δ	96		43	146	2020 δ	101		29	68
2021 δ	27	131		15	2021 δ	30	64		7
2022 δ	13	264	19		2022 δ	14	135	9

Color code: Agronomy 16 00376 i002

.

Table 5. Standardized canonical discriminant function coefficients (|X|) shown; standardized to mean = 0, SD = 1 within each MG. Interpretation: Higher absolute values indicate stronger contribution to F1/F2 discrimination; sign shows direction relative to class centroids. Compare within columns (F1 vs. F2 per MG); magnitudes affected by collinearity/scaling.

Components	MG3		MG4E		MG4L		MG5E		MG5L		Experiment
Components	F1	F2	F1	F2	F1	F2	F1	F2	F1	F2	F1	F2
Variety	−0.04	−0.01	0.01	−0.11	0.04	−0.11	0.1	−0.09	−0.01	−0.02	0.03	−0.08
Yield	0.16	0.11	0.03	−0.04	0.06	−0.05	−0.12	−0.04	0	−0.02	0.02	−0.09
Location	−0.12	−0.18	0.06	0.01	0.11	−0.02	0.06	0.22	0.47	0.94	0.19	0.00
GDD	−5.63	2.12	−8.57	1.31	−8.34	1.09	−8.66	1.69	−9.75	3.2	−8.15	1.19
Avg. Temp.	0.00	0.00	12.43	−0.98	11.98	1.8	12.05	3.12	14.44	2.21	12.03	−0.3
Min. Temp.	5.64	2.56	0.06	2.75	−0.29	0.93	0.00	0.00	0.00	0.00	−0.41	1.04
Max. Temp.	3.5	−3.81	0.00	0.00	0.00	0.00	0.74	−0.36	0.44	−0.67	0.00	0.00
Precipitation	0.2	1.24	−0.07	1.05	−0.1	0.96	0.14	0.91	0.49	1.46	−0.07	1.09
Latitude	3.02	1.69	4.75	2.97	5.14	3.67	6.55	4.41	8.69	4.7	4.32	2.11
Longitude	1.21	−0.22	1.97	0.26	2.68	0.47	3.8	0.75	4.03	0.14	2.13	0.01
Altitude	0.06	0.68	−0.42	1.08	−0.36	1.25	−0.06	1.51	0.09	2.18	−0.38	0.6
Soil Type	0.14	0.34	−0.04	−0.08	0.06	−0.13	0.26	−0.15	0.25	−0.72	−0.07	0.16
Sand	0.00	0.00	−0.55	0.7	−0.51	0.65	0.09	0.93	−0.08	−0.25	−0.18	0.34
Silt	0.2	−0.68	0.00	0.00	0.00	0.00	0.11	0.19	0.00	0.00	0.11	0.09
Clay	−0.29	−0.04	−0.24	0.05	−0.17	−0.03	0.00	0.00	0.14	−0.54	0.00	0.00
MG											0.02	−0.08

MG, maturity group; Color code based on the absolute value: Agronomy 16 00376 i003

.

Table 6. Result of classification using the confusion matrix for training (Tr) and cross-validation (CV/LOOCV accuracy) of different maturity groups and the overall experiment.

Components		MG3											MG4E
		2019		2020		2021		2022		Σ	φ (%)		2019		2020		2021		2022		Σ	φ (%)
		Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV
2019	Tr	415		0		6		0		421	98.57		1149		0		1		243		1393	82.48
2019	CV		355		0		66		0	421		84.32		1127		0		0		266	1393		80.9
2020	Tr	0		396		21		0		417	94.96		0		1313		0		0		1313	100
2020	CV		0		404		13		0	417		96.88		0		1313		0		0	1313		100
2021	Tr	0		0		427		0		427	100		0		0		649		613		1262	51.43
2021	CV		6		0		421		0	427		98.59		0		0		548		714	1262		43.42
2022	Tr	3		0		41		370		414	89.37		0		0		0		1176		1176	100
2022	CV		0		0		38		376	414		90.82		90		0		2		1084	1176		92.18
Sum	Tr	418		396		495		370		1679	95.77		1149		1313		650		2032		5144	83.34
Sum	CV		361		404		538		376	1679		92.67		1217		1313		550		2064	5144		79.16
Components		MG4L											MG5E
		2019		2020		2021		2022		Σ	φ (%)		2019		2020		2021		2022		Σ	φ (%)
		Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV
2019	Tr	2650		0		5		168		2823	93.87		1226		0		0		0		1226	100
2019	CV		2655		0		0		168	2823		94.05		1160		0		0		66	1226		94.62
2020	Tr	0		2937		0		0		2937	100		0		1272		2		0		1274	99.84
2020	CV		0		2937		0		0	2937		100		0		1243		31		0	1274		97.57
2021	Tr	50		0		2758		3		2811	98.11		7		0		1057		0		1064	99.34
2021	CV		0		0		2811		0	2811		100		13		0		988		63	1064		92.86
2022	Tr	39		0		48		2136		2223	96.09		268		0		513		298		1079	27.62
2022	CV		240		0		26		1957	2223		88.03		280		0		512		287	1079		26.6
Sum	Tr	2739		2937		2811		2307		10,794	97.1		1501		1272		1572		298		4643	82.99
Sum	CV		2895		2937		2837		2125	10,794		95.98		1453		1243		1531		416	4643		79.22
Components		MG5L											Experiment
		2019		2020		2021		2022		Σ	φ (%)		2019		2020		2021		2022		Σ	φ (%)
		Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Tr	CV	Σ	Tr	CV
2019	Tr	394		0		0		0		394	100		7921		0		0		247		8168	96.98
2019	CV		394		0		0		0	394		100		8085		0		0		83	8168		98.98
2020	Tr	0		244		0		0		244	100		0		8673		37		0		8710	99.58
2020	CV		0		244		0		0	244		100		0		8710		0		0	8710		100
2021	Tr	0		0		381		0		381	100		59		9		8259		413		8740	94.5
2021	CV		0		0		368		13	381		96.59		23		0		7890		827	8740		90.27
2022	Tr	15		0		89		107		211	50.71		466		0		585		5719		6770	84.48
2022	CV		5		0		94		112	211		53.08		527		0		766		5477	6770		80.9
Sum	Tr	409		244		470		107		1230	91.54		8446		8682		8881		6379		32,388	94.39
Sum	CV		399		244		462		125	1230		90.89		8635		8710		8656		6387	32,388		93.13

Σ, total number of observations in each class; φ, percentage of well-classified observations. Color code: Agronomy 16 00376 i004

.

Table 7. The KMO (Kaiser–Meyer–Olkin) measure of sampling adequacy of the variables of different maturity groups and the overall experiment.

Components	MG3	MG4E	MG4L	MG5E	MG5L	Experiment
Yield	0.50	0.23	0.31	0.29	0.17	0.60
GDD	0.85	0.82	0.84	0.84	0.65	0.74
Avg. Temp.	0.69	0.67	0.68	0.62	0.69	0.69
Min. Temp.	0.69	0.67	0.68	0.61	0.69	0.69
Max. Temp.	0.69	0.66	0.67	0.60	0.66	0.69
Precipitation	0.74	0.73	0.72	0.79	0.60	0.79
Latitude	0.77	0.83	0.80	0.70	0.85	0.83
Longitude	0.42 *	0.58	0.60	0.66	0.62	0.48
Altitude	0.84	0.80	0.80	0.53	0.30	0.89
Soil Type	0.29	0.36	0.35	0.52	0.24	0.29
Sand	0.70	0.63	0.60	0.56	0.48	0.69
Silt	0.62	0.66	0.61	0.59	0.58	0.70
Clay	0.44	0.63	0.67	0.71	0.43	0.68
KMO	0.69	0.69	0.69	0.64	0.59	0.71

*, values above 0.5 green and less than 0.5 red.

Table 8. Results of the statistical indices of clustering locations that different maturity groups studied based on yield, climate, geography, and soil characteristics.

Components	K	S(i)	H	Δ	CH	Components	K	S(i)	H	Δ	CH
MG3	2	0.41	10.19	2.91	13.10	MG5E	2	0.25	15.68	−4.32	11.35
	3	0.39	10.06	0.12	14.26		3	0.33	5.50	10.18	16.60
	4	0.37	4.26	5.80	16.78		4	0.32	5.38	0.12	14.82
	5	0.38	3.72	0.55	15.60		5	0.29	3.74	1.64	14.41
	6	0.31	4.11	−0.39	14.92		6	0.31	3.54	0.20	13.59
	7	0.33	3.61	0.50	15.16		7	0.25	2.74	0.79	13.16
	8	0.32	3.20	0.40	15.39		8	0.27	2.83	−0.08	12.57
MG4E	2	0.27	15.12	1.21	16.32	MG5L	2	0.34	4.54	7.53	12.07
	3	0.34	10.38	4.74	19.11		3	0.27	4.79	−0.25	9.63
	4	0.31	6.37	4.01	19.82		4	0.29	3.65	1.14	9.64
	5	0.31	5.08	1.29	18.95		5	0.33	3.14	0.51	9.51
	6	0.31	5.35	−0.28	18.17		6	0.32	2.15	1.00	9.49
	7	0.34	3.25	2.11	18.23		7	0.30	2.31	−0.16	9.02
	8	0.29	3.44	−0.19	17.30		8	0.27	2.43	−0.13	8.98
MG4L	2	0.26	15.09	−1.36	13.73	Experiment	2	0.26	21.99	1.38	23.37
	3	0.33	7.02	8.07	17.52		3	0.31	16.27	5.72	26.91
	4	0.30	6.59	0.43	16.36		4	0.29	6.86	9.41	28.17
	5	0.29	4.81	1.78	16.29		5	0.28	6.34	0.52	25.05
	6	0.29	4.45	0.36	15.76		6	0.27	6.49	−0.15	23.26
	7	0.32	3.44	1.01	15.56		7	0.27	6.50	−0.01	22.44
	8	0.28	2.67	0.78	15.08		8	0.29	6.96	−0.46	22.15

K, cluster number; S(i), silhouette index; H, is the Hartigan index; Δ, difference metric or H(k − 1) − H(k), CH, Calinski and Harabasz Index. Bold values are where the optimal number of clustering was obtained.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mirahki, I.; Bond, R.; Heiniger, R.; Moseley, D.; Sykes, V.R. Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy 2026, 16, 376. https://doi.org/10.3390/agronomy16030376

AMA Style

Mirahki I, Bond R, Heiniger R, Moseley D, Sykes VR. Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy. 2026; 16(3):376. https://doi.org/10.3390/agronomy16030376

Chicago/Turabian Style

Mirahki, Isaac, Richard Bond, Ryan Heiniger, David Moseley, and Virginia R. Sykes. 2026. "Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis" Agronomy 16, no. 3: 376. https://doi.org/10.3390/agronomy16030376

APA Style

Mirahki, I., Bond, R., Heiniger, R., Moseley, D., & Sykes, V. R. (2026). Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy, 16(3), 376. https://doi.org/10.3390/agronomy16030376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Description

2.2. Data Entry Criteria

2.3. Climate Data

2.4. Soil and Geographical Data

2.5. Data Analysis

2.5.1. Discriminant Analysis (DA)

Fisher’s F Test

The Chi-Square

The Wilk’s Lambda (Λ) Test

2.5.2. Principal Component Analysis (PCA)

Data Standardization

The Pearson Correlation Matrix

Eigenvalue and Eigenvector

The Eigenvalues Were Sorted

The PC Score

The Explained Variance

Bartlett’s Test of Sphericity

The Kaiser–Meyer–Olkin (KMO)

2.5.3. Agglomerative Hierarchical Clustering (AHC)

Initialization

Distance Measurements

Ward’s Linkage

Silhouette Index

Hartigan Index (H)

Calinski–Harabasz Index (CH)

The H (k − 1) − H(k) Criterion

2.6. Software and Computational Tools

3. Results

3.1. Quadratic Discriminant Analysis

3.2. Principal Component Analysis (PCA)

3.3. Agglomerative Hierarchical Clustering (AHC)

4. Discussion

4.1. Discriminant Analysis and Yield Stability

4.2. Latent Environmental Drivers (PCA)

4.3. Delineating Mega-Environments (AHC)

4.4. Conclusion on Methodology

4.5. Limitations and Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI