Next Article in Journal
Seed Germination Characteristics of Lolium multiflorum Populations from Wheat Fields of Eastern China
Next Article in Special Issue
Regional-Scale Mapping of Gully Network in Mediterranean Olive Landscapes Using Machine Learning Algorithms: The Guadalquivir Basin
Previous Article in Journal
Identification of Specific Long-Lived mRNAs Associated with Seed Longevity in Sweet Corn Based on RNA-seq
Previous Article in Special Issue
A Lightweight Citrus Ripeness Detection Algorithm Based on Visual Saliency Priors and Improved RT-DETR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis

1
Department of Plant Sciences, University of Tennessee, 308 Agriculture and Natural Resources Building, Knoxville, TN 37996, USA
2
Department of Crop, Soil and Environmental Sciences, University of Arkansas, 115 Plant Sciences Building, 495 N Campus Walk, Fayetteville, AR 72701, USA
3
Department of Crop and Soil Sciences, North Carolina State University, Williams Hall, 101 Derieux Pl, Raleigh, NC 27695, USA
4
Dean Lee Ag Center, Louisiana State University, 8105 Tom Bowman Dr, Alexandria, LA 71302, USA
*
Author to whom correspondence should be addressed.
Agronomy 2026, 16(3), 376; https://doi.org/10.3390/agronomy16030376
Submission received: 16 December 2025 / Revised: 30 January 2026 / Accepted: 31 January 2026 / Published: 4 February 2026
(This article belongs to the Special Issue Advanced Machine Learning in Agriculture—2nd Edition)

Abstract

The current state-centric analysis of Official Variety Trials (OVTs) restricts the identification of stable performance zones across political boundaries. This study employed multivariate statistical learning techniques to delineate soybean (Glycine max L.) “mega-environments” using yield data from 2269 varieties collected across seven U.S. states (2019–2022). Utilizing Quadratic Discriminant Analysis (QDA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC), we examined the edaphoclimatic factors influencing yield stability. QDA classified over 79% of environments into distinct temporal categories, highlighting significant inter-annual climatic variability driven by Growing Degree Days (GDD) and latitude. PCA distinguished broad climatic drivers (PC1) from localized soil texture constraints (PC2). AHC identified optimal production clusters that frequently diverged from geographic proximity, indicating that distant sites often share more critical yield-determining factors than neighboring counties. By operationalizing these latent environmental patterns, this study provides a data-driven framework for cross-state environmental zoning that can support more precise variety placement once genotype performance has been evaluated within these zones.

1. Introduction

Soybean (Glycine max L.), a descendant of G. soja Sieb. and Zucc., has a cultivation history of over 5000 years [1] and has been a staple of U.S. agriculture since Official Variety Testing (OVT) became standard practice in the 1940s [2,3,4]. Characteristics like high protein content and benefits for soil health have made soybean a cornerstone of sustainable crop rotation [5], a role underscored by projections of an 8.7% increase in global production to 425.4 million tons [6,7]. While biotic stressors like weeds and diseases can cause significant yield losses of up to 50% and 30%, respectively [8,9,10], environmental factors remain the primary drivers of productivity. Temperature and precipitation alone control 20–40% of yield variability [11], with heat stress potentially causing 17% yield loss [2] and water scarcity during the reproductive stage causing up to 30% loss [12]. Furthermore, soil texture [13] and geographic factors like longitude, latitude, and altitude critically influence growth characteristics [14,15].
Optimizing soybean productivity therefore requires selecting varieties tailored to these regional edaphoclimatic conditions [16], a strategy that enhances land use efficiency and economic sustainability [17,18]. State variety testing data remain vital resources for this decision-making [19]. However, extracting actionable insights from these data requires novel analytical techniques to uncover hidden relationships between influential variables. Machine learning algorithms (MLA), which learn patterns to explain results and predict yield without explicit programming [20,21,22], have shown notable success in classifying crop performance across environments [22,23]. The adoption of statistical learning (UL-MLA)—including Discriminant Analysis (DA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC)—promises to uncover latent patterns in OVT datasets to promote sustainable farming practices [20,24,25].
Specific ML techniques offer distinct advantages for analyzing multi-environment trials. DA differentiates classes based on predictors [26,27], predicting group membership for observations [28] and facilitating the classification of varieties into performance groups [29,30]. Quadratic DA (QDA) is particularly useful as it allows for unequal covariance matrices between classes [31]. PCA reduces dataset dimensionality while preserving variance [32,33,34], identifying the most effective variables in large, highly correlated datasets [35,36,37,38]. Finally, AHC reveals hierarchical relationships [39,40] and has been successfully used to identify heterotic groups and patterns in complex datasets [41].
While methods such as Quadratic Discriminant Analysis (QDA), Principal Component Analysis (PCA), and Agglomerative Hierarchical Clustering (AHC) have deep roots in classical statistics, they are widely recognized as foundational algorithms within the field of machine learning. Following standard classifications in the field [42,43], we employ these techniques as supervised (QDA) and unsupervised (PCA, AHC) learning algorithms to extract latent patterns from high-dimensional ecological data. These methods allow for the unbiased discovery of environmental structures without the constraints of rigid, pre-defined political boundaries. The objective of this study was to apply these advanced analytical techniques to define data-driven recommendation domains for soybean, moving beyond simple geographic proximity. By providing verified, location-specific variety selection information [44,45], this study aims to integrate ML insights into dynamic models that enhance the adaptability and sustainability of soybean production [46]. Because the goal of this study is to delineate structural, cross-state mega-environments, we focus on seasonal thermal and moisture indices rather than event-scale extremes, and we do not explicitly model short-term heat or drought episodes. Consequently, the resulting zones are intended to describe long-term edaphoclimatic potential, not to quantify intra-seasonal production risk. In addition, due to incomplete metadata across the OVT network, management practices (e.g., planting date, irrigation, and input intensity) are not included, and the mega-environments should be interpreted as environmental domains rather than full G × E × M production systems.

2. Materials and Methods

Seeds with a relative maturity from 3 to 3.9, from 4 to 4.5, from 4.6 to 4.9, from 5 to 5.5, and from 5.6 to 5.9, respectively, are considered maturity group (MG) 3, 4 early (4E), 4 late (4L), 5 early (5E), and 5 late (5L). Varieties were stratified and analyzed within their specific maturity groups rather than pooled across the experiment. This stratification was necessary because soybean is a photoperiod-sensitive short-day plant; grouping disparate MGs (e.g., MG 3 vs. MG 5) would introduce significant confounding genotypic variance related to flowering time and maturity dates that would obscure the edaphoclimatic signals targeted by this study [47].
In the unsupervised statistical learning analyses (PCA and AHC), specific ‘Location-Years’ served as the observational units. ‘Location’ was utilized as the primary identifier because it functions as an aggregate proxy for the specific combination of edaphoclimatic and geographic variables (e.g., GDD, precipitation, soil texture) experienced by the crop at that site. Therefore, clustering or discriminating by ‘Location’ effectively groups sites based on their integrated environmental profiles rather than political boundaries.

2.1. Data Description

The time series yields data obtained from the official variety test (OVT) conducted in 2019 (Y1), 2020 (Y2), 2021 (Y3), and 2022 (Y4) [Y = 4] in 60 different locations across the seven mid-southeastern states of the U.S., including Arkansas (AR) [48,49,50,51], Kentucky (KY) [52,53,54,55], Louisiana (LA) [56,57,58,59], Missouri (MO) [60,61,62,63], North Carolina (NC) [64,65,66,67], Tennessee (TN) [68,69,70,71], and Virginia (VA) [72,73,74,75]. The current results are from locations between 30.12 and 40.37° N and between −75.73 and −95.43° W, with an average temperature from 10.56 to 20.85° C, minimum temperature from 4.3 to 15.89° C, maximum temperature from 16.02 to 26.41° C, precipitation from 689.36 to 2173.2 mm, and altitude from 2 to 450 masl; the soil of the experimental sites had a texture with from 5 to 70% sand, from 20 to 65% silt, and from 10 to 50% clay. The spatial distribution of the 60 OVT sites across Arkansas, Kentucky, Louisiana, Missouri, North Carolina, Tennessee, and Virginia is shown in Figure 1.

2.2. Data Entry Criteria

The criteria for data analysis consisted of two steps:
The overall experiment: All the reported data from all locations were analyzed.
Locations were included only if they reported data for all four years (2019–2022).

2.3. Climate Data

The long-term environmental data, including minimum, maximum, average temperature (°C), and accumulated precipitation (mm), were obtained from the National Centers for Environmental Information (NCEI) [76]. As specific planting dates varied by site-year, the analysis utilized the standard frost-free growing season for each location to calculate accumulated climatic indices. Consequently, the effect of specific planting timing is captured within the ‘Location-Year’ variance. While soybean development is driven by photothermal accumulation, Growing Degree Days (GDD) were utilized as the primary thermal metric rather than Photothermal Units (PTU) due to the lack of variety-specific genetic coefficients (e.g., critical photoperiod thresholds) for the 2269 commercial genotypes evaluated. To account for the photoperiodic influence on yield and development, the analysis was: (1) stratified by maturity group, effectively grouping varieties with similar photoperiod sensitivities; and (2) incorporated latitude as a direct independent variable in the machine learning models to serve as a continuous proxy for the photoperiodic environment. The calculation of environmental data, as well as the accumulated degree days (GDD), was as follows:
T ¯ M a x . = Σ i = 1 n T M a x , i n
T ¯ M i n . = Σ i = 1 n T M i n , i n
T ¯ A v g . = Σ i = 1 n T A v g , i n
P r e c i p . = i = 1 n P i
G D D = T M a x . + T M i n . 2 T B
where Max. is the maximum temperature during the growth season, TMax. is the maximum temperature on the i-th day, n is the total number of days of the growth season, Min. is the minimum temperature during the growth season, TMin. is the minimum temperature on the i-th day, Avg. is the mean temperature during the growth season, TAvg. is the mean temperature on the i-th day, Precip. is the accumulated precipitation during the growth season, Pi is the amount of precipitation recorded on the i-th day, GDD is the accumulated growth degree days, and TB is the base temperature of soybean (10 °C).

2.4. Soil and Geographical Data

Soil texture composition, including the percentage of sand, silt, and clay, was obtained from the United States Department of Agriculture (USDA) Web Soil Survey (WSS) [77,78]. Geographic coordinates (latitude and longitude) and altitude for each trial site were derived from the United States Geological Survey (USGS) National Map data, corresponding to the specific physical location of the agricultural experiment stations or the centroid of the reported county [79].

2.5. Data Analysis

2.5.1. Discriminant Analysis (DA)

Discriminant Analysis (DA) was utilized to classify and predict the allocation of observations during the experiment. Based on the nature of the experiment, the Quadratic Discriminant Analysis (QDA), which assumes inequality in class covariance matrices [26,80,81], was chosen for DA of the yield in different years as affected by climatic, geographic, and soil characteristics. The two-box test (Chi-square (X2) and Fisher’s F (F) test [81]) were used to assess the null hypothesis. Also, Wilk’s Lambda test evaluated the equality of the vectors of the means for various groups.
The general formula for QDA is as follows:
Q D A : g k x = 1 2 log Σ k 1 2 x μ k T Σ k 1 x μ k + l o g P k
where gk(x) is the discriminant score for class k, Σk is the covariance matrix (separately calculated for each class) for class k (Equation (7)) and |Σk| is its determinant, x is the feature vector of the observation, μk is the mean vector of class k (centroid) (Equation (8)), T is the transpose operator, Σk−1 is the inverse of the covariance matrix for class k (calculates Mahalanobis distances and accounts for correlations between features) (Equation (9)), log Pk is the prior probability of class k (Equation (10)).
Σ k = 1 N k i c l a s s k x i μ k x i μ k T
μ k = 1 N k i c l a s s k x i
where Nk is the number of samples in class k, and xi is the i-th data point in class k.
Σ k 1 = x μ k T Σ k 1 x μ k
P k = N k N
where N is the total number of samples.
To classify the sample x, gk(x) for all classes was computed, and observation x was assigned to the class with the highest discriminant score. The formula is expressed as:
C l a s s x = a r g m a x k , g k x
Statistical test
Fisher’s F Test
Fisher’s F test was used at the preprocessing stage to evaluate the equality of the means (H0) between different classes. It is expressed as:
F = S B k 1 S W N k
where SB is the between-class sum of squares (Equation (13)), SW is the within-class sum of squares, k is the number of classes (Equation (14)), and N is the total number of observations across all groups.
S B = i 1 K N i μ i μ μ i μ T
where Ni is the number of observations in class i, μi is the mean of group i, μ is the overall mean of all classes, and T is the transpose operator.
S W = i = 1 k j = 1 N i x i j μ i x i j μ i T
where xij is the individual observation in class i, and μi is the mean of group i.
The DF of SB is expressed as:
D F S B = k 1
The DF of SW is expressed as:
D F S W = N K
The Chi-Square
The Chi-square test evaluated the relationship between the categorical variables (e.g., climatic, geographic, etc.) to determine the model fit. This test is expressed as:
X 2 = Σ O i j E i j 2 E i j
where Oij is the observed frequency representing the actual count of occurrences in category ij, and Eij is the expected frequency for category ij calculated based on the null hypothesis, and calculated as:
E i j = R o w T o t a l × C o l u m n T o t a l G r a n d T o t a l
The degree of freedom (DF) in Chi-square is calculated as:
D F x 2 = r 1 c 1
where DFx2 is the degrees of freedom of chi-square (for tests involving categorical variables, DF is commonly used), r is the number of rows, and c is the number of columns in the contingency table.
The Wilk’s Lambda (Λ) Test
The Wilk’s Lambda (Λ) test assessed the discriminative ability of the variables by comparing the variance within groups to the variance between groups and is calculated through:
Λ = S W S W + S B
where |SW| is the determinant of the within-class scatter matrix, and |SW + SB| (SB & SW Equations (13) and (14)) is the determinant of the total scatter matrix.
The DFs of Λ are calculated through Equations (15) and (16).
In this study, QDA is used as a diagnostic tool to quantify how effectively the edaphoclimatic covariates separate distinct temporal regimes (years) within each maturity group, rather than as the primary procedure for defining mega-environments.

2.5.2. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) was employed to reduce dimensionality while preserving most variances and simultaneously identifying the most effective variable in constructing each principal component: 14 edaphoclimatic variables (yield, GDD, temperatures, precipitation, latitude/longitude/altitude, soil texture %) used Location-Years as observations; standardized via z-score; Pearson correlation matrix; PC selection: scree-plot elbow retaining ≥95% cumulative variance (Section 2.6 for software). The PCA based on the Pearson correlation matrix was selected due to the multiscale nature of the dataset. The steps of PCA are as follows [33,34,35,36,38,82].
Data Standardization
Data standardization was performed to ensure that all the components contributed equally and to prevent the features with a larger range from dominating the analysis. It is expressed as:
Z = X μ σ
where Z is the standardized data matrix, X is the original data matrix (n × p, where n is the number of samples, p is the number of features), μ is the mean of each feature, and σ is the standard deviation of each feature.
The Pearson Correlation Matrix
The Pearson correlation matrix was computed to measure the linear relationship between variables while ensuring scale invariance. It was calculated through:
R = 1 n 1 Z T Z
where R is the correlation matrix, and ZT is the transpose of the standardized data matrix. Each element Rij in the matrix represents the Pearson correlation between the i and j features.
Eigenvalue and Eigenvector
The eigenvalue and eigenvector were computed to identify the principal components (PC) and to quantify the variance explained by each PC. Its formula is:
R v i = λ i v i
where vi is the corresponding eigenvector, and λi is the eigenvalue.
The Eigenvalues Were Sorted
The eigenvalues were sorted in descending order to identify the eigenvalues corresponding to the most significant PCs. The top k PCs were selected at the elbow point of the changes in the corresponding inertia of PCs (Graph 3).
The PC Score
The PC score was calculated by projecting the original data into the selected PCs (new space). The expression is as follows:
P C = Z V k
where PC is the transformed dataset in the new k-dimensional space, and Vk is the matrix of the top k eigenvectors.
The Explained Variance
The explained variance was measured to quantify the variability explained by each PC to conclude the final number of PCs to retain most of the dataset’s variance. It is calculated through:
V E = λ j i = 1 p λ i
where VE is the variance explained, λj is the eigenvalue of the j-th component, and i = 1 p λ i is the total variance in the dataset.
Statistical tests
Bartlett’s Test of Sphericity
Bartlett’s test of sphericity was performed to evaluate the difference between the correlation matrix and the original matrix. This procedure aimed to explore the variables’ interrelation and suitability for PCA and find a significant p-value supporting conducting PCA. Its expression is as follows:
X 2 = n 1 2 p + 5 6 log R
where X2 is the Chi-square, n is the sample size, p is the number of variables, and |R| is the determinant of the correlation matrix.
The Kaiser–Meyer–Olkin (KMO)
The Kaiser–Meyer–Olkin (KMO) was conducted to assess the sampling adequacy for PCA by comparing the magnitude of the observed correlation coefficient to the partial correlation coefficient. Results with overall values closer to 1 indicated the suitability of the dataset for PCA. The KMO formula is as follows:
K M O = i j r i j 2 i j r i j 2 + i j u i j 2
where rij is the correlation coefficient, and uij is the partial correlation coefficient.

2.5.3. Agglomerative Hierarchical Clustering (AHC)

Agglomerative Hierarchical Clustering (AHC) was conducted to explore the underlying structure of the data, uncover latent correlations between variables, and quantify their effect on the classification of the participating locations in OVT. AHC builds nested clusters by iteratively merging the smaller clusters into larger ones. The steps of AHC are as follows:
Initialization
Initialization started by considering each data point (Location) as a cluster. (If there is n data points, there will be n clusters.)
Distance Measurements
Distance measurements calculated the distance (dissimilarity) between the initial cluster. The principal metric used for this purpose was Euclidean distance, which is expressed as:
d i , j = k = 1 p ( x i k x j k ) 2
where d is the distance, (i, j) are indices for two different data points, p is the number of features (dimensions), xik is the value of the k-th feature for data point i, and xjk is the value of the k-th feature for data point j.
Ward’s Linkage
Ward’s linkage was used for merging clusters because of its ability to create well-balanced (even) and compacted clusters, moderate sensitivity to outliers, and minimize the within-cluster variance. Its mathematical formula is:
d A , B = A B A + B . d ( A ¯ , B ¯ )
where |A| and |B| are the number of observations of cluster A and B, and are the centroids of cluster A and B, and d(A¯, B¯) is the Euclidean distance between centroid A and B.
Statistical tests
Silhouette Index
The Silhouette index assessed the clustering quality by measuring how well each data point fitted within its cluster compared to the data points in the neighboring cluster by combining intra-cluster and inter-cluster cohesion [83]. Its mathematical expression is as follows:
S i = b i a ( i ) m a x ( a i , b i )
where a(i) is the average intra-cluster distance for point i, b(i) is the average distance between point i and points in the nearest cluster. A higher S(i) value (closer to 1) indicates a better cluster cohesion and separation (cluster validity), not classification accuracy.
Hartigan Index (H)
The Hartigan index (H) was utilized to quantify the improvement in clustering by measuring the reduction in within-cluster dispersion as the number of clusters increases from k − 1 to k [84]. The expression of this test is:
H ( k ) = Δ W W k
where ΔW is the change in the within-cluster dispersion (Equation (32)), k is the number of clusters, and Wk is the within-cluster sum of squared deviations for the k-th cluster (Equation (33)).
Δ W = W k 1 W k
W k = i = 1 k x C i | x μ i | 2
where k is the number of clusters, x is the data point in cluster Ci, Ci is the i-th cluster, and μi is the centroid of cluster Ci.
Calinski–Harabasz Index (CH)
The Calinski–Harabasz Index (CH), also known as the variance ratio criterion, was used to evaluate the clustering quality by comparing the dispersion within clusters to the dispersion between clusters [85]. It is defined as:
C H k = T r B k T r W k . n k k 1
where k is the number of clusters, n is the number of data points, Tr(Bk) is the trace of the between-cluster scatter matrix (Equation (35)), and Tr(Wk) is the trace of the within-cluster scatter matrix (Equation (36)).
B k = i = 1 k n i ( μ i μ ) ( μ i μ ) T
where ni is the number of points in cluster i, μi is the centroid of cluster i, μ is the global mean of all data points, and T is the transpose operator.
W k = i = 1 k x C i ( x μ i ) ( x μ i ) T
where Ci is the i-th cluster, x is the datapoint in cluster Ci, and μi is the centroid of cluster Ci.
The H (k − 1) − H(k) Criterion
The H (k − 1) − H(k) criterion was used to identify diminishing returns in clustering quality as the number of clusters increases, to select the optimal cluster number [84]. This method uses the difference in the Hartigan index between consecutive cluster numbers (k and k − 1) to evaluate the improvement in clustering. A significant positive value of ΔH(k) indicates a significant improvement in clustering quality by increasing the number of clusters from k − 1 to k. It is defined as:
Δ H k = H k 1 H k
where H(k) is the Hartigan index for k clusters, and H(k − 1) is the Hartigan index at the k − 1 clusters.

2.6. Software and Computational Tools

Statistical analysis and data visualization were performed using SAS 9.4 [86] Microsoft Excel (Microsoft Corp., Redmond, WA, USA), and the XLSTAT add-in [87], which were used for all calculations and figure generation.

3. Results

3.1. Quadratic Discriminant Analysis

Table DA’s two box tests (Chi-square (X2) and Fisher’s (F) distribution test) (Table 1) showed that the covariance matrices are unequal between different groups (years) in all MGs and the Exp., which rejects the null hypothesis (H0) and proves that the covariance matrices of the years are not equal. The Wilk’s Lambda test (Λ) also showed that the vector of the means of all the MGs and the Exp. significantly differed.
The statistical results of the Discriminant Analysis (Table 2) of the MGs and the Exp. showed that the prior probability (PP) in each class (year), which was directly affected by the sum of weights (SW) (frequency) of the cumulative measures, was relatively consistent. However, the logarithms of determinants (Log.D), in other words, the dispersion of the data in each class varied more noticeably. In MG3, MG4E, MG4L, MG5E, MG5L, and Exp., respectively, classes 2019, 2022, 2020, 2021, 2020, and 2020 had the lowest, and 2022, 2021, 2019, 2022, 2022, and 2022 had the highest Log.D. The high variability in 2022 (the highest Log.D in most MGs) and lower classification accuracy (MG5E/5L) likely stem from significant thermal shifts; 2022 was characterized by a drastic shift toward the negative side of F1 and F2, driven by extreme GDD and temperature fluctuations that diverged from the 2019–2021 patterns.
Table 3 shows that F1 was the dominant factor in separating the classes for all the MGs and the Exp. The F1 of the Exp. and MG3, respectively, had the highest and the lowest discriminative effect on classifying the observations. However, the condition for F2 was reversed, which shows that the MG3 varieties are most likely more sensitive to variables that affect the proportion of F2 discrimination. Also, Bartlett’s test for eigenvalue significance showed that both F1 and F2 effectively distinguished the observations in different classes in all MGs and the total experiment.
Based on the Mahalanobis distance test (Table 4) between all classes, in MG3, MG4E, and MG4L, the longest distance was between classes 2022 and 2020; in MG5E and Exp., the longest distance was between classes 2019 and 2020, and in MG5L, the class of 2020 had the longest distance from the class of 2022. The shortest distance between classes in MG3 was between 2019 and 2021; in MG4E and MG4L, it was between 2022 and 2019; in MG5E and MG5L, it was between 2021 and 2022; and in the Exp., it was between 2021 and 2022.
Figure 2 visualizes the correlation between variables with each function. In MG3, minimum temperature and precipitation had the highest positive correlation, respectively, with F1 and F2. In MG4E, MG4L, MG5E, MG5L, and the Exp., precipitation had the highest positive correlation with F1. The highest negative correlation with F1 in MG3, MG4E, MG4L, MG5E, MG5L, and the Exp. resulted from GDD; in the MG3 yield and percentage of sand in the soil texture, the MG4E and MG4L variety, the MG5E and MG5L yield, and Exp., longitude had the highest negative correlation with F2.
The top five variables by absolute standardized canonical coefficient (|X| ranking, Table 5) for F1 classification were MG3 (min temp, GDD, max temp, latitude, longitude); MG4E/MG4L/MG5E/MG5L/Exp (avg temp, GDD, latitude, longitude). For F2, the variables were MG3 (max temp, min temp, GDD, latitude, precipitation); others emphasized latitude and altitude. Note: Coefficient magnitudes depend on variable scaling and collinearity and should be interpreted relative to standardization within each MG (see Table 5 caption).
Influential variables were identified by ranking absolute standardized canonical discriminant coefficients (Table 5); note that coefficient magnitudes depend on variable scaling and collinearity and should be interpreted as relative discriminatory power within each model. The visualization of the classification score of the observations (Figure 3), their relative bootstrap ellipses, and the centroid of each class on the factor axis clarifies how effectively this method was able to distinguish between different classes for different maturity groups and Exp. It also illustrates that in all of the MGs and Exp., (a) the 2019, 2021, and 2022 were located relatively close together; (b) the class of 2019 was usually located at the negative side of the F1 and positive side of the F2; (c) however, in the next consecutive (following) class of 2020, the classification moved toward the positive side of F1 and relatively less movement toward the negative side of F2.;(d) the following classification (2021) moved to an in-between position and (e) was followed by a drastic shift toward the negative side of both F1 and F2.
Validation protocol: Confusion matrices (Table 6) report resubstitution (training, Tr) accuracy using all data and leave-one-out cross-validation (LOOCV, CV) accuracy. LOOCV folds were defined by individual Location-Year observations (internal validation, no independent holdout). Respectively, 95.77%, 83.34%, 97.1%, 82.99%, 91.54%, and 94.39% of observations in MG3, MG4E, MG4L, MG5E, MG5L, and Exp. achieved Tr accuracy, with CV accuracies of 92.67%, 79.16%, 95.98%, 79.22%, 90.89%, and 93.13%. The observations of class 2021 in MG4E, and 2022 in MG5E and MG5L were the only classes with <75% cross-validation accuracy below our internal threshold, which we adopted based on standard performance benchmarks for Discriminant Analysis in ecological data where year-to-year variability is the primary grouping factor. However, the ratio of well-classified observations to the total observations was above the acceptable range. These results confirmed both the eligibility of data for use in allocating similar observations to the function of the corresponding group to determine the specific classification of observation of interest, and the ability of QDA in identifying the quality of the data for classification (Supplementary Materials, Tables S1–S6). Note on coefficients: Large values/intercepts result from z-score standardization and Mahalanobis distance scaling in QDA; zeros indicate negligible contribution after covariance adjustment (not regularization/selection). Full formulas are available in the Supplementary Materials.

3.2. Principal Component Analysis (PCA)

The results of the KMO test (Table 7) of the sampling adequacy of variables to be included in the PCA showed that the overall KMO of all the MGs was in the ‘moderate’ range (0.5 < KMO < 0.7), and for Exp. was within the ‘good’ range (0.7 < KMO < 0.8) for the Principal Component Analysis.
The scree plot visualization (Figure 4) suggested that two principal components (PCs) could optimally capture the majority of the variance in the data in all cases.
The circle plot (Figure 5) based on the correlation (factor loading) between variables and PCs provides the coordinates of variables in the new space. The results of the PCA indicated that for MG3, MG4E, MG4L, MG5E, MG5L, and the Exp., PC1 explained 51.57%, 49.97%, 48.1%, 45.43%, 54.72%, and 47.62% of the variance, respectively. PC2 accounted for 23.39%, 23.74%, 25.22%, 30.06%, 19.01%, and 23.05% of the variance in these groups. It also shows that in all cases, climatic factors (average, maximum, and minimum temperature, GDD, and precipitation) had the highest positive, and latitude and altitude had the highest negative correlations with PC1. Soil type, sand percentage, and yield had the highest positive correlation, and clay percentage, silt percentage, and longitude (and altitude in MG5L) had the highest negative correlation with PC2.
The pie chart of the variables’ contribution percentage and their significance level in building PC1 and PC2 is visualized in Figure 6. It shows that climatic characteristics along with latitude and altitude (except in MG5L, it was longitude) were the most significant contributors in building PC1. Longitude, the percentage of sand, silt, and clay in all cases (in MG5E, precipitation was added to the mix, and in MG5L, it was only soil characteristics) had the highest contribution in building PC2.

3.3. Agglomerative Hierarchical Clustering (AHC)

Table 8 (S(i), H, Δ, and CH) and Figure 7 (PIDOC of BSS and WSS) provide verified guidance for selecting the optimal number of clusters and the explanatory indices. The optimal number of clusters for MG3 was four, where S(i), H, Δ, and CH were 0.37, 4.26, 5.8, and 16.78, respectively. Also, the PIDOC of BSS and WSS at four clusters explained 70.56% and 29.44% of the total inertia. MG4E was optimal at four clusters where S(i) = 0.31, H = 6.37, Δ = 4.01, CH = 19.82, and the PIDOC of BSS and WSS, respectively, explained 65.58% and 34.42% of the total inertia. MG4L was optimal at three clusters with S(i) = 0.33, H = 7.02, Δ = 8.07, CH = 17.52, and PIDOC of BSS and WSS, respectively, were responsible for 53.88% and 46.12% of the total inertia. MG5E was optimal at three clusters with S(i) = 0.33, H = 5.5, Δ = 10.18, CH = 16.6, and PIDOCs of BSS and WSS were 56.08% and 43.92%. MG5L was optimal at two clusters with S(i) = 0.34, H = 5.45, Δ = 7.53, CH = 12.07, and PIDOCs of BSS and WSS were 57.01% and 42.99%. The optimal number of clusters for the Exp. was determined to be four, with metrics S(i) = 0.29, H = 6.86, Δ = 9.41, CH = 28.17, and PIDOC of BSS and WSS were 61.03% and 38.97%, respectively.
The dendrogram (Figure 8) shows that in MG3, cluster 1, unlike the other three clusters, was the most diverse cluster (it had members from KY, MO, and TN). Clusters 2, 3, and 4 were dominantly comprised of MO, LA, and VA counties, respectively. The subcluster of Tensas Parish, LA, with West Carrol Parish, LA (from cluster 3), had the least, and Dunklin Co., MO, with Robertson Co., TN (from cluster 1), had the highest dissimilarities. Cluster 4 was the shortest node height, which means the data points had the least within-cluster dissimilarities, and cluster 3 was the least compacted (tallest, highest within-cluster dissimilarities). Clusters 1 and 2 were the first linked clusters (least dissimilarity), which were later linked with cluster 4 and eventually with cluster 3.
In MG4E, clusters 1 (KY, MO, TN) and 2 (AR, LA, TN) were the most diverse clusters, and cluster 3 (MO) was the least diverse cluster. The subcluster of Lee Co., AR, and St. Francis Co., AR, had the least, and East Baton Rouge Parish, LA, with Red River Parish, LA, had the most dissimilarities at the first level of linkage. Cluster 3 and 2, respectively, were the most and least compacted clusters. Clusters 3 and 1 had a lower level of dissimilarity linked together than clusters 2 and 4.
Among the three clusters of MG4L, clusters 1 (KY, MO, TN) and 2 (AR, LA, TN) were the most diverse clusters. Once again, the subcluster of Lee Co., AR, and St. Francis Co., AR, had the least, and East Baton Rouge Parish, LA, with Red River Parish, LA, had the most dissimilarities. Clusters 2 and 1 were the most and least compacted clusters. Clusters 1 and 2 at a lower node height are linked together (lower distance between clusters.
In MG5E, clusters 1 and 3 were the most and least diverse clusters. The least and most dissimilar subclusters were observed in the cluster of St. Francis Co., AR, with Shelby Co., TN (from cluster 2), and Suffolk, VA, with Richmond Co., VA. Clusters 1 and 3 were the most and least compacted clusters. Clusters 1 and 3 had the least between-cluster distance, which was later linked with cluster 2.
In MG5L, the subclusters of Tensas Parish, LA, with Rapides Parish, LA, had the least dissimilarities, and Orange Co., VA, with Nottoway Co., VA, had the most. Components of cluster 1 had a lower level of dissimilarity linked together (lower difference) compared to components of cluster 2.
In the Exp., clusters 1 (KY, NC, TN, VA) and 2 (AR, KY, MO, TN) were the most diverse, and cluster 4 (MO) was the least diverse cluster. In the first subcluster, the grouping of Lee Co., AR, with St. Francis Co., AR, from cluster 3 had the least, and Yadkin Co., NC, with Pulaski Co., KY, from cluster 1 had the highest dissimilarities. Cluster 4 and 1, respectively, were the most and least compacted clusters. Clusters 2 and 4 were linked at a shorter node height compared to the linkage between clusters 1 and 3.
The profile plot (Figure 9) showed that the variables followed a pattern in the contribution level in clustering. The effect of yield on clustering (in a scale from −2 to 2) was usually between −0.87 (in cluster 3 of the MG3) and 0.62 (in cluster 4 of the MG3). The highest positive contribution of GDD, average temperature, minimum temperature, maximum temperature, and precipitation (climatic variables), respectively, were 1.89, 1.83, 1.84, 1.8, and 1.29 on cluster 3 of MG3; the highest negative contribution of these variables were −1.21, −1.33, −1.29, −1.35, and −1.34 on cluster 3 of MG4E. The highest positive contribution of (geographical characteristics) latitude and longitude, respectively, was 1.23 and 0.97 in cluster 4 of the Exp., and the highest positive contribution of altitude was in cluster 3 in MG4E. The highest negative contributions of latitude (−1.93), longitude (−1.96), and altitude (−1.3), respectively, were on clusters 3, 4, and 3 of MG3. The highest positive contributions of (soil characteristics) soil type (0.6) and percentage of sand (1.65), silt (0.72), and clay (0.73), respectively, were on clusters 3 of MG5E, 4 of MG4E, 1, and 1 of MG5E. The highest negative contributions of soil type and percentage of sand, respectively, were −0.59 and −0.84 in cluster 1 of MG5E; silt was −1.55 in cluster 4 of MG3 and 4 of MG4E, and clay was −1.07 in cluster 3 of MG4L.

4. Discussion

The application of machine learning algorithms in this study successfully moved the analysis of soybean performance beyond traditional political boundaries, establishing a robust framework for delineating environmental stability zones. The main contribution of this work is not the introduction of new algorithms, but the integration of QDA, PCA, and AHC into a unified statistical-learning framework applied to a large, seven-state soybean OVT network to delineate cross-state mega-environments at the maturity-group level. This provides an explicit, data-driven alternative to the traditional state-based zoning used in current recommendation systems.

4.1. Discriminant Analysis and Yield Stability

The QDA effectively highlighted year-to-year yield variability, classifying over 79% of observations into distinct temporal classes. While year-to-year climatic variation is expected, the high classification accuracy confirms that specific variables—namely GDD and latitude—are the primary discriminators of yield potential. Unlike retrospective analyses, these discriminant functions provide a predictive model: by inputting forecasted or early-season GDD and latitudinal data, breeders can predict which “performance class” a growing season is likely to resemble. This aligns with findings by [88] regarding rice yield classification and suggests that breeding programs must prioritize genotypic stability against thermal accumulation variances (GDD) rather than precipitation alone. Furthermore, the variation in discriminant functions across different years suggests that not all factors influence yield equally in every season, supporting the need for tailored management approaches for specific maturity groups or environmental conditions [89]. Although internal classification and cross-validation accuracies are high, the current analysis does not yet include external validation with independent years, withheld locations, or prospective trials. At this stage, the mega-environments should therefore be viewed as descriptive, data-driven environmental strata that require additional validation before being used operationally for predicting future yield stability or variety rankings. The high year-wise QDA accuracies therefore demonstrate that the selected edaphoclimatic variables capture strong inter-annual differences in background climate, which complements—but does not replace—the PCA–AHC-based delineation of spatial mega-environments.

4.2. Latent Environmental Drivers (PCA)

The PCA results provided critical insight into the hierarchy of yield-limiting factors, verifying that climatic and geographic variables—specifically latitude and altitude—are the determining factors in constructing PC1, while soil characteristics and longitude drive PC2 [90]. In all maturity groups, PC1 was dominated by broad climatic drivers, explaining nearly 50% of the variance. This orthogonal separation—climate on PC1 and soil on PC2—validates the agronomic reality that while photothermal conditions set the potential yield ceiling [91], local edaphic factors determine the realized yield. This methodology allows for the adjustment and verification of correlation weights to better understand land-climate interactions [92], effectively simplifying the analysis to focus on biological interpretation rather than data processing [32].

4.3. Delineating Mega-Environments (AHC)

Most notably, the AHC analysis revealed that optimal production clusters frequently defied geographic proximity, indicating that “neighboring” counties often belong to different mega-environments.
Cross-State Clustering: For example, in MG4E, cluster 2 grouped locations from Arkansas, Louisiana, and Tennessee together, separating them from cluster 3, which was comprised almost entirely of Missouri locations. This suggests that a grower in Western Tennessee may share more critical yield-determining factors with a producer in the Arkansas Delta than with a producer in Eastern Tennessee, despite the latter being within the same state political boundary.
Operational Implications: Similarly, in MG3, cluster 1 formed a distinct “transitional” mega-environment comprising counties from Kentucky, Missouri, and Tennessee. This grouping operationalizes the concept that these specific sites share a microclimate and soil profile that justifies sharing variety recommendations.
The profile plots (Figure 9) further elucidate why these distant sites clustered together; for instance, the high negative contribution of latitude in MG3 cluster 3 indicates a “Southern-adaptability” zone that spans state lines. This confirms that while geographic proximity plays a role, local microclimate and soil characteristics are often equally influential in determining cluster membership.

4.4. Conclusion on Methodology

These findings support the move toward data-driven recommendation domains. As noted by Dawson and Belkhir (2009) [39], the node height in our dendrograms serves as a proxy for environmental similarity. Just as Das et al. (2021) [16] utilized AHC to differentiate wheat varieties based on physiological parameters, our study successfully differentiates production environments to optimize variety placement. Utilizing these clusters can assist with reducing the cost and improving the accuracy of strategizing for production. This approach mirrors the work of Ibrar et al. (2024) [41], who used hierarchical clustering to identify distinct heterotic groups; similarly, we identified distinct environmental groups to streamline testing locations and reduce redundancy in trial networks.

4.5. Limitations and Future Directions

First, the climatic covariates are based on seasonal means and cumulative indices summarized over the frost-free growing season. While these metrics capture broad suitability, they do not explicitly capture short-duration extremes, such as brief heat waves or intra-seasonal droughts, which often drive year-specific yield anomalies. As a result, the framework is better suited for long-term zoning and strategic trial placement than for assessing short-term production risk under extreme events. While this study establishes a statistical learning framework for mega-environment delineation, several limitations must be noted. First, the Silhouette scores for the identified clusters (0.29–0.37) indicate a “moderate-to-weak” structure. This suggests that the study area represents a continuum of environmental gradients rather than sharply distinct ecological islands. Consequently, these clusters should be viewed as probabilistic transition zones rather than absolute boundaries. Second, this analysis relied on seasonal means for climatic variables. As noted by recent studies, soybean yield is often limited by short-term acute weather events (e.g., heat stress during flowering), which may be smoothed over in seasonal averages. Finally, this study focused on Genotype × Environment (G × E) interactions; future work should integrate Management (M) variables—such as planting date and irrigation—to provide a full G × E × M analysis.
Second, both climate data from NCEI and soil texture information from the Web Soil Survey are derived from interpolated products, which inevitably smooth local variability in weather and soil properties. Consequently, the identified mega-environments should be interpreted as regional edaphoclimatic groupings, not as prescriptive units for fine-scale or within-field precision management. Future work could pair this framework with higher-resolution datasets (e.g., Mesonet observations, gridded reanalysis, or proximal soil sensing) to refine spatial detail where needed.
Third, key management variables such as planting date, irrigation infrastructure, plant population, and overall management intensity were not consistently available across states and therefore were not included in the models. The mega-environments derived here thus represent environmental potential rather than fully specified G × E × M systems, and cultivar recommendations will ultimately need to overlay management metadata on top of these environmental strata.

5. Conclusions

This study’s results show that QDA effectively classified 79+% of the yield of all maturity groups into distinguished classes. Also, QDA, while highlighting GDD and latitude as the main discriminatory variables in the two main discrimination factors (F1 and F2) in all maturity groups, illustrated that the climatic and geographic variables are the main variables in controlling yield classification in different years. Furthermore, PCA showed that a 2-D principal component could successfully explain the majority of the variance, in which the climatic and geographic (except longitude) variables were the main influential variables on PC1, and soil characteristics, along with longitude, were the most influential variables on PC2. Moreover, AHC showed that the optimal clusters varied between two and four for different maturity groups and the overall experiment. This result also showed that geographic approximation did not necessarily result in grouping locations in the same cluster. The profile plot indicated that environmental and geographical variables (except longitude) usually play the leading role in AHC.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy16030376/s1, Table S1: Classification Function Formula of MG3; Table S2: Classification Function Formula of MG4E; Table S3: Classification Function Formula of MG4L; Table S4: Classification Function Formula of MG5E; Table S5: Classification Function Formula of MG5L; Table S6: Classification Function Formula of the overall experiment (Exp.).

Author Contributions

Conceptualization, V.R.S.; methodology, I.M. and V.R.S.; software, I.M.; formal analysis, I.M.; investigation, I.M., V.R.S., R.B., R.H., and D.M.; resources, V.R.S., R.B., R.H., and D.M.; data curation, I.M., V.R.S., R.B., R.H., and D.M.; writing—original draft preparation, I.M. and V.R.S.; writing—review and editing, I.M., V.R.S., R.B., R.H., and D.M.; visualization, I.M.; supervision, V.R.S.; project administration, V.R.S.; funding acquisition, V.R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the United Soybean Board, project number 2323-206-0301.

Data Availability Statement

The original data presented in the study are openly available online or by request from each state OVT program, which publishes trial results annually.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AHCAgglomerative Hierarchical Clustering
BSSBetween-Cluster Sum of Squares
CHCalinski–Harabasz Index
CVCross-Validation
DADiscriminant Analysis
DFDegree of Freedom
Exp.Experiment (Overall)
GDDGrowing Degree Days
HHartigan Index
IDOCInertia Decomposition at Optimal Classification
KMOKaiser–Meyer–Olkin
maslMeters Above Sea Level
MGMaturity Group
MLAMachine Learning Algorithms
NCEINational Centers for Environmental Information
OVTOfficial Variety Trials (or Testing)
PCPrincipal Component
PCAPrincipal Component Analysis
PIDOCPercentage of Inertia Decomposition at Optimal Clustering
PPPrior Probability
PTUPhotothermal Units
QDAQuadratic Discriminant Analysis
SBBetween-Class Sum of Squares
SWWithin-Class Sum of Squares (also Sum of Weights)
S(i)Silhouette Index
TrTraining Sample
UL-MLAUnsupervised Learning Machine Learning Algorithms
USDAUnited States Department of Agriculture
USGSUnited States Geological Survey
WSSWeb Soil Survey (also Within-Cluster Sum of Squares)

References

  1. Hymowitz, T. The history of the soybean. In Soybeans; Johnson, L.A., White, P.J., Galloway, R., Eds.; AOCS Press: Amsterdam, The Netherlands, 2008; pp. 1–31. [Google Scholar] [CrossRef]
  2. USDA. Uniform Soybean Tests, Northern States. U.S. Department of Agriculture. 1951. Available online: https://www.ars.usda.gov/ARSUserFiles/60661000/UniformSoybeanTests/51soybook.pdf (accessed on 15 August 2025).
  3. Yang, L.; Song, W.; Xu, C.; Sapey, E.; Jiang, D.; Wu, C. Effects of high night temperature on soybean yield and compositions. Front. Plant Sci. 2023, 14, 1065604. [Google Scholar] [CrossRef]
  4. Shurtleff, W.; Aoyagi, A. History of Soybean Variety Development, Breeding and Genetic Engineering 1902–2020. Soyinfo Center. 2020. Available online: https://www.soyinfocenter.com/pdf/229/PrVd.pdf (accessed on 15 August 2025).
  5. Jemo, M.; Devkota, K.P.; Epule, T.E.; Chfadi, T.; Motiq, R.; Hafidi, M.; Silatsa, F.B.T.; Jibrin, J.M. Exploring the potential of mapped soil properties, rhizobium inoculation, and phosphorus supplementation for predicting soybean yield in the savanna areas of Nigeria. Front. Plant Sci. Sec. Plant Symbiotic Interact. 2023, 14. [Google Scholar] [CrossRef]
  6. USDA National Agricultural Statistics Service. Crop Production 2020 Summary. 2020. Available online: https://www.nass.usda.gov/Publications/Todays_Reports/reports/cropan20.pdf (accessed on 15 August 2025).
  7. USDA National Agricultural Statistics Service. 2020 State Variety Testing Report: Soybean. United States Department of Agriculture. 2020. Available online: https://www.nass.usda.gov/Publications/ (accessed on 15 August 2025).
  8. Everman, W. Weed Management. NC State Extension Publications. ncsu.edu. 2024. Available online: https://content.ces.ncsu.edu/north-carolina-soybean-production-guide/soybean-weed-management (accessed on 15 August 2025).
  9. Lin, F.; Chhapekar, S.S.; Vieira, C.C.; Da Silva, M.P.; Rojas, A.; Lee, D.; Liu, N.; Pardo, E.M.; Lee, Y.-C.; Dong, Z.; et al. Breeding for disease resistance in soybean: A global perspective. Theor. Appl. Genet. 2022, 135, 3773–3872. [Google Scholar] [CrossRef]
  10. Allen, T.W.; Bradley, C.A.; Sisson, A.J.; Byamukama, E.; Chilvers, M.I.; Coker, C.M.; Collins, A.A.; Damicone, J.P.; Dorrance, A.E.; Dufault, N.S.; et al. Soybean yield loss estimates due to diseases in the United States and Ontario, Canada, from 2010 to 2014. Plant Health Prog. 2017, 18, 19–27. [Google Scholar] [CrossRef]
  11. Fowler, A.; Basso, B.; Maureira, F.; Millar, N.; Ulbrich, R.; Brinton, W.F. Spatial patterns of historical crop yields reveal soil health attributes in US Midwest fields. Sci. Rep. 2024, 14, 465. [Google Scholar] [CrossRef]
  12. Zeleke, K.; Nendel, C. Yield response and water productivity of soybean (Glycine max L.) to deficit irrigation and sowing time in south-eastern Australia. Agric. Water Manag. 2024, 296, 108815. [Google Scholar] [CrossRef]
  13. Bashir, M.; Adam, A.M.; Shehu, B.M.; Abubakar, M.S. Effects of Soil Texture and Nutrients Application on Soybean Nutrient Uptake, Growth and Yield Response. J. Agric. Food Sci. 2022, 20, 227–241. [Google Scholar] [CrossRef]
  14. Lin, T.S.; Song, Y.; Lawrence, P.; Kheshgi, H.S.; Jain, A.K. Worldwide maize and soybean yield response to environmental and management factors over the 20th and 21st centuries. J. Geophys. Res. Biogeosci. 2021, 126, e2021JG006304. [Google Scholar] [CrossRef]
  15. Dong, A.; Lai, X.; Han, T.; Nsigayehe, J.M.V.; Li, G.; Shen, Y. Crossing latitude introduction delayed flowering and facilitated dry matter accumulation of soybean as a forage crop. J. Integr. Agric. 2024, 24, 0033. [Google Scholar] [CrossRef]
  16. Das, S.; Christopher, J.; Apan, A.; Choudhury, M.R.; Chapman, S.; Menzies, N.W.; Dang, Y.P. UAV-Thermal imaging and agglomerative hierarchical clustering techniques to evaluate and rank physiological performance of wheat genotypes on sodic soil. ISPRS J. Photogramm. Remote Sens. 2021, 173, 221–237. [Google Scholar] [CrossRef]
  17. Smith, J.R.; Jones, M.A. The role of variety testing in sustainable agricultural practices: Implications for soybean producers. J. Agric. Sci. 2021, 159, 567–579. [Google Scholar]
  18. Chen, H.; Pan, X.; Wang, F.; Liu, C.; Wang, X.; Li, Y.; Zhang, Q. Novel QTL and Meta-QTL Mapping for Major Quality Traits in Soybean. Front. Plant Sci. 2021, 12, 774270. [Google Scholar] [CrossRef]
  19. USDA National Agricultural Statistics Service. 2021 State Variety Testing Report: Soybean. United States Department of Agriculture. 2021. Available online: https://www.nass.usda.gov/Publications/ (accessed on 15 August 2025).
  20. Wakefield, K. A Guide to Machine Learning Algorithms and Their Applications: Understanding the Types of Machine Learning Algorithms and When to Use Them. SAS UK. Available online: https://www.sas.com/en_us/insights/articles/analytics/machine-learning-algorithms-guide.html (accessed on 19 December 2024).
  21. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
  22. Elbasi, E.; Zaki, C.; Topcu, A.E.; Abdelbaki, W.; Zreikat, A.I.; Cina, E.; Shdefat, A.; Saker, L. Crop Prediction Model Using Machine Learning Algorithms. Appl. Sci. 2023, 13, 9288. [Google Scholar] [CrossRef]
  23. Araújo, S.O.; Peres, R.S.; Ramalho, J.C.; Lidon, F.; Barata, J. Machine Learning Applications in Agriculture: Current Trends, Challenges, and Future Perspectives. Agronomy 2023, 13, 2976. [Google Scholar] [CrossRef]
  24. Van Klompenburg, T.; Kassahun, A.; Catal, C. Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
  25. Gao, Z.; Ma, W.; Huang, S.; Hua, P.; Lan, C. Deep Learning for Super-Resolution in a Field Emission Scanning Electron Microscope. AI 2020, 1, 1–10. [Google Scholar] [CrossRef]
  26. McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition; Wiley: Hoboken, NJ, USA, 2004; Available online: https://onlinelibrary.wiley.com/doi/book/10.1002/0471725293 (accessed on 15 August 2025).
  27. Dong, S.; Gao, Y.; Xin, L.; Ding, W. Insights into the effects of transgenic glyphosate-resistant semiwild soybean on soil microbial diversity. Sci. Rep. 2024, 14, 32017. [Google Scholar] [CrossRef] [PubMed]
  28. Bianchini, A.; Moraes, P.V.D.; Longhi, S.J.; Adami, P.F.; Rossi, P.; Batista, V.V. Multivariate analysis using a discriminant method for evaluating the techniques of weed management in soybean crop. Agric. Sci. 2020, 12, 48–61. [Google Scholar] [CrossRef]
  29. Kim, S.-Y.; Kim, S.Y.; Lee, S.M.; Lee, D.Y.; Shin, B.K.; Kang, D.J.; Choi, H.-K.; Kim, Y.-S. Discrimination of Cultivated Regions of Soybeans (Glycine max) Based on Multivariate Data Analysis of Volatile Metabolite Profiles. Molecules 2020, 25, 763. [Google Scholar] [CrossRef] [PubMed]
  30. Li, X.; He, Z.; Liu, F.; Chen, R. Fast Identification of Soybean Seed Varieties Using Laser-Induced Breakdown Spectroscopy Combined With Convolutional Neural Network. Front. Plant Sci. 2021, 12, 21. [Google Scholar] [CrossRef]
  31. Wu, W.; Mallet, Y.; Walczak, B.; Penninckx, W.; Massart, D.L.; Heuerding, S.; Erni, F. Comparison of regularized discriminant analysis, linear discriminant analysis, and quadratic discriminant analysis applied to NIR data. Anal. Chim. Acta 1996, 329, 257–265. [Google Scholar] [CrossRef]
  32. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent development. Phil. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
  33. Ringnér, M. What is principal component analysis? Nat. Biotechnol. 2008, 26, 303–304. [Google Scholar] [CrossRef]
  34. Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
  35. Jackson, J.O. A User’s Guide to Principal Components; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
  36. Maronna, R. Principal Components and Orthogonal Regression Based on Robust Scales. Technometrics 2005, 47, 264–273. [Google Scholar] [CrossRef]
  37. Mirahki, I.; Ardakani, M.R.; Golzardi, F.; Paknejad, F.; Mahrokh, A.; Faraji, S. Yield, Water Use Efficiency and Silage Feeding Value of Sorghum Cultivars as Affected by Planting Date and Planting Method. Gesunde Pflanz. 2023, 75, 1963–1973. [Google Scholar] [CrossRef]
  38. Hair, J.F.; Anderson, R.E.; Tatham, R.L.; Black, W.C. Multivariate Data Analysis; Pearson: Abingdon, UK, 2010. [Google Scholar]
  39. Dawson, K.; Belkhir, K. An agglomerative hierarchical approach to visualization in Bayesian clustering problems. Heredity 2009, 103, 32–45. [Google Scholar] [CrossRef] [PubMed]
  40. Shen, B.; José, J.; Feng, Q.; Li, D.; Ye, Y.; Ahmadi, G. Semi-supervised hierarchical ensemble clustering based on an innovative distance metric and constraint information. Eng. Appl. Artif. Intell. 2023, 124, 106571. [Google Scholar] [CrossRef]
  41. Ibrar, D.; Khan, S.; Raza, M.; Nawaz, M.; Hasnain, Z.; Kashif, M.; Rais, A.; Gul, S.; Ahmad, R.; Gaafar, A.-R.Z. Application of machine learning for identification of heterotic groups in sunflower through combined approach of phenotyping, genotyping and protein profiling. Sci. Rep. 2024, 14, 7333. [Google Scholar] [CrossRef]
  42. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  43. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  44. Pan, W.-J.; Wang, X.; Deng, Y.-R.; Li, J.-H.; Chen, W.; Chiang, J.Y.; Yang, J.-B.; Zheng, L. Nondestructive and intuitive determination of circadian chlorophyll rhythms in soybean leaves using multispectral imaging. Sci. Rep. 2015, 5, 11108. [Google Scholar] [CrossRef]
  45. Ratke, R.F.; de Sousa, A.; Chaves, D.V.; Zanatta, F.L.; Edvan, R.L.; Sousa, H.R.; Silva-Filho, E.C.; Osajima, J.A.; Nascimento, A.M.S.S.; Aguilera, J.G.; et al. Cashew gum hydrogel as an alternative to minimize the effect of drought stress on soybean. Sci. Rep. 2024, 14, 2159. [Google Scholar] [CrossRef] [PubMed]
  46. Cao, P.; Zhao, Y.; Wu, F.; Xin, D.; Liu, C.; Wu, X.; Lv, J.; Chen, Q.; Qi, Z. Multi-Omics Techniques for Soybean Molecular Breeding. Int. J. Mol. Sci. 2022, 23, 4994. [Google Scholar] [CrossRef] [PubMed]
  47. Salmerón, M.; Purcell, L.C. Simplifying the prediction of phenology with the DSSAT-CROPGRO-soybean model based on relative maturity group and determinacy. Agric. Syst. 2016, 148, 178–187. [Google Scholar] [CrossRef]
  48. Carlin, J.F.; Bond, R.D.; Still, J.A. Arkansas Soybean Performance Tests 2019. Arkansas Agricultural Experiment Station Research Series. 2019. Available online: https://scholarworks.uark.edu/aaesser/158 (accessed on 15 August 2025).
  49. Carlin, J.F.; Bond, R.D.; Morgan, R.B. Arkansas Soybean Performance Tests 2020. Arkansas Agricultural Experiment Station Research Series. 2021. Available online: https://scholarworks.uark.edu/aaesser/196 (accessed on 15 August 2025).
  50. Carlin, J.F.; Morgan, R.B.; Bond, R.D. Arkansas Soybean Performance Tests 2021. Arkansas Agricultural Experiment Station Research Series. 2022. Available online: https://scholarworks.uark.edu/aaesser/206 (accessed on 15 August 2025).
  51. Carlin, J.F.; Mulloy, R.B.; Bond, R.D. Arkansas Soybean Performance Tests 2022. Arkansas Agricultural Experiment Station Research Series. 2023. Available online: https://scholarworks.uark.edu/aaesser/216 (accessed on 15 August 2025).
  52. Venard, C.M.-P.; Duckworth, J. 2019 Kentucky Soybean Performance Tests (PR-775). University of Kentucky, College of Agriculture, Food and Environment. 2019. Available online: https://publications.ca.uky.edu/sites/publications.ca.uky.edu/files/PR775.pdf (accessed on 15 August 2025).
  53. Venard, C.M.-P.; Mertz, D.R. 2020 Kentucky Soybean Variety Performance Tests (PR-794). University of Kentucky, College of Agriculture, Food and Environment. 2020. Available online: https://soybeanresearchinfo.com/wp-content/uploads/2020/12/2020-Kentucky-Soybean-Variety-Performance-Trials.pdf (accessed on 15 August 2025).
  54. Venard, C.M.-P.; Mertz, D.R. 2021 Kentucky Soybean Variety Performance Tests (PR-811). University of Kentucky, College of Agriculture, Food and Environment. 2021. Available online: http://www2.ca.uky.edu/agc/pubs/PR/PR811/PR811.pdf (accessed on 15 August 2025).
  55. Mertz, D.; Bruening, B.; Kenimer, C.; Shine, P. 2022 Kentucky Soybean Variety Performance Trials (PR-828). University of Kentucky, College of Agriculture, Food and Environment. 2022. Available online: https://publications.ca.uky.edu/files/PR828.pdf (accessed on 15 August 2025).
  56. Spivey, T.; Woodard, C.; Stephenson, D.O.; Bollich, P.K.; Buckley, B.; Webster, E.P.; Padgett, G.B.; Harrell, D.L.; Copes, J. 2019 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2019. Available online: https://www.lsuagcenter.com/articles/page1544459344263 (accessed on 15 August 2025).
  57. Padgett, G.; Webster, E.P.; Collins, F.L.; Davis, J.A.; May, D.; Woodard, C.; Stephenson, D.O.; Bollich, P.K.; Buckley, B.; Harrell, D.L.; et al. 2020 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2020. Available online: https://www.lsuagcenter.com/articles/page1576271753656 (accessed on 15 August 2025).
  58. Moseley, D.; Stephenson, D.O.; Collins, F.L.; Buckley, B.; Brown, S.; Price, P.P., III; Padgett, G.B.; Gentry, G.T.; Harrell, D.L.; Kongchum, M.; et al. 2021 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2021. Available online: https://www.lsuagcenter.com/articles/page1606947975606 (accessed on 15 August 2025).
  59. Moseley, D.; Brown, S.; Price, P.P., III; Padgett, G.B.; Gentry, G.T.; Collins, F.L.; Watson, T.; Towles, T.; Davis, J.A.; Burns, D.; et al. 2022 Soybean Variety Yields and Production Practices. Louisiana State University Agricultural Center. 2022. Available online: https://www.lsuagcenter.com/articles/page1639689730578 (accessed on 15 August 2025).
  60. Wiebold, W.J.; Nichols, J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2019 Missouri soybean crop performance tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2019. Available online: https://varietytesting.missouri.edu/archive/2019-Soybean-Complete.pdf (accessed on 15 August 2025).
  61. Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2020 Missouri Soybean Crop Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2020. Available online: https://varietytesting.missouri.edu/archive/2020-soybean-complete.pdf (accessed on 15 August 2025).
  62. Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2021 Missouri Soybean Crop Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2021. Available online: https://varietytesting.missouri.edu/archive/2021-soybean-complete.pdf (accessed on 15 August 2025).
  63. Wiebold, W.J.; Knuckles, C.; Wieberg, M.; Miller, C.; Koelling, P. 2022 Missouri Corn Soybean Performance Tests. University of Missouri, College of Agriculture, Food and Natural Resources. 2022. Available online: https://varietytesting.missouri.edu/archive/2022-soybean-complete.pdf (accessed on 15 August 2025).
  64. Heiniger, R.W. 2019 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2019. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2019/ (accessed on 15 August 2025).
  65. Heiniger, R.W. 2020 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2020. Available online: https://officialvarietytesting.ces.ncsu.edu/soybean-2020/ (accessed on 15 August 2025).
  66. Heiniger, R.W. 2021 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2021. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2021/ (accessed on 15 August 2025).
  67. Heiniger, R.W. 2022 North Carolina Soybean Variety Performance Tests. North Carolina State University, College of Agriculture and Life Sciences. 2022. Available online: https://officialvarietytesting.ces.ncsu.edu/soybeans-2022/ (accessed on 15 August 2025).
  68. Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2019. University of Tennessee Institute of Agriculture. 2019. Available online: https://search.utcrops.com/wp-content/uploads/2019/12/2019-Soybean-Variety-Test-PB-FINAL.pdf (accessed on 15 August 2025).
  69. Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2020. University of Tennessee Institute of Agriculture. 2020. Available online: https://search.utcrops.com/wp-content/uploads/2020/12/2020-Soybean-Report-WEB.pdf (accessed on 15 August 2025).
  70. Sykes, V.; Blair, R.; Kelly, H.; Wilson, A.; Bracey, W.; Pantalone, V.; McClure, A.T. Soybean Variety Tests in Tennessee 2021. University of Tennessee Institute of Agriculture. 2021. Available online: https://search.utcrops.com/wp-content/uploads/2021/12/Soybean_2021.pdf (accessed on 15 August 2025).
  71. Sykes, V.; Blair, R.; Kelly, H.; Schumacher, L.; Palacios, F.; Keadle, B.; Thelin, A.; Pantalone, V. Soybean Variety Tests in Tennessee 2022. University of Tennessee Institute of Agriculture. 2022. Available online: https://search.utcrops.com/wp-content/uploads/2022/12/2022-Soybean-Publication-Full-FINAL.pdf (accessed on 15 August 2025).
  72. Holshouser, D.; Pawlick, A.; Taylor, B.; Seymore, E. Virginia Soybean Performance Tests 2019. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2019. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2019.pdf (accessed on 15 August 2025).
  73. Holshouser, D.; Pawlick, A.; Taylor, B. Virginia Soybean Performance Tests 2020. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2020. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2020_soybean_performance.pdf (accessed on 15 August 2025).
  74. Holshouser, D.; Taylor, B. Virginia Soybean Performance Tests 2021. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2021. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2021_soybean.pdf (accessed on 15 August 2025).
  75. Holshouser, D.; Taylor, B.; Daughtrey, R. Virginia Soybean Performance Tests 2022. Virginia Tech, Tidewater Agricultural Research and Extension Center. 2022. Available online: https://www.sites.ext.vt.edu/newsletter-archive/soybean-performance-test/2022_soybean.pdf (accessed on 15 August 2025).
  76. NOAA National Centers for Environmental Information. Climate Data Online. National Oceanic and Atmospheric Administration. 2025. Available online: https://www.ncdc.noaa.gov/cdo-web/ (accessed on 15 August 2025).
  77. Soil Survey Staff; Natural Resources Conservation Service; United States Department of Agriculture. (n.d.). Web Soil Survey. Available online: https://websoilsurvey.nrcs.usda.gov/ (accessed on 15 August 2025).
  78. USDA. USDA Corn and Soybean Projections for 2024/2025—October 2024. 2024. Available online: https://www.pig333.com/latest_swine_news/usda-corn-and-soybean-projections-for-2024-2025-october-2024_20856/ (accessed on 15 August 2025).
  79. U.S. Geological Survey. 3D Elevation Program (3DEP). 2019. Available online: https://www.usgs.gov/3d-elevation-program (accessed on 15 August 2025).
  80. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  81. Mika, S.; Ratsch, G.; Weston, J.; Schölkopf, B.; Muller, K.R. Fisher discriminant analysis with kernels. In Proceedings of the Neural Networks for Signal Processing IX, 1999. NNSP ’99, 1999 IEEE Signal Processing Society Workshop, Madison, WI, USA, 25 August 1999; Volume 1, pp. 41–48. [Google Scholar] [CrossRef]
  82. Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar] [CrossRef]
  83. Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  84. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
  85. Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]
  86. SAS Institute Inc. SAS 9.4 [Computer Software]. In SAS/STAT User’s Guide, Version 9.4 ed; SAS Institute Inc.: Cary, NC, USA, 2024. [Google Scholar]
  87. Addinsoft. XLSTAT [Computer Software], Addinsoft: New York, NY, USA, 2024.
  88. Sharma, A.; Kumar, J.; Redhu, M.; Kumar, P.; Godara, M.; Ghiyal, P.; Fu, P.; Rahimi, M. Estimation of rice yield using multivariate analysis techniques based on meteorological parameters. Sci. Rep. 2024, 14, 12626. [Google Scholar] [CrossRef]
  89. Amaral, L.O.; Miranda, G.V.; Val, B.H.P.; Silva, A.P.; Moitinho, A.C.R.; Unêda-Trevisoli, S.H. Artificial Neural Network for Discrimination and Classification of Tropical Soybean Genotypes of Different Relative Maturity Groups. Front. Plant Sci. 2022, 13, 814046. [Google Scholar] [CrossRef]
  90. Omondi, J.O.; Chiduwa, M.S.; Kyei-Boahen, S.; Masikati, P.; Nyagumbo, I. Yield gap decomposition: Quantifying factors limiting soybean yield in Southern Africa. npj Sustain. Agric. 2024, 2, 32. [Google Scholar] [CrossRef]
  91. Corbellini, M.; Bobek, D.V.; de Toledo, J.F.F.; Ferreira, L.U.; Santana, D.C.; Gilio, T.A.S.; Teodoro, L.P.R.; Teodoro, P.E.; Tardin, F.D. Geographical adaptability for optimizing the recommendation of soybean cultivars in the Brazilian Cerrado. Sci. Rep. 2024, 14, 13076. [Google Scholar] [CrossRef] [PubMed]
  92. Joswig, J.S.; Wirth, C.; Schuman, M.C.; Kattge, J.; Reu, B.; Wright, I.J.; Sippel, S.D.; Rüger, N.; Richter, R.; Schaepman, M.E.; et al. Climatic and soil factors explain the two-dimensional spectrum of global plant trait variation. Nat. Ecol. Evol. 2021, 6, 36–50. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Geographic distribution of the 60 soybean OVT sites across seven U.S. states, with state boundaries and site locations indicated.
Figure 1. Geographic distribution of the 60 soybean OVT sites across seven U.S. states, with state boundaries and site locations indicated.
Agronomy 16 00376 g001
Figure 2. Circle plot of correlations of variables with discriminant functions (F1, F2), along with the percentage of discrimination and eigenvalue (λ) for each function across different maturity groups and the overall experiment.
Figure 2. Circle plot of correlations of variables with discriminant functions (F1, F2), along with the percentage of discrimination and eigenvalue (λ) for each function across different maturity groups and the overall experiment.
Agronomy 16 00376 g002
Figure 3. Clusters of yield from various locations were categorized into different classes based on year, along with the discrimination percentage and eigenvalue (λ) of each function (F1 and F2) for different maturity groups and the overall experiment.
Figure 3. Clusters of yield from various locations were categorized into different classes based on year, along with the discrimination percentage and eigenvalue (λ) of each function (F1 and F2) for different maturity groups and the overall experiment.
Agronomy 16 00376 g003
Figure 4. Scree plot of ratio explanations of eigenvalues (λ), the corresponding inertia over each principal component, and the cumulative variability percentage of explained variability of principal components of different maturity groups and the overall experiment.
Figure 4. Scree plot of ratio explanations of eigenvalues (λ), the corresponding inertia over each principal component, and the cumulative variability percentage of explained variability of principal components of different maturity groups and the overall experiment.
Agronomy 16 00376 g004
Figure 5. The circle plot of the coordinates of the variable’s projection onto the new space based on their correlation with principal components (PC1 and PC2), the eigenvalue (λ), and the percentage of variance in different maturity groups and the overall experiment explained by each PC.
Figure 5. The circle plot of the coordinates of the variable’s projection onto the new space based on their correlation with principal components (PC1 and PC2), the eigenvalue (λ), and the percentage of variance in different maturity groups and the overall experiment explained by each PC.
Agronomy 16 00376 g005
Figure 6. Pie chart showing the contribution percentages of variables and their levels of significance (cosine squared in bold) in constructing PC1 and PC2 across different maturity groups and the overall experiment.
Figure 6. Pie chart showing the contribution percentages of variables and their levels of significance (cosine squared in bold) in constructing PC1 and PC2 across different maturity groups and the overall experiment.
Agronomy 16 00376 g006aAgronomy 16 00376 g006b
Figure 7. Silhouette score (SS), and the inertia of within-cluster (WSS) and between-cluster (BSS) in each step of clustering, and inertia decomposition percentage at the optimal classification (IDOC) of WSS and BSS across different maturity groups and the overall experiment.
Figure 7. Silhouette score (SS), and the inertia of within-cluster (WSS) and between-cluster (BSS) in each step of clustering, and inertia decomposition percentage at the optimal classification (IDOC) of WSS and BSS across different maturity groups and the overall experiment.
Agronomy 16 00376 g007
Figure 8. Dendrogram of the location classifications of different maturity groups and the overall experiment. C is the number of clusters.
Figure 8. Dendrogram of the location classifications of different maturity groups and the overall experiment. C is the number of clusters.
Agronomy 16 00376 g008
Figure 9. Profile plot of the clusters based on the behavior of the variables across different clusters of different maturity groups and the overall experiment.
Figure 9. Profile plot of the clusters based on the behavior of the variables across different clusters of different maturity groups and the overall experiment.
Agronomy 16 00376 g009
Table 1. Results of H0 and Ha hypothesis tests in the Discriminant Analysis, including the box tests of Chi-square (X2) and Fisher’s (F) asymptotic approximation and Wilk’s Lambda test (Rao’s approximation) (Λ), of the Discriminant Analysis of soybean maturity groups and the overall experiment.
Table 1. Results of H0 and Ha hypothesis tests in the Discriminant Analysis, including the box tests of Chi-square (X2) and Fisher’s (F) asymptotic approximation and Wilk’s Lambda test (Rao’s approximation) (Λ), of the Discriminant Analysis of soybean maturity groups and the overall experiment.
ComponentsMG3MG4EMG4L
X2FΛX2FΛX2FΛ
LambdaN/AN/A0.03N/AN/A0.052N/AN/A0.044
−2 Log(M)47,34347,343N/A118,101118,101N/A191,862191,862N/A
Observations46,757200309117,585431669191,4597011532
Critical Value27125,6561.42313201,1271.4313834,6701.4
DFX2234N/AN/A273N/AN/A273N/AN/A
DFSBN/A23436N/A27339N/A27339
DFSWN/A5,997,7864917N/A54,890,48815,186N/A227,829,93431,917
p-Value
(two-tailed)
***************************
ComponentsMG5EMG5LExperiment
X2FΛX2FΛX2FΛ
LambdaN/AN/A0.033N/AN/A0.014N/AN/A0.069
−2 Log(M)164,380164,380N/A48,68648,686N/A534,966534,966N/A
Observations163,58159976047,725174294534,56416973335
Critical Value313161,4251.431375671.43576,557,8521.38
DFX2272N/AN/A273N/AN/A315N/AN/A
DFSBN/A27339N/A27339N/A31542
DFSWN/A44,053,55413,702N/A2,062,4053596N/A2,065,617,51496,029
p-Value
(two-tailed)
***************************
α = 0.05; ***, p-Value < 0.001; X2, goodness of fit; F, comparison between the classes means; Λ, proportion of variance explained by between-class versus within-class variance; Lambda, ratio of within-class variance to the total variance; −2 Log(M), homogeneity of covariance matrices; DF, degree of freedom; SB, between-classes sum of squares; SW, within classes sum of squares; two-tailed, test concerned with deviations in both directions (i.e., increase and decrease) from the null hypothesis. N/A indicates that the statistical test was not performed due to insufficient variance in the treatment subset.
Table 2. The sum of weights (SW), prior probabilities (PP), and logarithms of the determinants (Log.D) of the Discriminant Analysis for each class of soybean maturity group and the overall experiment.
Table 2. The sum of weights (SW), prior probabilities (PP), and logarithms of the determinants (Log.D) of the Discriminant Analysis for each class of soybean maturity group and the overall experiment.
ComponentsMG3MG 4EMG 4L
ClassSW (Freq.)PPLog.DSW (Freq.)PPLog.DSW (Freq.)PPLog.D
20194210.25115.113930.27149.128230.26280.19
20204170.24836.8813130.25547.2229370.27247.82
20214270.25415.2812620.24549.8728110.2650.63
20224140.24738.0711760.22927.522230.20650.2
Total16791 51441 10,7941
ComponentsMG 5EMG 5LExperiment
ClassSW (Freq.)PPLog.DSW (Freq.)PPLog.DSW (Freq.)PPLog.D
201912260.26425.443940.3218.181680.25256.84
202012740.27225.082440.19814.5287100.26954.31
202110640.22923.863810.3115.2387400.2758.23
202210790.23247.492110.17237.6167700.20958.65
Total46431 12301 32,3881
Freq., Frequency in the Model; Color code: Agronomy 16 00376 i001.
Table 3. The eigenvalue (λ), discrimination percentage (D), cumulative discriminant percentage (ΣD), value of Bartlett’s statistic (X), and the significance level of the X2 for the two main functions of the Discriminant Analysis of different maturity groups and the overall experiment.
Table 3. The eigenvalue (λ), discrimination percentage (D), cumulative discriminant percentage (ΣD), value of Bartlett’s statistic (X), and the significance level of the X2 for the two main functions of the Discriminant Analysis of different maturity groups and the overall experiment.
ComponentsMG3MG4L
F1F2F1F2
λ6.962.278.251.07
D73.2023.8486.7711.24
ΣD73.2097.0486.7798.01
X25855.002391.0033,692.009704.00
p-Value************
ComponentsMG5EExperiment
F1F2F1F2
λ9.371.566.460.74
D84.6514.0688.3710.06
ΣD84.6598.7188.3798.43
X215,804.004967.0086,398.0021,351.00
p-Value************
α = 0.05; ***, p-Value < 0.001.
Table 4. Mahalanobis distances (δ) between different classes of different maturity groups and the overall experiment.
Table 4. Mahalanobis distances (δ) between different classes of different maturity groups and the overall experiment.
MG3MG5E
2019 δ2020 δ2021 δ2022 δ 2019 δ2020 δ2021 δ2022 δ
2019 δ 435654202019 δ 3679762
2020 δ137 1223402020 δ142 66202
2021 δ19234 1912021 δ45178 19
2022 δ5366055 2022 δ363028
MG4EMG5L
2019 δ2020 δ2021 δ2022 δ 2019 δ2020 δ2021 δ2022 δ
2019 δ 17850212019 δ 873387773
2020 δ85 341082020 δ322 3201413
2021 δ2098 162021 δ80406 9
2022 δ1124220 2022 δ985235
MG4LExperiment
2019 δ2020 δ2021 δ2022 δ 2019 δ2020 δ2021 δ2022 δ
2019 δ 22272312019 δ 1364414
2020 δ96 431462020 δ101 2968
2021 δ27131 152021 δ3064 7
2022 δ1326419 2022 δ141359
Color code: Agronomy 16 00376 i002.
Table 5. Standardized canonical discriminant function coefficients (|X|) shown; standardized to mean = 0, SD = 1 within each MG. Interpretation: Higher absolute values indicate stronger contribution to F1/F2 discrimination; sign shows direction relative to class centroids. Compare within columns (F1 vs. F2 per MG); magnitudes affected by collinearity/scaling.
Table 5. Standardized canonical discriminant function coefficients (|X|) shown; standardized to mean = 0, SD = 1 within each MG. Interpretation: Higher absolute values indicate stronger contribution to F1/F2 discrimination; sign shows direction relative to class centroids. Compare within columns (F1 vs. F2 per MG); magnitudes affected by collinearity/scaling.
ComponentsMG3MG4EMG4LMG5EMG5LExperiment
F1F2F1F2F1F2F1F2F1F2F1F2
Variety−0.04−0.010.01−0.110.04−0.110.1−0.09−0.01−0.020.03−0.08
Yield0.160.110.03−0.040.06−0.05−0.12−0.040−0.020.02−0.09
Location−0.12−0.180.060.010.11−0.020.060.220.470.940.190.00
GDD−5.632.12−8.571.31−8.341.09−8.661.69−9.753.2−8.151.19
Avg. Temp.0.000.0012.43−0.9811.981.812.053.1214.442.2112.03−0.3
Min. Temp.5.642.560.062.75−0.290.930.000.000.000.00−0.411.04
Max. Temp.3.5−3.810.000.000.000.000.74−0.360.44−0.670.000.00
Precipitation0.21.24−0.071.05−0.10.960.140.910.491.46−0.071.09
Latitude3.021.694.752.975.143.676.554.418.694.74.322.11
Longitude1.21−0.221.970.262.680.473.80.754.030.142.130.01
Altitude0.060.68−0.421.08−0.361.25−0.061.510.092.18−0.380.6
Soil Type0.140.34−0.04−0.080.06−0.130.26−0.150.25−0.72−0.070.16
Sand0.000.00−0.550.7−0.510.650.090.93−0.08−0.25−0.180.34
Silt0.2−0.680.000.000.000.000.110.190.000.000.110.09
Clay−0.29−0.04−0.240.05−0.17−0.030.000.000.14−0.540.000.00
MG 0.02−0.08
MG, maturity group; Color code based on the absolute value: Agronomy 16 00376 i003.
Table 6. Result of classification using the confusion matrix for training (Tr) and cross-validation (CV/LOOCV accuracy) of different maturity groups and the overall experiment.
Table 6. Result of classification using the confusion matrix for training (Tr) and cross-validation (CV/LOOCV accuracy) of different maturity groups and the overall experiment.
ComponentsMG3MG4E
2019202020212022Σφ (%)2019202020212022Σφ (%)
TrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCV
2019Tr415 0 6 0 42198.57 1149 0 1 243 139382.48
CV 355 0 66 0 84.32 1127 0 0 266 80.9
2020Tr0 396 21 0 41794.96 0 1313 0 0 1313100
CV 0 404 13 0 96.88 0 1313 0 0 100
2021Tr0 0 427 0 427100 0 0 649 613 126251.43
CV 6 0 421 0 98.59 0 0 548 714 43.42
2022Tr3 0 41 370 41489.37 0 0 0 1176 1176100
CV 0 0 38 376 90.82 90 0 2 1084 92.18
SumTr418 396 495 370 167995.77 1149 1313 650 2032 514483.34
CV 361 404 538 376 92.67 1217 1313 550 2064 79.16
ComponentsMG4LMG5E
2019202020212022Σφ (%)2019202020212022Σφ (%)
TrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCV
2019Tr2650 0 5 168 282393.87 1226 0 0 0 1226100
CV 2655 0 0 168 94.05 1160 0 0 66 94.62
2020Tr0 2937 0 0 2937100 0 1272 2 0 127499.84
CV 0 2937 0 0 100 0 1243 31 0 97.57
2021Tr50 0 2758 3 281198.11 7 0 1057 0 106499.34
CV 0 0 2811 0 100 13 0 988 63 92.86
2022Tr39 0 48 2136 222396.09 268 0 513 298 107927.62
CV 240 0 26 1957 88.03 280 0 512 287 26.6
SumTr2739 2937 2811 2307 10,79497.1 1501 1272 1572 298 464382.99
CV 2895 2937 2837 2125 95.98 1453 1243 1531 416 79.22
ComponentsMG5LExperiment
2019202020212022Σφ (%)2019202020212022Σφ (%)
TrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCVTrCV
2019Tr394 0 0 0 394100 7921 0 0 247 816896.98
CV 394 0 0 0 100 8085 0 0 83 98.98
2020Tr0 244 0 0 244100 0 8673 37 0 871099.58
CV 0 244 0 0 100 0 8710 0 0 100
2021Tr0 0 381 0 381100 59 9 8259 413 874094.5
CV 0 0 368 13 96.59 23 0 7890 827 90.27
2022Tr15 0 89 107 21150.71 466 0 585 5719 677084.48
CV 5 0 94 112 53.08 527 0 766 5477 80.9
SumTr409 244 470 107 123091.54 8446 8682 8881 6379 32,38894.39
CV 399 244 462 125 90.89 8635 8710 8656 6387 93.13
Σ, total number of observations in each class; φ, percentage of well-classified observations. Color code: Agronomy 16 00376 i004.
Table 7. The KMO (Kaiser–Meyer–Olkin) measure of sampling adequacy of the variables of different maturity groups and the overall experiment.
Table 7. The KMO (Kaiser–Meyer–Olkin) measure of sampling adequacy of the variables of different maturity groups and the overall experiment.
ComponentsMG3MG4EMG4LMG5EMG5LExperiment
Yield0.500.230.310.290.170.60
GDD0.850.820.840.840.650.74
Avg. Temp.0.690.670.680.620.690.69
Min. Temp.0.690.670.680.610.690.69
Max. Temp.0.690.660.670.600.660.69
Precipitation0.740.730.720.790.600.79
Latitude0.770.830.800.700.850.83
Longitude0.42 *0.580.600.660.620.48
Altitude0.840.800.800.530.300.89
Soil Type0.290.360.350.520.240.29
Sand0.700.630.600.560.480.69
Silt0.620.660.610.590.580.70
Clay0.440.630.670.710.430.68
KMO0.690.690.690.640.590.71
*, values above 0.5 green and less than 0.5 red.
Table 8. Results of the statistical indices of clustering locations that different maturity groups studied based on yield, climate, geography, and soil characteristics.
Table 8. Results of the statistical indices of clustering locations that different maturity groups studied based on yield, climate, geography, and soil characteristics.
ComponentsKS(i)HΔCHComponentsKS(i)HΔCH
MG320.4110.192.9113.10MG5E20.2515.68−4.3211.35
30.3910.060.1214.2630.335.5010.1816.60
40.374.265.8016.7840.325.380.1214.82
50.383.720.5515.6050.293.741.6414.41
60.314.11−0.3914.9260.313.540.2013.59
70.333.610.5015.1670.252.740.7913.16
80.323.200.4015.3980.272.83−0.0812.57
MG4E20.2715.121.2116.32MG5L20.344.547.5312.07
30.3410.384.7419.1130.274.79−0.259.63
40.316.374.0119.8240.293.651.149.64
50.315.081.2918.9550.333.140.519.51
60.315.35−0.2818.1760.322.151.009.49
70.343.252.1118.2370.302.31−0.169.02
80.293.44−0.1917.3080.272.43−0.138.98
MG4L20.2615.09−1.3613.73Experiment20.2621.991.3823.37
30.337.028.0717.5230.3116.275.7226.91
40.306.590.4316.3640.296.869.4128.17
50.294.811.7816.2950.286.340.5225.05
60.294.450.3615.7660.276.49−0.1523.26
70.323.441.0115.5670.276.50−0.0122.44
80.282.670.7815.0880.296.96−0.4622.15
K, cluster number; S(i), silhouette index; H, is the Hartigan index; Δ, difference metric or H(k − 1) − H(k), CH, Calinski and Harabasz Index. Bold values are where the optimal number of clustering was obtained.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mirahki, I.; Bond, R.; Heiniger, R.; Moseley, D.; Sykes, V.R. Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy 2026, 16, 376. https://doi.org/10.3390/agronomy16030376

AMA Style

Mirahki I, Bond R, Heiniger R, Moseley D, Sykes VR. Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy. 2026; 16(3):376. https://doi.org/10.3390/agronomy16030376

Chicago/Turabian Style

Mirahki, Isaac, Richard Bond, Ryan Heiniger, David Moseley, and Virginia R. Sykes. 2026. "Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis" Agronomy 16, no. 3: 376. https://doi.org/10.3390/agronomy16030376

APA Style

Mirahki, I., Bond, R., Heiniger, R., Moseley, D., & Sykes, V. R. (2026). Delineating Soybean Mega-Environments Across State Lines: A Statistical Learning Approach to Multi-State Official Variety Trial Analysis. Agronomy, 16(3), 376. https://doi.org/10.3390/agronomy16030376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop