Spectral-Based Classification of Genetically Differentiated Groups in Spring Wheat Grown under Contrasting Environments

The global concern about the gap between food production and consumption has intensified the research on the genetics, ecophysiology, and breeding of cereal crops. In this sense, several genetic studies have been conducted to assess the effectiveness and sustainability of collections of germplasm accessions of major crops. In this study, a spectral-based classification approach for the assignment of wheat cultivars to genetically differentiated subpopulations (genetic structure) was carried out using a panel of 316 spring bread cultivars grown in two environments with different water regimes (rainfed and fully irrigated). For that, different machine-learning models were trained with foliar spectral and genetic information to assign the wheat cultivars to subpopulations. The results revealed that, in general, the hyperparameters ReLU (as the activation function), adam (as the optimizer), and a size batch of 10 give neural network models better accuracy. Genetically differentiated groups showed smaller differences in mean wavelengths under rainfed than under full irrigation, which coincided with a reduction in clustering accuracy in neural network models. The comparison of models indicated that the Convolutional Neural Network (CNN) was significantly more accurate in classifying individuals into their respective subpopulations, with 92 and 93% of correct individual assignments in water-limited and fully irrigated environments, respectively, whereas 92% (full irrigation) and 78% (rainfed) of cultivars were correctly assigned to their respective classes by the multilayer perceptron method and partial least squares discriminant analysis, respectively. Notably, CNN did not show significant differences between both environments, which indicates stability in the prediction independent of the different water regimes. It is concluded that foliar spectral variation can be used to accurately infer the belonging of a cultivar to its respective genetically differentiated group, even considering radically different environments, which is highly desirable in the context of crop genetic resources management.


Introduction
The frequency and severity of drought events have been increasing in recent years in various regions around the world [1], a trend that is expected to continue as a consequence of climate change [2]. These drought events are responsible for much of the loss of crop productivity, varying between 10% and 35%, depending on its intensity and duration [3]. Several mechanisms that could provide tolerance to drought stress in agricultural crops have been proposed. For instance, Hassanisaadi et al. [4] notes that nanoparticles have protective effects against drought stress, while Van Oosten et al. [5] indicate that biostimulators and bioeffectors improve plant resilience in water-limiting environments. However,

Results
This section shows the wheat spectral-based classification accuracy obtained from three machine-learning-based methods (PLS-DA, MLP, and CNN). The workflow considered three steps: (I) the estimation of the genetic population structure (Section 2.1), (II) the identification of better hyperparameter combinations in the neural network models (MLP and CNN) (Section 2.2), and (III) the comparison of neural network models with PLS-DA (Section 2.3).

Genetic Structure Analysis and Spectral Data
The genetic population structure analysis indicated that the most probable number of subpopulations in the sample was two (K = 2), according to the methodology of Evanno et al. [24] (Figure 1A,B). These results were similar to what was visualized with the first two primary components of the PCA method ( Figure 1C), which, in the axes PC1 and PC2, explained 23.29% and 8.4% of the variation in the genotypic data, respectively. Moreover, Bayesian credibility intervals (at 95%) of the FST values confirm the existence of significant differences between both clusters. The estimates of the FST for clusters 1 and 2 were 0.1820 (0.1650-0.1980) and 0.3338 (0.3110-0.3530), respectively. Cluster 1 contained 59% of cultivars (186/316), of which 70%, 23% and 7% corresponded to the CIMMYT, INIA-Chile, and INIA-Uruguay breeding programs, respectively. On the other hand, cluster 2 consisted of 41% (130/316), including mainly INIA-Uruguay cultivars (96%).
water availability during the growing season. For this, different machine-learning models were trained with foliar spectral information to assign wheat cultivars to subpopulations previously defined using SNPs markers.

Results
This section shows the wheat spectral-based classification accuracy obtained from three machine-learning-based methods (PLS-DA, MLP, and CNN). The workflow considered three steps: I) the estimation of the genetic population structure (Section 2.1), II) the identification of better hyperparameter combinations in the neural network models (MLP and CNN) (Section 2.2), and III) the comparison of neural network models with PLS-DA (Section 2.3).

Genetic Structure Analysis and Spectral Data
The genetic population structure analysis indicated that the most probable number of subpopulations in the sample was two (K = 2), according to the methodology of Evanno et al. [24] (Figures 1A and 1B). These results were similar to what was visualized with the first two primary components of the PCA method ( Figure 1C), which, in the axes PC1 and PC2, explained 23.29% and 8.4% of the variation in the genotypic data, respectively. Moreover, Bayesian credibility intervals (at 95%) of the FST values confirm the existence of significant differences between both clusters. The estimates of the FST for clusters 1 and 2 were 0.1820 (0.1650-0.1980) and 0.3338 (0.3110-0.3530), respectively. Cluster 1 contained 59% of cultivars (186/316), of which 70%, 23% and 7% corresponded to the CIMMYT, INIA-Chile, and INIA-Uruguay breeding programs, respectively. On the other hand, cluster 2 consisted of 41% (130/316), including mainly INIA-Uruguay cultivars (96%).
In Figure 2, a comparison between the average of spectral curves produced by each genetically differentiated group in each environment is shown. The principal wavelength that distinguishes both groups is between 741 and 1357 nm, in which Santa Rosa presents greater differences in the wavelengths than Cauquenes (0.020-0.017 nm, respectively).  In Figure 2, a comparison between the average of spectral curves produced by each genetically differentiated group in each environment is shown. The principal wavelength that distinguishes both groups is between 741 and 1357 nm, in which Santa Rosa presents greater differences in the wavelengths than Cauquenes (0.020-0.017 nm, respectively).
Plants 2022, 11, x FOR PEER REVIEW spatially distributed in relation to the first two main components. The cultivars according to cluster matching the structure population (blue for cluster 1 and orange f

Hyperparameter Evaluation
For both neural network models (CNN and MLP), the hyperparameter co depicted in Table 1 were evaluated. The evaluation was carried out by runnin for the best classification accuracy over the defined hyperparameters; therefo combination of hyperparameters, both neural networks were trained. On a ReLU activation function was significantly more accurate (CA = 0.80-0.84) th tivation functions in assigning the individuals to their respective clusters, both neural networks in both environments-Santa Rosa and Cauquenes (Tab On the other hand, the Adam optimizer had a higher CA than rmsprop in bo ments; nonetheless, in the full irrigation site (Santa Rosa), the adam was signifi efficient in correctly assigning individuals to clusters. In general, both the ba the number of layers did not have a significant effect on the ability of CNN assign individuals to the respective clusters of each environment evaluated (T 2).
Both artificial neural networks showed the highest CA with the followin tion of hyperparameters: the ReLU activation function, the adam optimizer, of 10, and only one conv1D layer. However, the MLP network in the Cauquen ment showed an increase in the assignment of individuals when the batch s from 10 (CA = 0.78) to 20 (CA = 0.88). For both the MLP and CNN neural ne classification accuracy was lower for the site with a low irrigation supply (C Nevertheless, the CNN had a higher CA value than MLP in the prediction of structure from both environments.

Hyperparameter Evaluation
For both neural network models (CNN and MLP), the hyperparameter combinations depicted in Table 1 were evaluated. The evaluation was carried out by running the search for the best classification accuracy over the defined hyperparameters; therefore, for each combination of hyperparameters, both neural networks were trained. On average, the ReLU activation function was significantly more accurate (CA = 0.80-0.84) than other activation functions in assigning the individuals to their respective clusters, considering both neural networks in both environments-Santa Rosa and Cauquenes (Tables 1 and 2). On the other hand, the Adam optimizer had a higher CA than rmsprop in both environments; nonetheless, in the full irrigation site (Santa Rosa), the adam was significantly more efficient in correctly assigning individuals to clusters. In general, both the batch size and the number of layers did not have a significant effect on the ability of CNN and MLP to assign individuals to the respective clusters of each environment evaluated (Tables 1 and 2). Both artificial neural networks showed the highest CA with the following combination of hyperparameters: the ReLU activation function, the adam optimizer, a batch size of 10, and only one conv1D layer. However, the MLP network in the Cauquenes environment showed an increase in the assignment of individuals when the batch size changed from 10 (CA = 0.78) to 20 (CA = 0.88). For both the MLP and CNN neural networks, the classification accuracy was lower for the site with a low irrigation supply (Cauquenes). Nevertheless, the CNN had a higher CA value than MLP in the prediction of the genetic structure from both environments.

Best Performing Models
The best combination of hyperparameters seen in the previous section was used to train the CNN and MLP final model and obtain a comparison of their performance with PLS-DA. The performance of the best MLP and CNN models with their hyperparameters achieved a CA of 0.88 (MLP in Cauquenes) and 0.93 (CNN in Santa Rosa), while PLS-DA showed the lowest CA values in both environments. In general, MLP was as efficient as CNN in assigning individuals to cluster 1; however, CNN exhibited a significantly higher effectiveness than MLP in the allocation of individuals to cluster 2 for both environments. Moreover, CNN exhibited the highest CA for both environments, which was significantly high compared to that of other models (Table 3). Interestingly, PLS-DA showed significantly higher CA values in Cauquenes (rainfed) than in Santa Rosa (full irrigation), while both neural network models exhibited higher CA values in Santa Rosa; however, CNN was not statistically different between both environments. This indicates that CNN and PLS-DA models are suitable for the assignment of individuals to subpopulation membership regardless of the environment in which the analysis takes place (rainfed or full irrigation).

Discussion
A comprehensive understanding of the genetic structure, and of how diversity is distributed, is needed to efficiently exploit and continuously advance in the wheat breeding programs to meet global future demands. Recent advances in genotyping and sequencing technologies have progressively expedited the introduction of genomic-based techniques in modern plant breeding. DArT, SSR, and SNP molecular marker systems have been validated for population genetic studies in wheat [25][26][27][28]. Particularly, Mora et al. [25] and, more recently, Ballesta et al. [28] reported that the population genetic structure of the Santa Rosa and Cauquenes sites is separated into two significantly differentiated clusters with relatively high differentiation values (Fst) according to the Bayesian clustering method (implemented in the software STRUCTURE) and a principal component analysis (PCA), which is in agreement with previous studies that have examined the genetic structure of different collections of bread wheat germplasm in the world [29,30]. Even though the cost of high-throughput genotyping has been decreasing over the years, it is still considered a limitation in the use of this technology in large-scale plant breeding. Genotyping thousands (or millions) of individuals is challenging and remains inaccessible for most species [31]. In fact, the cost efficiency of these tools has been improved considering the low coverage of the plant genomes and facilitated by genomic data imputation techniques [32]. The results of the present study revealed that the foliar spectral variability could be highly correlated with the genetic diversity of these wheat populations, which is demonstrated by the high classification accuracy (CA = 0.91) of the trained models from the spectral data (mainly, the neural network-based models). In general, the classification accuracy of the PLS-DA method was relatively high (CA > 0.7). However, it was less efficient than any of the neural networks tested in the present study (CNN and MLP). PLS-DA is a linear classifier, which has proven to be unsuitable for dealing with foliar spectral data that have several features to consider for classification procedures [33]. In fact, PLS-DA tends to make a decision criterion based on easily classifiable data, which leads to an inadequate classification for outliers [34]. The genetic structure of different plant populations has been inferred by non-supervised artificial neural network methods using SNP molecular markers [35,36]. Unlike previous studies, our approach used supervised classification neural network-based models, which consider foliar spectral reflectance information as "attributes" and the genetic structure (previously defined with SNPs) as the "class labels". In general, CNN was significantly superior to MLP in terms of classification accuracy. According to Lowe et al. [37], despite the fact that CNN has a complex and time-consuming architecture for training the model, it is one of the methods with the highest precision and a high rate of object recognition. In this sense, CNN differs from MLP mainly by the existence of convulsion and pooling in CNN, which deals with the problem of dimensionality [38]. Additionally, MLP has a dense structure that significantly increases the number of parameters, which is not observable in CNN because it considers convulsion filters that gradually reduce the dimensionality of the data.
Neural network models trained with foliar spectral data have been widely used to classify samples according to different criteria, such as phytosanitary, nutritional, and fruit development status, among others [39,40]. Consistent with the results of this study, Nachtigall et al. [39] evaluated the use of CNN and MLP to detect and classify diseases, damage, and nutritional deficiencies in apple trees from images of their leaves. The authors reported that the CNN models outperformed MLPs, achieving a classification accuracy of 97.3%. Moreover, both neural network models were compared with human expert perception, revealing that the MLP model was less efficient than expert perception in classifying diseases, damage, and nutritional deficiencies. It is noteworthy that, in the context of variety identification and genetic diversity studies, CNNs have proven to be more efficient than other traditional classification methods. Qiu et al. [41] classified rice seed varieties based on hyperspectral images using CNNs. According to the authors, the CNN model outperformed other methods, such as the K-nearest neighbors (KNN) and support vector machine (SVM) models, revealing the effectiveness of analyzing foliar spectral data based on CNN. In another study, Zhu et al. [42] modeled a CNN using nearinfrared hyperspectral imaging to classify three varieties of soybeans (i.e., "Zhonghuang37", "Zhonghuang41", and "Zhonghuang55"), and each variety has a classification accuracy of over 90%. Kittlein et al. [43] performed a CNN model to provide a highly accurate prediction of the genetic diversity and differentiation of Ctenomys australis populations using molecular markers and high-resolution satellite imagery. CNN-based predictions accounted for about 98% of the variation observed in the genetic differentiation index and mean allele richness values, which may facilitate the identification of areas of interest for the conservation and management of populations.
One of the most promising results of the present study is the fact that the neural networks achieved a classification accuracy higher than 80% for both environmental conditions (rainfed and full irrigation), which suggests that the ability of the genetic structure inference of neural network methods (based on foliar spectral data) was not significantly affected by environmental conditions. On the other hand, the results revealed that the classification accuracy was slightly lower in Cauquenes. Previous studies have discussed how water deficit interferes with the spectral signal of vegetation [44,45]. According to Damm et al. [44], plants subjected to water deficit exhibit higher reflectance values (especially within the shortwave infrared spectrum) than plants growing under higher water availability and may exhibit a higher variability in specific regions of the spectrum (in terms of the standard deviation). Therefore, rainfed conditions may be generating potential noise in spectral measurements, which in turn may be affecting classification models. Notably, CNNs exhibited the highest classification accuracy in both sites (Cauquenes and Santa Rosa), which can be explained by the fact that this method has a high ability to process data that have never before been observed and allows for robustness regarding data acquisition conditions, background heterogeneity, and intraclass variability [46]. It is important to note that, unlike neural network models, PLS-DA exhibited a higher classification accuracy in Cauquenes (rainfed) than in Santa Rosa (full irrigation), which indicates the potential of this methodology in assigning individuals to subpopulations under rainfed conditions. Figure 3 provides an overview of the spectral-based classification of genetically differentiated groups carried out in this study. In brief, a spring wheat panel is genotyped using SNP markers, and then the genetic structure analysis is performed using the Bayesian clustering method (available in the STRUCTURE software) and Principal Components Analysis in order to confirm the genetic structure result. The genetic structure of the population is scored as an ordinal variable (in this wheat panel, clusters 1 and 2). Parallelly, high-throughput plant phenotyping of the entire population is performed using a specialized instrument. In this study, the foliar spectral data were considered input variables for the neural networks models [11], which are used to assign individuals to each genetically homogeneous group (clusters) previously inferred by the genetic structure analysis. Finally, the trained neural network models were used to assign individuals to a specific genetic cluster. It is necessary to consider that the analytical models of neural networks proposed in the present study are limited to classifying genotypes in genetic groups previously defined from the structure analysis; therefore, efforts should be made to obtain a genotyping for a set of individuals that are a good representation of the population diversity. The proposed methods are presented as a solution in the short term, at a low cost, and on a large scale of individuals. In addition, considering the relevance of examining the genetic structure of the population for breeding programs and the possibility of implementing high-throughput phenotyping in wheat panels, the pipeline based on neural network methods from foliar spectral data can be used to train neural network models to assign accessions to a given genetic subpopulation on a large scale and at a lower cost.     Cauquenes is classified as a drought-prone area with a Mediterranean climate (with De Martonne arid index 20 ≤ IDM < 24) [28] and represents the water-limited conditions (rainfed). Santa Rosa exhibits a humid climate type (De Martonne index 28 ≤ MDI < 35) and represents a full irrigation condition [28]. The experimental design was an alpha-lattice design with 20 incomplete blocks per replicate and 20 genotypes per block.

High-Throughput Phenotyping
Spectral reflectance (350-2500 nm) was measured using a portable spectroradiometer (FieldSpec 3 Jr., ASD, Boulder, CO, USA), with a 2.3 mm diameter optical fiber. Measurements to 316 cultivars were made from 11:00 to 17:00 h. on clear days (radiation higher than 800 W m −2 ). The equipment was set up to take three scans per plot, with the bean of the optical fiber placed at 45 • and 80 cm over the top of the canopy, and measurements were made by moving the sensor over the plot (covering the three central rows) according to Garriga et al. [47]. To limit variations in reflectance induced by changes in the angle of the sun, radiometric calibration was performed every 15 min, using a white barium sulfate panel as the reference (Spectralon, ASD, Boulder, CO, USA). The foliar spectral data were pre-processed in the R 4.0.5 software (Core Development Team, 2020), where the reflectance measures were normalized (centered and scaled), and their first derivative was computed using a Savitzky-Golay filter, with a window size of 37 data points.

High-Throughput Genotyping and Population Genetic Structure
DNA was isolated from the leaves of 316 cultivars according to the protocol of Porebski et al. [48]. Genotyping by sequencing (GBS) was carried out using the Illumina HiSeq 2000 sequencing platform and the bioinformatic approach for SNP identification described previously by Lado et al. [49]. SNP calls were made using the Tassel Pipeline (http://maizegenetics.net), with modifications for non-reference SNP calling according to Poland et al. [50] and Lado et al. [49]. SNPs with an MAF < 0.01 and a proportion of missing data per location >30% were eliminated according to Ballesta et al. [28], resulting in a total of 2,214 SNP markers from an initial dataset comprising 8,746 SNPs, which were used for the population genetic structure study.
The population genetic structure of the wheat was inferred using the STRUCTURE software [51], following Mora et al. [25]. For each genetically differentiated group (K, with K = 1-6), 10 runs were performed separately, each with 100,000 Monte Carlo Markov Chain (MCMC) replicates and a burn-in period of 10,000 iterations. The optimal K value was determined with Evanno's method [24] according to Mora et al. [25]. The results of STRUCTURE are available in Ballesta et al. [28], which indicated that the 316 wheat cultivars were grouped into two genetically distinct subgroups (K = 2).
A description of the origin breeding program and respective genetic cluster for each cultivar is provided in Data Sheet S1, while the spectral signatures distinguishing each group are presented in Figure 2. PLS-DA is a regression approach where the reduction of the dimensions and the latent decomposition between a set of predictors X (feature matrix with I × J dimensions) and label responses Y (response with I × K dimensions) are key. PLS defines a new subspace of latent variables through an iterative process, considering a compromise between maximum variance in X and maximum correlation to Y [52]. This method can be described statistically by: where the T and U matrices represent the score matrices X and Y, respectively, P and Q are the loading matrices of X and Y, respectively, and the E and F matrices correspond to the residual matrices of X and Y, respectively. The plsDA() function of the DiscriMiner package [53] in R 4.0.5 software [54] was used to run the PLS-DA algorithm.

Neural Network Classifiers
In this study, two types of ANN were tested to classify spectral data: the first is plain Multi-Layer Perceptron (MLP), in which neurons in a layer are fully connected to all neurons in neighboring layers, and the second is a one-dimensional convolutional neural network (Conv1D), which is a special form of CNN. MLP is the simplest form of ANN, while Conv1D employs one-dimensional filters to capture the temporal pattern or shape of the input series.
The architecture of the CNN used in this study is based on a convolutional layer (conv1D) to realize the feature extraction of the data. The Conv1D-based classification models, in general, contain convolution layers, pooling layers, dense layers, and an output layer. In this sense, by applying different convolution filters within each convolution layer, the Conv1D has the ability to extract different one-dimensional features from different layers [55]. On the other hand, the pooling layers are generally used to provide samples with the input data by retaining only the maximum value of the feature map in a window with a specified pool size, while the dense layers represent a deeply connected neural network layer [56]. The Conv1D-based model used in this study was composed of two conv1D layers, a maximum one-dimensional sampling (MaxPooling1D) layer, a flatten layer (to transform the input data into a one-dimensional vector) with a dropout parameter (to reduce the overfitting of the training data), and, finally, a softmax layer which sets the predictions of the model (dense layer for classification). Because of the versatility of the specialized architectures, there is no standard procedure for searching for the optimal combination of hyper-parameters [56]. Therefore, in this study, the optimization of the Conv1D-based model was realized by combining different hyper-parameters. For this, the ReLU, Sigmoid, and Tanh activation functions were tested in conv1D, while the softmax activation function was used in the output layer. The categorical cross-entropy (categorical_crossentropy) was used as the loss function, and the Adam and rmsprop optimizers were tested. The training was realized with 500 epochs, testing a batch size of 10, 20, and 40 (Table 4). MLP is a simple deep feedforward artificial neural network; there are at least three layers (input layer, hidden layer, and output layer), and the neurons of a layer are fully connected with all neurons of neighboring layers [57]. In this study, the architecture of MLP was composed of one or two dense hidden layers and an output layer (dense layer for classification). The ReLU, Sigmoid, and Tanh activation functions were tested in dense hidden layers, while softmax was in the output layer. categorical_crossentropy was used to track the loss function with the Adam and rmsprop optimizers ( Table 4).

Performance Metric
The performance of the proposed CNN and MLP architectures was assessed in terms of classification accuracy, which measures the number of predictions correctly predicted divided by the total number of predictions made, multiplied by 100 (to express it as a percentage). This metric was used in the CNN and MLP networks. Ten-fold cross-validation was applied for tuning each hyperparameter in the CNN and MLP networks and for the later comparison of the classification models (PLS-DA, CNN, and MLP). In each fold, the sample data were split into two datasets: training and validation sets, in which 80% of samples were randomly assigned to the training model, and the remaining 20% were used for validating the model. The Fisher LSD test was performed to compare the accuracy values among the evaluated models.

Conclusions
The present study evaluated the use of an artificial neural network-based spectral approach for the assignment of wheat genotypes to genetically differentiated groups (according to the genetic population structure), considering two contrasting water regimens. In general, the results revealed that the classification accuracy of neural networks was slightly lower in Cauquenes (rainfed condition). Notably, the clusters showed smaller differences in the wavelengths in Cauquenes than in Santa Rosa, which could have affected the average accuracy of the clustering in this environment. However, independent of environmental conditions (rainfed and full irrigation), the neural network models attain high classification accuracies. The methods proposed in this study can be applied to discriminate and assign accessions to a given genetic subpopulation in the short term and at a low cost, thus providing relevant information for breeding purposes. Additionally, the results demonstrated that this methodology has a high precision in different environmental conditions, which is highly desirable in the context of selection methodologies (e.g., marker-assisted selection) and studies related to the conservation of germplasm with a high diversity.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/plants12030440/s1, Data Sheet S1: Description of the origin breeding program and the respective genetic cluster for each spring wheat cultivar (from genetic structure analyses).

Data Availability Statement:
The data presented in this study are available upon request from the corresponding author.