Classification of Soybean Genotypes as to Calcium, Magnesium, and Sulfur Content Using Machine Learning Models and UAV–Multispectral Sensor

: Making plant breeding programs less expensive, fast, practical, and accurate, especially for soybeans, promotes the selection of new soybean genotypes and contributes to the emergence of new varieties that are more efficient in absorbing and metabolizing nutrients. Using spectral information from soybean genotypes combined with nutritional information on secondary macronutrients can help genetic improvement programs select populations that are efficient in absorbing and metabolizing these nutrients. In addition, using machine learning algorithms to process this information makes the acquisition of superior genotypes more accurate. Therefore, the objective of the work was to verify the classification performance of soybean genotypes regarding secondary macronutrients by ML algorithms and different inputs. The experiment was conducted in the experimental area of the Federal University of Mato Grosso do Sul, municipality of Chapad ã o do Sul, Brazil. Soybean was sown in the 2019/20 crop season, with the planting of 103 F2 soybean populations. The experimental design used was randomized blocks, with two replications. At 60 days after crop emergence (DAE), spectral images were collected with a Sensifly eBee RTK fixed-wing remotely piloted aircraft (RPA), with autonomous takeoff control, flight plan, and landing. At the reproductive stage (R1), three leaves were collected per plant to determine the macronutrients calcium (Ca), magnesium (Mg), and sulfur (S) levels. The data obtained from the spectral information and the nutritional values of the genotypes in relation to Ca, Mg, and S were subjected to a Pearson correlation analysis; a PC analysis was carried out with a k-means algorithm to divide the genotypes into clusters. The clusters were taken as output variables, while the spectral data were used as input variables for the classification models in the machine learning analyses. The configurations tested in the models were spectral bands (SBs), vegetation indices (VIs), and a combination of both. The combination of machine learning algorithms with spectral data can provide important biological information about soybean plants. The classification of soybean genotypes according to calcium, magnesium, and sulfur content can maximize time, effort, and labor in field evaluations in genetic improvement programs. Therefore, the use of spectral bands as input data in random forest algorithms makes the process of classifying soybean genotypes in terms of secondary macronutrients efficient and important for researchers in the field.


Introduction
Soybean genetic improvement programs face the challenge of developing more productive and management-responsive genotypes, bearing part of the responsibility for ensuring effective solutions by 2050, seeking to meet the food needs of the world population, making it necessary to double the current agricultural production rate [1].Among the crops that are responsible for global food security, soybean is a source of protein widely used in animal feed, and a basis for the production of oil used in human food and biofuels [2].
Conventionally in soybean breeding programs, superior cultivars are selected through phenotypic traits, manually and visually measuring the characteristics of interest [3].Several aspects of breeding programs seek to improve the performance of plants under abiotic stress, mainly to find individuals resilient to conditions of low availability of water and nutrients, as well as to improve the efficiency in the use of these inputs [4].Root nutrient absorption rates are highly heritable and there is a notable genotypic preference for specific ions [5].Thus, it is possible to select genotypes that are capable of absorbing certain nutrients, making it possible to generate populations that are efficient in their use, generating savings with the use of fertilizers, and reducing negative environmental impacts due to mistaken applications of them.
The selection of higher plants of agronomic interest has been based on phenotypic traits long before the discovery of DNA.And within breeding, these selected plants are used in crosses, in which the more crosses and environments used to evaluate the performance of the selection response, the greater the chance of success of the progenies.During the breeding process, researchers need to phenotype a large number of plants, where there is a need to accurately identify the best progeny.In the field of genotyping, there have been significant advances that have provided rapid and low-cost genomic information [6], such as marker-assisted recurrent selection (MARS) and genomic selection, in which, like all advances in genomic analysis, phenotypic data are required [7].
In recent years, the great development of the use of high-throughput phenotyping (HTP) in agriculture is relatively new despite the implementation base being remote sensing, which is a well-established field of research [8].This type of phenotypic measurement allows for obtaining information about the plant in a detailed and non-invasive approach, as well as enabling assessments throughout the plant's life cycle.Therefore, plant breeders will be able to collect information more efficiently about the variables of interest, which enables them to evaluate large soybean populations in a quick and accurate way [9].
HTP is based on a multiple-image system, in which multispectral sensors operate at determined angles, allowing for the derivation of a mathematical relation between several two-dimensional (2D) images in the visible range (RGB), which allows for obtaining spectral information from plants.Every method that makes use of HTP technologies requires calibration in view of finding accurate answers and enabling image information understanding with plant growth dynamics, so that these data sets can ultimately be used to measure phenotypic variation in biological systems of interest [10].
By obtaining the visible spectral bands, it is possible to carry out calculations to obtain vegetation indices (VIs), which function as metrics related to, for example, senescence, nutritional status, and chlorophyll degradation due to some stress, such as water stress or pathogen [6].In this context, the spectral region of 470-800 nm is important in the relationship between leaf pigments and nutritional elements, including the secondary macronutrients calcium (Ca), magnesium (Mg), and sulfur (S) [11].
Multispectral sensors generate a large amount of spectral data, which are not directly related to agronomic variables of interest.Based on this data, the use of machine learning (ML) algorithms such as statistical analyses are capable of combining spectral and agronomic information about crops, providing accurate results, especially regarding the recognition of patterns that optimize the identification of soybean genotypes with greater accuracy [12].ML algorithms can help with various issues regarding plant classification.To be efficient, data must be collected in a systematic and representative way to enable the design of a reliable data set [13].
In recent literature, there are works that use ML techniques together with multispectral data for various activities linked to phenotyping, such as [14] using leaf reflectance to classify soybean genotypes in terms of industrial characters, reaching levels of correct classification close to 0.9 [15].Due to these applications, the use of such technologies can be promising in classifying soybean genotypes efficient in nutrient absorption; Santana et al. [16] managed to carry out such selection in soybeans for primary macronutrients, achieving greater precision with algorithms such as SVM and J48.
Using spectral data from soybean genotypes combined with nutritional information regarding secondary macronutrients can help genetic breeding programs select populations that are efficient in absorbing and metabolizing these nutrients.Combined with this information, using machine learning algorithms for data processing makes the selection of superior genotypes more accurate.Therefore, the objective of the study was to verify the classification performance of soybean genotypes regarding secondary macronutrients by ML algorithms and different inputs in datasets.
The experimental design was randomized blocks with two replications, featuring planting lines 3 m long per plot, spacing of 0.45 m, and planting density of 15 plants per meter.The evaluations took place on central line plants.
For sowing, the seeds were treated with fungicide (Pyraclotrobin + Methyl Thiophanate) and insecticide (Fipronil), at a dose of 200 mL of commercial product for every 100 kg of seeds, to prevent pests and soil diseases.Inoculation of seeds with bacteria of the genus Bradyrhizobium occurred with a dose of 200 mL of concentrated liquid inoculant for every 100 kg of seeds.Other cultural treatments were carried out according to the crop needs.
At 60 days after crop emergence (DAE), spectral images were generated based on the Sensifly eBee RTK fixed-wing remotely piloted aircraft (RPA), with autonomous control of takeoff, flight plan, and landing.A Parrot Sequoia multispectral sensor was boarded on the eBee, from where images were generated at 09:00 in the morning, at an altitude of 100 m, spatial resolution of 0.10 m, and with a clear sky of clouds.Radiometric calibration of the sensor was performed for the entire scene, using a calibrated reflective surface, provided by the manufacturer.The Parrot Sequoia multispectral sensor has a luminosity detector, allowing for the calibration of acquired values.The Sequoia sensor is a multispectral camera for agriculture that uses a sunlight sensor and an additional 16 Mpx RGB camera for recognition.The multispectral sensor used was acquired with a horizontal field of view (HFOV) of 61.9 • , vertical field of view (VFOV) of 48.5 • , and diagonal field of view (DFOV) of 73.7 • , as explained by [15].Reflectance values were obtained by the average of each repetition of the 103 soybean genotypes evaluated, obtaining wavelength information red (660 nm), green (550 nm), NIR (735 nm), and red-edge (790 nm) spectral bands (SBs).These wavelengths enabled the calculations of vegetation indices (VIs) such as the Enhanced Vegetation Index (EVI, [18]), Green Normalized Difference Vegetation Index (GNDVI, [19]), Modified Chlorophyll Absorption in Reflectance Index (MCARI, [20]), Modified Soil-adjusted Vegetation Index (MSAVI, [20], Normalized Difference Red Edge Index (NDRE, [19]), Normalized Difference Vegetation Index (NDVI, [21]), Soil-adjusted Vegetation Index (SAVI, [22]), and Simplified Canopy Chlorophyll Content Index (SCCCI, [23]).
RTK (Real-Time Kinematics) technology enabled aerial surveying and estimation of the camera position at the time of image collection, with an accuracy of 2.5 m.The images obtained were mosaicked and orthorectified using the computer program Pix4Dmapper, with the positional accuracy of the orthoimages verified with ground control points (GCPs) surveyed with RTK.
In those cases where the plant reaches the reproductive stage (R1), three leaves of each plant were collected and washed with water, mild detergent solution (0.1%), acid solution (HCl 0.3%), and deionized water.After washing, samples were kept in paper bags and dried in a forced circulation oven at 65 ± 5 • C, until constant dry mass condition.Then, the samples were weighed on a precision scale (0.0001 g) and ground in a Wiley mill.The micronutrient content (calcium, magnesium, and sulfur) was gauged following adequate methods [24].
Data from spectral information and micronutrient nutritional values of genotypes were subjected to Pearson correlation analysis through Rbio software [25].From this result, the k-means algorithm was applied for grouping near centroids genotypes to avoid significant variation in minimal distance observation, and thus clustering in two groups.Principal component (PC) analysis was performed to express cluster separation with biplot, based on the "ggfortify" library in R software [26].Further, following the Tukey test, boxplots for each cluster nutrient content were designed to highlight the higher nutrient content in each genotype set.
The formed clusters were used as output variables, while the spectral data were used as input variables for the following classification models in the machine learning analyses: Multilayer Perceptron Artificial Neural Network (ANN, [27]), REPTree Decision Tree Algorithm (DT, [28]), J48 Decision Tree Algorithm (J48, [29]), Logistic Regression (LR, [30]), random forest (RF, [31]), and Support Vector Machine (SVM, [32]).The algorithms were chosen according to those most recently used in the literature [16,33,34].The inputs tested in the datasets were spectral bands (SBs), vegetation indices (VIs), and the combination of both VIs+SBs.Cluster classification was based on stratified cross-validation with k-fold = 10 and ten replications, obtaining 100 runs for each model.
The used models' parameters were defined by following the default configuration in Weka 3.8.5 software.The models' performance was evaluated according to accuracy metrics of percentage of correct classifications (CCs), F-score, and kappa coefficient, where the higher the values for the metrics, the better the performance of the algorithms.The performance of inputs, ML models, and interaction between them was verified through analysis of variance based on the models, resulting in boxplots with means, with significance at the 5% level according to the Scott-Knott.Such a task was based on ggplot2 and ExpDes.ptlibraries from the R software [26].

Results and Discussion
The Pearson correlation analysis was plotted in the form of a scatterplot (Figure 2), where the shades in red represent positive correlations; the more intense the color, the greater the magnitude of the correlation.Similarly, negative correlations are expressed by the colors in blue, using the same tone condition associated with magnitude.A medium magnitude correlation was noticed between Ca and Mg.The spectral variables presented a low magnitude of correlation with the macronutrients and a high magnitude with each other, in which red and green presented high negative correlations with the VIs, and the VIs and red-edge reached high positive correlations with each other.The median correlation between calcium and magnesium (Figure 1) can be explained by their similar chemical properties, such as ionic radius, valence, degree of hydration, and mobility, thus these nutrients compete for adsorption sites in the soil at the time of being absorbed by plants [35].Due to this competition for the same absorption site, soil levels of Ca and Mg must be in balance since the overload of one limits the absorption of the other, which means lower levels of these nutrients in plant leaves and seeds [36].The Ca and S and Mg and S correlations showed very low correlations, due to the different absorption and metabolic routes within the plant.
The high correlations between spectral bands and vegetation indices are already expected relationships since VI calculation relies on SB data [33].The low correlations observed between nutritional and spectral variables are attributed to the lack of linearity between these variables, which have complex relationships not explained by traditional statistical methods, such as Pearson's correlations, in which the most recommended is the use of ML algorithms, which overcome problems with the lack of linearity between nutritional and spectral variables [37].These ML algorithms are robust enough to provide reliable results on the relationship between spectral and agronomic data, thus making the results more reliable [38].
Two clusters were set (C1 and C2) through PC analysis clustering using the k-means algorithm, which within the cluster have mutual parity and are distinct from other cluster genotypes, based on the macronutrients evaluated (Figure 3).The first two principal components combined represent 69% of the total data variation, a value very close to that recommended in past research [39], which suggested a value above 70%, managing to confidently group the genotypes and go further with subsequent analyses.Every genotype received the same fertilization management despite presenting different levels of Ca, Mg, and S, which allowed us to separate them into two groups (Figure 2) with the help of the k-means algorithm.The purpose of the k-means algorithm is to split the dataset based on the clustering criterion, in which data are grouped in view of trait similarity, and the designed groups by k-means are clusters [40].After the clusters are defined based on the secondary macronutrient amount in the leaves, PC analysis was carried out with the first two principal components.
Subsequently, the nutrients from each cluster were subjected to the Tukey test, in which the genotypes grouped in Cluster 2 reached significantly higher values of secondary macronutrient concentration when compared to those in Cluster 1 (Figure 4).By cluster definition, it is noted that Cluster 2 presented the highest means for all nutrients (Figure 4).Different genotypes presented different efficiencies in the use of nutrients, which are influenced by genetic and physiological factors.The plant being efficient in certain nutrients refers to an individual that produces higher yields per unit of nutrient applied or absorbed when compared to other plants grown in similar environmental conditions [41].Therefore, it can be stated that the genotypes contained in Cluster 2 are superior in terms of Ca, Mg, and S metabolization efficiency.
Analyzing the performance of the machine learning algorithms, three different parameters were used, namely correct classification (CC), F-score, and kappa coefficient.The interaction between Inputs × ML was found to be significant for correct classification (CC), kappa, and F-score (Table 1).
With genotype clusters set, machine learning algorithm analyses were carried out with different inputs from these algorithms, searching for the one with the greatest accuracy in classifying the groups.The combination of ML with multispectral data demonstrates exceptional results in modeling diverse crop characteristics, such as yield, biomass, and height [42].ML methods use advanced statistical devices to model non-linear data that have complex actions among spectral variables and biological variables linked to plants [43].The evaluation pattern for ML in this work was based on the LR algorithm, in which algorithms that presented superior results were sought.From the perspective of the inputs in the algorithms, the ML techniques employed do not present notable differences, except for LR, in which the SB+VIs input had greater performance, reaching an accuracy close to 0.60 (Figure 5).Evaluating the SB input, the algorithms that obtained the best results were J48, RF, and SVM, achieving accuracies of around 0.55 and 0.60 for this metric.In input VIs, the algorithm that performed best was RF.In view of SB+VIs input, RL had better accuracy for CC, close to 0.60.The inputs used provided a difference only for RF and RL, in which the best performances were achieved using IVs and SB+IVs, respectively.Evaluating the performance of each input within each algorithm, the use of SB provided better results for RF, which also presented better results when IVs were used.SB+IVs provided better performances for SVM (Figure 6).
In the F-score accuracy metric evaluating the performance of the algorithms with the three inputs, it is noted that DT, RF, and SVM showed no difference in performance regardless of the input used (Figure 7).ANN and RL showed better performance when using SB+IVs.The J48 algorithm showed better performance when using SB, above 0.5 accuracy.Evaluating the performance of the algorithms with each input, both SB and IVs provided better accuracies for RF, reaching performance above 0.5.SB+IVs showed better accuracies for RF and RL, with accuracies between 0.5 and 0.6.Means followed by the same uppercase letters do not differ for the inputs tested by the Scott-Knott test at 5% probability; means followed by the same lowercase letters do not differ for the algorithms tested by the Scott-Knott test at 5% probability.
In general, RF performed better for all tested accuracy metrics, especially when the tested inputs were SB and VIs.LR presented a good performance for the metrics when SB+VIs was used.J48 and SVM performed well when using the algorithms' input SB for only the CC accuracy metric.
Vegetation indices enable the summarization of information regarding the plant canopy reflectance, which makes it possible to evaluate various quantitative and qualitative plant parameters when combined with algorithms [44].However, the use of spectral bands makes figures more viable from a data processing perspective due to the absence of a requirement in mathematical calculations to obtain inputs, as occurs with vegetation indices [33].The use of spectral bands as input data for ML algorithms presents accurate results for the identification of soybean cultivars [14].Greater precision was detected among the tested algorithms, as the use of spectral bands as model input data provided better accuracy in determining height and maturation days in soybean plants [45].Good accuracy was also found in soybean genotypes classification when it comes to oil and protein characteristics [33].
Among the ML algorithms used, RF achieved better performance than the other algorithms.RF is a learning algorithm that uses a non-parametric regression-based model combining a set of decision trees [46].Random forest is a high-precision ML technique in various agricultural applications [47], such as in predictions regarding corn yield [37,48], soybean yield [43], nitrogen content concentration, and height prediction of corn plants [49], in the classification of injuries in soybean seeds [50], and early detection of diseases [51].The use of RF has a superior advantage in identifying soybeans and corn, being an algorithm with high potential with remotely sensed data [52].
Therefore, the results prove the effectiveness of using ML algorithms, notably the RF algorithm.RF had superiority compared to other methods in classifying soybean genotypes when it comes to secondary macronutrients.Using SB as algorithm input increases accuracy and reduces data processing work.The use of these technologies in the genetic improvement of plants makes the selection of a soybean genotype superior in terms of absorption and metabolization of nutrients such as Ca, Mg, and S faster, more practical, and non-destructive.In this way, the use of spectral data combined with machine learning techniques allows us to simultaneously analyze phenotypic characteristics and relate them to different nutritional characters, being able to assist in the process of plant improvement by selecting genotypes that will be more efficient in absorbing nutrients.This contribution to agriculture allows for savings in time and resources, reducing costs with labor and chemical reagents used in the laboratory to determine such elements.Furthermore, the use of modern techniques, such as the use of algorithms, supports more digital, precise, and efficient agriculture.

Conclusions
Machine learning algorithms have demonstrated promising results in being used to classify soybean genotypes in relation to calcium, magnesium, and sulfur content.The algorithm that presented better results than the others was random forest, achieving accuracies close to 0.6 for correct classification and F-score.This algorithm proved to be robust and capable of efficiently generalizing the information obtained, regardless of the type of input used.
In future work, the use of hyperspectral sensors can provide greater amounts of information across the spectrum, in more detail in relation to these and other nutrients.Furthermore, the machine learning techniques applied in this study can be adapted and extended to other agricultural crops and different nutrients, contributing to expanding the application potential and impact of these approaches in high-precision phenotyping.

Figure 1 .
Figure 1.Location of the experimental area in Chapadão do Sul-MS, Brazil; photographic area of the experimental area.

100 Fscore = 2 ×
CC = true positive classification true positive classification + false negative classification + false positive classification × true positive classification 2 × true positive classification + false negative classification + false positive classification Kappa = (observed agreement − agreement expected by chance) (1 − agreement expected by chance)

Figure 2 .
Figure 2. Pearson correlation scatterplot with spectral and secondary macronutrients.

Figure 3 .
Figure 3. Principal Component (PC) for clusters based on Ca, M, and S contents of soybean genotypes based on k-means.

Figure 4 .
Figure 4. Boxplot with Ca, Mg, and S means for clustered data.Means followed by the same letters do not differ for the cluster by the Scott-Knott test at 5% probability.

Figure 5 .
Figure 5. Boxplot with clustering means for percent correct classification regarding the machine learning models.Means followed by the same uppercase letters do not differ for the inputs tested by the Scott-Knott test at 5% probability; means followed by the same lowercase letters do not differ for the algorithms tested by the Scott-Knott test at 5% probability.

Figure 6 .
Figure6.Boxplot with clustering means for kappa regarding machine learning models.Means followed by the same uppercase letters do not differ for the inputs tested by the Scott-Knott test at 5% probability; means followed by the same lowercase letters do not differ for the algorithms tested by the Scott-Knott test at 5% probability.

Figure 7 .
Figure 7. Boxplot with clustering means for F-score regarding the machine learning models tested.Means followed by the same uppercase letters do not differ for the inputs tested by the Scott-Knott test at 5% probability; means followed by the same lowercase letters do not differ for the algorithms tested by the Scott-Knott test at 5% probability.

Table 1 .
Summary of the analysis of variance for the variables percent correct classification (CC), kappa coefficient, and F-score.