Soybean Cultivars Identiﬁcation Using Remotely Sensed Image and Machine Learning Models

: Using remote sensing combined with machine learning (ML) techniques is a promising approach to classify soybean cultivars. Therefore, the objectives of this study are (i) to verify which input dataset conﬁguration (using only spectral bands, only vegetation indices, or both) is more accurate in the identiﬁcation of soybean cultivars, and (ii) to verify which ML technique is more accurate in the identiﬁcation of soybean cultivars. Information was extracted from ﬁve central irrigation pivots in the same region and with the same sowing date in the 2015/2016 crop year, in which each pivot was cultivated with a different cultivar, in which the cultivars used were: CV1—P98y12 RR, CV2—Desaﬁo RR, CV3—M6410 IPRO, CV4—M7110 IPRO, and CV5—NA5909 RR. A cloud-free orbital image of the site was acquired from the Google Earth Engine platform. In addition to the spectral bands alone, a total of 13 vegetation indices were calculated. The models tested were: artiﬁcial neural networks (ANN), radial basis function network (RBF), decision tree algorithms J48 (DT) and reduced error pruning tree (REP), random forest (RF), and support vector machine (SVM). The ﬁve soybean cultivars were classiﬁed by the six-machine learning (ML) models in stratiﬁed randomized cross-validation with k-fold = 10 and 10 repetitions (100 runs for each model). After obtaining the r and MAE statistics, analysis of variance was performed considering a 6 × 3 factorial scheme (models versus inputs) with 10 repetitions (folds). The means were grouped by the Scott–Knott test at 5% probability. The spectral bands were the most accurate among the tested inputs in the identiﬁcation of soybean cultivars. ANN was the most accurate model in identifying soybean cultivars.


Introduction
Soybean (Glycine max L.) is the major Brazilian agricultural commodity. The estimated national production in the 2020/21 harvest was 135.91 million tons, which represents an increase of 8.9% over the previous harvest [1]. In the international market, Brazilian production represents 37% of the 363.19 million tons produced globally [2], thus placing it in a position of international prominence [3,4].
The increase in soybean production worldwide is due to several factors, among which we highlight the genetic improvement, which has contributed to providing a diversity of cultivars with suitable characteristics for the growing location and the management carried out on farms. Assessing phenotypic plant traits is a crucial step in the soybean genetic breeding programs, and thanks to advances in remote sensing and data analysis techniques, this process is becoming faster and more accurate [5]. The enhanced characterization achieved by remote sensing and evidenced by statistical modeling allows the understanding of several plant traits, even the most complex ones, assisting breeding programs in highthroughput phenotyping (HTP) [6].
Silva Junior et al. [7] finds satisfactory results in differentiating soybean varieties using vegetation indices (VIs) and wavelengths obtained from UAV-based imagery. Spectral bands and VIs are positively correlated with several plant traits, such as leaf nitrogen content in corn, regardless of the variety analyzed [8]. HTP has assisted in monitoring the development of plants and their relationship with their environment [9].
Combining machine learning (ML) with remote sensing becomes a prosperous approach in extracting information from agronomic traits, making data processing automated and more accurate [10]. This is because the use of ML enables the development of algorithms to be used in large datasets and with complex information (such as spectral imagery data) that requires integration among them [11].
Marques Ramos et al. [12], when using ML techniques combined with different VIs, achieved satisfactory results in predicting maize yields, with Random Forest (RF) standing out. Schwalbert et al. [13] used ML models applied to remote sensing data for soybean yield, in which Artificial Neural Networks (ANN) outperformed other algorithms.
The hypothesis of our study is that using satellite imagery in data collection and ML in data processing can assist in the identification of soybean cultivars, making this process more accurate and faster. Therefore, the objectives were: (i) to verify which dataset input configuration (only spectral bands, only vegetation indices, or both) is most accurate in soybean cultivar discrimination, and (ii) to verify which ML technique is most accurate in this classification modeling.

Experimental Area and Treatments Evaluated
The study area is located in the municipality of Pereira Barreto in the State of São Paulo at the mouth of the Tietê River. According to the Köppen classification, the region climate is tropical rain forest with summer rainfall and winter drought (Aw) [14]. Average temperatures are between 26.8 and 21.2 • C, average annual rainfall is 1128 mm, and the average altitude of the region is 347 m.
Aiming to set water availability parameters so that there is no influence of water deficit on the VIs, in view of the differentiation only of soybean cultivars and not of other external factors, data from irrigated areas were used, in which irrigation management allowed the supply of water necessary for the proper crop development over the cycle (Figure 1).

Image Acquisition and Multispectral Models
A cloudless orbital image of the site was acquired from the Google Earth Engine platform. The image is already corrected at the top of the atmosphere, where the conversion from digital numbers to sensor radiation was applied to the linear transformation conversion, the solar elevation and the Earth-Sun distance [15] for 2015/2016. The image used was from the Landsat-8 satellite with the OLI sensor (USGS Landsat 8 Collection 1 Tier 1 and Real-Time data TOA Reflectance) available in the LANDSAT/LC08/C01/T1_RT_TOA   Data were collected from five irrigated central pivots and with the same sowing date  in the 2015/2016 crop season. Each pivot was grown with a different cultivar, which  consisted of the following materials: CV1-P98y12 RR, CV2-Desafio RR, CV3-M6410  IPRO, CV4-M7110 IPRO, and CV5-NA5909 RR ( Figure 1).

Image Acquisition and Multispectral Models
A cloudless orbital image of the site was acquired from the Google Earth Engine platform. The image is already corrected at the top of the atmosphere, where the conversion from digital numbers to sensor radiation was applied to the linear transformation conversion, the solar elevation and the Earth-Sun distance [15]  , all with a spatial resolution of 30 m. Besides the isolated spectral bands, a total of 13 VIs were calculated, as described by Table 1. For data acquisition from the 100 random repetitions (pixel by pixel) per cultivar on the orbital image, the Google Colab platform was used in Python language through the packages ee, os, and geemap [16].

Using Machine Learning Models
The models tested were: artificial neural networks (ANN), radial basis function (RBF) network, the decision tree algorithms J48 (DT) and reduced error pruning (REPTree), random forest (RF), and support vector machine (SVM). The ANN tested consists of a single hidden layer formed by a number of neurons that is equal to the number of attributes, plus the number of classes, all divided by 2. The J48 algorithm (DT) is a classifier for generating a C4.5 decision tree with an additional pruning step based on reduced-error strategy [17,18]. RBF is a feed-forwarded network in which training is performed in a hidden layer, implementing a normalized Gaussian radial basis function and the k-means clustering algorithm for the basis function of this hidden layer, and supervised learning is used for the output layer [19]. REPTree uses the decision tree logic and creates several trees in different iterations. Afterwards, it selects the best tree using the information gain and performs the reduced-error pruning as splitting criterion [20]. The RF model is able to produce multiple decision trees for the same dataset and uses a voting scheme among all these learned trees to classify new instances [21]. SVM performs classification tasks by building hyperplanes in multidimensional space to distinguish different classes [22].
The classification of the five soybean cultivars was performed by the six ML models in a 10-fold stratified randomized cross-validation with ten repetitions (100 runs for each model). Different inputs were considered for each classification model: spectral bands only (SBs), vegetation indices only (VIs), and SBs + VIs. The parameters obtained for performance evaluation of the models and inputs were correct classification (CC, %) and Kappa coefficient. ML analyses were performed on Weka 3.9.4 software using the default setting for all tested models [23] using a CPU Intel ® CoreTM i5 with 6 Gb RAM.

Statistical Analysis
After obtaining the r and MAE parameters, analysis of variance was performed considering a 6 × 3 factorial scheme (models versus inputs) with ten repetitions (folds). The means were grouped by the Scott-Knott test at 5% probability. Bar graphs were generated for each parameter (r and MAE) considering the models and inputs tested. Based on these statistics, the best ML technique was identified, and a confusion matrix was developed for this technique and the different inputs evaluated. These analyses were performed on R software [24] using the packages ExpDes.pt and ggplot2.

Spectral Signature of Cultivars
The result of the spectral curves extracted from the corrected Landsat-8/OLI TOA image for the 100 repetitions of each cultivar is represented in Figure 2. Visually, there is a slight difference between the spectral signatures for the five cultivars, considering the eight spectral bands analyzed. The variations occur more intensely when isolating each variety cultivar's maximum and minimum values (Figure 2a-f). After obtaining the r and MAE parameters, analysis of variance was performed considering a 6 × 3 factorial scheme (models versus inputs) with ten repetitions (folds). The means were grouped by the Scott-Knott test at 5% probability. Bar graphs were generated for each parameter (r and MAE) considering the models and inputs tested. Based on these statistics, the best ML technique was identified, and a confusion matrix was developed for this technique and the different inputs evaluated. These analyses were performed on R software [24] using the packages ExpDes.pt and ggplot2.

Spectral Signature of Cultivars
The result of the spectral curves extracted from the corrected Landsat-8/OLI TOA image for the 100 repetitions of each cultivar is represented in Figure 2. Visually, there is a slight difference between the spectral signatures for the five cultivars, considering the eight spectral bands analyzed. The variations occur more intensely when isolating each variety cultivar's maximum and minimum values (Figure 2a-f).
The physiological appearance of the analyzed soybean cultivars (cv1 ... cv5) as a function of the mean spectral curves (Figure 3a) is shown to be healthy, and can be noticed mainly by the high reflectance for B5 (~0.865 µm) and absorptions by B4 (~0.655 µm), B6 (~1.61 µm), and B7 (~2.2 µm). The OLI sensor's reflectance values for all cultivars were consistent compared to the target healthy green vegetation behavior. It is considered as collection data of the curves the day of the scene passage in January, which refers to the vegetative vigor of the soybean crop in the site studied, clearly in the phenological stage R5 (Figure 3).  The physiological appearance of the analyzed soybean cultivars (cv1 . . . cv5) as a function of the mean spectral curves (Figure 3a) is shown to be healthy, and can be noticed mainly by the high reflectance for B5 (~0.865 µm) and absorptions by B4 (~0.655 µm), B6 (~1.61 µm), and B7 (~2.2 µm).   The OLI sensor's reflectance values for all cultivars were consistent compared to the target healthy green vegetation behavior. It is considered as collection data of the curves the day of the scene passage in January, which refers to the vegetative vigor of the soybean crop in the site studied, clearly in the phenological stage R5 (Figure 3).

Scattering between Variables
A scatterplot of the correct classification (%) and kappa coefficient for discrimination of five soybean cultivars using ML models and different inputs is shown in Figure 5. It can be seen that using ANNs with the inputs SB and SB + VIs gives the highest values of correct classification (%) and kappa coefficient for discriminating soybean cultivars. Using these same inputs, the random forest (RF) algorithm obtained values close to the ANNs but slightly inferior. It is important to highlight that regardless of the model and input tested, there was low variability between folds, occurring just one outlier in some cases.

Scattering between Variables
A scatterplot of the correct classification (%) and kappa coefficient for discrimination of five soybean cultivars using ML models and different inputs is shown in Figure 5. It can be seen that using ANNs with the inputs SB and SB + VIs gives the highest values of correct classification (%) and kappa coefficient for discriminating soybean cultivars. Using these same inputs, the random forest (RF) algorithm obtained values close to the ANNs but slightly inferior. It is important to highlight that regardless of the model and input tested, there was low variability between folds, occurring just one outlier in some cases.

Choosing the Best Model and Best Input
The unfolding of the significant interaction between model x input for correct classification (%) and Kappa coefficient for discrimination of five soybean cultivars are shown in Tables 2 and 3, respectively. By analyzing the unfolding of models within input, ANNs presented the highest mean correct classification and Kappa coefficient regardless of the input used. For the input within model splitting, the spectral bands (SBs) and spectral bands + vegetation indices (SBs + VIs) inputs had the highest mean correct classifications and Kappa coefficients and did not differ for ANNs, DT, REPTree, and RF models. Table 2. Unfolding of the significant model x input interaction for the correct classification (%) of five soybean cultivars using machine learning (ML) models and different inputs (vegetation indices-VIs, spectral bands-SBs, and SBs + VIs).

Model
SBs

Choosing the Best Model and Best Input
The unfolding of the significant interaction between model x input for correct classification (%) and Kappa coefficient for discrimination of five soybean cultivars are shown in Tables 2 and 3, respectively. By analyzing the unfolding of models within input, ANNs presented the highest mean correct classification and Kappa coefficient regardless of the input used. For the input within model splitting, the spectral bands (SBs) and spectral bands + vegetation indices (SBs + VIs) inputs had the highest mean correct classifications and Kappa coefficients and did not differ for ANNs, DT, REPTree, and RF models. Table 2. Unfolding of the significant model x input interaction for the correct classification (%) of five soybean cultivars using machine learning (ML) models and different inputs (vegetation indices-VIs, spectral bands-SBs, and SBs + VIs).

Confusion Matrix Using ANN's
Based on the results contained in Tables 2 and 3, the ANNs showed a better ability to discriminate soybean cultivars. Thus, Figure 6 shows the confusion matrix obtained with this model for each evaluated input. The diagonal (pink-scale values) shows the number of correct classifications obtained for each cultivar. It can be seen that using SBs and SBs + VIs as inputs provides the highest number of correct classifications. These inputs showed no statistical difference between them (see Tables 2 and 3) and were superior to using VIs as input.

Confusion Matrix Using ANN's
Based on the results contained in Tables 2 and 3, the ANNs showed a better ability to discriminate soybean cultivars. Thus, Figure 6 shows the confusion matrix obtained with this model for each evaluated input. The diagonal (pink-scale values) shows the number of correct classifications obtained for each cultivar. It can be seen that using SBs and SBs + VIs as inputs provides the highest number of correct classifications. These inputs showed no statistical difference between them (see Tables 2 and 3) and were superior to using VIs as input.

Tested Models
Using machine learning has innovative potential in any area of science. The basic requirement is that there must be enough data to train and validate the tested models, making a considerable amount of data necessary [25]. Among the models tested, the ANNs stood out for achieving higher means of correct classification and Kappa coefficient, being the most accurate among the evaluated models in identifying soybean cultivars. Using data derived from spectral images, Eugenio et al. [26] reached an adequate adjustment and generalization capacity using ANNs to predict soybean yield. The modeling used by ANNs can achieve high accuracy, leading to answers to cover several situations [27].
In some studies, the use of ANN has provided more reliable results than other modeling techniques [28], such as Stepwise Multiple Linear Regression (MLR) and Principal Component Regression (PCR) [29]. The ANNs are also a more accurate alternative in predicting crop yields than traditional regression models [30].
Taratuhin et al. [31] found high accuracy using ANNs in predicting the earliness of the soybean accesses. Taratuhin et al. [32] found high accuracy in modeling using ANN in predicting several traits of soybeans under different climatic conditions. In eucalyptus, Figure 6. Confusion matrix for discrimination of five soybean cultivars using artificial neural networks (ANNs) and different inputs (vegetation indices-VIs, spectral bands-SBs, and SBs + VIs).

Tested Models
Using machine learning has innovative potential in any area of science. The basic requirement is that there must be enough data to train and validate the tested models, making a considerable amount of data necessary [25]. Among the models tested, the ANNs stood out for achieving higher means of correct classification and Kappa coefficient, being the most accurate among the evaluated models in identifying soybean cultivars. Using data derived from spectral images, Eugenio et al. [26] reached an adequate adjustment and generalization capacity using ANNs to predict soybean yield. The modeling used by ANNs can achieve high accuracy, leading to answers to cover several situations [27].
In some studies, the use of ANN has provided more reliable results than other modeling techniques [28], such as Stepwise Multiple Linear Regression (MLR) and Principal Component Regression (PCR) [29]. The ANNs are also a more accurate alternative in predicting crop yields than traditional regression models [30].
Taratuhin et al. [31] found high accuracy using ANNs in predicting the earliness of the soybean accesses. Taratuhin et al. [32] found high accuracy in modeling using ANN in predicting several traits of soybeans under different climatic conditions. In eucalyptus, it is widely used to estimate yield since adopting traditional methods is difficult due to the number of independent variables and the complex relationship between them and the dependent variable [33].
Using ANNs together with spectral bands and/or vegetation indices generates accurate results in providing information about forest inventories with time and labor savings, since they have the ability to learn and present information about non-linear data [34]. These coupled techniques successfully improve accuracy, speed, and reliability in several research lines, as well as to farmers [35].

Tested Inputs
The use of remote sensing for measuring soybean agronomic traits has great potential to revolutionize genetic breeding programs and production systems, especially because this technology allows the quantification of phenotypic variables by combining images [10]. Traditional genotype selection programs are limited to costly and imprecise field analyses, which can be improved using remote sensing technologies [5]. This technology demonstrates efficiency in classifying soybean varieties, as previously reported by Silva Junior et al. [7,36].
When evaluating the inputs within each model, SBs and SBs + VIs obtained the highest means for correct classification and Kappa coefficient. Even though both inputs have achieved similar results, in a practical way, using the SBs would be more feasible since to obtain them, it is not necessary to perform calculations such as those used in the acquisition of the VIs.
Spectral bands are a reliable source for spatial and temporal detailing, making estimates on variables such as chlorophyll and leaf area index in agricultural crops [37]. Silva Junior et al. [38] have achieved accurate responses using spectral bands in discriminating eucalyptus plants for different levels of boron fertilization.
In addition to the results exposed in the breakdowns for correct classification and Kappa coefficient, the correlogram showed a significant relationship between spectral bands and variables. These results demonstrate the efficiency of using the spectral bands B1, B2, B3, B4, B6, and B7 to identify soybean cultivars. Using more than one spectral band when processing the analyses, a detailed exploration of what is being evaluated is possible, providing relevant information about the differentiation of soybean varieties [7,39]. Tables 2 and 3 show the efficiency of using spectral bands with artificial neural networks, which is highlighted by the results presented by the confusion matrix ( Figure 5). Using methodologies that evaluate the plant phenotype associated with computational intelligence is an accurate and reliable way to measure characteristics when the crop is still in the field [40].
Our findings demonstrate that it is possible to distinguish soybean genotypes more accurately using spectral bands as input in the tested machine learning models. This represents an important scientific advance for mapping soybean areas in world. For example, in Brazil, a large number of soybean cultivars are used annually, which have several different characteristics from each other, especially regarding the cycle. As in Brazil, soybean is grown in the crop season, being able to distinguish soybean cultivars demonstrates the possibility of introducing public policies for the prevention of end-of-cycle diseases, harvest planning and off-season planting.
However, it is also necessary that more orbital data be evaluated in the discrimination of plant species, seeking to achieve the absence of clouds, either through data with better spatial resolution (Sentinel-2/MSI) or even via satellite constellations (PlanetScope). Possibly the application of machine learning techniques can bring new results with the different characteristics of the various orbital sensors, even those that are equivalent, as is the case of the new Landsat-9 platform [41].

Conclusions
Spectral bands were the most accurate among the tested inputs in identifying soybean cultivars. Artificial neural networks provided the highest accuracy in identifying soybean cultivars. These findings demonstrate that it is possible to distinguish soybean genotypes more accurately using spectral bands using public images (Landsat-8 satellite) as input in the tested machine learning models. This represents an advance in soybean mapping, allowing us to accurately identify the most planted cultivars in a given region. However, it is also necessary that more orbital data be evaluated in the discrimination of plant species, seeking to achieve the absence of clouds, either through data with better spatial resolution (Sentinel-2/MSI) or even via satellite constellations (PlanetScope).