Identification of Amaranthus Species Using Visible-Near-Infrared (Vis-NIR) Spectroscopy and Machine Learning Methods

: The feasibility of rapid and non-destructive classiﬁcation of six different Amaranthus species was investigated using visible-near-infrared (Vis-NIR) spectra coupled with chemometric approaches. The focus of this research would be to use a handheld spectrometer in the ﬁeld to classify six Amaranthus sp. in different geographical regions of South Korea. Spectra were obtained from the adaxial side of the leaves at 1.5 nm intervals in the Vis-NIR spectral range between 400 and 1075 nm. The obtained spectra were assessed with four different preprocessing methods in order to detect the optimum preprocessing method with high classiﬁcation accuracy. Preprocessed spectra of six Amaranthus sp. were used as input for the machine learning-based chemometric analysis. All the classiﬁcation results were validated using cross-validation to produce robust estimates of classiﬁcation accuracies. The different combinations of preprocessing and modeling were shown to have a classiﬁcation accuracy of between 71% and 99.7% after the cross-validation. The combination of Savitzky-Golay preprocessing and Support vector machine showed a maximum mean classiﬁcation accuracy of 99.7% for the discrimination of Amaranthus sp. Considering the high number of spectra involved in this study, the growth stage of the plants, varying measurement locations, and the scanning position of leaves on the plant are all important. We conclude that Vis-NIR spectroscopy, in combination with appropriate preprocessing and machine learning methods, may be used in the ﬁeld to effectively classify Amaranthus sp. for the effective management of the weedy species and/or for monitoring their food applications. libraries through species identiﬁcation and discrimination. The study can be expanded for multiple applications in botany, such as the early diagnosis of certain plant diseases to reduce postharvest losses and also for guaranteed quality assurance in the food industries.


Introduction
Amaranthus is a cosmopolitan genus of herbs with about 70 species of plants worldwide, and about nine species have been introduced into Korea [1]. It is widely distributed from temperate regions to tropical regions worldwide, and it is very difficult to distinguish morphologically due to a lot of intra-species hybridization [2]. The genus Amaranthus, introduced into Korea, is mainly distributed in areas with relatively large ecosystem disturbances, such as agricultural land, roadside, bare land, and riverside.  [3]. Though Amaranthus sp. are widespread in agriculture and are observable throughout life, a new classification technique is needed because they are difficult to distinguish and manage, particularly for quality control purposes [4]. Furthermore, plant databases are becoming increasingly important in order to conserve endemic plants and classify the floral diversity of various species [5]. Species identification of plants by leaf analysis is an important aspect of botany that is gaining more interest in study for a variety of reasons, from industrial benefits to endangered species protection [6]. In addition, nearly all Amaranthus sp. are edible, but varieties sold for eating have been selected for their good seed production and tasty leaves which are believed to be rich in Vitamin C and iron with a taste rather like spinach [7].
Generally, the large number of plant species worldwide necessitates the adaption and development of rapid and competent classification techniques which have become an active area of research [8]. The basic chemical methods based on isoenzymes or DNA analysis are also used for classification [9]. However, these methods are labor-intensive, time-consuming and cannot be performed under field conditions [9]. Furthermore, sample preparation procedures pose additional cost, time and technical accuracy issues, indicating the need for a non-invasive alternative approach with better advantages [10,11]. Rapid noninvasive approaches with remote analysis are constantly sorted to pace-up with the huge volumes of available plant species for nutritional, economical and historical value [8,9].
Visible and Near-Infrared (Vis-NIR) spectroscopy is one such method that has recently been employed in multiple studies for component detection and authentication purposes [8]. Vis-NIR spectroscopy is a non-destructive analytical method with the advantages of simple preprocessing and fast data acquisition methods that may be applied in the agricultural industry for monitoring and quality control [9,12]. It is a quantitative method based on Lambert Beer's law, which states that when specific functional groups in a sample are exposed to Vis-NIR rays, they cause molecular vibrations and absorb light of a specific wavelength [13]. The degree of absorption is proportional to the concentration of functional groups in the sample. Recently, Vis-NIR spectroscopy has been used to accurately identify important chemical components in the pharmaceutical, food and agricultural industries [9]. Their potential for precise and reliable detection of plants is also being investigated [14][15][16]. According to the literature, Vis-NIR spectroscopy is frequently used in combination with various chemometric and multivariate analyses which are selected based on the objectives of the study [9,12,17]. Among these techniques, supervised and non-supervised classification techniques are the most commonly used techniques and they are based on the fact that samples with similar spectral responses are similar in physical, chemical, and biochemical properties [18]. NIR/Vis-NIR spectroscopy techniques have been used to identify chemical properties in recent studies. The near-infrared spectrum has been used to identify a variety of substances, including water physicochemical content, cellulose, lignin, cutin, and xylan [19][20][21], protein powder [22] and even in genetically modified foods. These were all achieved through the so-called "fingerprint" method, where a unique spectrum/spectra can be used to define a particular component in the analytes [23].
Changes in Vis-NIR spectra are often too small to notice with the human eye that is why the actual usefulness of Vis-NIR spectroscopy as an analytical tool is based on statistical and mathematical manipulation of the spectral data [9]. It is also important to state that the physical and the environment of experimental conditions can also influence spectra quality during the Vis-NIR spectroscopy analysis [24]. Therefore, preprocessing techniques have been proposed as one of the initial steps in the analysis of Vis-NIR spectroscopy data for optimized results [25,26]. The combination of appropriate preprocessing, chemometric tools and machine learning approaches with Vis-NIR spectroscopy has been used in various aspects of the agricultural sciences [9]. This tool has demonstrated excellent results in identifying plant species by measuring the near-infrared spectra of plant tissues (leaves, timber, bark) in tropical rain forests. Sandak et al., [27] developed models for the automated in-field determination of quality indices for log grading in mountain forests by means of a portable spectrometer. Durgante et al., [14] showed a high-level classification of closely related species between the Eschweilera and Corythophora (Lecythidaceae) of the central Amazon in the NIR spectral data from dry leaves.
Despite all this, there is no information on the development of a classification model based on Vis-NIR spectroscopy and machine learning for Amaranthus sp. Hence, the present study aims to analyze the potential of Vis-NIR spectroscopy to discriminate the six Amaranthus sp. from different geographical locations in South Korea with different preprocessing and machine learning approaches.

Plant Materials
Six Amaranthus sp. were identified and selected for species discrimination from six different geographical locations in Korea ( Table 1). Details of Amaranthus sp., distribution, spectra collection sites, and number of measured spectra for each species were provided in Table 1. Different geographical locations have different environmental conditions. The typical images of the different Amaranthus sp. identified in the fields are shown in Figure 1.
The study was performed from May to July 2019 in six different geographical regions of South Korea.

Spectral Measurement in the Field
For the visible and near-infrared (Vis-NIR) spectral acquisition, fully expanded leaves with no signs of disease or insect damage were selected and an integrated portable spectral analyzer (FieldSpec ®® HandHeld 2, ASD Inc., Longmont, CO, USA), working in reflectance mode (log/R) in the range of 400-1075 nm with a stepping of 1.5 nm was used. Spectral measurement was performed directly on the adaxial surfaces of the leaves, which are most noted for light capturing. For each leaf, three spectra were taken from different spots of the leaf blade. During each acquisition, the optical window of the Vis-NIR device was placed in direct contact with the surface of the leaf, making sure that the sensor window was completely covered. To avoid the contamination of the adaxial surfaces with external pollutants, vinyl gloves were used at all times when handling the leaves.

Preprocessing of Spectral Data
The initial spectrum comprised not only sample-related information, but also noise signal generated by different variables, which not only interfered with spectral information, but also hampered the model's creation and prediction of unknown sample composition or characteristics. To obtain the best discrimination model, four different types of spectral preprocessing techniques were used. These included no treatment (raw data), normalization, Savitzky-Golay [28], and standard normal variate [29] to find an optimal preprocessing method that removes noise from the spectral data and improves predictability of the classification models. All the computations were carried out on Unscrambler ®® X software, version 10.5.1 (CAMO ASA, Oslo, Norway).  (Table 1).

Spectral Measurement in the Field
For the visible and near-infrared (Vis-NIR) spectral acquisition, fully expanded leaves with no signs of disease or insect damage were selected and an integrated portable spectral analyzer (FieldSpec ®® HandHeld 2, ASD Inc., Longmont, CO, USA), working in reflectance mode (log/R) in the range of 400-1075 nm with a stepping of 1.5 nm was used. Spectral measurement was performed directly on the adaxial surfaces of the leaves, which are most noted for light capturing. For each leaf, three spectra were taken from different spots of the leaf blade. During each acquisition, the optical window of the Vis-NIR device was placed in direct contact with the surface of the leaf, making sure that the sensor window was completely covered. To avoid the contamination of the adaxial surfaces with external pollutants, vinyl gloves were used at all times when handling the leaves.  (Table 1).

Modeling and Statistical Analysis
To analyze the data extracted from the Vis-NIR spectroscopy, a data mining model was developed. Model construction was performed with RapidMiner studios Version 9.0.002 (Rapidminer, Inc., Boston, MA, USA). Rapid miner is a software used for data mining and machine learning. It was used to apply different algorithms on the dataset, and the performance of each algorithm could be easily evaluated using the performance operator. Four classification algorithms, namely the Support Vector Machine [30], Generalized Linear Model [31], Decision Tree, and Naïve Bayes were used to find the best modeling approach with higher classification accuracy. For each algorithm, the inputs were provided as the data points of the spectra (absorbance values of wavelengths 400 nm to 1075 nm, with a stepping of 1.5 nm) and the classes were the identification labels of each Amaranthus sp. All the classification results were validated using cross-validation to obtain robust estimates of classification accuracies of the experiments [32]. One-way analysis of variance (ANOVA) was performed when comparing means for testing the influence of: (i) The application of a scatter correction method; (ii) The four classification algorithms and, (iii) The interaction of the two precious factors.
Tukey's range test was used as mean comparison method at a significance level of p ≤ 0.05.

VNIR Spectra and Data Preprocessing
Raw spectra (without preprocessing) collected from the leaves of six Amarnathus sp. are shown in Figure 2a-f. The X-axis represented the wavelength and the Y-axis indicated the spectral absorbance ( Figure 2). With the exception of A. viridis, no clear differences could be visualized in the spectral patterns of the analyzed species (Figure 2a-f), but the average spectral curves of Amaranthus sp. were somewhat different, suggesting the need for more detailed mathematical analysis. The differences among the six Amaranthus sp. were further visualized in PCA analysis, which showed 89.35% of the variance being expressed in the PC1. The different species could be slightly visually separated in the plot, with the exception of A. patulus, which overlapped with all the other species in the plot ( Figure 3). Therefore, chemometric methods were introduced to build more reliable qualitative models for classification after outlier detection in PCA. Based on the visual inspection of the spectra prior to preprocessing and outlier detection in PCA, some of the spectra that could be affected by measurement errors were removed, and the final spectral library had a total of 6242 leaf spectra.

Chemometric Analysis-Based Species Discrimination
The classification accuracy of various machine learning approaches combined with different preprocessing methods was calculated to identify the precise method for the discrimination of Amaranthus sp. After the cross-validation, the classification accuracy ranges from 71% to 99.7% for the different classification models, according to the combination of preprocessing and models applied to the spectra ( Table 2). For Support Vector Machine, preprocessing with Derivative (Savitzky-Golay) yielded the best classification accuracy of 99.7%. The best classification accuracy of 98% was also achieved using the Derivative (Savitzky-Golay) preprocessing for the Generalized Linear Model. For the Decision Tree and Naive Bayes classification models, the best classification accuracy of 89.6% and 89% respectively, was achieved using the Standard Normal Variate preprocessing technique. Overall, the Support Vector Machine yielded the highest classification accuracy among all the tested classification models when only raw spectra were used (98% with Savitzky-Golay Derivative).
In this study, normalization yielded the least performance accuracies method among the tested preprocessing methods ( Table 2). Generalized Linear Model and Support Vector Machine accuracies were 93% and 91.3%, respectively, for normalization, whereas Naive Bayes and Decision Tree accuracies were 78.3% and 72.8%, respectively. With Savitzky-Golay preprocessing, the accuracies of Support Vector Machine and Generalized Linear Model were 99.7% and 98%, respectively, while Naive Bayes and Decision Tree were 87.5% and 89%, respectively. In the case of Standard Normal Variate preprocessing, Support Vector Machine showed 98.8% accuracy, Generalized Linear Model 92.5%, Naive Bayes 89%, and Decision Tree 89.6%. It is not certain whether Vis-NIR spectroscopy can be applied to varietal discrimination or classification of other plants, as there are numerous factors such as light conditions and the state of the spectrum acquisition device that can affect the results of applying spectral preprocessing methods.

Significance of Preprocessing and Selection of Optimal Classification Model
The effects of preprocessing and various modeling algorithms on spectral datasets obtained from six Amaranthus sp. were statistically analyzed ( Table 3). The mean percentage of classification accuracy of each modeling method in combination with different preprocessing methods shows the significant modeling for the discrimination of Amaranthus sp. after the cross-validation (Table 3). Among them, the combination of the Generalized Linear Model and the preprocessing Savitzky-Golay was found to be significant. It was proven that using Savitzky-Golay preprocessing together with Support Vector Machine yielded the highest mean percentage of classification of 99.7%. ANOVA statistical analysis in Table 4 represents the effects of preprocessing and modeling approaches on species classification accuracy. The effects of preprocessing in the discrimination of Amaranthus sp. found to be very significant at p ≤ 0.05 (p-value of 0.0045) and the effects of modeling approaches were also analyzed to be very significant at p ≤ 0.05 (p-value of 0.0039). However, the combination of preprocessing and different models used together, there was no significance with p ≥ 0.05 (p-value of 0.0549). Table 5 shows the confusion matrix that confirms through the degree of error in the discrimination of different Amaranthus sp., also suggests that using Savitzky-Golay smoothing combined with Support Vector Machine was the most effective method for the classification. Among the six Amaranthus sp. except A. lividus, all five species (A. patulus, A. spinosus, A. viridis, A. retroflexus, and A. powellii) showed perfect scores (percentage of correct classification: 100%). In the case of A. lividus, when the accuracy of the spectrum was verified with this combination, there was only one misclassified instance with A. retroflexus.

Discussion
Field spectroscopy has been widely used for the effective discrimination of plant species in fields and forests. The main issue in this is distinguishing the derivation of spectral response among the different species [33,34]. Presently, NIR spectroscopy with the combination of machine learning approaches has solved the issues. Generally, the Vis-NIR spectra might have substantial noise from the instrument and the environment. To reduce noise and to obtain proper results, preprocessing methods are highly useful [35]. Fernández-Cabanás et al., [36] noted that the selection of a suitable spectral preprocessing is not easy because several different mathematical transformations are likely to be used. Different preprocessing methods lead to different prediction results. As shown in Figure 4, the preprocessed spectra with three preprocessing methods (normalization, Savitzky-Golay and standard normal variate) effectively reduced the influence of noise and enhanced the resolution and characteristics of spectra in comparison with the raw spectra. The best preprocessing choice for spectral analysis should be performed based on a combination of statistical testing and model prediction with regards to the objective of the study [37].  Model selection is an important part of mathematical modeling which is often performed based on the complexities of the developed models and their prospective application [9,38]. Support Vector Machine can be well applied to high-dimensional data, and there is no limit to the value of each attribute [39]. This method produced classification accuracies similar to ours for the classification of cotton leaf through image analysis [40] and also for the detection of tomato [41] and guava [42] plant disease through leaf analysis. Bergo et al., [43] used NIRs and PLS-DA to distinguish between Swellenia macrophylla and Carapa guianensis. Soares-Filho et al., [44] were successful in classifying six look-alike Amazon species of mahogany using a handheld NIRs instrument. Buitrago et al., [45] se- Model selection is an important part of mathematical modeling which is often performed based on the complexities of the developed models and their prospective application [9,38]. Support Vector Machine can be well applied to high-dimensional data, and there is no limit to the value of each attribute [39]. This method produced classification accuracies similar to ours for the classification of cotton leaf through image analysis [40] and also for the detection of tomato [41] and guava [42] plant disease through leaf analysis. Bergo et al., [43] used NIRs and PLS-DA to distinguish between Swellenia macrophylla and Carapa guianensis. Soares-Filho et al., [44] were successful in classifying six look-alike Amazon species of mahogany using a handheld NIRs instrument. Buitrago et al., [45] selected the wavelengths that most effectively distinguish 19 species from infrared spectra using the fresh leaf spectrum. Hadlich et al., [16] identified 11 species in the Amazon forest using Vis-NIR spectra from the outer or inner shells of the trees collected by the handheld spectrometer in the field. This study confirmed that discrimination and classification of Amaranthus sp. in the field using portable Vis-NIR spectroscopy is possible in combination with different machine learning techniques. Previously, the discrimination of Amaranthus sp. would have been successfully performed with biochemical and DNA-based methods [4,46,47] but in this study, Vis-NIR spectroscopy proved capable of discriminating six Amaranthus sp. This is particularly important because the method is speedy and affordable. In addition, the plant has recently been rediscovered as a promising food crop, mainly due to its resistance to heat, drought, diseases and pests, and the high nutritional value of both seeds and leaves [48]. For example, the gluten-free seeds have a nutty flavor and are high in protein and calcium, while the leaves are reported to be rich in antioxidants and phytochemicals, depending on the species [49].
The species accuracy (71~99.7%) obtained from the Amaranthus spectrum dataset achieved a very high level of classification goal considering the growth stage of the plant, other measurement locations, and the measurement location of leaves on the plant. Changes in spectral properties related to the biochemical composition and structure of leaves, which depend on many factors such as the leaf developmental position or the leaf microclimate position on the plant species and plant body, are known to be powerful factors that induce spectra differentiation [50]. Due to changes in cell wall composition such as polysaccharides, proteins, and phenolic compounds, all of which can show significant changes throughout the plant growth period [51,52]. There are often spectral differences between different growth stages of the same species, but research suggests that plant species can be distinguished if the differences in spectral signatures between plant species are sufficiently large [15]. Early studies of plants using diffuse reflectance measurements suggested that plant cuticles and underlying cell walls determine spectral features [19,53]. Consequently, various symbionts, parasites, and epiphylls are found in and on plant tissue. In addition, these can modify the spectral signature. Discriminant functions graphs based only on young samples always show greater separation of species than graphs from adults, which is thought to be due to shared biotic contaminants, suggesting that some convergence occurs in mature plants [15]. Castro-Esau et al., [54] suggested that leaves of the same species, but of different age and health, will vary widely in their spectral reflectance properties, and that the internal leaf structure affects leaf reflectance in the near-infrared region. However, some plant species show spectral differences depending on the developmental stage, while others show the opposite. Therefore, it can be said that more research is needed to determine whether the morphological and chemical changes according to the developmental stage are the same phenomenon in plant species.

Conclusions
The results of this study demonstrated that Vis-NIR spectroscopy has the capability to discriminate Amaranthus sp. with a notable accuracy, up to 99.7%. A combination of Savitzky-Golay and Support Vector Machine yielded high reliability in the development of a varietal classification model, suggesting the possibility of Amaranthus sp. remote field analysis using Vis-NIR. Through this study, it can be said that the possibility of developing a technology that can classify Amaranthus sp. easily, quickly, and accurately even at the young stage of plants is possible and is recommended to be explored. In future studies, we recommend using Vis-NIR spectroscopy in combination with the appropriate preprocessing and models that can be helpful in the development and maintenance of plant libraries through species identification and discrimination. The study can be expanded for multiple applications in botany, such as the early diagnosis of certain plant diseases to reduce postharvest losses and also for guaranteed quality assurance in the food industries.