Vis-NIR Spectroscopy and Machine Learning Methods for the Discrimination of Transgenic Brassica napus L. and Their Hybrids with B. juncea

: The rapid advancement of genetically modiﬁed (GM) technology over the years has raised concerns about the safety of GM crops and foods for human health and the environment. Gene ﬂow from GM crops may be a threat to the environment. Therefore, it is critical to develop reliable, rapid, and low-cost technologies for detecting and monitoring the presence of GM crops and crop products. Here, we used visible near-infrared (Vis-NIR) spectroscopy to distinguish between GM and non-GM Brassica napus , B. juncea , and F 1 hybrids ( B. juncea X GM B. napus ). The Vis-NIR spectra were preprocessed with different preprocessing methods, namely normalization, standard normal variate, and Savitzky–Golay. Both raw and preprocessed spectra were used in combination with eight different chemometric methods for the effective discrimination of GM and non-GM plants. The standard normal variate and support vector machine combination was determined to be the most accurate model in the discrimination of GM, non-GM, and hybrid plants among the many combinations (99.4%). The use of deep learning in combination with Savitzky–Golay resulted in 99.1% classiﬁcation accuracy. According to the ﬁndings, it is concluded that handheld Vis-NIR spectroscopy combined with chemometric analyses could be used to distinguish between GM and non-GM B. napus , B. juncea , and F 1 hybrids.


Introduction
Brassica juncea L. Czern (Brown Mustard) is an important annual crop and is an outcome of hybridization between the diploid Brassica species B. rapa (AA, 2n = 20) and B. nigra (BB, 2n = 16) followed by spontaneous hybridization with chromosome doubling [1]. In China and Korea, wild B. juncea is a natural weedy species widely found along roadsides or empty lands [2,3]. It is known to have the highest potential for gene transfer from B. napus after B. rapa [4]. It has previously been reported that conventional and transgenic B. napus hybridize with B. juncea spontaneously or by hand pollination [5][6][7][8]. Recently, Tang Processes 2022, 10, 240 2 of 10 et al. [9] found the estimated frequencies of natural gene flow from the genetically modified (GM) B. napus to 10 different B. juncea cultivars in the field experiment varied from 0.08 to 0.93%. The transgenic hybrids' ability to persist is determined by their fitness as crop-wild hybrids [10]. Little is known about the fitness of the F 1 hybrid between B. juncea and B. napus in the environment. According to Lim et al. [2], seeds from a hybrid of B. juncea and GM B. napus have shown an increase in dormancy and overwintering traits, suggesting that they could become soil seed banks. Seeds in such a seed bank can germinate again if they meet a favorable environment, leading to the formation of a feral population. As a result, the transgene may spread across the ecosystem. If the flowering period of B. juncea overlaps with that of B. napus, there is a possibility of forming hybrids with GM B. napus and releasing them into the environment. If GM B. napus and hybrids (B. juncea X GM B. napus) can be quickly identified and removed, it will be useful to avoid the unintentional environmental release of transgenes and promote the safe management of GM B. napus.
Various methods have been used to detect genetically modified organisms (GMOs), including enzyme-linked immunosorbent assays (ELISA), lateral flow strips, biosensors, Western blots, real-time PCR, qualitative polymerase chain reaction (qPCR), microarrays, electrophoresis, Southern blots, liquid chromatography, and gas chromatography [11]. Nowadays, spectroscopy is one of the rapid, accurate, and nondestructive methods for distinguishing between GM and non-GM crops that does not require complex sample processing [11]. Spectroscopy-based GMO identification is not to detect changes in DNA or single proteins but to detect unknown structural changes due to genotype alterations generated by the introduction of transgenes for specific traits [12]. Generally, a vast number of spectroscopy methods are available for detecting structural changes in different samples, including absorption spectroscopy, photoacoustic spectroscopy, light-induced thermoelastic spectroscopy, and photothermal spectroscopy [13][14][15]. Among them, nearinfrared (NIR) spectroscopy working with the principle of absorption spectroscopy is the most common for the detection of GMOs [11]. NIR spectroscopy coupled with chemometric analyses was found to be effective in discriminating various types of GM and non-GM crops with very high accuracy [11,16,17]. To distinguish transgenic soybean oils from non-transgenic ones, Luna et al. [18] used NIR and support vector machine discriminant analysis (SVM-DA). Later, Garcia-Molina et al. [19] used NIR spectroscopy in combination with partial least square (PLS) analysis to successfully distinguish low gliadin wheat grain from non-transgenic wheat lines with 96% of classification accuracy. It has been shown that using spectroscopic and machine learning algorithms makes it possible to distinguish not only GM and non-GM plants, but also plant species [11,20] and even varieties [21]. However, there is no study that discriminates the GM and non-GM plants with their interspecific hybrids. Therefore, in this study, we used visible near-infrared (Vis-NIR) spectroscopy coupled with different preprocessing and machine learning methods for effective discrimination of B. juncea, GM B. napus, and their hybrids (B. juncea X GM B. napus).

Spectral Analysis and Preprocessing
The averaged raw spectra of the B. napus, GM B. napus, B. juncea, and F 1 hybrids collected in the green house are depicted in Figure 1A. The original unprocessed raw spectra were ones that had not been altered in any manner. The Savitzky-Golay preprocessed spectra are shown in Figure 1B. Standard normal variate (SNV) ( Figure 1C) and normalization ( Figure 1D) procedures were used to preprocess the spectra acquired from these plants. The spectra were preprocessed to remove systemic noise and highlight the variations across the samples [17]. The majority of the spectra acquired from four plants followed a similar pattern, despite variances in spectral reflectance. The difference in average reflectance between GM and non-GM B. napus, B. rapa, and F 1 hybrids is thought to reflect the changes in hundreds of physicochemical constituents in the plant leaves. In general, NIR spectra disclose the information about a material's chemical composition and physical state. This provides structural data on the chemical functional groups of the elements that constitute the molecular fingerprints of the sample [22,23]. The spectral data were preprocessed to remove systemic noise and emphasize variations across samples. Using a variety of preprocessing methods at the same time will help us achieve a higher level of classification accuracy and provide us with the opportunity to choose the optimal preprocessing method for a specific sample [11]. The average spectra for all the plants, raw and preprocessed, were effectively visualized using three different methods: the Savitzky-Golay smoothing filter (21 points), normalization, and standard normal variate ( Figure 1). In general, the normalization is the process of regularizing the data with respect to variations in sample preparation, sample thickness, absorber concentration, etc. Derivatives are mainly used to resolve peak overlap and eliminate constant and linear baseline shifts between samples. SNV is often used on spectra where baseline and path length changes cause differences between otherwise identical spectra [11,24].
Processes 2022, 10, x FOR PEER REVIEW 3 of 11 the changes in hundreds of physicochemical constituents in the plant leaves. In general, NIR spectra disclose the information about a material's chemical composition and physical state. This provides structural data on the chemical functional groups of the elements that constitute the molecular fingerprints of the sample [22,23]. The spectral data were preprocessed to remove systemic noise and emphasize variations across samples. Using a variety of preprocessing methods at the same time will help us achieve a higher level of classification accuracy and provide us with the opportunity to choose the optimal preprocessing method for a specific sample [11]. The average spectra for all the plants, raw and preprocessed, were effectively visualized using three different methods: the Savitzky-Golay smoothing filter (21 points), normalization, and standard normal variate ( Figure 1). In general, the normalization is the process of regularizing the data with respect to variations in sample preparation, sample thickness, absorber concentration, etc. Derivatives are mainly used to resolve peak overlap and eliminate constant and linear baseline shifts between samples. SNV is often used on spectra where baseline and path length changes cause differences between otherwise identical spectra [11,24]. Some typical peaks can be seen in this figure, especially around 500-600 nm, which is the spectral range for chlorophyll [25], and also around 800 nm. However, it is difficult to differentiate these samples solely on the basis of spectral reflectance. Thus, Vis-NIR spectroscopy coupled with various models and machine learning methods such as discriminant analysis and principal component analysis (PCA) was used for effective discrimination [11]. All of the different PCs showed the same slight pattern of separation for the different samples in the PCA paired plot from PC1 to PC6 (Figure 2A), but PC1 vs. PC2 Some typical peaks can be seen in this figure, especially around 500-600 nm, which is the spectral range for chlorophyll [25], and also around 800 nm. However, it is difficult to differentiate these samples solely on the basis of spectral reflectance. Thus, Vis-NIR spectroscopy coupled with various models and machine learning methods such as discriminant analysis and principal component analysis (PCA) was used for effective discrimination [11]. All of the different PCs showed the same slight pattern of separation for the different samples in the PCA paired plot from PC1 to PC6 (Figure 2A), but PC1 vs. PC2 showed the most visual differences, as shown in Figure 2B, so outlier detection was performed using these two PCs before starting preprocessing for the machine learning methods. showed the most visual differences, as shown in Figure 2B, so outlier detection was performed using these two PCs before starting preprocessing for the machine learning methods.

Chemometric Analysis for Discrimination of B. Napus, GM B. Napus, B. Juncea, and F1 Hybrids
The classification accuracy of different chemometric methods combined with various preprocessing methods was calculated in order to determine the most exact way of distinguishing between GM and non-GM B. napus, B. juncea, and F1 hybrids. A summary of the classification accuracy for the different methods can be found in Table 1. Both original raw spectra and preprocessed spectra assessed with chemometric analyses resulted in effective discrimination with different classification accuracies. However, preprocessed spectra were found to have comparatively higher classification accuracy than raw spectra in most of the chemometric analyses. The classification accuracies of the different methods generally ranged from 62.6 to 99.4% (Table 1).  The classification accuracy of different chemometric methods combined with various preprocessing methods was calculated in order to determine the most exact way of distinguishing between GM and non-GM B. napus, B. juncea, and F 1 hybrids. A summary of the classification accuracy for the different methods can be found in Table 1. Both original raw spectra and preprocessed spectra assessed with chemometric analyses resulted in effective discrimination with different classification accuracies. However, preprocessed spectra were found to have comparatively higher classification accuracy than raw spectra in most of the chemometric analyses. The classification accuracies of the different methods generally ranged from 62.6 to 99.4% (Table 1).
From Table 1, the Savitzky-Golay pretreatment proved to be the most efficient preprocessing method for classifying the different plant species with all the tested classification methods except for the support vector machine (SVM) classification technique, where SNV proved to be more effective. Using the Savitzky-Golay, classification accuracies were always higher than when only raw spectra were used. Classification accuracies for the Savitzky-Golay ranged from 80.1 to 99.1%.
Among the different classification methods, support vector machine, linear discriminant analysis, deep learning, and fast large margin were found to have higher classification accuracies in combination with different preprocessing methods (SNV/SVM, 99.4%; Savitzky-Golay/Deep Learning, 99.1%; Savitzky-Golay/SVM, 98.8%) ( Table 1). The support vector machine model showed a high accuracy of 97.1% even when using the raw spectrum without preprocessing the data. The support vector machine is especially suitable for high-dimensional data, and the value of each attribute has no limit [26]. When comparing the average value of accuracy according to each model application for each of the four preprocessing methods, Savitzky-Golay showed the highest accuracy, followed by standard normal variate, raw spectrum, and normalization (Table 1). Similar studies have already been performed by various researchers on various crops. Feng et al. [17] used NIR in combination with support vector machine and partial least squares discriminant analysis (PLS-DA) for the effective discrimination of GM and non-GM maize. Similarly, VNIR multispectral imaging and PLS-DA were used for discrimination of GM and non-GM rice using least squares support vector machines (LS-SVM) and PCA backpropagation neural network (PCA-BPNN) [27]; Fourier transform Infrared (FT-IR) was also used for discrimination of GM and non-GM soybeans with Kth nearest neighbors (KNN) [28]. NIR and support vector machine discriminant analysis (SVM-DA) and PLS-DA were used for discrimination of GM and non-GM soybean [18], and NIR and PLS-DA were used for identification of herbicide-resistant GM soybean seeds [16]. The use of Vis-NIR for discrimination of transgenic tomato using DA and PLS-DA [23] and the use of Vis-NIR for discrimination of RNAi transgenic wheat using NIR and PLS [19] are examples of effective discrimination of GM and non-GM crops using spectroscopy and chemometric analyses. Linear discriminant analysis also yields higher accuracy of 96.5% even when no preprocessing is performed. Figure 3 also shows the linear discriminant analysis plot for discriminating the four different plant varieties. GM B. napus slightly overlapped with B. napus, but B. juncea and F 1 hybrids were completely separated from each other and all the other plant varieties. This suggests that GM B. napus and non-GM B. napus may share similar biological composition compared to B. juncea and F 1 hybrids. Similar studies also reported higher classification accuracy using NIR spectroscopy and linear discriminant analysis to monitor mung bean sprouts [29], classify different melon varieties [30], and detect pea protein powder containing adulterants [31]. inating the four different plant varieties. GM B. napus slightly overlapped with B. napus, but B. juncea and F1 hybrids were completely separated from each other and all the other plant varieties. This suggests that GM B. napus and non-GM B. napus may share similar biological composition compared to B. juncea and F1 hybrids. Similar studies also reported higher classification accuracy using NIR spectroscopy and linear discriminant analysis to monitor mung bean sprouts [29], classify different melon varieties [30], and detect pea protein powder containing adulterants [31].

Significance of Preprocessing and Selection of Optimal Classification Model
The efficiency of preprocessing and machine learning methods used in the study was statistically analyzed (Table 2). After cross-validation, the mean percentage of classification accuracy of each chemometric method combined with various preprocessing methods indicated significant modeling for the discrimination of GM and non-GM B. napus, B. juncea, and F 1 hybrids (Table 2). The statistical analysis with analysis of variance (ANOVA) ( Table 3) showed the sum of square and mean sum of square values of various preprocessing and machine learning approaches used with statistical significance at p ≤ 0.005. However, there was no significance with p ≥ 0.005 when using a combination of preprocessing and different machine learning methods together (p value of 0.0005). The confusion matrix depicts the degree of error in the classification of the evaluated plants, indicating that Savitzky-Golay smoothing in combination with support vector machine was the most effective classification approach (Table 4).

Plant Materials
The seeds used in the study, namely B. napus L. 'Youngsan' and B. juncea var. integrifolia and GM B. napus seeds with CAMV 35S-regulated bar and early flowering gene (BrAGL20), were procured from the National Agrobiodiversity Center, Jeonju, Republic of Korea. For the hybrid preparation, artificial hand pollination was performed with B. juncea and GM B. napus, and the seeds of F 1 hybrids (B. juncea X GM B. napus) were used for further research. The hybrids were confirmed through the survival assay after 0.3% Basta treatment; the phenotype of the hybrids; and polymerase chain reaction with 35S ribosomal DNA, BrAGL20 gene partial region, bar gene, and chloroplast marker. All of the seeds were grown in soil pots ( Figure 4) and kept in a controlled environment. This research was carried out in the greenhouse of the National Institute of Agricultural Sciences, Jeonju, Republic of Korea, during May-July 2019.

Spectral Data Collection
A handheld integrated portable spectrum analyzer (FieldSpec HandHeld 2, ASD Inc., Longmont, CO, USA) was used to collect Vis-NIR diffuse reflectance spectra in the range of 325-1075 nm with a stepping of 1.5 nm in reflectance mode (log/R). The spectra were collected on the adaxial surface of the fully expanded leaves, which can easily capture light. Three spectra were obtained from various parts of the leaf blade of 100 plants in each group. A total of 300 (3 × 100 = 300) spectra were collected from each group and used for further analysis. To avoid unnecessary noise, the optical window of the Vis-NIR device was placed in direct contact with the leaf's surface throughout each spectrum capture, ensuring that the sensor window was completely covered [32,33].

Preprocessing and Machine Learning Methods
Due to system parameters and environmental noise, background signals appeared in the raw spectra of samples. Different preprocessing methods, such as raw spectra assessment, normalization (area), standard normal variate (SNV), and derivatives (Savitzky-Golay (first differentiation)) were used, which can reduce the spectral noise and improve the accuracy of modeling approaches. The computations on preprocessing were performed with Unscrambler X software, version 10.5.1 (CAMO ASA, Oslo, Norway).
For the effective visualization and discrimination of spectral data, several machine learning methods were used and compared. The modeling was performed with the software package RapidMiner studios Version 9.0.002 (Rapidminer, Inc., Boston, MA, USA). In the study, eight classification methods were used to find the best modeling approach with the highest classification accuracy, namely deep learning, decision tree, support vector machine, random forest, generalized linear model, fast large margin, naive Bayes, and linear discriminant analysis. Linear discriminant analysis was performed in R-studio using the Aquap2 package developed by Kovacs and Pollner [34]. For each of the algorithms, the inputs were provided as the data points of the spectra and the classes were the identification labels of B. napus, GM B. napus, B. juncea, and F1 hybrids (B. juncea X GM B. napus). Cross-validation was performed to assess the robustness of the models in predicting the different sample types. For this, the data were divided into a training set and a validation set. The training set was made up of two-thirds of the data; thus, the spectra from the first and second replicates of each sample were included, while the validation set was made up of spectra from the third replicate. The data splitting was done three times, such that each sample was used at least once in the calibration and validation set. The classification results are displayed as score plots or confusion matrix, which illustrates the percentages of classification accuracy. One-way analysis of variance (ANOVA) was used to compare means for determining the influence of (1) the scatter correction method, (2) the eight machine learning methods, and (3) the interaction of preprocessing and machine learning methods. As a mean comparison method, Tukey's range test was used at a significance level of p ≤ 0.05.

Spectral Data Collection
A handheld integrated portable spectrum analyzer (FieldSpec HandHeld 2, ASD Inc., Longmont, CO, USA) was used to collect Vis-NIR diffuse reflectance spectra in the range of 325-1075 nm with a stepping of 1.5 nm in reflectance mode (log/R). The spectra were collected on the adaxial surface of the fully expanded leaves, which can easily capture light. Three spectra were obtained from various parts of the leaf blade of 100 plants in each group. A total of 300 (3 × 100 = 300) spectra were collected from each group and used for further analysis. To avoid unnecessary noise, the optical window of the Vis-NIR device was placed in direct contact with the leaf's surface throughout each spectrum capture, ensuring that the sensor window was completely covered [32,33].

Preprocessing and Machine Learning Methods
Due to system parameters and environmental noise, background signals appeared in the raw spectra of samples. Different preprocessing methods, such as raw spectra assessment, normalization (area), standard normal variate (SNV), and derivatives (Savitzky-Golay (first differentiation)) were used, which can reduce the spectral noise and improve the accuracy of modeling approaches. The computations on preprocessing were performed with Unscrambler X software, version 10.5.1 (CAMO ASA, Oslo, Norway).
For the effective visualization and discrimination of spectral data, several machine learning methods were used and compared. The modeling was performed with the software package RapidMiner studios Version 9.0.002 (Rapidminer, Inc., Boston, MA, USA). In the study, eight classification methods were used to find the best modeling approach with the highest classification accuracy, namely deep learning, decision tree, support vector machine, random forest, generalized linear model, fast large margin, naive Bayes, and linear discriminant analysis. Linear discriminant analysis was performed in R-studio using the Aquap2 package developed by Kovacs and Pollner [34]. For each of the algorithms, the inputs were provided as the data points of the spectra and the classes were the identification labels of B. napus, GM B. napus, B. juncea, and F 1 hybrids (B. juncea X GM B. napus). Crossvalidation was performed to assess the robustness of the models in predicting the different sample types. For this, the data were divided into a training set and a validation set. The training set was made up of two-thirds of the data; thus, the spectra from the first and second replicates of each sample were included, while the validation set was made up of spectra from the third replicate. The data splitting was done three times, such that each sample was used at least once in the calibration and validation set. The classification results are displayed as score plots or confusion matrix, which illustrates the percentages of classification accuracy. One-way analysis of variance (ANOVA) was used to compare means for determining the influence of (1) the scatter correction method, (2) the eight machine learning methods, and (3) the interaction of preprocessing and machine learning methods. As a mean comparison method, Tukey's range test was used at a significance level of p ≤ 0.05.

Conclusions
In conclusion, Vis-NIR spectroscopy in combination with machine learning methods could effectively discriminate between GM and non-GM B. napus, B. juncea, and the F 1 hybrids (B. juncea X GM B. napus). The utilization of Vis-NIR spectroscopy and chemometric analyses for the discrimination of GM and non-GM crops is quick and accurate. It can also deliver information for monitoring and safety management of agro-food market products in which GMOs are introduced. Among the different combinations of preprocessing and machine learning methods, the combination of standard normal variate and support vector machine was found to be the most effective method, with 99.6% classification accuracy, but Savitzky-Golay smoothing also yields good classification accuracy when other classification methods are used. Thus, it is proposed that this nondestructive method be employed in the field for the rapid detection and management of unintended releases of GM Brassicaceae crops into the environment. It is suggested to create a database with broad-spectrum results on GM and non-GM Brassicaceae crops for the effective utilization of the technology in the field.