Application for Identifying the Origin and Predicting the Physiologically Active Ingredient Contents of Gastrodia elata Blume Using Visible–Near-Infrared Spectroscopy Combined with Machine Learning

Gastrodia elata (G. elata) Blume is widely used as a health product with significant economic, medicinal, and ecological values. Due to variations in the geographical origin, soil pH, and content of organic matter, the levels of physiologically active ingredient contents in G. elata from different origins may vary. Therefore, rapid methods for predicting the geographical origin and the contents of these ingredients are important for the market. This paper proposes a visible–near-infrared (Vis-NIR) spectroscopy technology combined with machine learning. A variety of machine learning models were benchmarked against a one-dimensional convolutional neural network (1D-CNN) in terms of accuracy. In the origin identification models, the 1D-CNN demonstrated excellent performance, with the F1 score being 1.0000, correctly identifying the 11 origins. In the quantitative models, the 1D-CNN outperformed the other three algorithms. For the prediction set of eight physiologically active ingredients, namely, GA, HA, PE, PB, PC, PA, GA + HA, and total, the RMSEP values were 0.2881, 0.0871, 0.3387, 0.2485, 0.0761, 0.7027, 0.3664, and 1.2965, respectively. The Rp2 values were 0.9278, 0.9321, 0.9433, 0.9094, 0.9454, 0.9282, 0.9173, and 0.9323, respectively. This study demonstrated that the 1D-CNN showed highly accurate non-linear descriptive capability. The proposed combinations of Vis-NIR spectroscopy with 1D-CNN models have significant potential in the quality evaluation of G. elata.


Introduction
In recent years, Gastrodia elata (G.elata) Blume has been used as a health product in some countries.Because it has some remarkable and reliable benefits, it has received very good attention.As one of the traditional food materials and a rare Chinese medicine, G. elata is widely used in cooking, healthcare products, and cosmetics in China [1].G. elata was designated by the Chinese Health Commission as a pilot variety for the management of "substances that are both food and Chinese herbal medicines according to tradition" and represented a "medicine-food homology" in the legal sense [2,3].Currently, most G. elata are artificially cultivated.The quality of G. elata may vary significantly between different geographical origins owing to the differences in the growing environment, climate, and soil [4].In China, G. elata has been cultivated in many provinces, including Guizhou, Yunnan, Shaanxi, Hubei, and Henan.The World Health Organization (WHO) has indicated Foods 2023, 12, 4061 3 of 25 algorithm, K-nearest neighbor (KNN) exhibits a high accuracy when it is used for classification and regression.However, it is insensitive to outliers and has issues, such as difficult feature band selection [24].Meanwhile, support vector machine (SVM) and support vector regression (SVR), as non-linear algorithms, are effective in reducing the model complexity and prediction error for both classification and regression tasks.However, they require the manual selection of suitable features and kernel functions [25].As a deep learning algorithm, the one-dimensional convolutional neural network (1D-CNN) can overcome the aforementioned issues.
So far, the application of convolutional neural networks (CNNs) has produced significant results in various fields.A 1D-CNN has a similar structure to conventional CNN; however, the former is more powerful at model representation than the latter.It can be trained effectively using limited data sets because it only requires simple pre-processing, exhibits good information extraction efficiency, and has low computational requirements.Therefore, optimal models can be obtained in real-life applications using a 1D-CNN [26][27][28].Therefore, it was postulated that combining the excellent predictive performance of a 1D-CNN with detailed Vis-NIR analysis would allow for the simplification of complex tasks, such as the identification of the geographical origin and the prediction of the physiologically active ingredient contents in herbal medicines.
The aim of this study was to establish an effective and industrially referable method for the evaluation of G. elata to identify the origin of G. elata and predict its physiologically active ingredient contents.Specifically, our objectives were (1) to evaluate the potential of applying Vis-NIR spectroscopy for the identification of the geographical origin of G. elata and prediction of the bioactive contents, (2) to compare the prediction efficacy of models established using deep learning algorithms (1D-CNN) and conventional machine learning methods (PLS-DA/PLSR, KNN, and SVM/SVR) and identify the ideal modeling method, (3) to verify the feasibility of using a Vis-NIR discriminant model to identify the origin of G. elata through spectral characterization, and (4) to establish a calibration model for the bioactive components in G. elata using Vis-NIR analysis for predicting the contents of multiple components simultaneously.

Sample Collection and Pre-Treatment
The majority of the market's G. elata comes from various geographical origins in China (Figure 1).The samples for this study were collected from these cultivation bases in December 2021 by the Bijie Institute of Traditional Chinese Medicine, Bijie, Guizhou Province.The information on the samples is shown in Table 1.The G. elata samples from different origins and batches were classified, cleaned, steamed for 30 min, and then dried in a 50 • C oven (Shanghai Yetuo Technology Co., Ltd., Shanghai, China).Subsequently, in order to have a homogeneous G. elata powder sample, each batch of dried G. elata samples was crushed and sieved.A total of 240 powder samples of G. elata were obtained and stored in a laboratory at room temperature (25 ± 1 • C) and a humidity of 45 ± 1%.These samples were analyzed via HPLC after collecting the Vis-NIR spectral data.

Acquisition of Vis-NIR Spectral Data
A Vis-NIR spectrometer (XDS Rapid Content, Foss NIR SystemsInc., Hillerød, Denmark) with silicon (400-1100 nm) and lead sulfide (1100-2500 nm) detectors was used to collect the spectra in the range of 400-2500 nm with a sampling interval of 2 nm.Vis-NIR spectra of samples were collected in the laboratory at a room temperature of 25 ± 1 °C and humidity of 45 ± 1%.The spectrum of each sample was measured three times, and the average spectrum was used for further analysis.

Determination of the Contents of Bioactive Components via HPLC
A total of 2.0 g of each G. elata sample was weighed and transferred into a 50 mL conical flask, and 25 mL of a 60% methanol solution was added.The mixture was weighed, and the extraction was performed via ultrasonication for 1 h.The mixture was then weighed, replenished, and centrifuged (Hunan Kaida Scientific Instrument Co., Ltd., Changsha, China).Thereafter, 5 mL of the supernatant was added to 5 mL of the 60% methanol solution and filtered through a 0.45 µm microporous membrane before analyzing via HPLC.

Acquisition of Vis-NIR Spectral Data
A Vis-NIR spectrometer (XDS Rapid Content, Foss NIR SystemsInc., Hillerød, Denmark) with silicon (400-1100 nm) and lead sulfide (1100-2500 nm) detectors was used to collect the spectra in the range of 400-2500 nm with a sampling interval of 2 nm.Vis-NIR spectra of samples were collected in the laboratory at a room temperature of 25 ± 1 • C and humidity of 45 ± 1%.The spectrum of each sample was measured three times, and the average spectrum was used for further analysis.

Determination of the Contents of Bioactive Components via HPLC
A total of 2.0 g of each G. elata sample was weighed and transferred into a 50 mL conical flask, and 25 mL of a 60% methanol solution was added.The mixture was weighed, and the extraction was performed via ultrasonication for 1 h.The mixture was then weighed, replenished, and centrifuged (Hunan Kaida Scientific Instrument Co., Ltd., Changsha, China).Thereafter, 5 mL of the supernatant was added to 5 mL of the 60% methanol solution and filtered through a 0.45 µm microporous membrane before analyzing via HPLC.
HPLC analysis was performed on an UltiMate 3000 HPLC system (Thermo Fisher Scientific, Waltham, MA, USA) with a Phenomenex Luna C18 (250 mm × 4.6 mm, 5 µm) column.The mobile phase consisted of acetonitrile (A) and 0.1% aqueous phosphate solution (B).The flow rate was 1.0 mL/min, and the gradient elution conditions were as follows: 0-5 min, 3.0% A; 5-15 min, 3.0-5.0%A; 15-22 min, 5.0% A; 22-25 min, 5.0-10.1% A; 25-35 min, 10.1-10.2%A; 35-45 min, 10.2-14.0%A; 45-52 min, 14.0% A; 52-55 min, 14.0-16.5% A; 55-63 min, 16.5-17.5%A; 63-65 min, 17.5-20.0%A; 65-70 min, 20.0% A. The column temperature was 35 • C, the injection volume was 4 µL, and the detection wavelength was 220 nm.For the validation of the HPLC method, refer to [29].Compared with the method recorded in the Chinese Pharmacopoeia (2020) [30], this method allows for the simultaneous determination of multiple physiologically active ingredient contents of G. elata.PLSR is a multivariate statistical method commonly used in spectral analysis for regression with linear features.PLSR operates well when processing predictor variables with multicollinearity and considers both spectral and feature information.PLS-DA is an extension of the classical PLS algorithm and is also based on a linear classification technique [31,32].In this study, the categorical variables of the calibration sample set were established first during the development of the PLS-DA discriminant model, followed by the PLS analysis of the categorical variables and the spectral data to establish their PLS model; the values of the categorical variable (ypredicted) of the testing set were calculated based on this PLS model.The classification variables of the eleven different sample origins were assigned as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11.When 0.5 < ypredicted < 1.5, the samples were assigned to the first category.When 1.5 < ypredicted < 2.5, the samples were assigned to the second category.

K-Nearest Neighbor (KNN)
KNN is a supervised-learning-based classification algorithm that can be used for classification and regression tasks [24].The basic concept of the KNN algorithm [33] is to calculate the distance or similarity between the data of the sample to be classified and the known training samples and determine the K-nearest neighbors to the sample data to be classified according to the distance or similarity.Thereafter, the category of the sample data can be determined based on the categories of the neighbors.When all the K neighbors of the sample data belong to the same category, the sample is assigned to belong to that category.Therefore, new data Y and some training sample X = (x 1 , x 2 , . . . ,x n ) is usually calculated using the following Euclidean distance formula: where d(x, y) represents the Euclidean distance between Y and X. x k means the k feature attribute value of X training sample.y k means the k feature attribute value of the Y training sample.

Support Vector Machine (SVM)/Support Vector Regression (SVR)
An SVM is a data mining method for classification and regression based on the structural risk minimization principle [25].SVR is an application of SVM used to address regression problems and can model high-dimensional data well.Therefore, the radial basis function (RBF) kernel function was applied for SVM modeling in this research.The two important parameters of the RBF kernel function are the penalty parameter c and the kernel function parameter g.These factors are very important to the model, especially the complexity, approximation error, and measurement accuracy.In addition, their optimization is essential.Therefore, the grid search (GS) technique was used to select the optimal parameters for this study [34].For this purpose, all the values of the c-g parameter pairs were tested, and the c-g pair with the highest accuracy was identified via cross-validation and used as the optimal parameters.

One-Dimensional Convolutional Neural Network (1D-CNN)
The structure of a 1D-CNN is similar to that of a conventional CNN and consists of input, convolutional, grouping, pooling, fully connected, and dense layers (Figure 2) for feature extraction, learning, and providing numerical outputs for classification or regression tasks [35,36].However, it is more powerful in model representation than a conventional CNN.The convolutional layer is composed of multiple convolutional kernels.Convolution with these kernels on the original data is regarded as extracting features that contain the characteristics represented by the convolutional kernels.The number of convolutional kernels determines the number of generated features.Different activation functions are used to show complex features.The dropout layer can temporarily discard a certain percentage of neural network units from each fully connected layer during the training process of a deep neural network.This means that different networks are trained for each batch, reducing the occurrence of overfitting.The fully connected layer maps the learned features to the sample label space.The flatten layer flattens the input data without affecting the batch size, serving as a transformation between the convolutional and fully connected layers.
temporarily discard a certain percentage of neural network units from each fully connected layer during the training process of a deep neural network.This means that different networks are trained for each batch, reducing the occurrence of overfitting.The fully connected layer maps the learned features to the sample label space.The flatten layer flattens the input data without affecting the batch size, serving as a transformation between the convolutional and fully connected layers.The structure of the 1D-CNN models was built based on the TensorFlow framework, the structure and parameters of which are shown in Table 2.The default parameters in TensorFlow are not specified in the table.The input spectral data are presented as different origins.As Vis-NIR spectral data exhibits continuous changes in absorbance values with wavelength, when the convolutional kernel has a small size, the kernel may extract features on subintervals that are not near absorption peaks.Modeling with these features makes it difficult for the model to capture distinctive and discriminative spectral patterns, resulting in poor generalization performance.In addition, as shown in Figure 2, 8 kernels were used to capture low-level local features in the first convolutional layer.The second convolutional layer used 16 kernels to further combine and abstract these features, capturing higher-level features.This hierarchical feature extraction helped the model to better understand the structure of the data.Shallow networks have weaker modeling capabilities, and the combination of two convolutional layers could improve the model s representational power.Therefore, a brief description of the layers is as follows:  The structure of the 1D-CNN models was built based on the TensorFlow framework, the structure and parameters of which are shown in Table 2.The default parameters in TensorFlow are not specified in the table.The input spectral data are presented as different origins.As Vis-NIR spectral data exhibits continuous changes in absorbance values with wavelength, when the convolutional kernel has a small size, the kernel may extract features on subintervals that are not near absorption peaks.Modeling with these features makes it difficult for the model to capture distinctive and discriminative spectral patterns, resulting in poor generalization performance.In addition, as shown in Figure 2, 8 kernels were used to capture low-level local features in the first convolutional layer.The second convolutional layer used 16 kernels to further combine and abstract these features, capturing higherlevel features.This hierarchical feature extraction helped the model to better understand the structure of the data.Shallow networks have weaker modeling capabilities, and the combination of two convolutional layers could improve the model's representational power.Therefore, a brief description of the layers is as follows: (1) The Gaussian noise layer assisted in the regularization of the model by noising the pre-processed data with a Gaussian noise filter and was only valid during training.The value 1030 in the input shape of the Gaussian noise layer represented the feature dimension of each sample.In this study, the spectral data of each sample was obtained from the Vis-NIR spectra after spectral pre-processing and consisted of 1030 wavelength points.The Gaussian noise layer only added noise to the data without changing its shape, and thus, the output feature sequence remained 1030.
(2) In deep neural networks, the dimensionality of the input data is altered using the reshape layer.The reshape layer adapts the two-dimensional (2D) Vis-NIR spectroscopy data to three-dimensional (3D) data, with the third dimension having a fixed value of 1.
(3) There were two convolutional layers.The rectified linear unit (ReLU) function was regarded as the activation function of both convolutional layers.The convolutional layer 1 achieved convolution in 1D with 8 convolution kernels with a size of 32, with a stride of 1.Therefore, when data with 1030 features underwent convolution with a size of 32, the resulting number of features was (1030-32) + 1, which is 999 features.The third parameter, namely, 8, represented the number of convolutional kernels, each of which learned different features and generated an output sequence.
(4) The second convolutional layer further enhanced the model's ability to learn feature representations and capture more feature information.Similarly, 16 convolutional kernels with a size of 32 were used in this layer.When data with 999 features underwent convolution with a size of 32, the resulting number of features was (999-32) + 1, which is 968 features.The third parameter, namely, 16, represented the number of convolutional kernels.
(5) The dropout layer generalized the model by randomly dropping out neurons to prevent overfitting, with the dropout rate of the input units represented by r.
(6) The flattening layer flattened the features extracted by the convolutional layers.During this process, it sequentially unfolded all the elements and transformed the output data shape to (None, 15,488).(7) The dense layer, also known as the fully connected layer, further compressed the nodes in the network.It took an input dimension of 15488.This layer was configured with 128 neurons, each of which performed a weighted sum and processed the 15,488 input features, resulting in an output value.(8) The output layer operated on the same principles as the previous fully connected layer.It was configured with 11 neurons, each of which performed a weighted sum and processed the 128 input features, resulting in an output value.It mapped the 128-dimensional features of the input to an 11-dimensional features space, where each neuron corresponded to a category, and the output value represents the probability of that category.Additionally, L1 and L2 losses were set in this layer to achieve weight regularization.
A 1D-CNN is more flexible in extracting key features and more expressive than conventional machine learning algorithms, enabling a better extraction of patterns and features.Furthermore, compared with conventional neural networks, 1D-CNN has fewer parameters and is easier to train.

Statistical Analysis
HPLC analysis requires known standards to identify the peaks representing GA, HA, PA, PB, PC, and PE.Based on the calibration curves, the content of each physiologically active ingredient can be calculated.The results of the PLS-DA, the principal component analysis (PCA), and a clustered heatmap were plotted to analyze the physiologically active ingredients.
Accuracy, precision, recall rate, and F 1 score were used in this study; these are the evaluation indicators of the origin discrimination models.By comparing these indicators, the optimal Vis-NIR spectral discrimination model was selected to classify and validate the respective origins of the G. elata samples.
where the numbers of true positive (TP), true negative (TN), false negative (FN), and false positive (FP) results were counted, respectively.The predictions also are shown in a confusion matrix for convenience [38].
The final optimal calibration model was chosen based on the minimum root-meansquare error of prediction (RMSEP), the highest coefficients of determination of the training set (R 2 v )and the prediction set (R 2 p ), and the lowest mean relative errors for cross-validation (MRECV) and prediction (MREP).The regression model has a better fit when the coefficient of determination is closer to one.Moreover, numerically closer values of RMSECV and RMSEP suggest a better generalization ability of the model.When the results fulfilled Foods 2023, 12, 4061 9 of 25 these criteria, it could be concluded that the model was well suited for the prediction of physiologically active ingredient contents in G. elata from different origins.

Statistical Analysis of Physiologically Active Ingredients Content Determined via the HPLC Method
The contents of the physiologically active ingredients (GA, HA, PA, PB, PC, and PE) in the G. elata samples from different origins were determined via HPLC.The sum of the contents of GA and HA [12], the sum of all component contents (total), and the coefficient of variation (CV) of each content were calculated (Table 4).Considering the CV values, the amounts and types of the G. elata samples from the eleven origins varied considerably.The GA content had the highest variation, with a CV of 61.06%, whereas the PB content had the lowest CV of 22.05%.The samples with a PGI indicating they were from DJ had the highest GA content of 4.7047 mg/g, whereas the samples with a PGI indicating that they were from YC had the lowest GA content of 1.0361 mg/g.The highest content of HA was 1.2674 mg/g in the samples with a PGI indicating that they were from LS, and the lowest HA content of 0.3898 mg/g was observed for samples from DAF.The highest content of PE was 5.8001 mg/g, which was detected for the samples from LP, and the lowest PE content of 1.9492 mg/g was observed for the samples from LY.The highest content of PB was 3.9674 mg/g in the samples with a PGI indicating that they were from DJ, and the lowest PB content was 2.0601 mg/g, which was detected for the samples from DF.The highest content of PC was 1.5170 mg/g in the samples with a PGI indicating that they were from DJ, and the lowest PC content was 0.4814 mg/g, which was detected in the samples from YC.The highest PA content was 10.3782 mg/g, which was determined in the samples with a PGI indicating that they were from DJ, and the lowest PA content was 2.6593 mg/g, which was detected in samples from ZT.The highest sum of the contents of GA and HA was 5.8882 mg/g and was detected in samples with a PGI indicating that they were from DJ, and the lowest value was 1.7200 mg/g obtained for samples from LP.The total contents ranged from a maximum of 24.4433 mg/g observed for samples with a PGI indicating that they were from DJ to a minimum of 12.0235 mg/g detected in samples from DAF.The reason for this phenomenon was that different regions lead to differences in soil pH, organic matter content, and microbial populations [39].As the contents of the physiologically active ingredients in the G. elata are correlated with these influencing factors, the levels of physiologically active ingredients in the G. elata from different regions varied.
The results of the comparison show that the G. elata samples with a PGI indicating that they were from DJ had the highest physiologically active ingredients content, which may be attributed to the fact that this region has a humid climate, sufficient sunlight, average temperatures of 13-17 • C throughout the year, and frost-free weather up to 295 days a year [40].The G. elata samples with a PGI indicating that they were from LS had lower physiologically active ingredient contents than the samples from PUA without a PGI.The total contents in the G. elata samples collected from DAF and ZT with PGIs were lower than those in samples collected from PUA, LP, and WF without PGIs.The season of collection and the grade of the collected G. elata may have both contributed to the poor quality of samples collected from DAF and ZT.This finding indicates that sourcing G. elata from a region with a PGI does not ensure high physiologically active ingredient contents.Furthermore, the quality of G. elata should be determined based on a combination of the origin and physiologically active ingredient contents of the sample.For a more intuitive study, we combined the analysis of the physiologically active ingredient contents in G. elata with chemometrics to identify the clusters of origins.The 2D PCA score plot for the origin classification (Figure 3a) showed that the sum of PC1 and PC2 accounted for 96.65% of the explained total variance (PC1 = 92.01%,PC2 = 4.65%) and could explain most of the variability.However, no significant clustering of the samples from the eleven origins was observed, and the data points were dispersed and had overlapping distributions.Based on the results of the PLS-DA classification (Figure 3b), the samples from several origins showed indications of classification, but ultimately, they could not be completely separated owing to the extensive overlaps.In addition, to examine whether certain components were indicative of certain origins, a clustered heatmap analysis of the physiologically active ingredients content (x-axis) and sample origin (y-axis) was performed (Figure 3c).The results showed that the samples from the same origin were not distinguished, indicating that the physiologically active ingredient contents varied in samples from the same origin.This may have been because the G. elata samples had different quality grades [41].The results of the aforementioned analysis indicate that it is not currently possible to identify the origin of G. elata by solely considering the physiologically active ingredient contents.Therefore, efficient methods should be developed to rapidly and accurately identify the origin of G. elata.

Spectral Analysis of the G. elata Samples
Figure 4a shows that there were some average raw spectra of the G. elata samples th represented different origins, and there were eight distinct peaks and seven valleys.The average spectra that represent different origins were consistent in trends and overlapp tightly; however, subtle differences were observed in the absorption intensiti particularly in the Vis detection range (400-780 nm).The absorption peaks approximately 980 nm referred to the second harmonic generation (SHG) of O-H in t phenolic acids and water.The absorption peaks at ~1180 nm referred to the SHG of Cin the phenolic acids and polysaccharides in G. elata.The absorption peaks in the range 1420-1440 nm referred to the first harmonic generation (FHG) of O-H in the phenolic aci and polysaccharides in G. elata, as well as in water.The absorption peaks in the range 1500-1600 nm referred to the FHG of C-H in the phenolic acid and polysaccharides in elata and the SHG of N-H in amino acids.The absorption peaks at 1940 nm referred to t sum frequency generation (SFG) of O-H in water, and those at 2150 and 2350 nm referr  intensities of certain peaks, such as those at 1400 and 1900 nm, were enhanced.However, distinguishing the origin of G. elata using the averaged spectrum remains difficult.

Visual Analysis of Spectral Characteristics
To understand the similarities and differences in the Vis-NIR spectral datasets of G. elata samples obtained from 11 different origins more intuitively and fully, three methods were used to map the Vis-NIR spectral datasets pre-processed with SD to a 2D space.As shown in Figure 5a-c, where each point represents an individual sample, visualization analysis was performed on the 11 datasets.To reduce noise, the Vis-NIR spectra were pre-treated with a second-order derivative (SD) in this study, which significantly changed the shapes of the spectra (Figure 4b).The intensities of certain peaks, such as those at 1400 and 1900 nm, were enhanced.However, distinguishing the origin of G. elata using the averaged spectrum remains difficult.

Visual Analysis of Spectral Characteristics
To understand the similarities and differences in the Vis-NIR spectral datasets of G. elata samples obtained from 11 different origins more intuitively and fully, three methods were used to map the Vis-NIR spectral datasets pre-processed with SD to a 2D space.As shown in Figure 5a-c, where each point represents an individual sample, visualization analysis was performed on the 11 datasets.
The 2D score plot obtained from PCA can cluster samples with similar spectral characteristics together [31,32].The results show that the sum of PC1 and PC2 for origin classification accounted for 66.3% of the explained total variance (PC1 = 41.2%,PC2 = 25.1%).Figure 5a presents that except for the samples from DJ, the samples from other regions overlapped to a large extent, indicating poor separation.The t-distributed stochastic neighbor embedding (t-SNE) visualization results (Figure 5b) show that the samples from DF, DJ, LJ, and WF formed distinct clusters, while samples from other regions exhibited significant overlap, indicating poor classification.The difference between the uniform manifold approximation and projection (UMAP) and t-SNE was minimal (Figure 5c).Although PCA is an effective method for extracting data information, it was not able to visualize a large amount of information.Comparatively, as a non-linear dimensionality reduction method, the t-SNE visualization method visualized the data significantly better than PCA [42].The UMAP approach and computation were largely similar to t-SNE [43].Compared with previous research, the results of PCA are consistent with those of the literature [15], but the results of t-SNE and UMAP are superior to the previous research.The visualization results validate the fact that the identification of G. elata origin cannot be accomplished by solely considering the clustering of spectra.At the same time, it was shown that G. elata from different origins had similar chemical compositions.Therefore, the combined application of Vis-NIR spectroscopy and chemometrics is required for further analysis.

Visual Analysis of Spectral Characteristics
To understand the similarities and differences in the Vis-NIR spectral datasets of G. elata samples obtained from 11 different origins more intuitively and fully, three methods were used to map the Vis-NIR spectral datasets pre-processed with SD to a 2D space.As shown in Figure 5a-c, where each point represents an individual sample, visualization analysis was performed on the 11 datasets.

Identification of the Origin of G. elata Based on the Vis-NIR Data
The Vis-NIR spectral data were analyzed using 1D-CNN and other learning algorithms (PLS-DA, KNN, and SVM), employing the eleven origins as labels.The results are shown in Table 5.To enhance the stability of the model and avoid overfitting, the original spectral data sets were expanded by adding random offsets and applying multiplication and slope effects.The random offset of the spectral data was set to 0.1 times the mean value, and the slope was offset by 0.05 times, that is, the slope was randomly adjusted between 0.95 and 1.05 to augment the spectrum.The data augmentation method was used to obtain a total of 1440 spectra for the training set.The training set was then used to train the neural network and to avoid the risk of developing models with poor generalization ability.The same expanded training and validation sets were also applied to other chemometric methods to compare the advantages and disadvantages of models developed using 1D-CNN and other chemometric methods before and after the data augmentation.
When analyzing the models developed with unexpanded data, it was found that the results of the raw spectra performed the worst.A comparison of the other spectral pre-processing methods for the models developed with unexpanded data showed that the classification outcomes of all four models were improved after the spectral data were preprocessed via a combination of SD and normalization methods.Specifically, the training set accuracy (Acc_train) and the testing set accuracy (Acc_test) of the PLS-DA model improved by 6.07% and 4.24%, respectively, after SD processing.Meanwhile, the precision improved by 4.37%, and the recall rate and the F1 score improved by 2.98% and 4.57%, respectively.Although the Acc_train of the KNN model was reduced, the Acc_test of the KNN model improved from 0.5167 to 0.9667, the precision improved from 0.5060 to 0.9542, the recall rate improved from 0.5167 to 0.9667, and the F1 score improved from 0.5062 to 0.9583.Additionally, the model performance of SVM improved considerably after the SD processing of the spectra.Acc_train improved from 0.8278 to 0.9611, Acc_test improved from 0.6500 to 0.9833, the precision improved from 0.7211 to 0.9847, the recall rate improved from 0.6500 to 0.9833, and the F1 score improved from 0.6579 to 0.9828.Similarly, the 1D-CNN model performance improved after SD processing, obtaining a value of 1.0000 for each of Acc_train, Acc_test, precision, recall rate, and F1 score when using processed data.As a result of a comprehensive comparison of the models established with unexpended data, the optimal model of G. elata origin discrimination could be established by pre-processing the Vis-NIR spectra using SD and normalization combined with the application of 1D-CNN.Analyzing the models established based on expanded data, it was found that the results of the raw spectra performed the worst.A comparison of the two spectral preprocessing methods for the models established based on expanded data suggested that the classification results of all four models were improved after the spectral data was pre-processed with both SD and normalization.Specifically, after pre-processing, for the PLS-DA model, Acc_train and Acc_test improved by 1.00% and 7.27%, respectively; the precision improved by 6.41%, and the recall rate and F1 score improved by 7.27% and 7.29%, respectively.In the case of the KNN model, Acc_test improved from 0.7167 to 0.9833, precision improved from 0.7409 to 0.9861, the recall rate improved from 0.7167 to 0.9833, and the F1 score improved from 0.7121 to 0.9829.The SVM model exhibited a significant improvement in parameters, with Acc_train improving from 0.8625 to 1.0000, Acc_test improving from 0.8833 to 0.9833, precision improving from 0.9012 to 0.9917, the recall rate improving from 0.8833 to 0.9833, and the F1 score improving from 0.8718 to 0.9849.Finally, the performance of the 1D-CNN model was improved, with the values of Acc_train, Acc_test, precision, recall rate, and F1 score equal to 1.0000.A comprehensive comparison of all the models developed with expanded data revealed that the optimal model for discriminating the origin of G. elata was the combined application of Vis-NIR spectra pre-processed using both SD and normalization and the 1D-CNN method.
The performances of the models established before and after data augmentation were compared after the spectral data were pre-treated via normalization.The identification results of the PLS-DA, KNN, SVM, and 1D-CNN models were improved after the augmentation of the Vis-NIR spectral data.The 1D-CNN model exhibited the highest performance, with an Acc_train of 1.0000, an Acc_test of 0.9833, a precision of 0.9847, a recall rate of 0.9833, and an F1 score of 0.9833.Thereafter, the performances of the models developed before and after data augmentation after the spectral data were pre-processed using both SD and normalization were compared.The classification results of the PLS-DA, KNN, SVM, and 1D-CNN models were also improved after the Vis-NIR spectral data augmentation.The 1D-CNN model showed the optimal results, with the values of the Acc_train, Acc_test, precision, recall rate, and F1 score equal to 1.0000.These results of 1D-CNN models also showed the presence of more inherent nonlinear correlations between spectral data and the original labels.Therefore, data augmentation is a viable Vis-NIR spectral data set augmentation technology.It improves the robustness of the 1D-CNN model.
In conclusion, the optimal spectral pre-processing method combined pre-processing with SD and normalization.Furthermore, the robustness of the model could be improved using data augmentation, and the optimal modeling algorithm was the 1D-CNN.To further verify the performances and effectiveness of the classification models in this study, the optimal model of each of the four algorithms was selected for plotting their confusion matrices, aiming to apply different discriminant models to each sample to obtain further details (Figure 6).The confusion matrices indicate that one sample from WF was misclassified as being from YC in the PLS-DA model, one sample from YC was misclassified as being from PA in the KNN model, and one sample from YC was misclassified as being from WF in the SVM model.Notably, all samples were correctly classified when using the 1D-CNN model.These results indicate that the PLS-DA, SVM, and KNN algorithms confused the data of the G. elata samples collected from YC, WF, and PA during classification, which may have been because of the characteristic spectral bands that suggest the differences in the origin and the bioactive component contents of G. elata.The 1D-CNN algorithm effectively addressed these issues.Therefore, the optimal model was established by pre-processing the Vis-NIR spectra using both SD and normalization, expanding the data, and modeling with the 1D-CNN algorithm.The results confirmed that the 1D-CNN model had strong automatic learning characteristics and was better suited for the origin identification of G. elata than the other models considered in this research, providing a rapid method for distinguishing samples of different origins with PGI.In conclusion, the optimal spectral pre-processing method combined pre-proc with SD and normalization.Furthermore, the robustness of the model could be imp using data augmentation, and the optimal modeling algorithm was the 1D-CN further verify the performances and effectiveness of the classification models in this the optimal model of each of the four algorithms was selected for plotting their con matrices, aiming to apply different discriminant models to each sample to obtain f details (Figure 6).The confusion matrices indicate that one sample from W misclassified as being from YC in the PLS-DA model, one sample from Y misclassified as being from PA in the KNN model, and one sample from Y misclassified as being from WF in the SVM model.Notably, all samples were co classified when using the 1D-CNN model.These results indicate that the PLS-DA and KNN algorithms confused the data of the G. elata samples collected from YC, W PA during classification, which may have been because of the characteristic spectral that suggest the differences in the origin and the bioactive component contents of G The 1D-CNN algorithm effectively addressed these issues.Therefore, the optimal was established by pre-processing the Vis-NIR spectra using both SD and normali expanding the data, and modeling with the 1D-CNN algorithm.The results con that the 1D-CNN model had strong automatic learning characteristics and was suited for the origin identification of G. elata than the other models considered research, providing a rapid method for distinguishing samples of different origin PGI.The loss and accuracy curves of the training set and testing set can diagno issues that could be causing underfit or overfit models during the learning process the model is overfitting, the loss curve will gradually decrease on the training set may stabilize or start to increase on the testing set.The accuracy of the model training set may approach 100%, while it may decrease or stabilize on the testing the model is underfitting, both the loss curve on the training set and the testing se fail to reach a low level, and the difference between them may be small.The accur the model on both the training set and the testing set may be low, with a small diff between them.As a result, the 1D-CNN model was trained with an initial learning 0.01, 50 iterations (epoch = 50), and a batch size of 32 (batch_size = 32) in this study Figure 7a, it can be observed that as the number of training iterations increased, th function of the 1D-CNN model for both the training set and testing set gra decreased, indicating that the model was finding spectral features related to the or G. elata.When the loss function was very low, the loss curve decreased significant became flat as it approached zero.At this point, the 1D-CNN model became more and the losses of the training set and testing set both converged, with a small diff between them, indicating successful fitting.Figure 7b shows that the accuracy curve 1D-CNN model on both the training set and testing set approached 1 (or 100%), ind optimal model performance.In terms of origin identification, the relevant literature has identified G. elat up to eight different regions [15].In this study, successful identification of G. elat 11 different regions was achieved.From the perspective of the algorithm performan The loss and accuracy curves of the training set and testing set can diagnose any issues could be causing underfit or overfit models during the learning process [44].If the model is overfitting, the loss curve will gradually decrease on the training set, but it may stabilize or start to increase on the testing set.The accuracy of the model on the training set may approach 100%, while it may decrease or stabilize on the testing set.If the model is underfitting, both the loss curve on the training set and the testing set may fail to reach a low level, and the difference between them may be small.The accuracy of the model on both the training set and the testing set may be low, with a small difference between them.As a result, the 1D-CNN model was trained with an initial learning rate of 0.01, 50 iterations (epoch = 50), and a batch size of 32 (batch_size = 32) in this study.From Figure 7a, it can be observed that as the number of training iterations increased, the loss function of the 1D-CNN model for both the training set and testing set gradually decreased, indicating that the model was finding spectral features related to the origin of G. elata.When the loss function was very low, the loss curve decreased significantly and became flat as it approached zero.At this point, the 1D-CNN model became more stable, and the losses of the training set and testing set both converged, with a small difference between them, indicating successful fitting.Figure 7b shows that the accuracy curve of the 1D-CNN model on both the training set and testing set approached 1 (or 100%), indicating optimal model performance.The loss and accuracy curves of the training set and testing set can diagnose any issues that could be causing underfit or overfit models during the learning process [44].If the model is overfitting, the loss curve will gradually decrease on the training set, but it may stabilize or start to increase on the testing set.The accuracy of the model on the training set may approach 100%, while it may decrease or stabilize on the testing set.If the model is underfitting, both the loss curve on the training set and the testing set may fail to reach a low level, and the difference between them may be small.The accuracy of the model on both the training set and the testing set may be low, with a small difference between them.As a result, the 1D-CNN model was trained with an initial learning rate of 0.01, 50 iterations (epoch = 50), and a batch size of 32 (batch_size = 32) in this study.From Figure 7a, it can be observed that as the number of training iterations increased, the loss function of the 1D-CNN model for both the training set and testing set gradually decreased, indicating that the model was finding spectral features related to the origin of G. elata.When the loss function was very low, the loss curve decreased significantly and became flat as it approached zero.At this point, the 1D-CNN model became more stable, and the losses of the training set and testing set both converged, with a small difference between them, indicating successful fitting.Figure 7b shows that the accuracy curve of the 1D-CNN model on both the training set and testing set approached 1 (or 100%), indicating optimal model performance.In terms of origin identification, the relevant literature has identified G. elata from up to eight different regions [15].In this study, successful identification of G. elata from 11 different regions was achieved.From the perspective of the algorithm performance, the In terms of origin identification, the relevant literature has identified G. elata from up to eight different regions [15].In this study, successful identification of G. elata from 11 different regions was achieved.From the perspective of the algorithm performance, the identification accuracy of the 1D-CNN model was comparable to the accuracy reported in the literature for three or six different regions of G. elata [4,16].

SD + Normalization
In addition, the evaluation indicators of the 1D-CNN model in this study were consistent with those in the relevant literature, and the F1 score was 1 [45].The main reason for this was that 1D-CNN has advantages, such as local feature extraction, parameter sharing, multi-level abstraction, and non-linear activation functions, which enable it to capture the correlations between data more effectively.In this study, although the G. elata samples were sourced from different regions, they were harvested during the same period and under the same cultivation techniques.With fewer confounding factors in the experiment, the 1D-CNN model was able to capture the correlations between the data more easily, resulting in an F1 score value of 1.0000.Therefore, future research should collect more G. elata samples from different regions that are harvested at different times, in order to enhance the reliability and applicability of the 1D-CNN model.

Prediction of Physiologically Active Ingredient Contents in G. elata Based on the Vis-NIR Method
Considering the intrinsic association between the origins of G. elata samples and their contents of physiologically active ingredients, it was investigated whether the Vis-NIR technique could be used to predict the contents of ingredients in G. elata.In this study, the SPXY method was applied to divide the training and testing sets (Table 3).The X variable was the Vis-NIR spectra after pre-processing using both SD and normalization and the Y variable was the content of each component in the samples determined via HPLC.The X variable was set to 180 samples × 1050 variables before the data augmentation and 1440 samples × 1050 variables after the data augmentation.
The parameters of the model for determining the physiologically active ingredients content of G. elata based on the Vis-NIR full-wavelength spectra are listed in Table 6.When the data augmentation was not performed, the GA content was most effectively predicted by the SVR and 1D-CNN models, as both models had high R 2 v (higher than 0.9800) and R 2 p values (higher than 0.8800).After the data augmentation, the predictive performance of PLSR and SVR did not improve significantly.However, the performances of the KNN and 1D-CNN models improved, yielding R 2 v and R 2 p values higher than 0.9900 and 0.9200, respectively.This indicates that the KNN and 1D-CNN models were more precise in predicting GA content after the data augmentation than the other models.Comparing the performance parameters of the models, the optimal method was determined to be the combined application of the 1D-CNN algorithm and the expanded Vis-NIR spectral data pre-processed using SD (Figure 8a).The optimal model had the highest R 2 v and R 2 p values (0.9974 and 0.9278, respectively) and lower RMSECV, RMSEP, MRECV, and MREP values (0.0843, 0.2881, 0.0328, and 0.1396, respectively) than the other models, suggesting that it was the optimal model for predicting the GA content.

Discussion
The main factors that affect the quality of G. elata are its origin and physiologically active ingredients.There are marked differences in the content of physiologically active ingredients in G. elata from 11 geographical origins, which confirms the importance of origin identification and PGI.Compared with the time-consuming HPLC method, Vis-NIR spectroscopy could predict the origin of G. elata and the contents of eight physiologically active ingredients in a single scan within seconds, thereby evaluating the quality of G. elata.This method is rapid, simple, non-polluting, and has lower instrument costs compared with HPLC instruments, demonstrating the necessity of adopting Vis-NIR spectroscopy for the rapid quality assessment of G. elata.However, to achieve the rapid quality inspection of G. elata in different scenarios, it is necessary to develop corresponding portable devices for Vis-NIR spectroscopy.
The Vis-NIR models that were established using the 1D-CNN nonlinear method outperformed other tested conventional models, indicating that there were more inherent nonlinear correlations between spectral data and origin labels or content.In particular, the 1D-CNN method based on deep learning had advantages, such as automatic feature extraction, hierarchical feature learning, parameter sharing and local perception, data augmentation, and generalization ability.If applied to portable devices for Vis-NIR spectroscopy, it could better handle local features in the data, make the model more adaptable, simplify the model construction process, and further the generalization ability of the model.
The phenolics in G. elata have neuroprotective, anti-inflammatory, and antioxidant effects.Therefore, besides being used as a health supplement, G. elata is also applied as a drug in clinical applications [46].For instance, Tianma injection was applied to cure a patient who had vertebrobasilar insufficiency [47].Tianmasu injection was applied to cure a patient who had dizziness [48].G. elata as a vegetable medicine is more and more welcome in some countries.In future research, more G. elata samples will be collected from different origins to expand the application range and improve the reliability of the proposed models.Additionally, in order to produce various high-quality end products, including food supplements and medications, economical and portable Vis-NIR equipment combined with the advantages of a 1D-CNN will be developed, which will meet the demand for rapid quality inspection of G. elata products for industry use.In summary, based on the holistic nature of Vis-NIR spectra, combined with the effectiveness in extracting the feature structure and strong modeling ability of 1D-CNN, multiple physiologically active ingredient contents in G. elata from different origins can be rapidly and simultaneously predicted.(1) By comparing different algorithms, it was concluded that the model built using the SD-pre-processed Vis-NIR spectra, data augmentation, and 1D-CNN algorithm had the highest predictive ability.This further demonstrated that the 1D-CNN model was capable of describing non-linear relationships better than the other models.(2) Without the data augmentation, the optimal quantitative modeling algorithms for predicting the contents of PB and GA + HA were SVR and 1D-CNN, whereas the optimal quantitative modeling algorithm for predicting the contents of other physiologically active ingredients was 1D-CNN.(3) After the data augmentation, the optimal quantitative modeling algorithm for predicting the contents of all the physiologically active ingredients was 1D-CNN, which demonstrates that data augmentation can improve the generalization ability of the 1D-CNN model.The relevant literature predicted the content of up to six physiologically active ingredients from G. elata [17].In this study, successful prediction of the content of up to eight physiologically active ingredients was achieved.Moreover, the 1D-CNN model outperformed the methods presented in the literature regarding predicting ingredients.

Discussion
The main factors that affect the quality of G. elata are its origin and physiologically active ingredients.There are marked differences in the content of physiologically active ingredients in G. elata from 11 geographical origins, which confirms the importance of origin identification and PGI.Compared with the time-consuming HPLC method, Vis-NIR spectroscopy could predict the origin of G. elata and the contents of eight physiologically active ingredients in a single scan within seconds, thereby evaluating the quality of G. elata.This method is rapid, simple, non-polluting, and has lower instrument costs compared with HPLC instruments, demonstrating the necessity of adopting Vis-NIR spectroscopy for the rapid quality assessment of G. elata.However, to achieve the rapid quality inspection of G. elata in different scenarios, it is necessary to develop corresponding portable devices for Vis-NIR spectroscopy.
The Vis-NIR models that were established using the 1D-CNN nonlinear method outperformed other tested conventional models, indicating that there were more inherent nonlinear correlations between spectral data and origin labels or content.In particular, the 1D-CNN method based on deep learning had advantages, such as automatic feature extraction, hierarchical feature learning, parameter sharing and local perception, data augmentation, and generalization ability.If applied to portable devices for Vis-NIR spectroscopy, it could better handle local features in the data, make the model more adaptable, simplify the model construction process, and further the generalization ability of the model.
The phenolics in G. elata have neuroprotective, anti-inflammatory, and antioxidant effects.Therefore, besides being used as a health supplement, G. elata is also applied as a drug in clinical applications [46].For instance, Tianma injection was applied to cure a patient who had vertebrobasilar insufficiency [47].Tianmasu injection was applied to cure a patient who had dizziness [48].G. elata as a vegetable medicine is more and more welcome in some countries.In future research, more G. elata samples will be collected from different origins to expand the application range and improve the reliability of the proposed models.Additionally, in order to produce various high-quality end products, including food supplements and medications, economical and portable Vis-NIR equipment combined with the advantages of a 1D-CNN will be developed, which will meet the demand for rapid quality inspection of G. elata products for industry use.

Conclusions
In this study, Vis-NIR spectroscopy combined with chemometric methods (PLS-DA, KNN, SVM, 1D-CNN) was applied to correctly and rapidly identify the geographical origin

Figure 1 .
Figure 1.Geographical origin of the G. elata samples (highlighted with color).

3. 2 .
Figure4ashows that there were some average raw spectra of the G. elata samples that represented different origins, and there were eight distinct peaks and seven valleys.These average spectra that represent different origins were consistent in trends and overlapped tightly; however, subtle differences were observed in the absorption intensities, particularly in the Vis detection range (400-780 nm).The absorption peaks at approximately 980 nm referred to the second harmonic generation (SHG) of O-H in the phenolic acids and water.The absorption peaks at ~1180 nm referred to the SHG of C-H in the phenolic acids and polysaccharides in G. elata.The absorption peaks in the range of 1420-1440 nm referred to the first harmonic generation (FHG) of O-H in the phenolic acids and polysaccharides in G. elata, as well as in water.The absorption peaks in the range of 1500-1600 nm referred to the FHG of C-H in the phenolic acid and polysaccharides in G. elata and the SHG of N-H in amino acids.The absorption peaks at 1940 nm referred to the sum frequency generation

Figure 4 .
Figure 4. Vis-NIR characterization of G. elata: (a) original averaged spectra of samples from each origin and (b) averaged spectra of samples from each origin after second-order derivative (SD) processing.

Figure 4 .
Figure 4. Vis-NIR characterization of G. elata: (a) original averaged spectra of samples from each origin and (b) averaged spectra of samples from each origin after second-order derivative (SD) processing.

Figure 7 .
Figure 7. Trend chart of loss and accuracy in 1D-CNN training process: (a) Loss curves and accuracy curves.

Figure 7 .
Figure 7. Trend chart of loss and accuracy in 1D-CNN training process: (a) Loss curves and (b) accuracy curves.

Figure 7 .
Figure 7. Trend chart of loss and accuracy in 1D-CNN training process: (a) Loss curves and (b) accuracy curves.

2 v
GA + HA: the sum of gastrodin and p-hydroxybenzyl alcohol; total: the sum of gastrodin, p-hydroxybenzyl alcohol, parishin E, parishin B, parishin C, and parishin A; bolded font indicates the optimal modelling result; R is the coefficient of determination of the training set; MRECV: mean relative error for cross-validation; RMSECV: root-mean-standard error for cross-validation; R 2 p is the coefficient of determination of the testing set; MREP: mean relative error for prediction; RMSEP: root mean standard error for prediction.Foods 2023, 12, x FOR PEER REVIEW 21 of 25physiologically active ingredients was 1D-CNN, which demonstrates that data augmentation can improve the generalization ability of the 1D-CNN model.The relevant literature predicted the content of up to six physiologically active ingredients from G. elata[17].In this study, successful prediction of the content of up to eight physiologically active ingredients was achieved.Moreover, the 1D-CNN model outperformed the methods presented in the literature regarding predicting ingredients.

Table 1 .
Geographical information of the G. elata samples.

Table 3 .
Classification statistics of the G. elata samples.
GA + HA: the sum of gastrodin and p-hydroxybenzyl alcohol; total: the sum of gastrodin, p-hydroxybenzyl alcohol, parishin E, parishin B, parishin C, and parishin A; SPXY: sample set portioning based on joint x-y distance; Min: minimum value; Max: maximum value; Mean: mean value; Std: standard deviation.

Table 4 .
Bioactive component contents of G. elata samples from different origins (mg/g).
Mean: mean value; CV: coefficient of variation; GA + HA: the sum of gastrodin and p-hydroxybenzyl alcohol; total: the sum of gastrodin, p-hydroxybenzyl alcohol, parishin E, parishin B, parishin C, and parishin A.

Table 5 .
Comparison of G. elata origin discrimination models based on different modeling methods.
Bolded font indicates the optimal modeling results; SD: second-order derivative; Acc_train: training set accuracy; Acc_test: testing set accuracy.

Table 6 .
Results of the calibration models for predicting the contents of bioactive components in G. elata based on different algorithms.