Rapid Discrimination and Prediction of Ginsengs from Three Origins Based on UHPLC-Q-TOF-MS Combined with SVM

Ginseng, which contains abundant ginsenosides, grows mainly in the Jilin, Liaoning, and Heilongjiang in China. It has been reported that the quality and traits of ginsengs from different origins were greatly different. To date, the accurate prediction of the origins of ginseng samples is still a challenge. Here, we integrated ultra-high-performance liquid chromatography quadrupole time-of-flight mass spectrometry (UHPLC-Q-TOF-MS) with a support vector machine (SVM) for rapid discrimination and prediction of ginseng from the three main regions where it is cultivated in China. Firstly, we develop a stable and reliable UHPLC-Q-TOF-MS method to obtain robust information for 31 batches of ginseng samples after reasonable optimization. Subsequently, a rapid pre-processing method was established for the rapid screening and identification of 69 characteristic ginsenosides in 31 batches ginseng samples from three different origins. The SVM model successfully distinguished ginseng origin, and the accuracy of SVM model was improved from 83% to 100% by optimizing the normalization method. Six crucial quality markers for different origins of ginseng were screened using a permutation importance algorithm in the SVM model. In addition, in order to validate the method, eight batches of test samples were used to predict the regions of cultivation of ginseng using the SVM model based on the six selected quality markers. As a result, the proposed strategy was suitable for the discrimination and prediction of the origin of ginseng samples.


Introduction
Ginseng is the dried root of Panax ginseng C. A. Mey, first recorded in the Shennong's Classic of Materia Medica. It has been widely used in many disease for more than two thousand years, because of its wide range of pharmacological effects [1]. Modern pharmacological studies have shown that ginseng has various pharmacological activities such as anti-tumor [2], anti-oxidative [3], improving immunity [4], and enhancing memory [5]. In China, ginseng is mainly planted in the northeast regions, including Jilin (JL), Liaoning (LN), and Heilongjiang (HLJ). According to reports, the quality and traits of ginseng from different origins shows great diversity, due to different cultivation techniques and ecological environments [6,7]. Therefore, it is imperative to establish a method of quality evaluation to differentiate and characterize ginseng samples from different regions.
Phytochemical studies have revealed the major compositions in ginseng, including ginsenosides, polysaccharides, amino acids, polypeptides proteins, and volatile oils [8]. Among them, ginsenosides are considered the main active components [9][10][11][12][13]. In the 2020 edition of the Pharmacopoeia of the People's Republic of China (ChP), only three ginsenosides and their contents were used as standards for quality evaluation of ginseng, making it impossible to distinguish ginsengs from different origins [14]. In recent years, methods based on liquid chromatography mass spectrometry (LC-MS) fingerprint, LC-MS quantification, and chemical pattern recognition have been widely used to solve this issue [15][16][17]. Xiu et al. quantified fourteen ginsenosides using UHPLC coupled with triple quadrupole mass spectrometer (QQQ-MS). Two commonly used traditional multivariate statistical analysis methods, principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were further employed to evaluate differences in the contents of these ginsenosides between origins [18]. However, these methods still lacked objectivity and accuracy in their identification results. Additionally, the established QQQ quantitative method required fourteen reference standards for content determination, resulting in a high detection cost and poor practicality of this method. Thus, it is essential to develop a convenient, effective strategy for the accurate differentiation and characterization of ginsengs from different regions of cultivation.
Recently, the combination of UHPLC-MS and support vector machines (SVM) has been considered as a valid method for the authentication of species and the identification of origins for Traditional Chinese Medicines (TCMs), with satisfactory accuracy [19,20]. For instance, Zhao et al. [19] managed to distinguish different varieties of ginsengs using UHPLC-MS integrated with SVM and accurately distinguished the red ginseng from other ginseng samples (white ginseng, Panax quinquefolium, and Panax notoginseng) after sufficient training. However, ginseng from different origins exhibited high similarity in chemical composition, which increased the difficulty of identification. Thus, higher requirements in establishing a model and data processing of SVM is required. In addition, as far as we know, the discovery of quality markers based on SVM model remains challenging.
In this work, a rapid, convenient, and effective differentiation method based on UHPLC-Q-TOF-MS coupled with SVM was developed to evaluate ginseng samples collected from JL, LN, and HLJ. Firstly, stable and reliable data were generated by UHPLC-Q-TOF-MS, and common ginsenosides components of 31 batches of ginsengs were screened. Additionally, the SVM model was established to accurately classify ginseng from different origins using the normalized data. Furthermore, an algorithm of feature contribution values was introduced to the SVM model to obtain quality markers of the ginsengs from three origins. Finally, on the basis of these quality markers, the SVM model was shown to be able to discriminate and predict the geographical origins of ginseng. This strategy was verified by successfully distinguishing test samples from JL, LN, and HLJ, indicating great reliability and affectivity. Our strategy has the potential to provide references for the regional differentiation and traceability of other TCMs.

Optimization of UHPLC-Q-TOF-MS Analysis Conditions
In order to achieve good separation effects and obtain high-quality UHPLC-Q-TOF-MS data, we optimized the extraction method, extraction solvents, composition of mobile phase, elution gradient, and injection concentration in detail during UHPLC-Q-TOF-MS analysis. The results showed that 40% ethanol with ultrasonic is suitable for the extraction of ginseng and that water (containing 0.01% formic acid, v:v) and acetonitrile (containing 0.01% formic acid, v:v) are preferred by the mobile phase system due to higher peak numbers and better resolution. The injection concentration of 20,000 ppm can obtain an excellent response and will not burden the instrument. These results are shown in Figure S1. Those results show that the optimization of UHPLC-Q-TOF-MS analysis conditions when used for ginseng is necessary to ensure that the samples enter the subsequent analysis in the best state.

Validation of the UHPLC-Q-TOF-MS Analysis Method
After the development of UHPLC-MS analysis conditions, the method was verified by QC samples. The stability and repeatability of the system were evaluated by extraction ion chromatograms (EICs) in QC samples. QC samples were run before and after injection every day, and one QC was inserted every ten samples during the injection. As shown in Table S1, information for a total of seven EICs was extracted from QC. The mass accuracy RSDs of those seven EICs was calculated to be from 1.10 × 10 −4 % to 1.46 × 10 −4 %, the RSDs of the retention time were from 0.06% to 0.43%, and the RSDs of the peak area were from 1.94% to 2.43%. The results showed good stability and repeatability of UHPLC-Q-TOF-MS. The analytical environment constructed by UHPLC-Q-TOF-MS can meet the needs of sample analysis and obtain real and robust data.

Rapid Screening and Identification of Characteristic Ginsenosides in Ginsengs from Different Regions
The developed UHPLC-Q-TOF-MS method was subsequently applied to the analysis of 31 batches of ginseng samples from JL, LN, and HLJ, and MS data were collected. The total ion chromatogram (TIC) of the ginseng sample by UHPLC-Q-TOF-MS is shown in Figure S2. We established a pre-processing method to rapidly filter high-quality information from redundant mass data for data analysis.
According to the processing of the workstation, we screened more than 6000 pieces of information from 31 batches of samples. After peak matching, alignment, and filtering, 122 common peaks were found in the data, and these common peaks were present in all 31 batches of ginsengs. Furthermore, we made comparisons using the in-house database (including MS and MS/MS information of over 400 ginsenosides collected from published references); 69 ginsenosides were quickly screened (Figure 1), and their chemical structures were preliminarily identified (Table 1).
Based on the in-house database, the fragmentation patterns of three typical ginsenosides were summarized. In PPT-type ginsenosides, such as Rg1, the parent ion [M-H] -(m/z 799) in the negative-ion mode showed a loss of two glucose to obtain aglycone protopanaxatriol(m/z 475), as shown in Figure S3A. In PPD-type ginsenosides, such as Rb1, the parent ion [M-H] -(m/z) in negative-ion mode showed a loss of four glucose residues to obtain protopanaxadiol(m/z 459), as shown in Figure S3B. In OA-type ginsenosides, such as Ro, the parent ion [M-H] -(m/z 955) in negative-ion mode showed a loss of two glucose residues and one glucuronic acid group to produce oleanolic acid (m/z 455), as shown in Figure S3C. In the negative-ion mode, the parent ion information was obtained by full scanning of UHPLC-Q-TOF-MS, also known as the MS1 fragment, which mainly exists in the form of [M-H]and [M+HCOO] -. These were common adduct ion forms of ginsenoside, which is also consistent with the literature [16]. Under MS/MS mode, the sugar on the branched chain gradually cracked, and finally, a relatively stable parent nucleus with m/z of 475, 459, and 455 was detected in three typical ginsenoside styles [21]. In addition, it was found that these parent nucleus were not easy to cleave, and this fragment information is an important basis for our identification and classification of unknown ginsenosides. The full scan mode and MS/MS mode of six compounds, including Compound 13(ginsenoside Rg1), Compound 14(ginsenoside Re), Compound 27(ginsenoside Rf), Compound 46(ginsenoside Rb1), Compound 50(ginsenoside Rc), and Compound 56(ginsenoside Rb2) are shown as examples in Figure S4. These experimental results are consistent with the rules obtained in our summary [22].
Accordingly, 69 ginsenosides were quickly screened (Figure 1), and their chemical structures were preliminarily identified (Table 1). Although the clear structure could not be determined, it does not affect the types of unknown components, nor does it affect the subsequent model analysis.
In brief, the rapid pre-processing method was used for rapid screening and identification of 69 characteristic ginsenosides in 31 batches of ginseng samples from three different origins, and the data pre-processing was performed within an hour, which provided high-quality data for the subsequent multivariate statistical analysis.

Traditional Multivariate Statistical Analysis
Traditional multivariate statistical analyses, such as PCA and PLS-DA, were conducted using the peak areas of the 69 characteristic ginsenosides to elucidate the similarities and differences between ginsengs from three different geographical origins.
PCA, a commonly used unsupervised data processing model, was used to discover the trends of the ginseng samples from different growing origins. The first two principal components only accounted for 33.0% of the variation. As shown in Figure 2, the 31 batches of ginseng samples failed to establish origins. Subsequently, a supervised data model PLS-DA was established to further to identify the samples by origins. The R 2 Y and Q 2 of PLS-DA were 0.87 and 0.56, respectively. Furthermore, the PLS-DA model was evaluated using a permutation test shown in Figure  S5. In the random permutation test ( Figure S2), intercepts of R 2 and Q 2 were 0.371 and 0.277, respectively. As shown in the PLS-DA score plot (Figure 3), the ginseng samples in the three different geographical areas were divided into only two clusters (from or not from LN), suggesting the failure of identification. This was possibly because the least-squares method cannot effectively handle nonlinear MS data.

SVM Analysis
As a widely used method, SVM has been successfully applied in the quality control of TCM with satisfactory classification and prediction accuracy [20]. In this work, an SVM model was developed to discriminate and predict the ginsengs from cultivation regions, using the peak areas and normalized data of 69 characteristic ginsenosides as input vectors and regions as outputs.
The best values for parameter C and parameter γ of the SVM model were calculated using a grid search method combined with ten-fold cross-validation. The parameter C affected the distance between the support vector and the decision plane. The parameter γ was mainly used to map the height of the low-dimensional samples. Classification accuracy under different combinations for γ and C are shown in Figure 4. There was a large plateau, indicating that the SVM model was well-establishment, and a γ value of 0.03 and a C value of 1 were chosen in the ten-fold cross-validation for all data.  Table 2, the 31 batches of ginseng samples using peak areas were assigned to individual origins by peak areas with a prediction accuracy of 83%. However, the accuracy of the classification of regions reached 100% when normalized data were used. Therefore, data normalization significantly improved the SVM performance because the Z-Score normalization converted each feature into a standard normal distribution. This prevented the average and variance of the features from affecting the dimensionality reduction results.

As shown in
Thus, the results strongly indicated that the developed SVM model with normalized data was a powerful tool for the geographical classification and prediction of ginsengs from JL, LN, and HLJ.

Discovery of Quality Markers of Ginsengs from Three Different Origins
As far as we know, key feature extraction for SVM is still a challenge, which cannot be handled by traditional statistical methods, such as the t-test. To deal with this problem, a permutation importance algorithm was employed in this study. According to the formula (A9), the contribution of all peaks to the SVM was calculated. In the next step, the potential quality markers of ginsengs from JL, LN, and HLJ were selected due to the calculations. Based on the importance value (IV > 0), six quality markers were discovered, including peak 65 (AcO-ginsenoside Rd or isomer), peak 18 (AcO-ginsenoside Re or isomer), peak 26 (Ginsenoside Re2 or isomer), peak 25 (Notoginsenoside M or isomer), peak 3 (Ginsenoside Re2 or its isomer), and peak 33 (Yesanchinoside J or isomer). Their contributions were ranked from highest to lowest as shown in Figure 5. Box plots of the six quality markers are shown in Figure S6, which indicates that there were distributional differences between the same characteristics in JL, LN, and HLJ.
To prove the capability of the six quality markers, SVM model was established again using six quality markers and ten-fold cross-validation. The origin identification accuracy of ginsengs was 100%. The results of the identification of ginseng origin by SVM with six quality markers are shown in Table S2, which indicates that the six quality markers were sufficient to identify the origin of ginseng samples. The selection of six peaks from 69 peaks simplified the process of ginseng sample data acquisition.

Verification of This Strategy for Ginseng Identification from Different Origins Using Test Samples
To verify the real application capability of this strategy, eight batches (T1-T8) of test ginseng samples, purchased in the market from different growth origins, were used for prediction experiments. According to the sample preparation method, analysis method, and pre-processing method of this strategy described above, normalized data of six differential markers in eight batches of ginseng samples were screened and imported into the SVM model as vectors to distinguish. As shown in Table 3, ginseng samples from three provinces were all correctly identified with an accuracy of 100%, indicating that this approach can effectively and accurately predict the geographical origin of ginseng samples sold in the market.

Ginseng Samples
Ginseng samples were collected in three provinces in Northeastern China, including JL, HLJ, and LN. All samples were identified as dry roots of Panax ginseng CA Mey. by Xiaoping Yang from Dalian Institute of Chemical Physics, Chinese Academy of Sciences. Sample information is shown in Table 4. S1~S31 are training samples and T1-T8 are the test samples.

Chemicals and Reagents
LC-MS-grade acetonitrile was purchased from Fisher Scientific (Pittsburgh, PA, USA), LC-grade formic acid was purchased from Sigma, ultrapure water was obtained from Milli-Q IQ 7000 system (Bedford, MA, USA), and analytical-grade ethanol was purchased from Energy Chemical (Shanghai, China).

Preparation of Samples
One gram of dry ginseng powder was extracted with 50 mL of 40% ethanol using an ultrasonic method (Kunshan ultrasonic instruments Co., Ltd., Suzhou, China) for 45 min, and the extracted solution was centrifuged at 10,000 rpm for 10 min to obtain the sample stock solutions for UHPLC-Q-TOF-MS. One milliliter of solution was collected from each sample stock solution from 39 batches of ginseng and mixed to obtain Quality Control (QC) samples. All stock solutions were filtered through a 0.22 µm membrane filter prior to UHPLC-Q-TOF-MS analysis.
The MS analysis of ginseng samples was performed on an Agilent 6545 Q-TOF-MS system (Agilent Technologies Inc, Santa Clara, CA, USA) equipped with a Dual AJS ESI ion source. Optimized parameters for the negative-ion mode were as follows: curtain gas temperature: 320°C; sheath gas temperature: 320°C; dry gas flow rate: 8 L/min, ionization pressure: −3500 V; fragmenter: 75 V; and collision energy: 40 and 60 V. The scan mode was full scan for MS and auto scan for MS/MS. The m/z range for MS was from 400 to 1700 Da, and the m/z range for MS/MS was from 100 to 1700 Da.

Data Processing and Analysis
The UHPLC-Q-TOF-MS raw data from 31 batch samples and QC were analyzed using the target/suspect compound screening algorithm in the MassHunter workstation (version 10.0, Agilent Technologies Inc., Santa Clara, CA, USA). The target/suspect compound screening algorithm took all ions into account exceeding 1000 counts with a charge state equal to one, and the qualitative score of compounds was greater than 60. Isotope grouping was based on the common organic molecules model. The resulting feature for each sample screened by the workstation was exported for peak matching, aligning, and filtering. Furthermore, peaks that were lacking in more than 80% samples were removed in order to obtain common peaks. In addition, the characterization of common peaks was completed according to the formula, and the exact molecular weight and fragment refer to our existing database. The common peaks identified as ginsenosides are called characteristic ginsenosides. The peak areas of characteristic ginsenosides in all samples were used as the data matrix for subsequent data analysis, including normalization, PCA, PLS-DA, and SVM.

Normalization Methods
The normalization methods of raw data are the mean normalization and Z-Score normalization method, whose formulas are shown below: Mean Normalization: where P m,standlize refers to the peaks m in sample k after being normalized and P m is the average value of peak m in all samples. Z-Score Normalization: where P m,standlize is the peaks k in sample m after being normalized, P k is the average value of peak k in all samples, and σ k is the standard deviation of peak k in all samples.

PCA Algorithm
PCA is a method of calculating principal components by covariance and using them to linearly transform the data, generally using only the first few principal components and ignoring the others [25]. The equation of the PCA model is: where X is the matrix of independent variables, P is the transformation matrix, and PX is a diagonal covariance matrix.

PLS-DA Algorithm
PLS-DA is a statistical method with principal component regression. It finds a regression model by projecting the independent variable X and the dependent variable Y into a new space. PLS-DA is a variant used when Y is categorical [26]: The equation of PLS model is [27]: where X is the matrix of independent variables and Y is the matrix of dependent variables; T and U are the projection of X and the projection vector of Y, respectively; P and Q are the orthogonal loading matrices; and the matrices E and F are the error terms, which are assumed to be independent and identically distributed random normal variables. The decomposition of X and Y is performed to maximize the covariance between O and U.

SVM Algorithm
The support vector machine (SVM) model uses support vectors to learn on samples and process unknown samples with the following mathematical expression.
where α * i is the constraint set for sample i at each iteration, x i is the vector composed of peak area data of sample i, y i is the sample label, w * is the feature matrix calculated at each iteration, x i · x j is the vector composed of peak area data of support vector sample j, y i is the support vector sample j label, and b* is the constant vector calculated at each iteration.
The final iterative result makes sample j in the support vector satisfy the formula:

Permutation Importance Algorithm
Traditional statistical learning is poorly interpretable, and calculating the feature contribution is a common method to account for sample variability. The feature contribution degree formula is calculated as follows: All the calculation and pre-processing involving multi-model statistical analysis were performed using the Python ® (Version 3.7.3). SVM model and feature selection method were built by the Scikit-learn ® (Version 0.21.2). All raw data files were imported into python by Pandas ® (Version 0.25.0).

Conclusions
In this paper, a rapid and efficient strategy is provided to achieve an intelligent distinction between ginseng from JL, LN, and HLJ. Firstly, a robust UHPLC-QTOF/MS analysis method was developed, and a total of 69 characteristic ginsenosides were successfully extracted in 31 batches of samples for subsequent analysis. PCA and PLS-DA methods could not solve the problem of the differentiation of ginseng origins, but our optimized SVM could achieve accurate differentiation, with an accuracy of 100%. More importantly, the permutation importance algorithm was used to extract quality markers in SVM for the first time, which greatly improves SVM's interpretation ability. Finally, the test samples were accurately predicted based on the six ginsenosides coupled with SVM. The proposed approach was helpful in elaborating more the specific discrimination and prediction of ginseng and provides a simple and reliable method for the discovery of quality markers for other TCMs.
Supplementary Materials: The following supporting information can be downloaded at https://www. mdpi.com/article/10.3390/molecules27134225/s1: Figure S1: Optimization of UHPLC-MS analysis conditions; Figure S2: TIC of ginseng samples by UHPLC-Q-TOF-MS (taking a QC sample as an example); Figure S3: The MS/MS spectra of ginsenosides; Figure S4: The full scan mode and MS/MS mode of six compounds; Figure S5: A presentation of 200 times the permutation test for PLS-DA analysis; Figure S6: Distribution of six peaks in ginseng from three different origins; Table S1: Stability and repeatability of UHPLC-Q-TOF-MS; Table S2: The classified and predicted results of ginsengs from three geographical origins using the SVM model's six quality markers.

Data Availability Statement:
The authors confirm that the data supporting the findings of this study are available within the article and from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.
Sample Availability: Ginseng samples are available from the authors.