Metabolomics Combined with Multivariate Statistical Analysis for Screening of Chemical Markers between Gentiana scabra and Gentiana rigescens.

Gentianae Radix et Rhizome (Longdan in Chinese, GRR) in Chinese Pharmacopoeia is derived from the dried roots and rhizomes of Gentiana scabra and G. rigescens, that have long been used for heat-clearing and damp-drying in the medicinal history of China. However, the characterization of the chemical components of two species and the screening of chemical markers still remain unsolved. In current research, the identification and characterization of chemical components of two species was performed using ultra-high-performance liquid chromatography (UHPLC) coupled with linear ion trap-Orbitrap (LTQ-Orbitrap) mass spectrometry. Subsequently, the chemical markers of two species were screened based on metabolomics and multivariate statistical analysis. In total, 87 chemical constituents were characterized in G. scabra (65 chemical constituents) and G. rigescens (51 chemical constituents), with 29 common chemical constituents being discovered. Thereafter, 11 differential characteristic components which could differentiate the two species were designated with orthogonal partial least squares discriminant analysis (OPLS-DA) and random forest (RF) iterative modeling. Finally, seven characteristic components identified as (+)-syringaresinol, lutonarin, trifloroside, 4-O-β-d-glu-trifloroside, 4″-O-β-d-glucopyranosy1-6'-O-(4-O-β-d-glucaffeoyl)-linearroside, macrophylloside a and scabraside were selected as the chemical markers for the recognition of two Gentiana species. It was implied that the results could distinguish the GRR derived from different botanical sources, and also be beneficial in the rational clinical use of GRR.


Introduction
Gentianae Radix et Rhizoma (GRR), named Longdan in Chinese, is a widely used as traditional Chinese medicine (TCM) for the treatment of icterepatitis, dermatophytes and herpes zoster [1]. As first recorded in Shennong Bencao Jing (Shennong's Classic of Materia Medica) in Han Dynasty, GRR has been used as hepatoprotective and choleretic drug for thousands of years. It has also been described in various Chinese ancient medicinal monographs, such as Tujing Bencao in Song Dynasty and Bencao Gangmu in Ming Dynasty. It was mainly used for clearing liver and gallbladder dampness and heat, purging fire of the liver and gallbladder, relaxing tendons and relieving pain, etc. [2][3][4]. Modern pharmacological studies reported that it possessed anti-inflammatory, anti-oxidative and antiviral activities [5][6][7][8]. According to phytochemical research, the main constituents of GRR are iridoids, xanthones, flavonoids, triterpenoids, and others [4,[9][10][11]. Arrays of studies demonstrated that iridoids and xanthones were the active components and had various remarkable pharmacological activities including hepatoprotective [12], anti-inflammatory [13][14][15], antioxidant [8,16,17] and immunomodulatory activities [17]. However, investigations into the chemical profiling of GRR and screening of chemical markers of GRR derived from the different sources remain inadequate.
It is well known that morphologic and microscopic identifications, high-performance thin-layer chromatography (HPTLC) technique, DNA barcodes and metabolomics could all be used for the interspecies identification [18]. However, the there is a high possibility of false identification by applying morphologic and microscopic techniques, due to the subjective judgement by the experimenters. Although the TLC and DNA barcoding techniques were more objective and accurate compared to other approaches, the results lacked integrated adequate chemical information, and could not screen chemical markers to identify and classify different species in an efficient and swift manner. Metabolomics revealed the global chemical profiling of TCM and the raw data could be further analyzed with multivariate statistical analysis for the screening of chemical markers among different species or sources [19].
Nowadays, ultra-high-performance liquid chromatography (UHPLC) coupled with high-resolution mass spectrometry (HRMS) has been widely applied for the chemical profiling of complex mixtures from natural products [20,21]. LC-HRMS could offer the potential chemical composition and infer the detailed structures of analytes on the basis of the measurement of accurate mass and generation of the MS/MS or MS n data [22][23][24]. Moreover, it could determine hundreds or even thousands of MS features using a single injection in a short time and with less consumption of organic solvents, and implement high-throughput data acquisition [25].
In this study, an untargeted metabolomics combined with univariate analysis, orthogonal partial least-squares-discriminant analysis (OPLS-DA), and random forest (RF) for the screening of chemical markers which could be applied to accurately distinguish G. scabra and G. rigescens. Furthermore, the characterization of chemical components of GRR was performed by UHPLC-LTQ-Orbitrap/MS. The common and specific chemical constituents of G. scabra and G. rigescens were compared and classified accordingly.

Optimization of Chromatographic Conditions and Sample Extraction
In order to obtain good separation, the different chromatographic columns, gradient elution program, column temperature and flow rate were optimized in detail. The six different columns including Eclipse Plus C18, Extent C18, BEH C18, Kinetex C18, Zorbax SB C18 and HSS T3 were used to compare the separation capacity, the detailed column information was listed in Table S1. As a result, the HSS T3 column showed a better retention capacity and column efficiency for the separation than the other columns. Subsequently, the flow rate (0.3, 0.4, and 0.5 mL/min) and column temperature (25, 30, 35, and 40°C) were compared under the optimized gradient elution program, and 0.4 mL/min at 35°C was selected for separation ( Figure 1). The injection volume was 2 µL with no solvent effect. In addition, QC samples of GRR were collected in both positive and negative ion modes, and more useful information was prevailed in negative mode ( Figure 2).  Different extract solvents (methanol, ethanol and acetonitrile) were compared according to the response intensity of four standards (longanic acid, gentiopicroside, sweroside and trifloroside), and the result of the extraction of methanol solvent was better than others. The solvent ratio was evaluated as above, and the extraction efficiency of 50% methanol-water (v/v) was optimal ( Figure 3).

Deductive Fragmentation Pathway of Iridoids
Iridoids belong to the highly oxygenated monoterpene; its parent nucleus was a five-carbon cyclopenta pyranoid skeletal structure. In the simple iridoids, C-1 semi-acetal hydroxyl was active, and always linked with β-d-glucose to form glycosides, mostly mono-glycosides.
Some iridoids, known as the seco-iridoids, such as 7,8-secoderivative, were formed by the cleavage of cyclopentane ring at the bond of C-7 and C-8. In the seco-iridoids, it could be calssified into four categories: gentiopicrins, swerosides, swertiamarinss and dicarboxylic acids. The structure of gentiopicrin was characterized by the formation of double bonds between C-3 and C-4 and C5 and C-6 positions, based on the seco-iridoid parent nucleus. To differentiate from gentiopicrin, sweroside glycoside's skeletal structure only has one double bond in C-3 and C-4. Swertiamarin glycoside has the hydroxyl substituents at C-5 on the basis of the sweroside parent nucleus structure. Both sweroside and swertiamarin had a glycosyl-acetylation structure. The structure of rindoside, one compound of swertiamarin group, was taken as an example for summarizing the cleavage pattern: de-saccharification occurs first and then deacetyl group as shown in Figure 4 [26].   According to the cleavage regularity and detected chemical data information, in total, 87 compounds, including 54 iridoids, 13 flavonoids, two xanthone, four triterpenoids and five other components were characterized in G. scabra (65 chemical constituents) and Gentiana rigescens (51 chemical constituents), among which 29 common chemical constituents were shown in Table 1

Univariate Analysis for the Screening of Differential Metabolites
Compared with multivariate statistics, univariate analysis focuses more on independent changes in the levels of metabolites. After the data filtering with 80% and 15% rules, the 366 stable metabolic characteristics were obtained. Student's t-test was applied to figure out the features which had statistical difference (p-value < 0.05) between the two groups. The results showed that 283 features with p-value < 0.05 were screened out, and the heat map (Figure 8) illustrated the differences in 283 features between the two Gentiana species. It can be seen clearly that the 283 screened features accumulated in the two Gentiana species show great differences; about half of the metabolites were upregulated in the G. scabra group compared with their counterparts in the G. rigescens group.

Data Visualization and Experimental Stability Evaluation
Principal component analysis (PCA) was carried out to provide an unbiased visual representation of the sample distributions in the extracted principal components (PCs) space. As a result, after the Pareto scaling, the first seven PCs of the PCA-X model explained 90.6% of the variation in the original dataset (R 2 X(cum) = 0.906); 74.4% of the variation in original data was predicted by the model according to 7-fold cross validation (Q 2 (cum) = 0.744), which means that this PCA-X model could well represent the variation information of the original data (Figure 9a). The sample distribution of two Gentiana species and QC samples in the score plot (PC 1 vs. PC 2 ) was provided in Figure 9b. The sample distribution of QC samples was closely clustered, showing that the data collection was relatively stable during the entire experiment. These test samples from different species are well separated, indicating that the differences in metabolic characteristics between two Gentiana species were obvious. Therefore, for the next step, the decision combination of OPLS-DA and RF were further used in search of the chemical markers which could be used to successfully distinguish two Gentiana species.

Screening of Differential Metabolic Characteristics
A supervised OPLS-DA approach was used to preliminarily investigate the metabolite characteristics that showed the greatest differences between the two Gentiana species. As a result, after the Pareto scaling, the OPLS-DA model described 86.5% of the variation in X (R 2 X (cum) = 0.865); 98.9% in response Y (R 2 Y(cum) = 0.989) and 97.5% in response Y was predicted by the model according to 7-fold cross validation (Q 2 (cum) = 0.975), with one predictive and six orthogonal (1 + 6) components. The high value of those parameters demonstrated that the OPLS-DA model presented a good classification and prediction ability to distinguish two class. In Figure 10a, the sample distribution in the score plot (predictive component 1 vs. X side orthogonal component 1) of two Gentiana species were well separated, and the interclass samples heterogeneity of G. scabra were larger than G. rigescens, which indicated that the chemical variation in different batches of G. scabra was larger than that of G. rigescens. Subsequently, permutation tests (n = 200) were performed to validate the model performance. As shown in Figure 10b, the values of R 2 = (0.0, 0.394) and Q 2 = (0.0, −0.664) of category 1 and R 2 = (0.0, 0.385) and Q 2 = (0.0, −0.672) of category 2 indicated the OPLS-DA model in the present study having no risk of overfitting. In total, 47 variables (VIP > 1, Figure 10c) in the OPLS-DA model were selected as the differential metabolic characteristics, since these variables have an important identification capability, which is also consistent with the result shown in S-plot (Figure 10d). These differential metabolic characteristics were helpful to clarify the chemical differences in two Gentiana species. Although 47 differential metabolic characteristics selected by OPLS-DA could successfully identify two Gentiana species, many metabolic characteristics were not easy to detect in practice. Therefore, these differential metabolic characteristics need to be further refined to select a set of robust features labeled as chemical markers after characterization. Furthermore, the model performance has been greatly affected by the data scaling method [29], which may interfere with the correct selection of important variables in the metabolomic research. By contrast, RF was not very sensitive to the choice of data scaling method [30], and the RF model was further adopted, with 47 differential metabolic characteristics. The parameters of RF were set for 500 trees, and the m try was the default value (square root of the number of variables). RF was modeled through 100 iterations; the variable importance of each modeling process was ranked, then the ranked variable ID was stored for further analysis.
In Figure 11a, according to the sorted table of the variable importance of 100 RF model iterations, the cumulative number of each variable in top Nth was obtained. Each line represented a metabolic characteristic; it could be clearly seen that the variables were divided into two parts. The cumulative number of 36 variables in the left side reached 100 around top 35th, while the remaining 11 variables were situated in the right side due to their relatively low feature importance. Then, 36 variables which showed a relatively larger feature importance in 100 iteration modeling were selected. Subsequently, the frequency of the variable as the most important parameter in the 100 iterative modeling was calculated. In Figure 11b, the variable with the highest frequency appeared nine times, indicating that this metabolic characteristic was of major importance in differentiating between the two Gentiana species. Furthermore, one variable appeared six times, four variables five times and five variables four times, and showed great discriminating power. In the results, 11 metabolic characteristics that appeared more than four times were selected. Finally, based on 11 selected metabolic characteristics, RF model (500 trees) was established to distinguish the two Gentiana species. Results showed that the out-of-bag estimate of error rate of RF model is 0, and the area under the curve (AUC) of receiver operating characteristic (ROC) curve of the RF model was 1.00, indicating that the two Gentiana species could be correctly identified using 11 selected metabolic characteristics.
HPLC-grade acetonitrile and formic acid used in the mobile phase were purchased from Merck KGaA (Darmstadt, Germany) and Tedia Company Inc. (Fairfield., OH, USA), respectively. Ultrapure water (18.2 MΩ cm at 25 • C) was in-lab prepared by a Millipore Alpha-Q water purification system (Millipore, Bedford, MA, USA). HPLC-grade methanol for the sample preparation was purchased from Sino Pharm Chemical Reagent Co., Ltd. (Shanghai, China).

Sample and Standards Preparation
The roots and rhizome of GS and GR were collected from Liaoning Province in September and Yunnan Province in China in November, 2019. Fifty-four batches of GS samples were collected from six areas of Qingyuan, particularly in Liaoning Province (the largest base for the cultivation of GS in China.) Thirty-four batches of GR samples were collected from four areas of Yunxian in Yunnan province. The collection information was summarized in Table 3. These samples were all recorded according to their resources. All of them were identified by Professor Jinglong Zhang, who is the expert in medicinal botany in Changchun University of Chinese Medicine. The plant specimens were stored in Shanghai Research Center for Modernization of Traditional Chinese Medicine, National Engineering Laboratory for TCM Standardization Technology, Shanghai Institute of Materia Medica, Chinese Academy of Sciences. An aliquot of 0.250 g of accurately weighted fine powder of GRR was initially immersed in 10 mL 50% aqueous methanol (v/v) and extracted on a water bath at 40 • C with ultrasound (1130 W, 37 kHz) assistance for 30 min. The supernatant was transferred into a 2 mL centrifuge tube. After being centrifuged at 14,000 rpm in 10 min, the supernatant was filtered through a 0.22 µm PTFE microporous membrane (Agilent Technologies, Santa Clara, CA, USA) to prepare the test solutions. The standard solutions of gentiopicroside, swertiamarin, sweroside, loganin, homoorientin, mangiferin, 6'-O-β-d-gentiopicroside, secologanoside, loganic acid, trifloroside, rindoside and 6'-O-β-d-glu-loganic acid were prepared in methanol at the appropriate concentration.
An LTQ-Orbitrap Velos Pro hybrid mass spectrometer (Thermo Fisher Scientific, San Jose, CA, USA) was used for accurate mass measurements and data collection in negative mode, and with ESI-source-operated. The ESI source parameters were set as follows: ion spray voltage 2.7 kV, capillary temperature 320 • C, source heater temperature 200 • C, sheath gas (N 2 ) 15 arbitrary units, auxiliary gas (N 2 ) eight arbitrary units, and sweep gas (N 2 ) two arbitrary units. The Orbitrap analyzer scanned the mass range from m/z 50 to 1345 with a resolution of 30,000 (FWHM defined at m/z 400) for MS. The MS data were recorded in profile and centroid formats, respectively. The average acquisition time required to finish a scan circle (containing four scan events) was 1.8 s. Default values were used for most other acquisition parameters.

Data Processing and Multivariate Analysis
Progenesis QI (Waters, Milford, USA) was used to process the raw data acquired from the UPLC-LTQ-Orbitrap/MS. The feature detection, precursor ions fusion, and retention time correction operated by Progenesis QI to get a peak table (5528) including retention time, m/z, and normalized abundance of all samples. Regarding the data filtering, firstly, features with a relative standard deviation (RSD) greater than 15% in QC samples were removed because these features were unstable during the data collection of the entire experiment. Secondly, the remaining features were filtered according to the 80% rule, whereas features present in at least 80% of samples in one group were allowed to remain.
In data analysis, univariate and multivariate statistical methods were both introduced to contribute the complementary advantages [31]. Firstly, univariate data analysis, i.e., Student's t-test, was used to figure out the features which had statistical difference (the p-value of a Student's t-test of < 0.05) between the two groups. Secondly, a visual representation of the sample distributions for two Gentiana species and QC samples were provided by principal component analysis (PCA). Thirdly, OPLS-DA analysis was carried out to distinguish two Gentiana species, and the metabolic characteristics with VIP > 1 were chosen as differential metabolic characteristics. Finally, the RF analysis was iteratively modeled 100 times to deeply investigate the permutation accuracy importance of these differential metabolic characteristics, and the chemical markers were selected to differentiate two Gentiana species.

Conclusions
A systematic chemical characterization method was developed to provide an effective and scientific basis for the quality control of GRR, as well as to carry out an integrated platform based on plant metabolomics and chemometrics for the characterization and classification of two officinal Gentiana species. In the present study, 87 components were identified in GRR by UHPLC-LTQ-Orbitrap/MS, including 54 iridoids, 13 flavonoids, two xanthone, and four triterpenoids. Then, after the data visualization and classification analysis of whole metabolite profiles operated by PCA and OPLS-DA, two Gentiana species of GRR were clearly separated and correctly identified, and 47 differential metabolic characteristics were selected. Subsequently, 11 differential features were further selected based on the 100 RF iterative modeling. After matching with authentic compounds, seven differential components were selected as chemical markers for the recognition of two Gentiana species. The results indicated that metabolomics combined with chemometrics is a powerful tool to differentiate closely related species and could be used as essential data for quality control of traditional Chinese medicines.