Next Article in Journal
Efficacy of Origanum syriacum Essential Oil against the Mosquito Vector Culex quinquefasciatus and the Gastrointestinal Parasite Anisakis simplex, with Insights on Acetylcholinesterase Inhibition
Next Article in Special Issue
Determination of Flavonoid Glycosides by UPLC-MS to Authenticate Commercial Lemonade
Previous Article in Journal
Study on the Material Basis of Houpo Wenzhong Decoction by HPLC Fingerprint, UHPLC-ESI-LTQ-Orbitrap-MS, and Network Pharmacology
Previous Article in Special Issue
Fast Detection of 10 Cannabinoids by RP-HPLC-UV Method in Cannabis sativa L.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing Geographical Origin of Gentiana Rigescens Using Untargeted Chromatographic Fingerprint, Data Fusion and Chemometrics

1
Yunnan Herbal Laboratory, Institute of Herb Biotic Resources, School of Life and Sciences, Yunnan University, Kunming 650091, China
2
The International Joint Research Center for Sustainable Utilization of Cordyceps Bioresouces in China and Southeast Asia, Yunnan University, Kunming 650091, China
3
College of Chemistry, Biological and Environment, Yuxi Normal University, Yu’xi 653100, China
4
College of Traditional Chinese Medicine, Yunnan University of Chinese Medicine, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Molecules 2019, 24(14), 2562; https://doi.org/10.3390/molecules24142562
Submission received: 10 June 2019 / Revised: 10 July 2019 / Accepted: 12 July 2019 / Published: 14 July 2019

Abstract

:
Gentiana rigescens Franchet, which is famous for its bitter properties, is a traditional drug of chronic hepatitis and important raw materials for the pharmaceutical industry in China. In the study, high-performance liquid chromatography (HPLC), coupled with diode array detector (DAD) and chemometrics, were used to investigate the chemical geographical variation of G. rigescens and to classify medicinal materials, according to their grown latitudes. The chromatographic fingerprints of 280 individuals and 840 samples from rhizomes, stems, and leaves of four different latitude areas were recorded and analyzed for tracing the geographical origin of medicinal materials. At first, HPLC fingerprints of underground and aerial parts were generated while using reversed-phase liquid chromatography. After the preliminary data exploration, two supervised pattern recognition techniques, random forest (RF) and orthogonal partial least-squares discriminant analysis (OPLS-DA), were applied to the three HPLC fingerprint data sets of rhizomes, stems, and leaves, respectively. Furthermore, fingerprint data sets of aerial and underground parts were separately processed and joined while using two data fusion strategies (“low-level” and “mid-level”). The results showed that classification models that are based OPLS-DA were more efficient than RF models. The classification models using low-level data fusion method built showed considerably good recognition and prediction abilities (the accuracy is higher than 99% and sensibility, specificity, Matthews correlation coefficient, and efficiency range from 0.95 to 1.00). Low-level data fusion strategy combined with OPLS-DA could provide the best discrimination result. In summary, this study explored the latitude variation of phytochemical of G. rigescens and developed a reliable and accurate identification method for G. rigescens that were grown at different latitudes based on untargeted HPLC fingerprint, data fusion, and chemometrics. The study results are meaningful for authentication and the quality control of Chinese medicinal materials.

Graphical Abstract

1. Introduction

Gentiana rigescens Franchet (Dian long dan) is a herbaceous species that grows in mountainous regions of Yunnan-Guizhou Plateau in the southwest of China [1]. Like European traditional medicinal plant yellow gentian (G. lutea L), G. rigescens is famous for its bitter properties that are due to the bitter active principles (e.g., loganin, gentiopicroside, swertiamarin, sweroside, etc.) [2,3,4]. Those compounds have pharmacological effects of anti-inflammation, antioxidant, anti-cancer, antiviral, cholagogic agent, hepatoprotective, wound-healing activities, and so forth [3,5]. Additionally, they are used to stimulate appetite and improve digestion [5,6,7]. In addition, a series of neuritogenic compounds had been isolated from the aerial and underground parts of G. rigescens, which could be used as raw material for the preparation of functional food and a therapeutic drug for Alzheimer’s disease [8,9,10,11]. Now, G. rigescens have been the official drug of Chinese pharmacopoeia (2015 edition) for chronic hepatitis and important raw materials for the pharmaceutical industry in China [12].
G. rigescens were usually collected from different regions of Yunnan-Guizhou Plateau in order to provide satisfaction of continuously increasing industrial demands for raw materials. However, some of the researchers had reported that chemical constitutions of underground part of G. rigescens were extremely variable and diverse according to plant grown location or producing area [13,14,15]. Quantitative analysis of bioactivity compounds (such as gentiopicroside, sweroside, swertiamarin, isoorientin, and other compounds) from rhizomes, stems, leaves, and flowers indicated that northwest of Yunnan-Guizhou Plateau was suitable for chemical compounds accumulation [13,14,15,16]. Additionally, conversion and transport of those compounds might be influenced by climatic conditions in the plant habitat [14,17].
Latitude has a strong impact on the local climate environment in southwest China [18,19]. As the main distribution area of G. rigescens, Yunnan-Guizhou Plateau is characterized by very complex topography and it displays a wide variety of micro-climates [18,19,20,21]. There are six climatic zones from the north towards the south [20]. Especially, in the higher latitude areas, such as northwest Yunnan or south of the Hengduan Mountains (26–28° N), the temperature gradients are more abrupt than in the other regions [19]. Furthermore, precipitation and temperature in the Yunnan-Guizhou Plateau also show clear variations along the latitude gradients [19,21]. Therefore, it is necessary to explore the variation of phytochemical and medicinal material quality of G. rigescens that were grown in different latitudes and build a classification model for tracing producing areas of medicinal materials.
As we know, the contents of bioactive compounds and quality of medicinal materials have a close relationship with the environment of producing area [22,23,24,25]. Quality control and geographical indication of medicinal materials raise many concerns by pharmaceutical industries with the expansion in the use of herbal medicines. However, using few marker compounds could not reflect the chemical complexity of herbs and this method is hard to effectively authenticate the origin of herbal medicines [26,27]. Chemical fingerprints, as a comprehensive evaluation methodology, have been widely used to deal with the problem [26,28,29]. In recent years, infrared spectroscopy (IR), UV-Vis spectroscopy (UV-Vis), and other spectral fingerprints have been well-established analytical techniques for geographical traceability studies of G. rigescens and other medicinal plants in the worldwide [30,31,32,33,34]. In contrast, there were limited reports on the use of chromatographic fingerprint to identify the producing regions of herbal materials [30,31,32,33,34,35]. Although there were many reports about discrimination of herbs according to their producing areas while using liquid chromatography technology, most of them are based on the information of limited chemical markers or chromatographic profiles [36,37,38,39]. The potential of chromatographic fingerprints for herbs authentication needs to be further explored.
When compared with chemical marker or chromatographic profile (targeted), chromatographic fingerprint (untargeted) contains unspecific and non-evident information and chemometric tools should extract chemical information [40]. Recently, literature reported some successful studies applying chromatographic fingerprint, together with chemometric methodology, to discriminate herbs and food samples of different origin or cultivars [41,42,43,44]. All of those studies suggested that it is possible to develop a reliable and accurate method for the geographical tracing of G. rigescens by applying the chromatographic fingerprint methodology.
In the progression of improving geographical authentication of food and drugs, one of the important goals is building discrimination models with a less error rate and reducing the uncertainty of the prediction results [33,44]. Data fusion strategy has been widely used in the last years in the field of food authentication in order to improve class discrimination techniques [45]. Some reports about Panax notoginseng, Paris Polyphylla var. yunnanensis and other herb materials also showed the huge potential of this strategy in the discrimination of medicinal materials producing areas [46,47,48]. Today, most of the fused data come from spectral fingerprint and very few studies report the data fusion of chromatographic fingerprint [42,43]. Furthermore, data fusion studies are mostly based on the fusion of multivariate instrumental techniques [42,43], while reports of P. Polyphylla var. yunnanensis, Macrohyporia cocos, and other species indicated that reliable classification results were also available by the fusion analysis of chemical fingerprint data collected from different medicinal parts of herbs [35,49]. Accumulation and distribution of metabolites in the different parts of plants were different because of the differential response of root, stem, flower and other organs to the environment variation of producing area [17,50]. Therefore, fingerprint data fusion of multi-medicinal parts may provide integrated chemical information for the authentication of medicinal materials. At the same time, this method also contributes to a more comprehensive understanding of the response and adaptation of medicinal plants to complex geographical environments.
The aim of this study is to explore the variation of chromatographic fingerprints of G. rigescens along the latitude gradients and to use chemometrics to mine fingerprint chemical information, and to investigate the potential of the untargeted chromatographic fingerprint to trace herbs grown at different latitudes. For this purpose, we developed fingerprint of rhizomes, stems, and leaves of G. rigescens by high-performance liquid chromatography with diode array detection (HPLC-DAD) technology. Subsequently, classification models for the identification of different producing areas were built by HPLC fingerprint combined with RF (random forest algorithm) and OPLS-DA (orthogonal partial least-squares discriminant analysis). At last, two types of data fusion strategies, “low- level” and “mid-level” data fusion, were studied in order to improve the model performances.

2. Results and Discussion

2.1. Chromatographic Fingerprints Variation Along the Latitude Gradients

Figure 1 displays the representative chromatographic fingerprints of rhizome, stem, and leaf. From HPLC fingerprints, it can be found that the five marker compounds of iridoids were eluted before 15 min. The retention times (t/min) of loganin (1), 6′-O-β-d-glucopyranosylgentiopicroside (2), swertiamarine (3), gentiopicroside (4), and sweroside (5) were 7.279, 9.213, 9.573, 11.376, and 11.622 min, respectively. Loganin and gentiopicroside were mainly accumulation in the underground part and sweroside accumulated more in the overground parts. Furthermore, differences in the chemical composition of rhizome, stem, and leaf can also be visually observed through chromatographic fingerprints. For facilitating subsequent data exploration and modeling analysis, the retention time of fingerprints signal was replaced by variables (Figure 1d–f). As a result, there were 3839, 4140, and 4140 variables of rhizome, stem, and leaf fingerprints, respectively.
Principal component analysis (PCA) and two-dimensional score plots visualized the differences and variation trends of three medicinal parts. Figure 2 shows that the rhizomes and stems of G. rigescens tended to cluster to the left part, while the leaves data scattered to the right.
Although the fingerprints between the aboveground and underground medicinal parts were obvious differences, an interesting result is that a trend of separation according to product region latitude was observed from the PCA and score plots of samples of three medicinal parts. For example, two-dimensional score plots of chromatographic fingerprint of rhizomes showed that the samples separation trend increases with an increase in geographical distance and a clear separation between samples that were collected from lower latitude and higher latitude regions (Figure 3). In contrast to this, when considering the separation between samples with product regions geographically close to each other, we observed that the rhizome samples separation trend decreases with a decrease in the geographical distance (Figure 4). The PCA score plots of stems and leaves changed in the same trend as rhizomes (Figures S1–S4).
The results of PCA highlighted that the chromatographic fingerprints of G. rigescens were different among rhizomes, stems, and leaves, and were affected by latitude gradients of the production regions. Especially between lower latitudes and higher latitudes, the samples seem to be clearly distinguishable. Based on PCA exploratory analysis (unsupervised methods), supervised pattern recognition (OPLS-DA) should be applied to gain better classification results for samples that were grown in different latitudes (Figure 5 and Figure 6), and OPLS-DA and variable importance in the projection (VIP) analysis were used to further investigate the fingerprint variables of G. rigescens that were sensitive to latitude changes.
The variable’s VIP value was greater than 1.00, which indicates that the variable was obviously affected by the change of the latitude of the producing areas. From Figure 7a, it could be found that the change of three ranges of rhizome’s fingerprint was closely related to producing areas latitude. The first range was related to variables of retention time at 2.00–13.00 min. The second range was related to variables of retention time at 15.00–20.00 min. Additionally, the third range was related to the variables of retention time after 25.00 min. Figure 7b showed that important variables (VIP value > 1.00) of stem fingerprint relate to the variables of retention time at 2.00–20.00 min and 25.00–30.00 min. For leaf fingerprint, chromatographic variables, retention time at 2.00–15.00 min, 17.00–19.00 min and 25.00–30.00 min, were the most sensitive to latitude changes of producing areas (Figure 7c). According to the identification of the major compounds in fingerprint, it showed that many of these important variables were chromatographic signals of iridoids and secoiridoids, such as loganin, 6′-O-β-d-glucopyranosylgentiopicroside, swertiamarine, gentiopicroside, and sweroside. A previous study regarding the spatial profiling of iridoids phytochemical constituents found that the geographical variation of those compounds could be attributed to some environmental factors [13,17], for example, the difference of precipitation of natural habitats [17]. Additionally, it was interesting to note that the number of important variables after 25 min is gradually increasing from the rhizome to the leaves. The results suggested that, in addition to iridoids, other low polarity products in G. rigescens have implications for the differentiation of different geographical origins.
In a word, current research indicated that the chemical composition of G. rigescens changes with the grown latitude in a way that could be traced with the chromatographic fingerprint. Furthermore, three-dimensional (3D) score plots and VIP analysis showed a difference of phytochemical geographic variation for overground and underground parts. Those differences might affect the result of geographical origin traceability of samples.

2.2. Geographic Authentication Based on Fingerprints of Different Medicinal Parts

In recent years, literature had already reported satisfying classification results that were obtained by RF or OPLS-DA models [51,52,53,54]. As an ensemble learning method, the RF algorithm could correct for decision trees’ habit of overfitting to their training set [55]. Additionally, OPLS could help to overcome these obstacles by separating useful information from noise and improve complex chemical data features and interpretability [56,57]. In this work, we tested RF and OPLS-DA models, combined with rhizome, stem, and leaf fingerprint data in order to classify G. rigescens according to their grown latitude.

2.2.1. RF Classification

In the beginning, samples from the data set of rhizomes (280 samples and 3839 variables) were separated into a calibration set (186 samples) and a validation set (94 samples) by the Kennard-Stone algorithm. Subsequently, 186 rhizome samples that were collected from four latitude gradients were used to establish the calibration model (R_RF). During the modeling process, the initial value of ntree (needs to be optimized) was defined as 2000, the initial value of mtry was defined as the square root of the number of variables, and the rest of the parameters were defined as the default value. Subsequently, OOB errors were calculated and the value of the best ntree was obtained according to the lowest OOB error. Figure 8 shows that the minimum error and the standard error are the lowest, with 663 trees. Based on the optimal number of trees, mtry was re-selected by searching the values ranged from 50 to 75. The calculation results found that the mtry value should be defined as 61, because of the model had the lowest OOB classification error. Finally, a final classification model was established based on optimum ntree and mtry values.
Table 1 shows that the accuracies for samples of calibration set were 96.77% for low latitude samples, 99.46% for mid-latitude samples, 94.62% for mid-high latitude samples, and 94.09% for high latitude samples. Additionally, the accuracies of samples of validation set were 91.49%, 95.74%, 94.68%, and 98.94% for four different latitudes samples, respectively.
Like previous investigations of the rhizome model, the data set of stems (280 samples and 4140 variables) and leaves (280 samples and 4140 variables) were separated into calibration sets and validation sets, respectively. Subsequently, RF calibration modes of stems (S_RF) and leaves (L_RF) were built. The optimum ntree and mtry could be found in Figure 9 and Figure 10.
For the RF model of the stem, the accuracies of samples of calibration set of 92.47%, 94.62%, 93.01%, and 93.01% were achieved for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes. Additionally, the accuracies of samples of validation set were 98.94%, 97.87%, 96.81%, and 97.87%, respectively (Table 2).
For RF model of the leaf, accuracies of 92.47%, 96.24%, 93.01%, and 94.62% were achieved for the calibration set. Additionally, accuracies of 85.11%, 93.62%, 89.36%, and 93.62% for the validation set (Table 3).

2.2.2. OPLS-DA Classification

The OPLS-DA models of rhizomes (R_OPLS-DA), stems (S_OPLS-DA), and leaves (L_OPLS-DA) were constructed based on the same calibration and validation sets that were used in RF models. All of the models were constructed based on the internal seven-fold cross-validation and permutation plot could be found in Supplementary Materials.
Table S1 showed that the R2 of models ranged from 0.77 to 0.82 and the Q2 of models were larger than 0.50, which indicated that the OPLS-DA models were well fitted and better predictive. The permutation test results could be found in Figures S14–S16.
The classification results of R_OPLS-DA model showed (Table 4) accuracies of calibration set were 98.92% for all classes. Accuracies of validation set were 95.47%, 98.94%, 94.86%, and 97.87% for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes samples, respectively. For S_OPLS-DA models (Table 4), although 98.92%, 99.46%, 98.92%, and 98.39% values of calibration set accuracies were obtained for samples that were grown in four different latitudes, a lower value of total accuracy rate of validation set was obtained (93.62%). Parameters of L_OPLS-DA model showed (Table 4) that the accuracies of the calibration set were 97.31%, 99.46%, 97.31%, and 98.39% for low latitude, mid-latitude, mid-high latitude, and high latitude samples, respectively. However, the total accuracy of the validation set was lower than the calibration set. Especially, for samples of class 1, the accuracy was only 88.30%.
Finally, we made a comprehensive comparison to the six models’ classification performance superiority on the basis of the above analysis. For the RF model, the order of calibration total accuracy was as follows: R_RF (96.24%) > L_RF (94.09%) > S_RF (93.28%). The order of validation total accuracy was as follows: S_RF (97.87%) > R_RF (95.21%) > L_RF (90.43%). For the OPL-DA model, the order of calibration total accuracy was as follows: R_OPL-DA (98.92%) and S_OPLS-DA (98.92%) > L_OPLS-DA (98.12%). The order of validation total accuracy was as follows: R_OPL-DA (96.81%) > S_OPLS-DA (93.62%) > L_OPLS-DA (92.55%). Classification models that were built by using leaf data set presented the worst performance from the accuracy point of view. Additionally, validation sets of the L_RF and L_OPL-DA model had lower Matthews correlation coefficient (MCC) values. By contrast, all of the models based on rhizome data set presented a better classification performance (total accuracy ranged from 95.21% to 98.92%). The best total accuracy occurred when rhizome data combined with the OPLS algorithm. We could find that phenomenon of imbalance category recognition in R_OPLS-DA model was better than other models from SE values, SP values, MCC values, and EFF value.
Although the classification performance for OPLS-DA and RF models on the basis of rhizome data set was good, the model classification ability, accuracy, sensitivity (SE), specificity (SP), MCC, and efficiency (EFF), need to be enhanced. In a further step, the feasibility of combining the information from rhizome, stem, and leaf fingerprint data for samples geographical traceability was investigated by low-level and mid-level data fusion strategies.

2.3. Geographic Authentication Based on Data Fusion Strategy

2.3.1. Low-Level Data Fusion

According to the method that was described in data preprocessing (Figure 11), fingerprint data sets of overground and underground organs as subsets were used to concatenate into a single data block (a new data set). In the case of the low-level strategy, four data sets, rhizome combined with stem (RS), rhizome combined with leaf (RL), stem combined with leaf (SL), and all data combined (RSL), were used to build RF (RS_RF, RL_RF, SL_RF, and RSL_RF) and OPLS-DA (RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA) models. For every data set, the samples were randomly selected as a calibration set and the rest of the samples were used as a validation set (finished by Kennard-Stone algorithm).
The optimum ntree and mtry values were selected at first (Figure S8). Afterwards, final classification models were established based on the best values of arguments. From Table 5, it could be seen that the samples collected from four different latitudes were better discriminated by using RS data set and RSL data set. RS_RF model achieved 95.43% total accuracy for the calibration set and achieved 96.81% total accuracy for calibration set. RSL_RF model achieved 94.89% correctly for the calibration set and achieved 97.37% correctly for the calibration set. From a comparison with SE, SP, MCC, and EFF values of S_RF and L_RF models (Table 1 and Table 3), we found that the low-level data fusion strategy improved the phenomenon of imbalance category recognition in the RF model (Table 5). However, the total accuracy of models was not obviously improved.
The permutation plot of all models could be found in Supplementary Materials (Figures S17–S20). The classification results of OPLS-DA models based on low-level data fusion showed models’ R2 values ranged from 0.86 to 0.90 and Q2 values ranged from 0.74 to 0.80 (Table S2). Total accuracy rates of the calibration set of RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA were 99.46%, 99.73, 100.00%, and 99.73%, respectively (Table 6). Additionally, correct classification rates of validation sets varied from 97.34% to 98.40% (Table 6). The comparison parameters for SE, SP, MCC, and EFF (Table 4 and Table 6), the results highlight classification abilities of data fusion OPLS-DA models were better than the individual data set models. What is more, the RS_OPLS-DA model was the optimum classification model when using low-level data fusion strategy (Table 5 and Table 6).

2.3.2. Mid-Level Data Fusion

At the end of the research, the feasibility for further optimizing the model parameters by feature subset selection and data fusion was investigated (Figure 11). Variables selection was one of the steps of the mid-level data fusion strategy. For the RF model, the “Boruta” algorithm was used to identify important chromatographic signal variables that significantly contributed to the classification performance. “Boruta” selection was finished based on three RM models that were built while using data sets of rhizomes (3839 variables), stems (4140 variables), and leaves (4140 variables), respectively. After comparing original attributes’ importance with importance achievable at random, 200 variables of rhizome data set, 305 variables of stem data set, and 359 of variables for leaf data set were retained as relevant features variables for sample discrimination (Figures S9–S11). Subsequently, those feature subsets were combined as a new data block and the fused data set (505 variables for RS, 559 variables for RL, 664 variables for SL, and 864 variables for RSL) was used to establish final classification models. The optimum ntree and mtry values of RS_RF, RL_RF, SL_RF, and RSL_RF model could be found in Figure S12.
Table 7 lists the statistical results for the classification ability of the four RF models based on mid-level data fusion. The average accuracies of the calibration set and validation set were achieved for 96.44% and 97.21% by using RF algorithm. It is notable that the RL_RF model had accuracies that ranged from 94.09% to 99.46% in the calibration set and accuracy ranging from 96.81% to 100% in the validation set. In addition, parameters of SE (0.87–1.00), SP (0.94–1.00), MCC (0.87–1.00), and EFF (0.92–1.00) for each class of RL_RF model were higher than most RF classification models. As a result, mid-level data fusion strategy could eliminate the unnecessary variables, enhance model classification ability, and improve the phenomenon of imbalance category recognition in the RF model relative to low-level data fusion strategy.
For the OPLS-DA model, in front of all, three independent classification models were built while using original data sets of rhizome, stem, and leaf, respectively. Subsequently, the VIP value of variables in different classification models was calculated by SIMCA software. The results showed (Figure S12) that a total of 4486 variables (1309 variables selected from rhizome data set, 1538 variables selected from stem data set and 1639 variables selected from leaf data set) VIP values were greater than 1. Those variables with large importance for the geographical traceability of samples were combined into a new data set (2847 variables for RS, 2948 variables for RL, 3177 variables for SL, and 4486 variables for RSL) for final classification model building. The R2 and Q2 values and the permutation plot of RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA model were shown in Table S2 and Figures S21–S24.
The classification results showed that average accuracies of calibration and validation sets were achieved for 99.66% and 96.81%, respectively (Table 8). The four models exhibit good performances (MCC values ranged from 0.96 to 1.00 and EFF values ranged from 0.92 to 1.00 (Table 8). OPLS-DA models based on mid-level data fusion and low-level data fusion showed similar accuracy and model performance although feature selection was useful for reducing irrelevant variable when classifying samples.
Overall, it can be seen that there is an improvement in the results that were provided by data fusion when compared with performances of models based on independent data sets. When considering the similar accuracy and a higher SE, SP, MCC, and EFF values between calibration set and validation set, the RS_OPLS-DA models that were based on low-level data fusion strategy was the best performance.

3. Materials and Methods

3.1. Plant Material Collection

Plant materials (29 population and 280 individuals) of G. rigescens were collected in the fall of 2012 and 2013 at the time of local traditional harvest period, at the different location of Yunnan, Guizhou, and Sichuan (Figure 12). Four producing areas were divided according to the location of population. (I) low latitudes area, with latitudes ranging from 23.92–23.66° N, South of Yunnan (eight population and 76 individuals), (II) mid-latitude area, with latitudes ranges from 24.95–25.06° N, Middle of Yunnan (five population and 48 individuals), (III) mid-high latitude area, with latitudes ranges from 26.49–26.64° N, Northwest of Yunnan and West of Guizhou (nine population and 76 individuals 87), and (IV) high latitude area, with latitudes ranges from 27.34–28.52° N, Hengduan Mountains Region of Yunnan and mountainous regions of Southwest of Sichuan (seven population and 69 individuals). The fresh materials were authenticated and transported to the laboratory of Yuxi normal University. Subsequently, samples were wash cleaning and dried at 50 °C as soon as possible. At last, all samples (rhizomes, stems and leaves) were stored in a relatively dry environment prior to the extraction procedure.

3.2. Chemicals and Reagents

HPLC-grade acetonitrile, methanol (MeOH) were supplied by Thermo Fisher Scientific (Waltham, MA, USA). HPLC-grade formic acid was purchased from Sigma-Aldrich (Steinheim, Germany). Deionized water was obtained from Wahaha Group Co., Ltd. (Hangzhou, Zhejiang, China). The primary grade reference standards loganin (purity: ≥98%), 6′-O-β-d-glucopyranosylgentiopicroside (purity: ≥98%), swertiamarine (purity: ≥98%), gentiopicroside (purity: ≥98%), and sweroside (purity: ≥98%) were purchased from the Chinese National Institute for Food and Drug Control (Beijing, China), Shanghai Shifeng Biological Technology (Shanghai, China), respectively.

3.3. Sample Preparation

The dried samples (rhizomes, stems, and leaves) were ground and then passed through a 100 mesh sieves. Each sample powder (25 mg) was accurately weighed and extracted while using 1.5 mL 80% methanol-water solution, at 25 °C. The samples were extracted while using an Ultrasonic extractor for 40 min. The final extract was filtered with a 0.22 μm syringe filter into an HPLC vial and then subjected to HPLC analysis [16,58].

3.4. Instrumentation and HPLC Analysis

Chromatographic analyses were performed with an Agilent 1260 Infinity LC system (Agilent Technologies, Santa Clara, CA, USA), which was equipped with a G1315D diode-array detector, a G1329B ALS autosampler, and a thermostated column compartment. The HPLC fingerprint was recorded by Chemstation software (Agilent Technologies, Waldbron, Germany).
The analytical separation was adopted from a published method for chemical fingerprinting analysis [16]. The separation was achieved on a reversed phase C18 (Agilent Intersil, 5 µm, 4.6 × 150 mm) column (Agilent, Santa Clara, CA, USA). The composition of the mobile phase was: (A) 0.1% phosphoric acid in water and (B) 100% acetonitrile. The separation was as follows: 0.00–2.50 min: 7–10% B, 2.50–20.00 min: 10–26% B, 20.00–29.02 min: 26–58.3% B, 29.02–30.00 min: 58.3–90% B. The column was subsequently washed with 90% B and re-equilibrated with 7% B prior to injection of the next sample. The flow rate was 1.0 mL/min and the column temperature was 30 °C. The injection volume was 5 µL and the detective wavelength of UV spectra was set at 241 nm. Chromatographic data was processed while using OpenLab software (Agilent Technologies) [16,58].

3.5. Data Analysis

HPLC fingerprints from the 280 rhizome samples, 280 stem samples, and 280 leaf samples, a total of 840 fingerprint data was exported in CSV format and imported to MATLAB R2018b (The MathWorks, Inc., Natick, MA, USA), which was used for correlation optimized warping (COW) alignment preprocessing of chromatographic fingerprint. MATLAB code of COW is freely available from www.models.kvl.dk. The preprocessing fingerprint was analyzed in the following work [59].
Exploratory data analysis (EDA) is necessary for building predictive models [60,61]. It can help in determining interesting correlations among all of the samples or variables and summarize data sets main characteristics [60]. Principal component analysis (PCA) is a popular primary tool in EDA [61,62]. It is often used to visualize the relatedness between samples and explains the variance in the data. Hence, PCA, as an unsupervised pattern recognition technique, was widely used to extract key information from chemical fingerprint for geographical origin or Modelling Research [61].
Unlike PCA, orthogonal partial least squares discriminant analysis (OPLS-DA) is a supervised pattern recognition technique. As an extension of PLS, an inbuilt orthogonal signal correction filter was incorporated in the OPLS-DA model [56]. This algorithm effectively divides the X variable into two parts: one part that is related to class information (Y-predictive) and the other is orthogonal or unrelated to class information (Y-uncorrelated). Therefore, interpretability and prediction performance of the model was enhanced [56].
Random forest (RF) is another supervised pattern recognition technique utilized in the study. RF is an ensemble learning method [55]. A large number of trees were produced by RF algorithm in order to improve model predictive ability, and trees’ decision results were combined as final decision results. In other words, the more trees built in the random forest classifier, the higher accuracy could be achieved. However, many researches showed that an optimum tree number was of great importance in modeling classification performance [33,46].
In this work, exploratory data analysis of HPLC fingerprints of G. rigescens grown in four different latitudes was finished with PCA. Two supervised pattern recognition techniques, OPLS-DA and RF, were applied to build classification models for G. rigescens producing areas. SIMCA 14.1 software managed PCA and OPLS-DA (Umetrics AB, Umea, Sweden). RF classification models were established with R 3.5.1 program and package randomForest (Version 4.6-14) [63].

Data Fusion Strategy

In the case of low-level fusion strategy (Figure 11), different subsets HPLC fingerprint data matrix of rhizomes, stems, and leaves) are straightforwardly concatenated and compiled into a new chromatographic data matrix for subsequent classification model construction [45,46]. Furthermore, each subset must be totally aligned and keep all the variables on the same scale before subsets reconnection [45,46].
In the case of mid-level fusion (Figure 11), the first step of data treatment is feature selection that is based on rhizomes, stems, or leaves classification models. When compared with the raw data sets, feature selection of subsets minimizes the data content and reduces data dimensions. Subsequently, new subsets of rhizomes, stems, and leaves were rebuilt while using variables of feature selection [45]. At last, those subsets are concatenated and compiled into a final data matrix for model construction [45].
In the research, relevant variables of RF classification models were determined by the R software package Boruta [64], and VIP was used for important variables selection of OPLS-DA [65].

3.6. Model Evaluation

Five parameters, including accuracy (ACC), sensitivity (SE), specificity (SP), efficiency (EFF), and Matthews correlation coefficient (MCC) were applied to evaluate the identification ability of RF and OPLS-DA models. The ruggedness of OPLS-DA model was investigated through 200 times permutation tests. Furthermore, cumulative prediction ability (Q2), cumulative interpretation ability (R2), root mean square error of estimation (RMSEE), root mean square error of cross-validation (RMSECV), and root mean square error of prediction (RMSEP) were important evaluation indexes for the predictive power of OPLS-DA model [33,66].
Values of TP (Correctly identified samples of positive class), TN (correctly identified samples of negative class), FN (incorrectly identified samples of positive class), and FP (incorrectly identified samples of negative class) were calculated according to confusion matrixes of classification models. Subsequently, ACC, SE, SP, EFF, and MCC were calculated while using Equations (1)–(5) and values of Q2, R2, RMSEE, RMSECV, and RMSEP computed by software SIMCA 14.1.
ACC = ( TN + TP ) ( TP + TN + FP + FN )
SE = TP ( TP + FN )
SP = TN ( TN + FP )
EFF = SE × SP
MCC = ( TP × TN FP × FN ) ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN )
For model performance, lower values of RMSEE, RMSECV, and RMSEP mean better predictive ability for the models. Conversely, the closer that values of ACC, SE, SP, EFF, MCC, and Q2, R2 are to 1, the more well performance the model is.

4. Conclusions

The findings in this study showed that G. rigescens chemical profiles were influenced by the latitude gradients of producing areas and lower latitudes and higher latitudes samples seemed to be clearly distinguishable. According to the score plots of PCA and OPLS-DA, the phytochemical geographic variation of the overground and underground part along the latitude gradients was visualized. Subsequently, the potential of fingerprint data obtained while using HPLC-DAD to discriminate and classify G. rigescens grown in four different latitudes was investigated. Additionally, RF and OPLS-DA models were used to develop an effective way for geographical traceability of the G. rigescens that were grown in four different latitudes. When using independent data sets to build models, rhizomes data set combined with OPLS-DA presented the best performance with a classification accuracy of calibration and validation set varied from 94.68% to 98.94%. In a further step, the feasibility of combining the chromatographic fingerprint data from overground and underground organs was investigated based on two kinds of data fusion strategies in order to improve the performance of classification models: low-level and mid-level. Notably, classification performances of OPLS-DA models were efficiently improved by low-level data fusion strategy and better performances of RF models appeared to be achieved by mid-level data fusion strategy. Although satisfactory results were obtained with both RF and OPLS-DA based on two kinds of data fusion strategies, OPLS-DA combined with rhizome-stem fusion data set was the optimum model for discriminating G. rigescens samples according to their grown latitudes, with an accuracy of (97.87–100.00%), SE of (0.96–1.00), SP of (0.98–1.00), MCC of (0.95–1.00), and EFF of (0.97–1.00).

Supplementary Materials

The following are available online. Figure S1: Variation of stems score plots along the latitude gradients, Figure S2: Variation of stems score plots between the adjacent latitudes, Figure S3: Variation of leaves score plots along the latitude gradients, Figure S4: Variation of leaves score plots between the adjacent latitudes, Figure S5: Permutation plot of the OPLS-DA of rhizome samples, Figure S6: Permutation plot of the OPLS-DA of stem samples, Figure S7: Permutation plot of the OPLS-DA of leaf samples, Figure S8: The ntree and mtry screening of RF models based on low-level data fusion strategy, Figure S9: Result of variables selection of rhizome fingerprint data based on “Boruta” algorithm, Figure S10. Result of variables selection of stem fingerprint data based on “Boruta” algorithm, Figure S11: Result of variables selection of leaf fingerprint data based on “Boruta” algorithm, Figure S12: The ntree and mtry screening of RF models based on mid-level data fusion strategy, Figure S13: The importance variables of OPLS-DA models of rhizomes, stems and leaves fingerprints data, Figure S14: Permutation testing (200 times) of the R_OPLS-DA model, Figure S15: Permutation testing (200 times) of the S_OPLS-DA model, Figure S16: Permutation testing (200 times) of the L_OPLS-DA model, Figure S17: Permutation testing (200 times) of the RS_OPLS-DA model based on low-level data fusion, Figure S18: Permutation testing (200 times) of the RL_OPLS-DA model based on low-level data fusion, Figure S19: Permutation testing (200 times) of the SL_OPLS-DA model based on low-level data fusion, Figure S20: Permutation testing (200 times) of the RSL_OPLS-DA model based on low-level data fusion, Figure S21: Permutation testing (200 times) of the RS_OPLS-DA model based on mid-level data fusion, Figure S22: Permutation testing (200 times) of the RL_OPLS-DA model based on mid-level data fusion, Figure S23: Permutation testing (200 times) of the SL_OPLS-DA model based on mid-level data fusion, Figure S24: Permutation testing (200 times) of the RSL_OPLS-DA model based on mid-level data fusion, Table S1: The evaluation indexes for predictive power of OPLS-DA model of rhizome, stem and leaf, Table S2: The evaluation indexes for predictive power of OPLS-DA models based on low-level and mid-level data fusion strategies.

Author Contributions

H.Y. and Y.-Z.W. designed the project and revised the manuscript. T.S. performed the experiments, analyzed the data and wrote the manuscript.

Funding

This research was supported by the Key Project of Yunnan Provincial Natural Science Foundation (2017FA049), the Projects for Applied Basic Research in Yunnan (2017FH001-028), Biodiversity Survey, Monitoring and Assessment (2019HB2096001006) and the Department of Science and Technology of Yunnan Province (2018IA075).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Flora of China Editorial Committee. Flora of China; Science Press and Missouri Botanical Garden Press: Beijing, China, 1995; Volume 22. [Google Scholar]
  2. Mustafa, A.M.; Caprioli, G.; Ricciutelli, M.; Maggi, F.; Marín, R.; Vittori, S.; Sagratini, G. Comparative HPLC/ESI-MS and HPLC/DAD study of different populations of cultivated, wild and commercial Gentiana lutea L. Food Chem. 2015, 174, 426–433. [Google Scholar] [CrossRef] [PubMed]
  3. Pan, Y.; Zhao, Y.L.; Zhang, J.; Li, W.Y.; Wang, Y.Z. Phytochemistry and pharmacological activities of the genus Gentiana (Gentianaceae). Chem. Biodivers. 2016, 13, 107–150. [Google Scholar] [CrossRef] [PubMed]
  4. Jiang, R.W.; Wong, K.L.; Chan, Y.M.; Xu, H.X.; But, P.P.H.; Shaw, P.C. Isolation of iridoid and secoiridoid glycosides and comparative study on Radix gentianae and related adulterants by HPLC analysis. Phytochemistry 2005, 66, 2674–2680. [Google Scholar] [CrossRef] [PubMed]
  5. Xu, Y.; Li, Y.; Maffucci, K.; Huang, L.; Zeng, R. Analytical methods of phytochemicals from the genus Gentiana. Molecules 2017, 22, 2080. [Google Scholar] [CrossRef] [PubMed]
  6. Mustafa, A.M.; Ricciutelli, M.; Maggi, F.; Sagratini, G.; Vittori, S.; Caprioli, G. Simultaneous determination of 18 bioactive compounds in Italian bitter liqueurs by reversed-phase high-performance liquid chromatography—Diode array detection. Food Anal. Method 2014, 7, 697–705. [Google Scholar] [CrossRef]
  7. Mirzaee, F.; Hosseini, A.; Jouybari, H.B.; Davoodi, A.; Azadbakht, M. Medicinal, biological and phytochemical properties of Gentiana species. J. Tradit. Complement. Med. 2017, 7, 400–408. [Google Scholar] [CrossRef] [PubMed]
  8. Gao, L.J.; Li, J.Y.; Qi, J.H. Gentisides A and B, two new neuritogenic compounds from the traditional Chinese medicine Gentiana rigescens franch. Bioorgan. Med. Chem. 2010, 18, 2131–2134. [Google Scholar] [CrossRef]
  9. Gao, L.J.; Xiang, L.; Luo, Y.; Wang, G.F.; Li, J.Y.; Qi, J.H. Gentisides C-K: Nine new neuritogenic compounds from the traditional Chinese medicine Gentiana rigescens franch. Bioorgan. Med. Chem. 2010, 18, 6995–7000. [Google Scholar] [CrossRef]
  10. Li, J.; Gao, L.J.; Sun, K.Y.; Xiao, D.; Li, W.Y.; Xiang, L.; Qi, J.H. Benzoate fraction from Gentiana rigescens franch alleviates scopolamine-induced impaired memory in mice model in vivo. J. Ethnopharmacol. 2016, 193, 107–116. [Google Scholar] [CrossRef]
  11. Mustafa, A.M.; Caprioli, G.; Dikmen, M.; Kaya, E.; Maggi, F.; Sagratini, G.; Vittori, S.; öztürk, Y. Evaluation of neuritogenic activity of cultivated, wild and commercial roots of Gentiana lutea L. J. Funct. Foods 2015, 19, 164–173. [Google Scholar] [CrossRef]
  12. Pharmacopoeia, C.C. Pharmacopoeia of the People’s Republic of China; China Medicinal Science Press: Beijing, China, 2015. [Google Scholar]
  13. Pan, Y.; Zhang, J.; Shen, T.; Zhao, Y.L.; Zuo, Z.T.; Wang, Y.Z.; Li, W.Y. Investigation of chemical diversity in different parts and origins of ethnomedicine Gentiana rigescens franch using targeted metabolite profiling and multivariate statistical analysis. Biomed. Chromatogr. 2016, 30, 232–240. [Google Scholar] [CrossRef] [PubMed]
  14. Li, J.; Zhang, J.; Zhao, Y.L.; Huang, H.Y.; Wang, Y.Z. Comprehensive quality assessment based specific chemical profiles for geographic and tissue variation in Gentiana rigescens using HPLC and FTIR method combined with principal component analysis. Front. Chem. 2017, 5, 125. [Google Scholar] [CrossRef] [PubMed]
  15. Wu, Z.; Zhao, Y.L.; Zhang, J.; Wang, Y.Z. Quality assessment of Gentiana rigescens from different geographical origins using FT-IR spectroscopy combined with HPLC. Molecules 2017, 22, 1238. [Google Scholar] [CrossRef] [PubMed]
  16. Wang, Y.; Shen, T.; Zhang, J.; Huang, H.Y.; Wang, Y.Z. Geographical authentication of Gentiana rigescens by high-performance liquid chromatography and infrared spectroscopy. Anal. Lett. 2018, 51, 2173–2191. [Google Scholar] [CrossRef]
  17. Ailer, B.; Avramov, S.; Banjanac, T.; Cvetković, J.; Nestorović Živković, J.; Patenković, A.; Mišić, D. Secoiridoid glycosides as a marker system in chemical variability estimation and chemotype assignment of Centaurium erythraea Rafn from the Balkan Peninsula. Ind. Crop. Prod. 2012, 40, 336–344. [Google Scholar]
  18. Yang, Y.M.; Tian, K.; Hao, J.M.; Pei, S.J.; Yang, Y.X. Biodiversity and biodiversity conservation in Yunnan, China. Biodivers. Conserv. 2004, 13, 813–826. [Google Scholar] [CrossRef]
  19. Fan, Z.X.; Thomas, A. Spatiotemporal variability of reference evapotranspiration and its contributing climatic factors in Yunnan Province, SW China, 1961–2004. Clim. Chang. 2013, 116, 309–325. [Google Scholar] [CrossRef]
  20. Liu, M.X.; Xu, X.L.; Sun, A.Y.; Wang, K.L.; Yue, Y.M.; Tong, X.W.; Liu, W. Evaluation of high-resolution satellite rainfall products using rain gauge data over complex terrain in southwest China. Theor. Appl. Climatol. 2015, 119, 203–219. [Google Scholar] [CrossRef]
  21. Tang, Q.H.; Ge, Q.S. Atlas of Environmental Risks Facing China under Climate Change; Springer Verlag: Berlin, Germany, 2018. [Google Scholar]
  22. Zhao, Z.Z.; Guo, P.; Brand, E. The formation of daodi medicinal materials. J. Ethnopharmacol. 2012, 140, 476–481. [Google Scholar] [CrossRef]
  23. Sun, M.M.; Li, L.; Wang, M.; van Wijk, E.; He, M.; van Wijk, R.; Koval, S.; Hankemeier, T.; van der Greef, J.; Wei, S.L. Effects of growth altitude on chemical constituents and delayed luminescence properties in medicinal rhubarb. J. Photoch. Photobio. B 2016, 162, 24–33. [Google Scholar] [CrossRef]
  24. Song, X.Y.; Jin, L.; Shi, Y.P.; Li, Y.D.; Chen, J. Multivariate statistical analysis based on a chromatographic fingerprint for the evaluation of important environmental factors that affect the quality of Angelica sinensis. Anal. Methods 2014, 6, 8268–8276. [Google Scholar] [CrossRef]
  25. Yao, R.Y.; Heinrich, M.; Zou, Y.F.; Reich, E.; Zhang, X.L.; Chen, Y.; Weckerle, C.S. Quality variation of Goji (fruits of Lycium spp.) in China: A comparative morphological and metabolomic analysis. Front. Pharmacol. 2018, 9, 151. [Google Scholar] [CrossRef] [PubMed]
  26. Huang, Y.P.; Wu, Z.W.; Su, R.H.; Ruan, G.H.; Du, F.Y.; Li, G.K. Current application of chemometrics in traditional Chinese herbal medicine research. J. Chromatogr. B 2016, 1026, 27–35. [Google Scholar] [CrossRef] [PubMed]
  27. Zhang, C.; Zheng, X.; Ni, H.; Li, P.; Li, H.J. Discovery of quality control markers from traditional Chinese medicines by fingerprint-efficacy modeling: Current status and future perspectives. J. Pharmaceut. Biomed. 2018, 159, 296–304. [Google Scholar] [CrossRef] [PubMed]
  28. Chen, D.D.; Xie, X.F.; Ao, H.; Liu, J.L.; Peng, C. Raman spectroscopy in quality control of Chinese herbal medicine. J. Chin. Med. Assoc. 2017, 80, 288–296. [Google Scholar] [CrossRef] [PubMed]
  29. Wang, P.; Yu, Z.G. Species authentication and geographical origin discrimination of herbal medicines by near infrared spectroscopy: A review. J. Pharm. Anal. 2015, 5, 277–284. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Qi, L.M.; Zhang, J.; Zhao, Y.L.; Zuo, Z.T.; Wang, Y.Z.; Jin, H. Characterization of Gentiana rigescens by ultraviolet-visible and infrared spectroscopies with chemometrics. Anal. Lett. 2017, 50, 1497–1511. [Google Scholar] [CrossRef]
  31. Zhao, Y.L.; Zhang, J.; Jin, H.; Zhang, J.Y.; Shen, T.; Wang, Y.Z. Discrimination of Gentiana rigescens from different origins by fourier transform infrared spectroscopy combined with chemometric methods. J. Aoac Int. 2015, 98, 22–26. [Google Scholar] [CrossRef]
  32. Lee, D.Y.; Kang, K.B.; Kim, J.; Kim, H.J.; Sung, S.H. Classficiation of bupleuri radix according to geographical origins using near infrared spectroscopy (NIRS) combined with supervised pattern recognition. Nat. Prod. Sci. 2018, 24, 164. [Google Scholar] [CrossRef]
  33. Pei, Y.F.; Wu, L.H.; Zhang, Q.Z.; Wang, Y.Z. Geographical traceability of cultivated Paris polyphylla var.yunnanensis using ATR-FTMIR spectroscopy with three mathematical algorithms. Anal. Methods 2019, 11, 113–122. [Google Scholar] [CrossRef]
  34. Chen, H.; Lin, Z.; Tan, C. Fast discrimination of the geographical origins of notoginseng by near-infrared spectroscopy and chemometrics. J. Pharmaceut. Biomed. 2018, 161, 239–245. [Google Scholar] [CrossRef] [PubMed]
  35. Wang, Q.Q.; Huang, H.Y.; Wang, Y.Z. Geographical authentication of Macrohyporia cocos by a data fusion method combining ultra-fast liquid chromatography and fourier transform infrared spectroscopy. Molecules 2019, 24, 1320. [Google Scholar] [CrossRef] [PubMed]
  36. Ma, X.D.; Fan, Y.X.; Jin, C.C.; Wang, F.; Xin, G.Z.; Li, P.; Li, H.J. Specific targeted quantification combined with non-targeted metabolite profiling for quality evaluation of Gastrodia elata tubers from different geographical origins and cultivars. J. Chromatogr. A 2016, 1450, 53–63. [Google Scholar] [CrossRef] [PubMed]
  37. Tang, J.F.; Li, W.X.; Zhang, F.; Li, Y.H.; Cao, Y.J.; Zhao, Y.; Li, X.L.; Ma, Z.J. Discrimination of radix polygoni multiflori from different geographical areas by UPLC-QTOF/MS combined with chemometrics. Chin. Med. 2017, 12, 1–12. [Google Scholar] [CrossRef] [PubMed]
  38. Zhu, L.X.; Xu, J.; Wang, R.J.; Li, H.X.; Tan, Y.Z.; Chen, H.B.; Dong, X.P.; Zhao, Z.Z. Correlation between quality and geographical origins of Poria cocos revealed by qualitative fingerprint profiling and quantitative determination of triterpenoid acids. Molecules 2018, 23, 2200. [Google Scholar] [CrossRef] [PubMed]
  39. Sun, L.L.; Wang, M.; Zhang, H.J.; Liu, Y.N.; Ren, X.L.; Deng, Y.R.; Qi, A.D. Comprehensive analysis of polygoni multiflori radix of different geographical origins using ultra-high-performance liquid chromatography fingerprints and multivariate chemometric methods. J. Food Drug Anal. 2018, 26, 90–99. [Google Scholar] [CrossRef] [PubMed]
  40. Cuadros-Rodríguez, L.; Ruiz-Samblás, C.; Valverde-Som, L.; Pérez-Castaño, E.; González-Casado, A. Chromatographic fingerprinting: An innovative approach for food ‘identitation’ and food authentication—A tutorial. Anal. Chim. Acta 2016, 909, 9–23. [Google Scholar] [CrossRef] [PubMed]
  41. Lucio-Gutiérrez, J.R.; Coello, J.; Maspoch, S. Enhanced chromatographic fingerprinting of herb materials by multi-wavelength selection and chemometrics. Anal. Chim. Acta 2012, 710, 40–49. [Google Scholar] [CrossRef]
  42. Zhang, L.L.; Liu, Y.Y.; Liu, Z.L.; Wang, C.; Song, Z.Q.; Liu, Y.X.; Dong, Y.Z.; Ning, Z.C.; Lu, A.P. Comparison of the roots of Salvia miltiorrhiza bunge (danshen) and its variety S. miltiorrhiza Bge f. Alba (baihua danshen) based on multi-wavelength HPLC-fingerprinting and contents of nine active components. Anal. Methods 2016, 8, 3171–3182. [Google Scholar] [CrossRef]
  43. Wang, X.; Li, B.Q.; Xu, M.L.; Liu, J.J.; Zhai, H.L. Quality assessment of traditional Chinese medicine using HPLC-PAD combined with tchebichef image moments. J. Chromatogr. B 2017, 1040, 8–13. [Google Scholar] [CrossRef]
  44. Jiménez-Carvelo, A.M.; Cruz, C.M.; Olivieri, A.C.; González-Casado, A.; Cuadros-Rodríguez, L. Classification of olive oils according to their cultivars based on second-order data using LC-DAD. Talanta 2019, 195, 69–76. [Google Scholar] [CrossRef] [PubMed]
  45. Borràs, E.; Ferré, J.; Boqué, R.; Mestres, M.; Aceña, L.; Busto, O. Data fusion methodologies for food and beverage authentication and quality assessment—A review. Anal. Chim. Acta 2015, 891, 1–14. [Google Scholar] [CrossRef] [PubMed]
  46. Li, Y.; Zhang, J.Y.; Wang, Y.Z. FT-MIR and NIR spectral data fusion: A synergetic strategy for the geographical traceability of Panax notoginseng. Anal. Bioanal. Chem. 2018, 410, 91–103. [Google Scholar] [CrossRef] [PubMed]
  47. Wu, X.M.; Zuo, Z.T.; Zhang, Q.Z.; Wang, Y.Z. FT-MIR and UV–vis data fusion strategy for origins discrimination of wild Paris polyphylla Smith var. yunnanensis. Vib. Spectrosc. 2018, 96, 125–136. [Google Scholar] [CrossRef]
  48. Wang, H.Y.; Song, C.; Sha, M.; Liu, J.; Li, L.P.; Zhang, Z.Y. Discrimination of medicine radix astragali from different geographic origins using multiple spectroscopies combined with data fusion methods. J. Appl. Spectrosc. 2018, 85, 313–319. [Google Scholar] [CrossRef]
  49. Pei, Y.F.; Zhang, Q.Z.; Zuo, Z.T.; Wang, Y.Z. Comparison and Identification for rhizomes and leaves of Paris yunnanensis based on Fourier transform mid-Infrared spectroscopy combined with chemometrics. Molecules 2018, 23, 3343. [Google Scholar] [CrossRef] [PubMed]
  50. Yang, H.; Liu, J.; Chen, S.; Hu, F.; Zhou, D. Spatial variation profiling of four phytochemical constituents in Gentiana straminea (Gentianaceae). J. Nat. Med. 2014, 68, 38–45. [Google Scholar] [CrossRef]
  51. Lei, M.; Yu, X.H.; Li, M.; Zhu, W.X. Geographic origin identification of coal using near-infrared spectroscopy combined with improved random forest method. Infrared Phys. Technol. 2018, 92, 177–182. [Google Scholar] [CrossRef]
  52. Sayago, A.; González-Domínguez, R.; Beltrán, R.; Fernández-Recamales, Á. Combination of complementary data mining methods for geographical characterization of extra virgin olive oils based on mineral composition. Food Chem. 2018, 261, 42–50. [Google Scholar] [CrossRef]
  53. Jandrić, Z.; Haughey, S.A.; Frew, R.D.; Mccomb, K.; Galvin-King, P.; Elliott, C.T.; Cannavan, A. Discrimination of honey of different floral origins by a combination of various chemical parameters. Food Chem. 2015, 189, 52–59. [Google Scholar] [CrossRef]
  54. Jolayemi, O.S.; Ajatta, M.A.; Adegeye, A.A. Geographical discrimination of palm oils (Elaeis guineensis) using quality characteristics and UV-visible spectroscopy. Food Sci. Nutr. 2018, 6, 773–782. [Google Scholar] [CrossRef] [PubMed]
  55. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  56. Wold, S.; Antti, H.; Lindgren, F.; öhman, J. Orthogonal signal correction of near-infrared spectra. Chemometr. Intell. Lab. 1998, 44, 175–185. [Google Scholar] [CrossRef]
  57. Boccard, J.L.; Rutledge, D.N. A consensus orthogonal partial least squares discriminant analysis (OPLS-DA) strategy for multiblock omics data fusion. Anal. Chim. Acta 2013, 769, 30–39. [Google Scholar] [CrossRef] [PubMed]
  58. Lyv, W.Q.; Zhang, J.; Zuo, Z.T.; Wang, Y.Z.; Zhang, Q.Z. Quality evaluation of Gentiana rigescence by grey relational analysis method. Chin. J. Exp. Tradit. Med. Formul. 2017, 23, 66–73. [Google Scholar]
  59. Skov, T.; van den Berg, F.; Tomasi, G.; Bro, R. Automated alignment of chromatographic data. J. Chemometr. 2006, 20, 484–497. [Google Scholar] [CrossRef]
  60. Kürzl, H. Exploratory data analysis: Recent advances for the interpretation of geochemical data. J. Geochem. Explor. 1988, 30, 309–322. [Google Scholar] [CrossRef]
  61. Esteki, M.; Shahsavari, Z.; Simal-Gandara, J. Food identification by high performance liquid chromatography fingerprinting and mathematical processing. Food Res. Int. 2019, 122, 303–317. [Google Scholar] [CrossRef]
  62. Berrueta, L.A.; Alonso-Salces, R.M.; Héberger, K. Supervised pattern recognition in food analysis. J. Chromatogr. A 2007, 1158, 196–214. [Google Scholar] [CrossRef]
  63. Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  64. Kursa, M.B.; Rudnicki, W.R. Feature selection with the boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
  65. Mehmood, T.; Liland, K.H.; Snipen, L.; Sæbø, S. A review of variable selection methods in partial least squares regression. Chemometr. Intell. Lab. 2012, 118, 62–69. [Google Scholar] [CrossRef]
  66. Cao, D.S.; Hu, Q.N.; Xu, Q.S.; Yang, Y.N.; Zhao, J.C.; Lu, H.M.; Zhang, L.X.; Liang, Y.Z. In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint. Anal. Chim. Acta 2011, 692, 50–56. [Google Scholar] [CrossRef] [PubMed]
Sample Availability: Samples of the compound are not available from the authors.
Figure 1. High-performance liquid chromatography (HPLC) fingerprint of rhizome (a), stem (b), leaf (c) and fingerprints after variable transformation (df). (1) loganin, (2) 6′-O-β-d-glucopyranosylgentiopicroside, (3) swertiamarine, (4) gentiopicroside, and (5) sweroside.
Figure 1. High-performance liquid chromatography (HPLC) fingerprint of rhizome (a), stem (b), leaf (c) and fingerprints after variable transformation (df). (1) loganin, (2) 6′-O-β-d-glucopyranosylgentiopicroside, (3) swertiamarine, (4) gentiopicroside, and (5) sweroside.
Molecules 24 02562 g001
Figure 2. Two-dimensional principal component score plot of rhizomes, stems, and leaves samples based on chromatographic fingerprint data.
Figure 2. Two-dimensional principal component score plot of rhizomes, stems, and leaves samples based on chromatographic fingerprint data.
Molecules 24 02562 g002
Figure 3. Variation of rhizomes score plots along the latitude gradients. (a) is low latitude and mid-latitude, (b) is low latitude and mid-high latitude and (c) is low latitude and high latitude (green circles = low latitudes area, 23.92–23.66° N, blue circles = mid-latitude area, 24.95–25.06° N, red circles = mid-high latitude area, 26.49–26.64° N, yellow circles = high latitude area, 27.34–28.52° N).
Figure 3. Variation of rhizomes score plots along the latitude gradients. (a) is low latitude and mid-latitude, (b) is low latitude and mid-high latitude and (c) is low latitude and high latitude (green circles = low latitudes area, 23.92–23.66° N, blue circles = mid-latitude area, 24.95–25.06° N, red circles = mid-high latitude area, 26.49–26.64° N, yellow circles = high latitude area, 27.34–28.52° N).
Molecules 24 02562 g003aMolecules 24 02562 g003b
Figure 4. Variation of rhizomes score plots between the adjacent latitudes. (a) is mid-latitude and mid-high latitude and (b) is mid-high latitude and high- latitude (blue circles = mid-latitude area, 24.95–25.06° N, red circles = mid-high latitude area, 26.49–26.64° N, yellow circles = high latitude area, 27.34–28.52° N).
Figure 4. Variation of rhizomes score plots between the adjacent latitudes. (a) is mid-latitude and mid-high latitude and (b) is mid-high latitude and high- latitude (blue circles = mid-latitude area, 24.95–25.06° N, red circles = mid-high latitude area, 26.49–26.64° N, yellow circles = high latitude area, 27.34–28.52° N).
Molecules 24 02562 g004
Figure 5. Two-dimensional principal component score plots for samples of rhizomes (a), stems (b), and leaves (c) of G. rigescens grown at four latitudes.
Figure 5. Two-dimensional principal component score plots for samples of rhizomes (a), stems (b), and leaves (c) of G. rigescens grown at four latitudes.
Molecules 24 02562 g005
Figure 6. Three-dimensional (3D) Scores-plot diagram of rhizomes (a), stems (b), and leaves (c) orthogonal partial least-squares discriminant analysis (OPLS-DA) analysis among four different latitudes (OPLS-DA model (a) R2 = 0.74 and Q2 = 0.68, model (b) R2 = 0.75 and Q2 = 0.68, model (c) R2 = 0.72 and Q2 = 0.71, permutation plot of three models were shown in Figures S5–S7).
Figure 6. Three-dimensional (3D) Scores-plot diagram of rhizomes (a), stems (b), and leaves (c) orthogonal partial least-squares discriminant analysis (OPLS-DA) analysis among four different latitudes (OPLS-DA model (a) R2 = 0.74 and Q2 = 0.68, model (b) R2 = 0.75 and Q2 = 0.68, model (c) R2 = 0.72 and Q2 = 0.71, permutation plot of three models were shown in Figures S5–S7).
Molecules 24 02562 g006
Figure 7. Important variables of fingerprint (purple = variable VIP value > 1) (a) rhizome, (b) stem, and (c) leaf.
Figure 7. Important variables of fingerprint (purple = variable VIP value > 1) (a) rhizome, (b) stem, and (c) leaf.
Molecules 24 02562 g007
Figure 8. The ntree (a) and mtry (b) screening of RF models based on rhizomes fingerprints.
Figure 8. The ntree (a) and mtry (b) screening of RF models based on rhizomes fingerprints.
Molecules 24 02562 g008
Figure 9. The ntree (a) and mtry (b) screening of RF models based on stems fingerprints.
Figure 9. The ntree (a) and mtry (b) screening of RF models based on stems fingerprints.
Molecules 24 02562 g009
Figure 10. The ntree (a) and mtry (b) screening of RF models based on leaves fingerprints.
Figure 10. The ntree (a) and mtry (b) screening of RF models based on leaves fingerprints.
Molecules 24 02562 g010
Figure 11. The workflow of geographical authentication of G. rigescens grown at different latitudes using data fusion strategy.
Figure 11. The workflow of geographical authentication of G. rigescens grown at different latitudes using data fusion strategy.
Molecules 24 02562 g011
Figure 12. Geographical distribution of sample information.
Figure 12. Geographical distribution of sample information.
Molecules 24 02562 g012
Table 1. The major parameters of random forest (RF) model based on rhizomes data set.
Table 1. The major parameters of random forest (RF) model based on rhizomes data set.
ModelPerformanceCalibration SetValidation Set
IIIIIIIVIIIIIIIV
R_RFACC (%)96.7799.4694.6294.0991.4995.7494.6898.94
SE0.920.970.930.890.920.750.930.96
SP0.991.000.950.960.911.000.951.00
MCC0.920.980.880.840.800.840.880.97
EFF0.950.980.940.920.920.870.940.98
Table 2. The major parameters of RF model based on stems data set.
Table 2. The major parameters of RF model based on stems data set.
ModelPerformanceCalibration SetValidation Set
IIIIIIIVIIIIIIIV
S_RFACC (%)92.4794.6293.0193.0198.9497.8796.8197.87
SE0.920.690.910.871.000.881.000.91
SP0.931.000.940.950.991.000.951.00
MCC0.820.800.840.810.970.920.930.94
EFF0.920.830.930.910.990.940.980.96
Table 3. The major parameters of RF model based on leaves data set.
Table 3. The major parameters of RF model based on leaves data set.
ModelPerformanceCalibration SetValidation Set
IIIIIIIVIIIIIIIV
L_RFACC (%)92.4796.2493.0194.6285.1193.6289.3693.62
SE0.940.780.910.850.880.690.860.74
SP0.921.000.940.980.840.990.911.00
MCC0.820.860.840.850.670.760.760.83
EFF0.930.880.930.910.860.820.880.86
Table 4. The major parameters of OPLS-DA models.
Table 4. The major parameters of OPLS-DA models.
ModelPerformanceCalibration SetValidation Set
IIIIIIIVIIIIIIIV
R_OPLS-DAACC (%)98.9298.9298.9298.9295.7498.9494.6897.87
SE0.980.970.980.980.920.940.930.96
SP0.990.990.990.990.971.000.950.99
MCC0.970.960.970.970.890.960.880.94
EFF0.990.980.990.990.950.970.940.97
S_OPLS-DAACC (%)98.9299.4698.9298.3991.4993.6291.4997.87
SE1.000.971.000.930.920.810.830.91
SP0.991.000.981.000.910.960.951.00
MCC0.970.980.980.960.800.770.800.94
EFF0.990.980.990.970.920.880.890.96
L_OPLS-DAACC (%)97.3199.4697.3198.3988.3095.7492.5593.62
SE0.941.000.931.000.810.880.900.83
SP0.990.990.990.980.910.970.940.97
MCC0.930.980.940.960.710.850.830.82
EFF0.961.000.960.990.860.920.920.90
Table 5. The major parameters of RF models based on low-level data fusion strategy.
Table 5. The major parameters of RF models based on low-level data fusion strategy.
ModelClassCalibration SetValidation Set
IIIIIIIVIIIIIIIV
RS_RFACC (%)96.7798.9292.4793.5595.7497.8795.7497.87
SE0.920.940.910.870.920.880.970.96
SP0.991.000.930.960.971.000.950.99
MCC0.920.960.830.830.890.920.900.94
EFF0.950.970.920.910.950.940.960.97
RL_RFACC (%)94.0998.3993.0196.2487.2394.6891.4992.55
SE0.900.910.910.910.960.750.860.70
SP0.961.000.940.980.840.990.941.00
MCC0.850.940.840.900.740.800.800.80
EFF0.930.950.930.950.900.860.900.83
SL_RFACC (%)93.5595.7092.4793.5590.4396.8196.8194.68
SE0.940.750.910.850.920.880.930.83
SP0.931.000.930.960.900.990.980.99
MCC0.840.840.830.820.780.880.920.85
EFF0.940.870.920.900.910.930.960.90
RSL_RFACC (%)95.7099.4691.9492.4794.6896.81100.0097.87
SE0.940.970.860.850.861.001.000.96
SP0.961.000.950.950.980.961.000.99
MCC0.890.980.810.800.870.881.000.94
EFF0.950.980.900.900.920.981.000.97
Table 6. The major parameters of OPLS-DA models based on low-level data fusion strategy.
Table 6. The major parameters of OPLS-DA models based on low-level data fusion strategy.
ModelClassCalibration SetValidation Set
IIIIIIIVIIIIIIIV
RS_OPLS-DAACC (%)99.46100.0099.4698.9297.8798.9497.8798.94
SE1.001.001.000.960.961.000.970.96
SP0.991.000.991.000.990.990.981.00
MCC0.991.000.990.970.950.960.950.97
EFF1.001.001.000.980.970.990.980.98
RL_OPLS-DAACC (%)99.46100.00100.0099.4695.7497.8797.8797.87
SE1.001.001.000.980.881.000.970.96
SP0.991.001.001.000.990.970.980.99
MCC0.991.001.000.990.890.930.950.94
EFF1.001.001.000.990.930.990.980.97
SL_OPLS-DAACC (%)100.00100.00100.00100.0094.6898.9497.8797.87
SE1.001.001.001.001.000.940.930.91
SP1.001.001.001.000.931.001.001.00
MCC1.001.001.001.000.880.960.950.94
EFF1.001.001.001.000.960.970.960.96
RSL_OPLS-DAACC (%)99.46100.00100.0099.4696.8198.9497.8797.87
SE1.001.001.000.980.921.000.970.96
SP0.991.001.001.000.990.990.980.99
MCC0.991.001.000.990.920.960.950.94
EFF1.001.001.000.990.950.990.980.97
Table 7. The major parameters of RF models based on mid-level data fusion strategy.
Table 7. The major parameters of RF models based on mid-level data fusion strategy.
ModelClassCalibration SetValidation Set
IIIIIIIVIIIIIIIV
RS_RFACC (%)99.4699.4694.0995.1698.94100.0096.8197.87
SE0.980.970.950.871.001.000.970.91
SP1.001.000.940.980.991.000.971.00
MCC0.990.980.870.870.971.000.930.94
EFF0.990.980.940.920.991.000.970.96
RL_RFACC (%)95.7096.7796.2497.3191.4998.9491.4994.68
SE0.920.880.970.930.921.000.860.78
SP0.970.990.960.990.910.990.941.00
MCC0.890.880.910.930.800.960.800.86
EFF0.940.930.960.960.920.990.900.88
SL_RFACC (%)95.1696.7793.5596.2497.87100.0098.9496.81
SE0.940.810.950.890.961.001.000.91
SP0.961.000.930.990.991.000.980.99
MCC0.880.880.860.900.951.000.980.91
EFF0.950.900.940.940.971.000.990.95
RSL_RFACC (%)97.8599.4694.0995.7096.81100.0095.7498.94
SE0.940.970.930.910.961.000.930.96
SP0.991.000.950.970.971.000.971.00
MCC0.940.980.860.880.921.000.900.97
EFF0.970.980.940.940.971.000.950.98
Table 8. The major parameters of OPLS-DA models based on mid-level data fusion strategy.
Table 8. The major parameters of OPLS-DA models based on mid-level data fusion strategy.
ModelClassCalibration SetValidation Set
IIIIIIIVIIIIIIIV
RS_OPLS-DAACC (%)100.00100.0099.4699.4693.6297.8794.6898.94
SE1.001.001.000.980.881.000.900.96
SP1.001.000.991.000.960.970.971.00
MCC1.001.000.990.990.840.930.870.97
EFF1.001.001.000.990.920.990.930.98
RL_OPLS-DAACC (%)100.00100.0099.4699.4696.8197.8797.8798.94
SE1.001.001.000.980.921.000.970.96
SP1.001.000.991.000.990.970.981.00
MCC1.001.000.990.990.920.930.950.97
EFF1.001.001.000.990.950.990.980.98
SL_OPLS-DAACC (%)100.00100.0098.9298.9293.6297.8794.6896.81
SE1.001.000.980.980.920.940.900.91
SP1.001.000.990.990.940.990.970.99
MCC1.001.000.970.970.850.920.870.91
EFF1.001.000.990.990.930.960.930.95
RSL_OPLS-DAACC (%)100.00100.0099.4699.4695.7498.9496.8197.87
SE1.001.001.000.980.921.000.930.96
SP1.001.000.991.000.970.990.980.99
MCC1.001.000.990.990.890.960.920.94
EFF1.001.001.000.990.950.990.960.97

Share and Cite

MDPI and ACS Style

Shen, T.; Yu, H.; Wang, Y.-Z. Assessing Geographical Origin of Gentiana Rigescens Using Untargeted Chromatographic Fingerprint, Data Fusion and Chemometrics. Molecules 2019, 24, 2562. https://doi.org/10.3390/molecules24142562

AMA Style

Shen T, Yu H, Wang Y-Z. Assessing Geographical Origin of Gentiana Rigescens Using Untargeted Chromatographic Fingerprint, Data Fusion and Chemometrics. Molecules. 2019; 24(14):2562. https://doi.org/10.3390/molecules24142562

Chicago/Turabian Style

Shen, Tao, Hong Yu, and Yuan-Zhong Wang. 2019. "Assessing Geographical Origin of Gentiana Rigescens Using Untargeted Chromatographic Fingerprint, Data Fusion and Chemometrics" Molecules 24, no. 14: 2562. https://doi.org/10.3390/molecules24142562

Article Metrics

Back to TopTop