Assessing Geographical Origin of Gentiana Rigescens Using Untargeted Chromatographic Fingerprint, Data Fusion and Chemometrics

Gentiana rigescens Franchet, which is famous for its bitter properties, is a traditional drug of chronic hepatitis and important raw materials for the pharmaceutical industry in China. In the study, high-performance liquid chromatography (HPLC), coupled with diode array detector (DAD) and chemometrics, were used to investigate the chemical geographical variation of G. rigescens and to classify medicinal materials, according to their grown latitudes. The chromatographic fingerprints of 280 individuals and 840 samples from rhizomes, stems, and leaves of four different latitude areas were recorded and analyzed for tracing the geographical origin of medicinal materials. At first, HPLC fingerprints of underground and aerial parts were generated while using reversed-phase liquid chromatography. After the preliminary data exploration, two supervised pattern recognition techniques, random forest (RF) and orthogonal partial least-squares discriminant analysis (OPLS-DA), were applied to the three HPLC fingerprint data sets of rhizomes, stems, and leaves, respectively. Furthermore, fingerprint data sets of aerial and underground parts were separately processed and joined while using two data fusion strategies (“low-level” and “mid-level”). The results showed that classification models that are based OPLS-DA were more efficient than RF models. The classification models using low-level data fusion method built showed considerably good recognition and prediction abilities (the accuracy is higher than 99% and sensibility, specificity, Matthews correlation coefficient, and efficiency range from 0.95 to 1.00). Low-level data fusion strategy combined with OPLS-DA could provide the best discrimination result. In summary, this study explored the latitude variation of phytochemical of G. rigescens and developed a reliable and accurate identification method for G. rigescens that were grown at different latitudes based on untargeted HPLC fingerprint, data fusion, and chemometrics. The study results are meaningful for authentication and the quality control of Chinese medicinal materials.


Introduction
Gentiana rigescens Franchet (Dian long dan) is a herbaceous species that grows in mountainous regions of Yunnan-Guizhou Plateau in the southwest of China [1]. Like European traditional medicinal plant yellow gentian (G. lutea L), G. rigescens is famous for its bitter properties that are due to the bitter active principles (e.g., loganin, gentiopicroside, swertiamarin, sweroside, etc.) [2][3][4]. Those compounds Panax notoginseng, Paris Polyphylla var. yunnanensis and other herb materials also showed the huge potential of this strategy in the discrimination of medicinal materials producing areas [46][47][48]. Today, most of the fused data come from spectral fingerprint and very few studies report the data fusion of chromatographic fingerprint [42,43]. Furthermore, data fusion studies are mostly based on the fusion of multivariate instrumental techniques [42,43], while reports of P. Polyphylla var. yunnanensis, Macrohyporia cocos, and other species indicated that reliable classification results were also available by the fusion analysis of chemical fingerprint data collected from different medicinal parts of herbs [35,49]. Accumulation and distribution of metabolites in the different parts of plants were different because of the differential response of root, stem, flower and other organs to the environment variation of producing area [17,50]. Therefore, fingerprint data fusion of multi-medicinal parts may provide integrated chemical information for the authentication of medicinal materials. At the same time, this method also contributes to a more comprehensive understanding of the response and adaptation of medicinal plants to complex geographical environments.
The aim of this study is to explore the variation of chromatographic fingerprints of G. rigescens along the latitude gradients and to use chemometrics to mine fingerprint chemical information, and to investigate the potential of the untargeted chromatographic fingerprint to trace herbs grown at different latitudes. For this purpose, we developed fingerprint of rhizomes, stems, and leaves of G. rigescens by high-performance liquid chromatography with diode array detection (HPLC-DAD) technology. Subsequently, classification models for the identification of different producing areas were built by HPLC fingerprint combined with RF (random forest algorithm) and OPLS-DA (orthogonal partial least-squares discriminant analysis). At last, two types of data fusion strategies, "low-level" and "mid-level" data fusion, were studied in order to improve the model performances.

Chromatographic Fingerprints Variation Along the Latitude Gradients
Figure 1 displays the representative chromatographic fingerprints of rhizome, stem, and leaf. From HPLC fingerprints, it can be found that the five marker compounds of iridoids were eluted before 15 min. The retention times (t/min) of loganin (1), 6 -O-β-d-glucopyranosylgentiopicroside (2), swertiamarine (3), gentiopicroside (4), and sweroside (5) were 7.279, 9.213, 9.573, 11.376, and 11.622 min, respectively. Loganin and gentiopicroside were mainly accumulation in the underground part and sweroside accumulated more in the overground parts. Furthermore, differences in the chemical composition of rhizome, stem, and leaf can also be visually observed through chromatographic fingerprints. For facilitating subsequent data exploration and modeling analysis, the retention time of fingerprints signal was replaced by variables (Figure 1d-f). As a result, there were 3839, 4140, and 4140 variables of rhizome, stem, and leaf fingerprints, respectively.
Principal component analysis (PCA) and two-dimensional score plots visualized the differences and variation trends of three medicinal parts. Figure 2 shows that the rhizomes and stems of G. rigescens tended to cluster to the left part, while the leaves data scattered to the right.
Although the fingerprints between the aboveground and underground medicinal parts were obvious differences, an interesting result is that a trend of separation according to product region latitude was observed from the PCA and score plots of samples of three medicinal parts. For example, two-dimensional score plots of chromatographic fingerprint of rhizomes showed that the samples separation trend increases with an increase in geographical distance and a clear separation between samples that were collected from lower latitude and higher latitude regions ( Figure 3). In contrast to this, when considering the separation between samples with product regions geographically close to each other, we observed that the rhizome samples separation trend decreases with a decrease in the geographical distance ( Figure 4). The PCA score plots of stems and leaves changed in the same trend as rhizomes (Figures S1-S4). Principal component analysis (PCA) and two-dimensional score plots visualized the differences and variation trends of three medicinal parts. Figure 2 shows that the rhizomes and stems of G. rigescens tended to cluster to the left part, while the leaves data scattered to the right. Although the fingerprints between the aboveground and underground medicinal parts were obvious differences, an interesting result is that a trend of separation according to product region latitude was observed from the PCA and score plots of samples of three medicinal parts. For example, two-dimensional score plots of chromatographic fingerprint of rhizomes showed that the samples separation trend increases with an increase in geographical distance and a clear separation between samples that were collected from lower latitude and higher latitude regions ( Figure 3). In contrast to this, when considering the separation between samples with product regions geographically close to each other, we observed that the rhizome samples separation trend decreases with a decrease in the geographical distance ( Figure 4). The PCA score plots of stems and leaves changed in the same trend as rhizomes (Figure S1-S4). The results of PCA highlighted that the chromatographic fingerprints of G. rigescens were different among rhizomes, stems, and leaves, and were affected by latitude gradients of the production regions. Especially between lower latitudes and higher latitudes, the samples seem to be clearly distinguishable. Based on PCA exploratory analysis (unsupervised methods), supervised pattern recognition (OPLS-DA) should be applied to gain better classification results for samples that were grown in different latitudes ( Figures 5 and 6), and OPLS-DA and variable importance in the projection (VIP) analysis were used to further investigate the fingerprint variables of G. rigescens that were sensitive to latitude changes. Although the fingerprints between the aboveground and underground medicinal parts were obvious differences, an interesting result is that a trend of separation according to product region latitude was observed from the PCA and score plots of samples of three medicinal parts. For example, two-dimensional score plots of chromatographic fingerprint of rhizomes showed that the samples separation trend increases with an increase in geographical distance and a clear separation between samples that were collected from lower latitude and higher latitude regions ( Figure 3). In contrast to this, when considering the separation between samples with product regions geographically close to each other, we observed that the rhizome samples separation trend decreases with a decrease in the geographical distance ( Figure 4). The PCA score plots of stems and leaves changed in the same trend as rhizomes (Figure S1-S4).   The results of PCA highlighted that the chromatographic fingerprints of G. rigescens were different among rhizomes, stems, and leaves, and were affected by latitude gradients of the  The results of PCA highlighted that the chromatographic fingerprints of G. rigescens were different among rhizomes, stems, and leaves, and were affected by latitude gradients of the production regions. Especially between lower latitudes and higher latitudes, the samples seem to be clearly distinguishable. Based on PCA exploratory analysis (unsupervised methods), supervised pattern recognition (OPLS-DA) should be applied to gain better classification results for samples that were grown in different latitudes ( Figures 5 and 6), and OPLS-DA and variable importance in the projection (VIP) analysis were used to further investigate the fingerprint variables of G. rigescens that were sensitive to latitude changes.  The variable's VIP value was greater than 1.00, which indicates that the variable was obviously affected by the change of the latitude of the producing areas. From Figure 7a, it could be found that the change of three ranges of rhizome's fingerprint was closely related to producing areas latitude. The first range was related to variables of retention time at 2.00-13.00 min. The second range was related to variables of retention time at 15.00-20.00 min. Additionally, the third range was related to the variables of retention time after 25.00 min. Figure 7b showed that important variables (VIP value > 1.00) of stem fingerprint relate to the variables of retention time at 2.00-20.00 min. and 25.00-30.00 min. For leaf fingerprint, chromatographic variables, retention time at 2.00-15.00 min, 17.00-19.00 min. and 25.00-30.00 min., were the most sensitive to latitude changes of producing areas (Figure 7c). According to the identification of the major compounds in fingerprint, it showed that many of these important variables were chromatographic signals of iridoids and secoiridoids, such as loganin, 6′-O-β-D-glucopyranosylgentiopicroside, swertiamarine, gentiopicroside, and sweroside. A previous study regarding the spatial profiling of iridoids phytochemical constituents found that the geographical variation of those compounds could be attributed to some environmental factors [13,17], for example, the difference of precipitation of natural habitats [17]. Additionally, it was interesting to note that the number of important variables after 25 min is gradually increasing from were grown in different latitudes ( Figures 5 and 6), and OPLS-DA and variable importance in the projection (VIP) analysis were used to further investigate the fingerprint variables of G. rigescens that were sensitive to latitude changes.  The variable's VIP value was greater than 1.00, which indicates that the variable was obviously affected by the change of the latitude of the producing areas. From Figure 7a, it could be found that the change of three ranges of rhizome's fingerprint was closely related to producing areas latitude. The first range was related to variables of retention time at 2.00-13.00 min. The second range was related to variables of retention time at 15.00-20.00 min. Additionally, the third range was related to the variables of retention time after 25.00 min. Figure 7b showed that important variables (VIP value > 1.00) of stem fingerprint relate to the variables of retention time at 2.00-20.00 min. and 25.00-30.00 min. For leaf fingerprint, chromatographic variables, retention time at 2.00-15.00 min, 17.00-19.00 min. and 25.00-30.00 min., were the most sensitive to latitude changes of producing areas (Figure 7c). According to the identification of the major compounds in fingerprint, it showed that many of these important variables were chromatographic signals of iridoids and secoiridoids, such as loganin, 6′-O-β-D-glucopyranosylgentiopicroside, swertiamarine, gentiopicroside, and sweroside. A previous study regarding the spatial profiling of iridoids phytochemical constituents found that the geographical variation of those compounds could be attributed to some environmental factors [13,17], for example, the difference of precipitation of natural habitats [17]. Additionally, it was interesting to note that the number of important variables after 25 min is gradually increasing from The variable's VIP value was greater than 1.00, which indicates that the variable was obviously affected by the change of the latitude of the producing areas. From Figure 7a, it could be found that the change of three ranges of rhizome's fingerprint was closely related to producing areas latitude. The first range was related to variables of retention time at 2.00-13.00 min. The second range was related to variables of retention time at 15.00-20.00 min. Additionally, the third range was related to the variables of retention time after 25.00 min. Figure 7b showed that important variables (VIP value > 1.00) of stem fingerprint relate to the variables of retention time at 2.00-20.00 min and 25.00-30.00 min. For leaf fingerprint, chromatographic variables, retention time at 2.00-15.00 min, 17.00-19.00 min and 25.00-30.00 min, were the most sensitive to latitude changes of producing areas ( Figure 7c). According to the identification of the major compounds in fingerprint, it showed that many of these important variables were chromatographic signals of iridoids and secoiridoids, such as loganin, 6 -O-β-d-glucopyranosylgentiopicroside, swertiamarine, gentiopicroside, and sweroside. A previous study regarding the spatial profiling of iridoids phytochemical constituents found that the geographical variation of those compounds could be attributed to some environmental factors [13,17], for example, the difference of precipitation of natural habitats [17]. Additionally, it was interesting to note that the number of important variables after 25 min is gradually increasing from the rhizome to the leaves. The results suggested that, in addition to iridoids, other low polarity products in G. rigescens have implications for the differentiation of different geographical origins. products in G. rigescens have implications for the differentiation of different geographical origins.
In a word, current research indicated that the chemical composition of G. rigescens changes with the grown latitude in a way that could be traced with the chromatographic fingerprint. Furthermore, three-dimensional (3D) score plots and VIP analysis showed a difference of phytochemical geographic variation for overground and underground parts. Those differences might affect the result of geographical origin traceability of samples.  In a word, current research indicated that the chemical composition of G. rigescens changes with the grown latitude in a way that could be traced with the chromatographic fingerprint. Furthermore, three-dimensional (3D) score plots and VIP analysis showed a difference of phytochemical geographic variation for overground and underground parts. Those differences might affect the result of geographical origin traceability of samples.

Geographic Authentication Based on Fingerprints of Different Medicinal Parts
In recent years, literature had already reported satisfying classification results that were obtained by RF or OPLS-DA models [51][52][53][54]. As an ensemble learning method, the RF algorithm could correct for decision trees' habit of overfitting to their training set [55]. Additionally, OPLS could help to overcome these obstacles by separating useful information from noise and improve complex chemical data features and interpretability [56,57]. In this work, we tested RF and OPLS-DA models, combined with rhizome, stem, and leaf fingerprint data in order to classify G. rigescens according to their grown latitude.

RF Classification
In the beginning, samples from the data set of rhizomes (280 samples and 3839 variables) were separated into a calibration set (186 samples) and a validation set (94 samples) by the Kennard-Stone algorithm. Subsequently, 186 rhizome samples that were collected from four latitude gradients were used to establish the calibration model (R_RF). During the modeling process, the initial value of n tree (needs to be optimized) was defined as 2000, the initial value of m try was defined as the square root of the number of variables, and the rest of the parameters were defined as the default value. Subsequently, OOB errors were calculated and the value of the best n tree was obtained according to the lowest OOB error. Figure 8 shows that the minimum error and the standard error are the lowest, with 663 trees. Based on the optimal number of trees, m try was re-selected by searching the values ranged from 50 to 75. The calculation results found that the m try value should be defined as 61, because of the model had the lowest OOB classification error. Finally, a final classification model was established based on optimum n tree and m try values.

Geographic Authentication Based on Fingerprints of Different Medicinal Parts
In recent years, literature had already reported satisfying classification results that were obtained by RF or OPLS-DA models [51][52][53][54]. As an ensemble learning method, the RF algorithm could correct for decision trees' habit of overfitting to their training set [55]. Additionally, OPLS could help to overcome these obstacles by separating useful information from noise and improve complex chemical data features and interpretability [56,57]. In this work, we tested RF and OPLS-DA models, combined with rhizome, stem, and leaf fingerprint data in order to classify G. rigescens according to their grown latitude.

RF Classification
In the beginning, samples from the data set of rhizomes (280 samples and 3839 variables) were separated into a calibration set (186 samples) and a validation set (94 samples) by the Kennard-Stone algorithm. Subsequently, 186 rhizome samples that were collected from four latitude gradients were used to establish the calibration model (R_RF). During the modeling process, the initial value of ntree (needs to be optimized) was defined as 2000, the initial value of mtry was defined as the square root of the number of variables, and the rest of the parameters were defined as the default value. Subsequently, OOB errors were calculated and the value of the best ntree was obtained according to the lowest OOB error. Figure 8 shows that the minimum error and the standard error are the lowest, with 663 trees. Based on the optimal number of trees, mtry was re-selected by searching the values ranged from 50 to 75. The calculation results found that the mtry value should be defined as 61, because of the model had the lowest OOB classification error. Finally, a final classification model was established based on optimum ntree and mtry values.     Like previous investigations of the rhizome model, the data set of stems (280 samples and 4140 variables) and leaves (280 samples and 4140 variables) were separated into calibration sets and validation sets, respectively. Subsequently, RF calibration modes of stems (S_RF) and leaves (L_RF) were built. The optimum n tree and m try could be found in Figures 9 and 10.
For the RF model of the stem, the accuracies of samples of calibration set of 92.47%, 94.62%, 93.01%, and 93.01% were achieved for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes. Additionally, the accuracies of samples of validation set were 98.94%, 97.87%, 96.81%, and 97.87%, respectively ( Table 2).
For RF model of the leaf, accuracies of 92.47%, 96.24%, 93.01%, and 94.62% were achieved for the calibration set. Additionally, accuracies of 85.11%, 93.62%, 89.36%, and 93.62% for the validation set (Table 3). Like previous investigations of the rhizome model, the data set of stems (280 samples and 4140 variables) and leaves (280 samples and 4140 variables) were separated into calibration sets and validation sets, respectively. Subsequently, RF calibration modes of stems (S_RF) and leaves (L_RF) were built. The optimum ntree and mtry could be found in Figures 9 and 10.
For the RF model of the stem, the accuracies of samples of calibration set of 92.47%, 94.62%, 93.01%, and 93.01% were achieved for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes. Additionally, the accuracies of samples of validation set were 98.94%, 97.87%, 96.81%, and 97.87%, respectively (Table 2).
For RF model of the leaf, accuracies of 92.47%, 96.24%, 93.01%, and 94.62% were achieved for the calibration set. Additionally, accuracies of 85.11%, 93.62%, 89.36%, and 93.62% for the validation set (Table 3).   Figure 10. The ntree (a) and mtry (b) screening of RF models based on leaves fingerprints.  Like previous investigations of the rhizome model, the data set of stems (280 samples and 4140 variables) and leaves (280 samples and 4140 variables) were separated into calibration sets and validation sets, respectively. Subsequently, RF calibration modes of stems (S_RF) and leaves (L_RF) were built. The optimum ntree and mtry could be found in Figures 9 and 10.
For the RF model of the stem, the accuracies of samples of calibration set of 92.47%, 94.62%, 93.01%, and 93.01% were achieved for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes. Additionally, the accuracies of samples of validation set were 98.94%, 97.87%, 96.81%, and 97.87%, respectively (Table 2).

OPLS-DA Classification
The OPLS-DA models of rhizomes (R_OPLS-DA), stems (S_OPLS-DA), and leaves (L_OPLS-DA) were constructed based on the same calibration and validation sets that were used in RF models. All of the models were constructed based on the internal seven-fold cross-validation and permutation plot could be found in Supplementary Materials. Table S1 showed that the R 2 of models ranged from 0.77 to 0.82 and the Q 2 of models were larger than 0.50, which indicated that the OPLS-DA models were well fitted and better predictive. The permutation test results could be found in Figures S14-S16.
The classification results of R_OPLS-DA model showed (Table 4) accuracies of calibration set were 98.92% for all classes. Accuracies of validation set were 95.47%, 98.94%, 94.86%, and 97.87% for low latitudes, mid-latitudes, mid-high latitudes, and high latitudes samples, respectively. For S_OPLS-DA models (Table 4), although 98.92%, 99.46%, 98.92%, and 98.39% values of calibration set accuracies were obtained for samples that were grown in four different latitudes, a lower value of total accuracy rate of validation set was obtained (93.62%). Parameters of L_OPLS-DA model showed ( Table 4) that the accuracies of the calibration set were 97.31%, 99.46%, 97.31%, and 98.39% for low latitude, mid-latitude, mid-high latitude, and high latitude samples, respectively. However, the total accuracy of the validation set was lower than the calibration set. Especially, for samples of class 1, the accuracy was only 88.30%. Finally, we made a comprehensive comparison to the six models' classification performance superiority on the basis of the above analysis. For the RF model, the order of calibration total accuracy was as follows: R_RF (96.24%) > L_RF (94.09%) > S_RF (93.28%). The order of validation total accuracy was as follows: S_RF (97.87%) > R_RF (95.21%) > L_RF (90.43%). For the OPL-DA model, the order of calibration total accuracy was as follows: R_OPL-DA (98.92%) and S_OPLS-DA (98.92%) > L_OPLS-DA (98.12%). The order of validation total accuracy was as follows: R_OPL-DA (96.81%) > S_OPLS-DA (93.62%) > L_OPLS-DA (92.55%). Classification models that were built by using leaf data set presented the worst performance from the accuracy point of view. Additionally, validation sets of the L_RF and L_OPL-DA model had lower Matthews correlation coefficient (MCC) values. By contrast, all of the models based on rhizome data set presented a better classification performance (total accuracy ranged from 95.21% to 98.92%). The best total accuracy occurred when rhizome data combined with the OPLS algorithm. We could find that phenomenon of imbalance category recognition in R_OPLS-DA model was better than other models from SE values, SP values, MCC values, and EFF value.
Although the classification performance for OPLS-DA and RF models on the basis of rhizome data set was good, the model classification ability, accuracy, sensitivity (SE), specificity (SP), MCC, and efficiency (EFF), need to be enhanced. In a further step, the feasibility of combining the information from rhizome, stem, and leaf fingerprint data for samples geographical traceability was investigated by low-level and mid-level data fusion strategies.

Low-Level Data Fusion
According to the method that was described in data preprocessing ( Figure 11), fingerprint data sets of overground and underground organs as subsets were used to concatenate into a single data block (a new data set). In the case of the low-level strategy, four data sets, rhizome combined with stem (RS), rhizome combined with leaf (RL), stem combined with leaf (SL), and all data combined (RSL), were used to build RF (RS_RF, RL_RF, SL_RF, and RSL_RF) and OPLS-DA (RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA) models. For every data set, the samples were randomly selected as a calibration set and the rest of the samples were used as a validation set (finished by Kennard-Stone algorithm).
superiority on the basis of the above analysis. For the RF model, the order of calibration total accuracy was as follows: R_RF (96.24%) > L_RF (94.09%) > S_RF (93.28%). The order of validation total accuracy was as follows: S_RF (97.87%) > R_RF (95.21%) > L_RF (90.43%). For the OPL-DA model, the order of calibration total accuracy was as follows: R_OPL-DA (98.92%) and S_OPLS-DA (98.92%) > L_OPLS-DA (98.12%). The order of validation total accuracy was as follows: R_OPL-DA (96.81%) > S_OPLS-DA (93.62%) > L_OPLS-DA (92.55%). Classification models that were built by using leaf data set presented the worst performance from the accuracy point of view. Additionally, validation sets of the L_RF and L_OPL-DA model had lower Matthews correlation coefficient (MCC) values. By contrast, all of the models based on rhizome data set presented a better classification performance (total accuracy ranged from 95.21% to 98.92%). The best total accuracy occurred when rhizome data combined with the OPLS algorithm. We could find that phenomenon of imbalance category recognition in R_OPLS-DA model was better than other models from SE values, SP values, MCC values, and EFF value.
Although the classification performance for OPLS-DA and RF models on the basis of rhizome data set was good, the model classification ability, accuracy, sensitivity (SE), specificity (SP), MCC, and efficiency (EFF), need to be enhanced. In a further step, the feasibility of combining the information from rhizome, stem, and leaf fingerprint data for samples geographical traceability was investigated by low-level and mid-level data fusion strategies.

Low-level Data Fusion
According to the method that was described in data preprocessing ( Figure 11), fingerprint data sets of overground and underground organs as subsets were used to concatenate into a single data block (a new data set). In the case of the low-level strategy, four data sets, rhizome combined with stem (RS), rhizome combined with leaf (RL), stem combined with leaf (SL), and all data combined (RSL), were used to build RF (RS_RF, RL_RF, SL_RF, and RSL_RF) and OPLS-DA (RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA) models. For every data set, the samples were randomly selected as a calibration set and the rest of the samples were used as a validation set (finished by Kennard-Stone algorithm).  The optimum n tree and m try values were selected at first ( Figure S8). Afterwards, final classification models were established based on the best values of arguments. From Table 5, it could be seen that the samples collected from four different latitudes were better discriminated by using RS data set and RSL data set. RS_RF model achieved 95.43% total accuracy for the calibration set and achieved 96.81% total accuracy for calibration set. RSL_RF model achieved 94.89% correctly for the calibration set and achieved 97.37% correctly for the calibration set. From a comparison with SE, SP, MCC, and EFF values of S_RF and L_RF models (Tables 1 and 3), we found that the low-level data fusion strategy improved the phenomenon of imbalance category recognition in the RF model (Table 5). However, the total accuracy of models was not obviously improved. The permutation plot of all models could be found in Supplementary Materials (Figures S17-S20). The classification results of OPLS-DA models based on low-level data fusion showed models' R 2 values ranged from 0.86 to 0.90 and Q 2 values ranged from 0.74 to 0.80 (Table S2). Total accuracy rates of the calibration set of RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA were 99.46%, 99.73, 100.00%, and 99.73%, respectively (Table 6). Additionally, correct classification rates of validation sets varied from 97.34% to 98.40% ( Table 6). The comparison parameters for SE, SP, MCC, and EFF (Tables 4  and 6), the results highlight classification abilities of data fusion OPLS-DA models were better than the individual data set models. What is more, the RS_OPLS-DA model was the optimum classification model when using low-level data fusion strategy (Tables 5 and 6).

Mid-Level Data Fusion
At the end of the research, the feasibility for further optimizing the model parameters by feature subset selection and data fusion was investigated ( Figure 11). Variables selection was one of the steps of the mid-level data fusion strategy. For the RF model, the "Boruta" algorithm was used to identify important chromatographic signal variables that significantly contributed to the classification performance. "Boruta" selection was finished based on three RM models that were built while using data sets of rhizomes (3839 variables), stems (4140 variables), and leaves (4140 variables), respectively. After comparing original attributes' importance with importance achievable at random, 200 variables of rhizome data set, 305 variables of stem data set, and 359 of variables for leaf data set were retained as relevant features variables for sample discrimination (Figures S9-S11). Subsequently, those feature subsets were combined as a new data block and the fused data set (505 variables for RS, 559 variables for RL, 664 variables for SL, and 864 variables for RSL) was used to establish final classification models. The optimum n tree and m try values of RS_RF, RL_RF, SL_RF, and RSL_RF model could be found in Figure S12. Table 7 lists the statistical results for the classification ability of the four RF models based on mid-level data fusion. The average accuracies of the calibration set and validation set were achieved for 96.44% and 97.21% by using RF algorithm. It is notable that the RL_RF model had accuracies that ranged from 94.09% to 99.46% in the calibration set and accuracy ranging from 96.81% to 100% in the validation set. In addition, parameters of SE (0.87-1.00), SP (0.94-1.00), MCC (0.87-1.00), and EFF (0.92-1.00) for each class of RL_RF model were higher than most RF classification models. As a result, mid-level data fusion strategy could eliminate the unnecessary variables, enhance model classification ability, and improve the phenomenon of imbalance category recognition in the RF model relative to low-level data fusion strategy.
For the OPLS-DA model, in front of all, three independent classification models were built while using original data sets of rhizome, stem, and leaf, respectively. Subsequently, the VIP value of variables in different classification models was calculated by SIMCA software. The results showed ( Figure S12) that a total of 4486 variables (1309 variables selected from rhizome data set, 1538 variables selected from stem data set and 1639 variables selected from leaf data set) VIP values were greater than 1. Those variables with large importance for the geographical traceability of samples were combined into a new data set (2847 variables for RS, 2948 variables for RL, 3177 variables for SL, and 4486 variables for RSL) for final classification model building. The R 2 and Q 2 values and the permutation plot of RS_OPLS-DA, RL_OPLS-DA, SL_OPLS-DA, and RSL_OPLS-DA model were shown in Table S2 and Figures S21-S24. The classification results showed that average accuracies of calibration and validation sets were achieved for 99.66% and 96.81%, respectively ( Table 8). The four models exhibit good performances (MCC values ranged from 0.96 to 1.00 and EFF values ranged from 0.92 to 1.00 (Table 8). OPLS-DA models based on mid-level data fusion and low-level data fusion showed similar accuracy and model performance although feature selection was useful for reducing irrelevant variable when classifying samples. Overall, it can be seen that there is an improvement in the results that were provided by data fusion when compared with performances of models based on independent data sets. When considering the similar accuracy and a higher SE, SP, MCC, and EFF values between calibration set and validation set, the RS_OPLS-DA models that were based on low-level data fusion strategy was the best performance.

Sample Preparation
The dried samples (rhizomes, stems, and leaves) were ground and then passed through a 100 mesh sieves. Each sample powder (25 mg) was accurately weighed and extracted while using 1.5 mL 80% methanol-water solution, at 25 • C. The samples were extracted while using an Ultrasonic extractor for 40 min. The final extract was filtered with a 0.22 µm syringe filter into an HPLC vial and then subjected to HPLC analysis [16,58].

Instrumentation and HPLC Analysis
Chromatographic analyses were performed with an Agilent 1260 Infinity LC system (Agilent Technologies, Santa Clara, CA, USA), which was equipped with a G1315D diode-array detector, a G1329B ALS autosampler, and a thermostated column compartment. The HPLC fingerprint was recorded by Chemstation software (Agilent Technologies, Waldbron, Germany).
The analytical separation was adopted from a published method for chemical fingerprinting analysis [16]. The separation was achieved on a reversed phase C18 (Agilent Intersil, 5 µm The column was subsequently washed with 90% B and re-equilibrated with 7% B prior to injection of the next sample. The flow rate was 1.0 mL/min and the column temperature was 30 • C. The injection volume was 5 µL and the detective wavelength of UV spectra was set at 241 nm. Chromatographic data was processed while using OpenLab software (Agilent Technologies) [16,58].

Data Analysis
HPLC fingerprints from the 280 rhizome samples, 280 stem samples, and 280 leaf samples, a total of 840 fingerprint data was exported in CSV format and imported to MATLAB R2018b (The MathWorks, Inc., Natick, MA, USA), which was used for correlation optimized warping (COW) alignment preprocessing of chromatographic fingerprint. MATLAB code of COW is freely available from www.models.kvl.dk. The preprocessing fingerprint was analyzed in the following work [59].
Exploratory data analysis (EDA) is necessary for building predictive models [60,61]. It can help in determining interesting correlations among all of the samples or variables and summarize data sets main characteristics [60]. Principal component analysis (PCA) is a popular primary tool in EDA [61,62]. It is often used to visualize the relatedness between samples and explains the variance in the data. Hence, PCA, as an unsupervised pattern recognition technique, was widely used to extract key information from chemical fingerprint for geographical origin or Modelling Research [61].
Unlike PCA, orthogonal partial least squares discriminant analysis (OPLS-DA) is a supervised pattern recognition technique. As an extension of PLS, an inbuilt orthogonal signal correction filter was incorporated in the OPLS-DA model [56]. This algorithm effectively divides the X variable into two parts: one part that is related to class information (Y-predictive) and the other is orthogonal or unrelated to class information (Y-uncorrelated). Therefore, interpretability and prediction performance of the model was enhanced [56].
Random forest (RF) is another supervised pattern recognition technique utilized in the study. RF is an ensemble learning method [55]. A large number of trees were produced by RF algorithm in order to improve model predictive ability, and trees' decision results were combined as final decision results. In other words, the more trees built in the random forest classifier, the higher accuracy could be achieved. However, many researches showed that an optimum tree number was of great importance in modeling classification performance [33,46].
In this work, exploratory data analysis of HPLC fingerprints of G. rigescens grown in four different latitudes was finished with PCA. Two supervised pattern recognition techniques, OPLS-DA and RF, were applied to build classification models for G. rigescens producing areas. SIMCA 14.1 software managed PCA and OPLS-DA (Umetrics AB, Umea, Sweden). RF classification models were established with R 3.5.1 program and package randomForest (Version 4.6-14) [63].

Data Fusion Strategy
In the case of low-level fusion strategy (Figure 11), different subsets HPLC fingerprint data matrix of rhizomes, stems, and leaves) are straightforwardly concatenated and compiled into a new chromatographic data matrix for subsequent classification model construction [45,46]. Furthermore, each subset must be totally aligned and keep all the variables on the same scale before subsets reconnection [45,46].
In the case of mid-level fusion (Figure 11), the first step of data treatment is feature selection that is based on rhizomes, stems, or leaves classification models. When compared with the raw data sets, feature selection of subsets minimizes the data content and reduces data dimensions. Subsequently, new subsets of rhizomes, stems, and leaves were rebuilt while using variables of feature selection [45]. At last, those subsets are concatenated and compiled into a final data matrix for model construction [45].
In the research, relevant variables of RF classification models were determined by the R software package Boruta [64], and VIP was used for important variables selection of OPLS-DA [65].

Model Evaluation
Five parameters, including accuracy (ACC), sensitivity (SE), specificity (SP), efficiency (EFF), and Matthews correlation coefficient (MCC) were applied to evaluate the identification ability of RF and OPLS-DA models. The ruggedness of OPLS-DA model was investigated through 200 times permutation tests. Furthermore, cumulative prediction ability (Q 2 ), cumulative interpretation ability (R 2 ), root mean square error of estimation (RMSEE), root mean square error of cross-validation (RMSECV), and root mean square error of prediction (RMSEP) were important evaluation indexes for the predictive power of OPLS-DA model [33,66].
Values of TP (Correctly identified samples For model performance, lower values of RMSEE, RMSECV, and RMSEP mean better predictive ability for the models. Conversely, the closer that values of ACC, SE, SP, EFF, MCC, and Q 2 , R 2 are to 1, the more well performance the model is.

Conclusions
The findings in this study showed that G. rigescens chemical profiles were influenced by the latitude gradients of producing areas and lower latitudes and higher latitudes samples seemed to be clearly distinguishable. According to the score plots of PCA and OPLS-DA, the phytochemical geographic variation of the overground and underground part along the latitude gradients was visualized. Subsequently, the potential of fingerprint data obtained while using HPLC-DAD to discriminate and classify G. rigescens grown in four different latitudes was investigated. Additionally, RF and OPLS-DA models were used to develop an effective way for geographical traceability of the G. rigescens that were grown in four different latitudes. When using independent data sets to build models, rhizomes data set combined with OPLS-DA presented the best performance with a classification accuracy of calibration and validation set varied from 94.68% to 98.94%. In a further step, the feasibility of combining the chromatographic fingerprint data from overground and underground organs was investigated based on two kinds of data fusion strategies in order to improve the performance of classification models: low-level and mid-level. Notably, classification performances of OPLS-DA models were efficiently improved by low-level data fusion strategy and better performances of RF models appeared to be achieved by mid-level data fusion strategy. Although satisfactory results were obtained with both RF and OPLS-DA based on two kinds of data fusion strategies, OPLS-DA combined with rhizome-stem fusion data set was the optimum model for discriminating G. rigescens samples according to their grown latitudes, with an accuracy of (97.87-100.00%), SE of (0.96-1.00), SP of (0.98-1.00), MCC of (0.95-1.00), and EFF of (0.97-1.00).

Supplementary Materials:
The following are available online. Figure S1: Variation of stems score plots along the latitude gradients, Figure S2: Variation of stems score plots between the adjacent latitudes, Figure S3: Variation of leaves score plots along the latitude gradients, Figure S4: Variation of leaves score plots between the adjacent latitudes, Figure S5: Permutation plot of the OPLS-DA of rhizome samples, Figure S6: Permutation plot of the OPLS-DA of stem samples, Figure S7: Permutation plot of the OPLS-DA of leaf samples, Figure S8: The n tree and m try screening of RF models based on low-level data fusion strategy, Figure S9: Result of variables selection of rhizome fingerprint data based on "Boruta" algorithm, Figure S10. Result of variables selection of stem fingerprint data based on "Boruta" algorithm, Figure S11: Result of variables selection of leaf fingerprint data based on "Boruta" algorithm, Figure S12: The n tree and m try screening of RF models based on mid-level data fusion strategy, Figure S13: The importance variables of OPLS-DA models of rhizomes, stems and leaves fingerprints data, Figure