Data Fusion of Fourier Transform Mid-Infrared (MIR) and Near-Infrared (NIR) Spectroscopies to Identify Geographical Origin of Wild Paris polyphylla var. yunnanensis

Origin traceability is important for controlling the effect of Chinese medicinal materials and Chinese patent medicines. Paris polyphylla var. yunnanensis is widely distributed and well-known all over the world. In our study, two spectroscopic techniques (Fourier transform mid-infrared (FT-MIR) and near-infrared (NIR)) were applied for the geographical origin traceability of 196 wild P. yunnanensis samples combined with low-, mid-, and high-level data fusion strategies. Partial least squares discriminant analysis (PLS-DA) and random forest (RF) were used to establish classification models. Feature variables extraction (principal component analysis—PCA) and important variables selection models (recursive feature elimination and Boruta) were applied for geographical origin traceability, while the classification ability of models with the former model is better than with the latter. FT-MIR spectra are considered to contribute more than NIR spectra. Besides, the result of high-level data fusion based on principal components (PCs) feature variables extraction is satisfactory with an accuracy of 100%. Hence, data fusion of FT-MIR and NIR signals can effectively identify the geographical origin of wild P. yunnanensis.


Introduction
The rhizome of Paris polyphylla var.yunnanensis (Franch.)Hand.-Mazz (P.yunnanensis) and P. polyphylla Smith var.chinensis (Franch.)Hara (P.chinensis), named as "chonglou" in Chinese, is a renowned and traditional herb with a history of thousands of years in China and plants belong to Paris genus, Liliaceae family.As an ancient history ethnobotanical medicinal plant, it is used to treat snake bite and insect sting, innominate toxin swelling, and a variety of inflammatory and traumatic in the folk in China.Various phytochemical researches have demonstrated that steroidal saponin, phytosterol, molting hormone, flavone, and pentacyclic triterpene are the major chemical components in Paris [1,2].Additionally, Paris is considered to have anti-bacterial, anti-myocardial ischemia, anti-tumor, analgesia, immune-regulation according to numerous pharmacological studies [3][4][5][6][7].As an important and precious medicinal plant, the raw material plants of P. yunnanensis are widely spread over southern China, especially in Yunnan Province [8].Our previous studies have shown that there are significant differences in the content of wild P. yunnanensis samples from different geographical sources, and the saponin content in southern Yunnan is relatively higher than other regions [9,10].Hence, it is crucial to the traceable geographical origin of wild P. yunnanensis samples to ensure effective medicinal values, which helps to ensure the effectiveness of the medication.Some techniques have been applied to identify the authenticity and quality of various herbal medicines, including Fourier transform mid-infrared (FT-MIR), near-infrared (NIR), ultraviolet-visible (UV-Vis), Raman, liquid chromatography-mass spectrometry (LC-MS), high performance liquid chromatography (HPLC), etc. [11][12][13][14][15][16].In recent years, several researches in classification of P. yunnanensis have widely used chemometrics models combined with various analytical techniques, including partial least squares discriminant analysis (PLS-DA), principal component analysis (PCA), hierarchical cluster analysis (HCA), random forest (RF), support vector machine (SVM), etc. [17][18][19][20][21].Among them, spectroscopic techniques are fast, lossless, and efficient for the analysis of herbal medicines.Besides, quality of P. yunnanensis is difficult to identify by one or several chemical components due to the synergistic effect of TCMs, while the integral chemical information of medicinal plants can be provided by chromatographic or spectral fingerprints.For example, Yang et al. applied PCA and cluster analysis combined with FT-MIR to classify the small quantity wild and cultivated P. yunnanensis samples [21].PLS-DA and RF models combined with FT-MIR have successfully traced the cultivated P. yunnanensis samples from Yunnan Province with different cultivation years [17].Besides, our previous study has shown that the PLS-DA model combined with different parts (rhizomes and leaves) FT-MIR information can effectively distinguish the samples of cultivated P. yunnanensis collected in different cities of Yunnan Province [19].
Only a kind of chemical profile was obtained by the single technique, the relative complete chemical information would be provided by multiple platforms.The data fusion strategy contains low-, middle-and high-level, which can effectively fuse the chemical information obtained by different platforms of samples into one dataset to identification and classification researches [22].As a case study, Li et al. found that the discriminant model established by FT-MIR and NIR spectral data combined with high-level data fusion strategy can effectively identify Panax notoginseng from different cultivation regions [23].Wu et al. demonstrated that FT-MIR combined with UV-Vis by data fusion strategy could obtain a reliable and good result to trace the geographical origins of wild P. yunnanensis samples [24].Hence, it is vital to effectively combine multiple techniques datasets of P. yunnanensis to obtain excellently chemical information to identification analysis and the effective results.
In this study, to obtain further realization of the similarities and differences among wild P. yunnanensis samples from central, western, northwest, southeast, and southwest Yunnan, we studied the collected samples using two spectroscopic techniques (FT-MIR and NIR), and fused with data fusion strategy (low-, mid-and high-level) combined with chemometrics including PCA, PLS-DA, and RF.The important variables regions among each classification models and the fast-quality assessment effects for geographical origins of P. yunnanensis were compared.Additionally, the chemical fingerprint of wild P. yunnanensis samples from different areas at Yunnan Province for FT-MIR and NIR spectra were analyzed.The results of our study may provide some basis for comprehensive utilization of P. yunnanensis resources.

Macroscopic Chemistry Components in IR Spectra
Averaged raw FT-MIR and NIR spectra of wild P. yunnanensis samples from central, northwest, western, southeast, and southwest Yunnan Province are shown in Figure 1.Twenty-five major common peaks were contained in FT-MIR spectra from these five regions, as shown in Figure 1a.The 4000 to 3700 cm −1 and 2620 to 1800 cm −1 absorptions were the FT-MIR spectral baseline area and diamond crystal spectral region, respectively, which areas provide invalid spectral information for this study [25].Besides, regions of 3700 to 2620 cm −1 and 1800 to 1300 cm −1 were defined as characteristic areas in our study, which mainly contained C=O, C=C, and C-H stretching vibration as well as C-H bending vibration mode [21,26].In addition, the region of 1300 to 650 cm −1 was the fingerprint region, which greatly contained C-O stretching vibration, C-C stretching vibration, C-OH bending vibration mode, as well as sugar skeleton vibration [27,28].For all above these useful regions, FT-MIR spectra can be divided into five distinct ranges, including 3700 to 2000, 1800 to 1500, 1500 to 1200, 1200 to 900, and 900 to 650 cm −1 [29].In the region of 3700 to 2000 cm −1 , the broad absorption band in the range 3700-3000 cm −1 corresponds to the stretching vibrations of free hydroxyl groups ν(OH) and the groups involved in intra-and intermolecular hydrogen bonds [29].Absorption at the peaks of 2928 and 2852 cm −1 were assigned to normal vibration mode such as the CH 3 asymmetric normal vibration mode at 2960 to 2920 cm −1 , CH 2 asymmetric normal vibration mode at 2930 to 2900 cm −1 , CH 3 symmetric normal vibration mode at 2900-2880 cm −1 , and CH 2 symmetric normal vibration mode at 2860 to 2850 cm −1 [29].Additionally, the region two to the region four were useful to deconvolute the bands into Lorentzian components, and the detailed information can be observed in Table 1.Moreover, amide I band is observed in the region of 1800 to 1500 cm −1 too, which corresponds to the C=C stretching mode of fatty acids and flavonoids [29].Nine major common peaks are obtained in average NIR spectra from five geographical origins, as shown in Figure 1b.The bands in the region of 9000 to 4500 cm −1 be associated with the first or second overtones [30].The peaks in the region 4500 to 4000 cm −1 are so narrow that it was difficult to provide detailed information [31].Besides, the peaks at 5169, 3382, 3334, and 1653 cm −1 were considered, which also may correspond to hydrogen bond stretching and scissoring vibration mode attributed to water molecules [29].The fingerprint of FT-MIR and NIR spectra characteristics of P. yunnanensis from different geographical origins were similar, as shown in Figure 1, which indicated similar chemical composition among these samples.Additionally, detailed peak positions and assignments were applied in Table 1.

Single Block Models
Raw FT-MIR and NIR spectra were pretreated by standard normal variate (SNV), first derivative (FD), second derivative (SD), SNV-FD, and SNV-SD, and parameters of these pretreatment algorithms are applied in Table S1, including parameters of cumulative interpretation ability (R 2 ), cumulative prediction ability (Q 2 ), the root mean square error of estimation (RMSEE), the root mean square error of cross-validation (RMSECV), and accuracy.For the FT-MIR spectra dataset, the worst classification ability was by FD algorithms, which also had accuracy worse than the raw dataset.But for the NIR spectra dataset, the SNV pretreatment algorithm was the worst preprocessing algorithm.However, SD was the best preprocessing algorithm, both for FT-MIR and NIR, whereby the accuracy even reached 100%.Among all preprocessing algorithms, the best pretreatment algorithm (SD) for each kind of spectroscopy should be selected and used to establish geographical classification models.
PLS-DA and RF classification models were established on FT-MIR and NIR SD datasets, respectively.The efficiency for each class and accuracy of the calibration set and validation set of these geographical origin models are shown in Table 2.The parameter of the root means square error of prediction (RMSEP) was one important parameter for evaluating model classification ability.The values of RMSEP were 0.203 and 0.236 of PLS-DA based on FT-MIR and NIR spectra datasets, respectively.With the higher accuracy and lower RMSEP, the PLS-DA model effect of using FT-MIR data to classify geographical origins was better than that of NIR.The permutation test can be used to determine whether the established PLS-DA model is at risk of overfitting [32,33].The intercepts generated by 200 random permutation tests for the FT-MIR PLS-DA model permutation test with the selected-class 1 variables were R 2 = 0.395 and Q 2 = −0.861, the values of R 2 and Q 2 of permutation tests with the remaining four categories variables are shown in Figure S1.All these results showed that this model was robust without overfitting.Additionally, the PLS-DA models involved in this paper were subjected to permutation tests and there was no overfitting.
For the RF model based on the FT-MIR spectra data matrix, the initial number of trees (n tree ) and number of variables (m try ) were set as 2000 and the square root of the number of variables.The optimal value of n tree was selected based on the lowest total out-of-bag (OOB) classification error value, meanwhile assured of the lower OOB classification error values of the most classes.Besides, the optimal n tree should be selected from the smooth region of the curve when the minimum OOB value was obtained in multiple regions.The optimal m try was selected according to the lowest OOB classification error value.As shown in Figure 2a,b, the most suitable values 1780 and 33 were selected as the best n tree and m try , respectively, for the RF model based on the FT-MIR spectra data matrix.Based on the same principle, the optimal values 434 and 39 were the best n tree and m try , respectively, for the RF model based on the NIR spectra dataset, as shown in Figure 2c,d

Important Variable Datasets Selected for Mid-Level Data Fusion
Mid-level on principal components (Mid-level-PCs), mid-level on recursive feature elimination (Mid-level-RFE) and mid-level on Boruta (Mid-level-Bo) dataset matrixes needed to be established to complete mid-level data fusion.The mid-level-RFE dataset was established as follows: RF models were established on FT-MIR and NIR spectra datasets of wild P. yunnanensis samples.For the two RF models, the initial n tree for both was defined as 2000, and the initial m try was set as 33 for the FT-MIR dataset and 39 for the NIR dataset.A total of 1033 trees and 529 trees were the optimal values for n tree of FT-MIR and NIR datasets, respectively, which are shown in Figure S2.Based on the optimal n tree , the number of m try was calculated to be 36 and 32 for FT-MIR and NIR spectra data matrixes, respectively.Next, the optimal n tree and m try were used to further obtain the importance of each variable of individual spectra matrix.All variables of FT-MIR and NIR datasets were arranged from small importance to large importance, respectively.The 10-fold cross-validation error rates of the RF model, based on FT-MIR and NIR data matrixes, are shown in Figure 3a,b.It was reduced sequentially by five variables for each step for the sorted important variables of FT-MIR and NIR spectra data matrixes, respectively.For the FT-MIR dataset (Figure 3a), all variables were divided into region 1, region 2, and region 3, which represent irrelevant variables, interference variables, and important variables, respectively [23].Among them, region 3 remained for further research.In other words, 45 of the most important variables of the FT-MIR dataset were selected to prepare for mid-level data fusion.However, all variables of the NIR data matrix were divided into region 1 and region 2 (Figure 3b), which represent irrelevant variables and important variables.Variables of region 1 were excluded and the other 145 NIR variables of region 2 were used to fuse with important variables of the FT-MIR dataset to establish Mid-level-RFE models.In other words, these two block datasets straightforwardly concatenated and reconstituted the independent data matrix named Mid-level-RFE.Basing on the optimal n tree and m try , important (confirmed and tentative) variables were calculated to be 304 and 343 for NIR and FT-MIR spectra datasets, respectively.These two block datasets (important variables) reconstituted the independent data matrix named Mid-level-Bo.
Similarly, two block datasets of 25 PCs NIR variables and 17 PCs FT-MIR variables straightforwardly concatenated and reconstituted the independent data matrix named Mid-level-PCs.
Besides, the selected variables for establishing mid-level data fusion classification models using RFE and Bo algorithms are shown in Figure 4.The important (confirmed) and tentative variables are represented by blue lines and yellow lines, respectively.

Important Variables Datasets Selected for High-Level Data Fusion
High-level data fusion uses the same PCs as mid-level data fusion as the PCs were selected from the unsupervised PCA model.Hence, 17 PCs of the FT-MIR spectra dataset (FT-MIR-PCs) and 25 PCs of the NIR spectra data matrix (NIR-PCs) were used to establish PLS-DA and RF models, respectively.For FT-MIR-PCs and NIR-PCs RF models, the two parameters of n tree and m try needed to be optimized first.As shown in Figure S3, the optimal values 535 and 1030 trees were selected and; furthermore, calculated the suitable m try to be 4 and 3 for FT-MIR-PCs and NIR-PCs RF models, respectively.Then, final RF models were established on the optimal parameters and the most important variables.Besides, vote results of validation sets and calibration sets of PLS-DA and RF models based on the FT-MIR-PCs and NIR-PCs datasets were obtained as shown in Tables S2-S5.
For the RF model based on the FT-MIR and NIR spectra datasets, the initial n tree was set as 2000, and the initial m try were set as 33 (FT-MIR) and 39 (NIR), respectively.The 1780 trees and 434 trees are the optimal values for n tree of FT-MIR and NIR datasets (Figure 2).The numbers of m try were defined to be 33 and 39 for FT-MIR and NIR spectra data matrixes, respectively, based on the optimal n tree .Furthermore, all variables of FT-MIR and NIR datasets were sorted, respectively.The 10-fold cross validation error rates of the RF model, based on the FT-MIR and NIR data matrixes are shown in Figure S4a,b.For the FT-MIR dataset (Figure S4a), all variables were divided into three regions.Among them, 80 of the most important variables (region 3) of the FT-MIR dataset were selected to establish FT-MIR-RFE PLS-DA and RF models.All variables of the NIR data matrix were divided into two regions (Figure S4b) and 200 NIR variables of region 2 were used to establish NIR-RFE PLS-DA and RF models.PLS-DA and RF models were based on two important variables datasets (FT-MIR-RFE and NIR-RFE), respectively.For FT-MIR-RFE and NIR-RFE RF models, 293 and 1459, respectively, were selected as the suitable trees, as shown in Figure S5.Furthermore, the optimal m try values were calculated to be 8 and 14, respectively.Based on the optimal parameters, FT-MIR-RFE and NIR-RFE datasets were used to establish the RF model.Similarly, vote results of validation sets and calibration sets of PLS-DA and RF models based on FT-MIR-RFE and NIR-RFE data matrixes could be obtained, respectively, as shown in Tables S6-S9.
The FT-MIR-Bo and NIR-Bo datasets were obtained by selecting important variables using the Bo algorithm with FT-MIR and NIR spectra datasets.These two data matrixes were used to establish PLS-DA and RF models, respectively.As shown in Figure S6, 1143 and 875 trees were selected to be the optimal n tree for FT-MIR-Bo and NIR-Bo datasets, respectively.Besides, the suitable m try values of the two datasets were calculated to be 12 and 20, respectively.Similarly, vote results of validation sets and calibration sets of two models were obtained as shown in Tables S10-S13.
Like mid-level data fusion, the comparison of selected variables for establishing high-level data fusion classification models using two algorithms is shown in Figure S7.In addition, it was found that both the fingerprint region and characteristic region variables contribute to the classification of P. yunnanensis from different origins.
The results of PLS-DA and RF models based on RFE, Bo, and PCs selection algorithms are shown in Table 2.For FT-MIR datasets containing three-variable selection algorithms, the validation set classification results of the PLS-DA and RF models were similar and slightly lower than that obtained by models based on the raw FT-MIR data matrix.Among these models, the RF model using the PCs selection algorithm showed the best ability.Besides, the classification ability of the PLS-DA and RF models based on NIR (RFE) and NIR (Bo) data matrixes were significantly lower than that based on the NIR (PCs) dataset.However, the accuracy for the validation sets of the PLS-DA and RF models based on the NIR (PCs) data matrix were both higher than that based on the FT-MIR (PCs) dataset, which is contrary to the results based on original FT-MIR and NIR datasets.Additionally, it is not hard to see that the classification ability of the PLS-DA and RF models based on the RFE algorithm were like the Bo algorithm, no matter whether based on the FT-MIR or the NIR spectral data.Distribution of their important variables was almost the same (Figure S7), which may be the reason for the similar classification results.Bo variables selection algorithm would be the preferred one than RFE algorithm because it used lesser operation time.

Low-Level Data Fusion Models
In this case, the low-level data matrix (196 × 2695) was used to establish PLS-DA and RF models.
For the low-level data fusion RF model, 2000 and 51were set as original n tree and m try values, respectively.As shown in Figure S8, 802 (n tree ) and 51 (m try ) were selected as the optimal parameters to establish the final low-level data fusion RF model.The results of calibration and validation sets for the two kinds of models are reported in Table 3.Although the accuracy of the calibration sets of the two models was quite different, the accuracy of the validation sets was only 3% different.Compared with the accuracy of individual data source models, the accuracy of low-level data fusion by both models was similar to that of the FT-MIR dataset, which was higher than that of the NIR data matrix.Besides, the samples of the second class (Northwest Yunnan) were correctly predicted by both low-level data fusion models.The low-level data fusion strategy enhanced (to 100%) the accuracy of samples from Northwest Yunnan.
Variables whose VIP (Variable importance in the projection) score was greater than 1 were selected as important variables to establish PLS-DA and RF models.The selected variables distribution is shown in Figure S9 and results of classification models are displayed in Table 3.The important variables were lesser dispersed in regions 3580 to 3530 cm −1 , 3460 to 3150 cm −1 , 3000 to 2950 cm −1 , 2750 to 2720 cm −1 , 1400 to 1200 cm −1 , and 1000 to 800 cm −1 of FT-MIR spectra and regions 6540 to 6200 cm −1 and 5070 to 4000 cm −1 of NIR spectra.Many interference and unrelated variables were eliminated by selecting VIP values.The efficiency and accuracy of each class in the low-level (VIP) data fusion model was unchanged and even enhanced.

Mid-Level Data Fusion Models
In our study, Mid-level-PCs, Mid-level-RFE, and Mid-level-Bo three kinds of data matrixes were used to establish PLS-DA and RF classification models to prepare for mid-level data fusion.As shown in Figure S10a,b, 427 (n tree ) and 6 (m try ) were selected as optimal parameters to establish the final mid-level-PCs RF model and obtained a validation set accuracy of 94.12%.The total accuracy of the validation set of the PLS-DA model was 98.53% (Table 3).
Similarly, 2000 and 13 were set as raw n tree and m try values for the mid-level-RFE RF model.According to the principle of parameter optimization of the RF model, the lowest values 262 (n tree ) and 13 (m try ) were defined as the suitable parameters to establish the RF model, as shown in Figure S10c,d.However, the validation set accuracy of the mid-level-RFE was only 69.12%, which was lesser than that of each individual spectral RF model.Besides, the PLS-DA model based on the mid-level-RFE had a worse classification ability with an accuracy of only 55.88% (Table 3).
The numbers of 2000 and 25 were set as raw n tree and m try values, respectively, for the mid-level-Bo RF model.As shown in Figure S10e,f, the suitable parameters were calculated to be 800 and 15, to establish the mid-level-Bo RF model, and the obtained validation set accuracy was 95.59%.The difference of accuracy for both the PLS-DA model and the RF model based on mid-level-RFE and mid-level-Bo was about 30%.By comparing the regions of important variables between mid-level-RFE and mid-level-Bo data matrixes (Figure 4), we could find that the number of important FT-MIR variables for the RFE variable selection algorithm was less than that of the Bo variable selection algorithm, and there was little difference in the important variables of the NIR spectroscopy selected by the two algorithms.Hence, we can infer that the RFE selection algorithm had excluded some of the important variables that may be enhancing the accuracy of classification models.In addition, we can also extrapolate that the FT-MIR dataset was more important for geographical origin classification of wild P. yunnanensis samples than the NIR data matrix, which provides more effective information.

High-Level Data Fusion Models
For high-level data fusion, four fuzzy aggregation operators were chosen as the voting rule for the voting decision, including minimum, maximum, product, and average [23].The category that has the maximum value in each operator is considered to be the selected class.It is worth mentioning that when the difference between the maximum value and the second largest value is less than 0.01, both values are considered maximum.Three kinds of vote results would be obtained by high-level data fusion including correct, false, and multiple discriminated.As shown in Table S14, the true Class of sample NO. 4 belongs to Class 1 and the four fuzzy aggregation operators were fully accorded with the true Class.For example, for No. 31 the true category is Class 1, while three voting results are distinguished into Class 3 and one voting result is defined as Class 5.The high-level data fusion voting results of this sample is Class 3. Besides, NO. 114 was voted into Class 2 and Class 3 by FT-MIR-PCs and NIR-PCs RF models.This sample truly belongs to Class 3, while the final data fusion voting result is distinguished into Class 2. Besides, sample 6 truly belongs to Class 1 while pertained to Class 1 and Class 4 by voting, which was multiple discriminated.Although multivariate discrimination does not affect the accuracy of the model, it influences efficiency values of each Class.This explains that the accuracy of the validation set of RF model is 1, while the efficiency of Class 1 and Class 4 does not reach 1 (Table S14).
As shown in Table 3, the accuracy of the validation set of the High-level-PCs RF model was reached at 100% and that of the High-level-PCs PLS-DA model was 98.53%.The high-level data fusion classification ability based on the PCs selection variables algorithm was better than that based on RFE and Bo algorithms.Besides, there was little difference between the PLS-DA model and the RF model in identifying the origin of P. yunnanensis samples based on the same data set.

Samples Preparation
The 196 rhizomes of wild P. yunnanensis samples were obtained from five different origins at central, western, northwest, southeast, and southwest areas of Yunnan Province, as shown in Figure 5.The detail collection information is shown in Table S15.All wild samples were identified as P. polyphylla var.yunnanensis (Franch.)Hand.-Mazz.by Professor Hang Jin (Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences, Kunming, China).All rhizome samples were washed with tap water and were dried in a drying oven at 50 • C, then sifted through 100 mesh sieves.Additionally, all samples were preserved in polyethylene zip-lock bags and kept in a dark and dry environment for further analysis.

Fourier Transform Mid-Infrared Spectroscopy (FT-MIR)
FT-MIR spectra were collected with an FTIR spectrometer equipped with a deuterated triglycine sulfate (DTGS) detector and a ZnSe ATR (attenuated total reflection) accessory (Perkin Elmer, Norwalk, CT, USA).All spectra recorded ranges of 4000 to 650 cm −1 with 4 cm −1 resolution, and 16 scans were averaged.Three analytical replicates of FT-MIR spectral data of all wild P. yunnanensis samples were obtained.

Near-Infrared Spectroscopy (NIR)
NIR analysis was conducted with an Antaris II spectrometer (Thermo Fisher Scientific, Madison, WI, USA) equipped, combined with a diffuse reflection module.All spectra recorded ranges of 10,000 to 4000 cm −1 with a spectral resolution of 4 cm −1 , and 16 scans were averaged for wild samples.Three scans were repeated for all wild samples.

Spectral Data Analysis and Software
FT-MIR spectra were converted from transmittance to absorbance and the advanced ATR correction was completed by OMNIC 9.7.7 software (Thermo Fisher Scientific, Madison, WI, USA).The spectral linear relation was greatly disturbed by high-frequency random noise, the interference of light scattering, baseline drift, etc. [34].Hence, FT-MIR and NIR spectra were processed using SNV, FD, SD, and their combination (SNV-FD and SNV-SD), to decrease a part of the irrelevant interferences [10,35,36].All these pretreatment procedures were performed by SIMAC-P + (Version 13.0, Umetrics, Umeå, Sweden).Additionally, the spectral regions of 4000 to 3700 cm −1 and 2620 to 1800 cm −1 were excluded for all FT-MIR spectra before establishing classification models due to a mass of interference information.
Hence, the regions of 3700 to 2620 cm −1 and 1800 to 650 cm −1 formed a data matrix for constructing classification models.
All samples for each class were separated into calibration sets and validation sets as a rate of 2 to 1 with Kennard-Stone algorithm using MATLAB (Version R2017a, Mathworks, Natick, MA, USA) [37,38].In other words, the number of calibration sets (128 samples) in one to five categories was 26, 26, 24, 26, and 26, respectively, and the number of validation sets (68 samples) in one to five classes was 14, 14, 12, 14, and 14, respectively.The preprocessing algorithms were estimated by parameters of R 2 , Q 2 , RMSEE, RMSECV, and accuracy of the calibration set [19,39].The better pretreatment algorithm required higher values of R 2 , Q 2 , and accuracy as well as lower values of RMSEE and RMSECV.Hence, the best preprocessing algorithm would be selected for identification analysis to establish PLS-DA and RF classification models.
The groundwork of PLS-DA is the PLS algorithm and it belongs to the binary classification algorithm from 0 to 1, which has been widely applied to resolve the classification problems for geographical origins, growth years, and others [40].For each sample, the probability of being assigned to each class could be obtained, and the category with the highest probability was seen as the category of this sample.For the validation samples, RF is based on the assembly classification or regression trees algorithm and has a better ability to handle the nonlinear and high-order interaction effects data matrixes [41].Both of these kinds of class-modeling methods belong to supervised pattern recognition and require a calibration set for each class in order to establish an individual model to explore their similarities between samples from the one class and the differences among all classes.Besides, the validation sets were used to validate the identification ability of supervised models.In our study, all PLS-DA were completed by SIMAC-P + (Version 13.0, Umetrics, Umeå, Sweden) and RF were established by RStudio (version 3.5.2,Boston, MA, USA).The operation of RF was roughly divided into the following three steps: Firstly, 2000 and the square root of the number of variables were set as the initial values of n tree and m try , respectively.Secondly, these two parameters were optimized according to the lowest OOB classification error values.Thirdly, the RF model was established with the selected optimal values of n tree and m try .
The RMSEP and accuracy of the validation set were the two parameters used to estimate the classification ability of PLS-DA.The lower values of RMSEP shows the better prediction ability of PLS-DA models.Besides, for the PLS-DA and RF models established by the best preprocessing algorithm, indices of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) were calculated for each class.The sensitivity (SEN) and specificity (SPE) were obtained by the above indices for each class.The efficiency of values was calculated by the geometric mean of SEN and SPE to evaluate the effectiveness of each class of PLS-DA and RF models.All formulas for the above parameters were as follow: The map of sample collection information in our study was obtained by Arc Map (version 10), and all figures were drawn by Origin (version 2018, OriginLab Corporation, Northampton, MA, USA) and Adobe Photoshop CC (version 2019, Adobe Systems Incorporated, San Jose, CA, USA).

Data Fusion Strategy
Data fusion strategy was applied in this study, as the comprehensive information of P. yunnanensis samples were unable to be provided by individual data sources.To compare and select the best data fusion strategy to trace geographical origins, the low-, mid-, and high-level data fusion strategies and three algorithms for variable selection were considered.The best preprocessing FT-MIR and NIR datasets were used to finish data fusion approaches.The schemes for these three strategies combined with two kinds of spectral signals are shown in Figure 6.High-level data fusion strategy, namely decision-data fusion, fused the vote results from the models of the FT-MIR and NIR datasets.Additionally, individual spectral matrices were formed by feature or important variables using variable selection algorithms (PCs, RFE, and Bo) before establishing

Figure 2 .
Figure 2. The parameter optimization of random forest models of independent decision making: (a) number of trees (n tree ) of the FT-MIR dataset; (b) number of variables (m try ) of the FT-MIR dataset; (c) n tree of the NIR dataset; (d) m try of the NIR dataset.

Figure 3 .
Figure 3.The 10-fold cross-validation error rates of the Random Forest (RF) model (sequentially reduced every five variables) based on total P. yunnanensis samples: (a) FT-MIR dataset; (b) NIR dataset.

Figure 4 .
Figure 4.The important variables of Boruta algorithm and RFE algorithm of random forest models based on total P. yunnanensis samples: (a,b) the FT-MIR dataset; (c,d) the NIR dataset.RFE: Recursive feature elimination.

Figure 6 .
Figure 6.Scheme of the low-, mid-, and high-level data fusion approaches used to combine the FT-MIR signals and NIR signals.In low-level data fusion strategy, the FT-MIR and NIR spectral signals are straightforwardly concatenated and reconstitute an independent data matrix.This new dataset (low-level data matrix) was equal to 196 rows and 2695 columns, namely 196 samples and 2695 spectral variables (= 1545 NIR variables + 1150 FT-MIR variables).Finally, the low-level dataset was used to establish the PLS-DA and RF models.Mid-level data fusion strategy, namely feature-level data fusion, was made up of feature important variables from single data sources including FT-MIR and NIR spectra.PLS-DA and RF classification models were established by new data matrixes, which were formed by concatenating the feature important variables from FT-MIR and NIR by different variable selection algorithms.In this case, important variables were selected based on each Y parameter (spectral datasets of total samples).More in detail, three different variable selection algorithms were as follows:• Mid-level-PCs consisted of principal components, which were selected by PCA of FT-MIR and NIR spectral datasets, respectively.PCs were selected based on values of eigenvalue greater than 1.Hence, the mid-level-PCs data matrix was obtained with 196 rows and 42 columns, namely 196 samples and 42 PCs variables (= 25 NIR PCs variables + 17 PCs FT-MIR variables).• Mid-level-RFE, which consisted of merging together the important variables of FT-MIR and NIR spectral datasets, was selected by the recursive feature elimination algorithm based on the RF model.The mid-level-REF dataset size was equal to 196 rows and 190 columns (= 145 NIR REF variables + 45 FT-MIR REF variables).• Mid-level-Bo, which consisted of merging together the important (confirmed and tentative) variables of FT-MIR and NIR spectral datasets, was selected by the Boruta algorithm based on the RF model.The mid-level-Bo data matrix consisted of 196 rows and 647 columns (= 153 NIR Bo confirmed variables + 151 NIR Bo tentative variables + 207 FT-MIR Bo confirmed variables + 136 FT-MIR Bo tentative variables).

Table 1 .
Peak assignments on the FT-MIR and NIR spectra of wild P. yunnanensis.

Table 2 .
The classification efficiency values and total accuracy of independent decision making with Partial least squares discriminant analysis (PLS-DA) and random forest (RF) models.RFE: Recursive feature elimination.

Table 3 .
The classification efficiency values and total accuracy of low-, mid-, and high-level data fusion strategies decision making with PLS-DA and RF models.RFE: Recursive feature elimination, Bo: Boruta, PCs: Principal components, VIP: Variable importance in the projection.