Determination of Cultivation Regions and Quality Parameters of Poria cocos by Near-Infrared Spectroscopy and Chemometrics

Poria cocos (PC) is an important fungus with high medicinal and nutritional values. However, the quality of PC is heavily dependent on multiple factors in the cultivation regions. Traditional methods are not able to perform quality evaluation for this fungus in a short time, and a new method is needed for rapid quality assessment. Here, we used near-infrared (NIR) spectroscopy combined with chemometric method to identify the cultivation regions and determine PC chemical compositions. In our study, 138 batches of samples were collected and their cultivation regions were distinguished by combining NIR spectroscopy and random forest method (RFM) with an accuracy as high as 92.59%. In the meantime, we used partial least square regression (PLSR) to build quantitative models and measure the content of water-soluble extract (WSE), ethanol-soluble extract (ASE), polysaccharides (PSC) and the sum of five triterpenoids (SFT). The performance of these models were verified with correlation coefficients (R2cal and R2pre) above 0.9 for the four quality parameters and the relative errors (RE) of PSC, WSE, ASE and SFT at 4.055%, 3.821%, 4.344% and 3.744%, respectively. Overall, a new approach was developed and validated which is able to distinguish PC production regions, quantify its chemical contents, and effectively evaluate PC quality.


Introduction
Poria cocos (PC) is derived from dried sclerotium of the fungus P. cocos (Schw.) Wolf, and it provides important nutritional and medicinal values. In China, it is known as "Fuling" and used for the treatment of various diseases [1], such as diabetes [2], cancer [3,4] and depression [5]. PC usage for nutritional and medicinal purposes has a long history of over 2000 years. PC was first recorded in an ancient Chinese medical book, Wushi'er Bingfang or Recipes for Fifty-Two Ailments, which described its clinical efficacy in promoting urination, removing dampness, invigorating the spleen and calming nerves [6][7][8]. Recent reports demonstrated that recipes with PC have been developed as part of a daily diet to prevent various chronic diseases such as hyperlipidemia, hyperglycemia, hypertension and atherosclerosis [7,8]. In China, these recipes are often used to develop and produce popular functional foods for both nutritional and medical benefits.
Since PC offers remarkable medicinal values with little side effects, the commercial PC market has grown rapidly in recent years. Cultivated PC has grown strongly due to the high demand for it that cannot be met by very limited amount of wild-growing PC. PC is a Chinese medicinal material, and similar to most materials for Chinese medicine, its quality is dependent on multiple factors within the cultivation region, including planting ecological environments [9], harvesting time [10], processing technology [11], etc. These factors affect the generation of secondary metabolites, which are important indicators of the quality and nutritional value of medicinal materials, including PC.
PC is mainly distributed in East and Southeast Asia where the climate is humid and subtropical, including China, Vietnam, and the Philippines [12,13]. The Dabie Mountains region, Yunnan region and Xiangqian region are the three main cultivation regions of PC in China, and the products of these regions, which are across different climate zones, have their own features [10,11,[14][15][16]. The diverse cultivation conditions in these remotely distanced regions have caused quality discrepancies in PC grown in China. Identification of PC's cultivation region in China would greatly benefit its quality classification, which is part of product quality control. This can be accomplished by developing a quick and accurate method to measure the levels of some key constituents of PC that are dependent on the cultivation region.
According to the Chinese and European Pharmacopoeia, the contents of water-soluble extracts and ethanol-soluble extracts are important indexes for the quality evaluation of PC. These contents contain a variety of chemical constituents, including triterpenes, polysaccharides, sterols, essential oil, proteins, etc. [13]. Among them, the main bioactive constituents are polysaccharides and triterpenes, which have been reported for various pharmacological benefits such as anti-inflammatory, antibacterial, antitumor and antioxidant effects. These pharmacological benefits contribute to the clinical efficacy of PC [17][18][19][20]. PC polysaccharides are a mixture of different types of polysaccharide, and account for 70~89% of the total dried mass of dried sclerotium, but only 2% of it dissolves in water-these water-soluble polysaccharides are the foundation of treating diseases, so PC is generally taken in the form of a water decoction [3]. PC polysaccharides contains rich branched glucans with (1,3)-β,(1,4)-β and (1,6)-β linkages and has exhibited strong bioactivity [21][22][23][24][25]. Up to now, more than 91 bioactive triterpenes have been isolated [13] and identified from PC, including Dehydrotumulosic acid (DTUA), Poricoic acid A (PAA), Polyporenic acid C (PAC), Dehydropachymic acid (DPA), Dehydrotrametenolic acid (DTRA) and other active triterpenoids [26,27]. Among them, DTRA and PAC are lanostane-type triterpenes, DPA and DTUA are Eburicane-type triterpenes, PAA is seco-Eburicane-type triterpenes, and they all show promising bioactivity [28,29]. These chemical constituents have special absorption spectra that show a unique reflection of their composition and molecular structure. This feature can correlate with the spectral response of sample to its constituents [30].
A broad range of analytical methods are available to assess the content of these chemicals in PC. Among them, near-infrared (NIR) spectroscopy is an advancing analytical technique that is rapid, easy to use and nondestructive [31,32]. NIR has been widely used to evaluate the quality of edible fungi, such as Pleurotus ostreatus [33], Ganoderma lucidum [34] and some cultivated mushrooms [35,36], with the capability to classify and identify cultivation regions [37]. NIR provides a complex spectrum with comprehensive absorption peak of H-containing groups (i.e., N-H, O-H, and C-H) that can be found in many active substances, including polysaccharides, flavonoids, triterpenes and phenols, etc. However, the spectroscopy of these substances is very sensitive to external interferences introduced by variances in ambient temperature, light scattering of the spectrometer and the color and physical status of samples. The raw spectra of these substances are similar and broad, consisting of many overlapping narrow bands, and hence make it challenging for further data analysis. To overcome this challenge, some spectral processing methods, such as standard normal variate (SNV) [38], smoothing, multiplicative scanter correction (MSC) [39] and Savizaky-Golay method (SG) [40,41], have to be used to optimize the calibration model by coupling with chemometrics. As previously stated, it is difficult to thoroughly assess the quality of Chinese medicinal herbs, due to the complexity of their active ingredients. However, NIR technology, combined with chemometric techniques, provides new insights into quality assessment of Chinese medicine materials with improved accuracy, and can be used for comprehensive analysis of the chemical composition of PC.
The objective of this study is to establish new models to distinguish cultivation regions and quantify the chemical composition of PC using NIR and chemometrics. Firstly, 138 batches of PC samples were collected from the three main cultivation regions, and then their raw NIR spectra and the content of water-soluble extract, alcohol-soluble extract, polysaccharides and the five triterpene acids were measured. Secondly, we used random forest method (RFM) to distinguish cultivation regions by NIR spectroscopy, and then used partial least square regression (PLSR) method to develop calibration model to determine the concentration of water-soluble extract, alcohol-soluble extract, polysaccharides and the sum of the five triterpenoids.

Sample Preparation
A total of 138 batches of PC samples were collected from the three major PC cultivation regions (Dabie Mountains region, Yunnan region and Xiangqian region). Detailed information of these samples is listed in Supplementary Materials (Supplementary Table S1). These samples were identified and authenticated by associate Prof. Hao Liu of Hunan Academy of Chinese Medicine. All the dry samples were further dried in an oven at 60 • C for 2 h, ground and sieved through a 60 mesh (0.3 mm). Well-prepared powder samples were labeled and stored in a dark and dry place. , all above 98% purity, were purchased from Chroma Biotechnology Co., Ltd. (Chengdu, China). Distilled water was purified using a Milli-Q system (Millipore, Bedford, MA, USA). Acetonitrile (HPLC-grade) was purchased from Merck (Darmstadt, Germany). Ultrapure water was purchased from Hangzhou Wahaha Group Co., Ltd. (Changsha, China). All the reagents and chemicals were of analytical grade.

Extraction of Water-Soluble Content and Ethanol-Soluble Content of PC
The water-soluble and ethanol-soluble contents of the solutions were then extracted and analyzed by following the standard method documented in the Chinese Pharmacopoeia Volume IV: 2201 (2020 Edition).
To determine the content of water-soluble extract: 2.5 g sample powder was mixed in 50 mL water to make a water solution, which was weighed and allowed to stand for 1 h. The solution was then heated in a water bath to slightly boil for 1 h. After this, the solution was cooled and its weight was replenished by adding water. The well-mixed water solution was then filtered and 25 mL of the filtrate was transferred to an evaporating dish, which was heated in a water bath to obtain the extract by removing water through evaporation. The extract was dried at 105 • C for 3 h and allowed to cool for 30 min in a desiccator. Immediately after that, the extract was accurately weighted. The water-soluble extract content was determined using the following equation: where Cw is the water-soluble extract content (%), S is the mass of the sample (g), W 1 is the mass of the evaporating dish and residue after drying (g), and W 0 is the mass of the evaporating dish (g). The content of ethanol-soluble extract was determined in a similar way by using ethanol instead of water.

Polysaccharides Content and Absorbance Curve
The contents of PC samples were determined using the phenol-sulphuric acid method. The powder samples (0.5 g) were mixed with 100 mL of 80% (v/v) ethanol, and the solution was heated in a water bath to reflux for 1 h. The solution was then filtered through a hopper to obtain the residue that was then washed with 15 mL of 80% (v/v) hot ethanol for three times. The remaining residue was heated in 75 mL water and refluxed for 4 h. The residue solution was then filtered while it was hot and the filter was washed with a small amount of hot water. The filtrate was diluted with water in a 100 mL flask. In total, 1 mL of filtrate was precisely measured, placed in a 10 mL test tube, and further homogenized with 0.8 mL of 5% (v/v) phenol and 3.5 mL of concentrated sulphuric acid. The mixture was incubated in water bath at 40 • C for 15 min. After each sample was cooled in an ice bath, its absorbance was measured at a wavelength of 488 nm. A total of 0.098 mg/mL standard solution of glucose was prepared as reference solution. The sample was diluted for different concentrations of 0.0018, 0.0037, 0.0074, 0.0111, 0.0148 and 0.0185 mg/mL. The D-glucose standard was used to generate the standard absorbance curve, with the equation Y = 84.978X + 0.0571, R 2 = 0.9924 (X-sample concentration; Y-absorbance).

Quantitative Analysis of Five Triterpene Acids Using High-Performance Liquid Chromatography (HPLC)
Chromatographic analysis was performed using a HPLC system (1200 series, Agilent Technologies, Santa Clara, CA, USA), comprising a quaternary solvent delivery pump, online degasser, autosampler, column thermostat and diode array detector. The column (ZORBAX SB-C18 column, 5 µm, 4.6 × 250 mm) was performed at 30 • C and at the flow rate of 1.0 mL/min. Acetonitrile and 0.2% (v/v) formic acid/water were used as mobile phase A and B, respectively. The solvent gradient was set as follows: 51% A during 0-5 min, 51-57% A during 5-20 min, 57-70% A during 20-45 min, 70-85% A during 45-55 min, and 85-100% A during 55-60 min. The injection volume was 10 µL and the detection wavelength was 245 nm.
Powdered samples (2.0 g) were accurately weighed, and then extracted with 20 mL methanol by ultrasonication (KM500DB, Kunshan Meimei Ultrasonic Instrument Co., Ltd., Kunshan, Jiangsu, China) at 40 kHz for 30 min. The extracts were then passed through 0.45 µm syringe filter for HPLC analysis. The concentrations of five triterpene acids, including DTUA, PAA, PAC, DPA and DTRA, were determined using standard calibration curves, and the detailed information of standard calibration curves was listed in Table 1. a , x-concentration of the analyte (mg mL −1 ), y-peak value; b , LOD-limit of detection; c , LOQ-limit of quantification.

NIR Spectroscopy Determination
The NIR spectra of the samples were measured by Antaris II FT-NIR spectrometer (Thermo Electron Co., Waltham, MA, USA), and the parameters of spectra were determined as 32 iterations of scans, resolution 8 cm −1 , and scan range 600 to 2000 nm. The powder sample was added to the sample cup and compressed. Each sample was measured three times and the average of three spectra was calculated for further statistical analysis. The temperature was kept at around 25 • C in the laboratory.

Geographical Discriminant Analysis of Random Forest Method (RFM)
RFM was an integration algorithm based on classification regression tree [42]. We used it to distinguish the producing areas of PC samples. During the training process, the RFM method constructed multiple independent tree classifiers based on Bagging algorithm, and the final results were voted by all constructed training classifiers. After model training, it can be used to classify the samples according to their producing areas. Previous studies [43] had reported in more detail how RFM performed better in distinguishing producing areas of Chinese medicinal materials.

Partial Least Squares Regression (PLSR) and Determination of Chemical Composition
PLSR is a classical regression method and has been widely used for quantitative determination through modeling. PLSR intends to project the high-dimensional predictor variables into a smart set of latent variables, which have a maximal covariance to the responses [43]. In this study, we used PLSR to establish the regression model for determination of chemical composition content, i.e., polysaccharides, ethanol-soluble extract, water-soluble extract and the five triterpenoids. The regression model was able to learn the relationship between independent variable X (NIR matrix) and dependent variable Y (chemical composition content), and the determination of chemical composition content was achieved by multiple linear regression of score matrix.

Software Requirements and Data Analysis
Statistical evaluation was performed using SPSS 21.0 (SPSS Inc., Chicago, IL, USA). All the algorithms were implemented in MATLAB 8.0.1 (MathWorks), and all the programing was done by the authors.
The indicators for evaluating the calibration set were the root mean square error of calibration (RMSEC), the root mean square error of prediction (RMSEP), the coefficient of determination for calibration (R 2 cal ) and the coefficient of determination for prediction (R 2 pre ). RMSEC and RMSEP were used to measure the model effectiveness and accuracy, while R 2 cal and R 2 pre were used to evaluate the fit of model. More specifically, a better model is indicated by smaller values of RMSEC and RMSEP and R 2 cal and R 2 pre being closer to 1. RMSEC, RMSEP and R 2 were respectively defined using the following equation: where n c , n p y i , y i andȳ are the number of observations in calibration, number of observations in prediction set, predicted value of the i-th observation, measured value of i-th observation and average of measured values, respectively.

Polysaccharides, Water-Soluble Extract and Ethanol-Soluble Extract in PC Samples
According to Chinese, Japanese, Korean and European pharmacopoeias, ethanolsoluble extract, water-soluble extract and polysaccharides are important indicators to evaluate the quality of PC. Our study had detected all these three indicators, and the detailed contents for these chemical compositions are listed in Table 2 and Figure 1A-C. Samples from the Yunnan region have the highest content of polysaccharide, water-soluble extract and ethanol-soluble extract at 6.81 mg g −1 , 25.23 mg g −1 and 37.85 mg g −1 , respectively. Samples from the Dabie Mountains region contain the lowest levels of polysaccharide (4.59 mg g −1 ), water-soluble extract (19.57 mg g −1 ) and ethanol-soluble extract (31.24 mg g −1 ). Figure 1A-C shows that the polysaccharides, ethanol-soluble extract, and water-soluble extract from the Yunnan region were significantly (p < 0.05) higher than those from the Dabie Mountains region. The contents of polysaccharides, water-soluble extract and ethanol-soluble extract in all the samples ranged from 1.24 mg g −1 to 31.16 mg g −1 , 8.80 mg g −1 to 73.95 mg g −1 , and 16.29 mg g −1 to 90.00 mg g −1 , respectively. The wide range of these values demonstrates that the quality of PC varies greatly among different producing regions. With regard to the standard deviation, the Yunnan region is highest in all of the three chemical compositions, followed by the Xiangqian region and the Dabie Mountains region. Although samples from the Yunnan region have high quality, samples from the Dabie Mountains region show better consistency in quality. The high quality of PC products in the Yunnan region could be attributed to the local climate and natural environment, while the quality consistency in samples from the Dabie Mountains region could be related to higher standard of planting and processing. Data are mean values ± SD (significant differences between the letters). The boxes are bounded by the 25% and 75% quartiles, with the median inside, whereas the extreme levels correspond to 5% and 95% percentiles, the rhombuses and stars represent the atypical values.

HPLC Analysis of Five Triterpene Acids
Triterpenes is also an important indicator of the quality of PC. In this study, DTUA, PAA, PAC, DPA and DTRA were selected as the leading indicators and the contents of the five triterpene acids in these PC samples were determined via HPLC analysis. The detailed results were listed in Table 2 and plotted in Figure 1D-I. The results show that the contents of DTUA, PAC and DPA in samples from the Dabie Mountains region are significantly higher than those from the Yunnan region, while the contents of PAA and DTRA are slightly lower than those from the Yunnan region. However, the total contents of the five triterpenoids in the samples do not change significantly from one cultivation region to another. Similar to polysaccharides, ethanol-soluble extract and water-soluble extract, contents of triterpenoids vary greatly among the producing regions. The result also suggests that high contents of five triterpenoids in PC corresponds to low contents of polysaccharides, and vice versa.

NIR Spectra Characteristics
The samples were analyzed using NIR spectroscopy in the wavenumber range of 12,000 to 4000 cm −1 . The results are presented in Figure 2, which illustrates five absorption peaks were mainly concentrated at about~8256,~6880,~5610,~5180 and~4800 cm −1 . The chemical structures of PC chemical components have abundant C-H and O-H bonds. It can be considered that the chemical components of PC were responsible for the NIR spectra. The strongest absorbance peak was observed at the wavenumber of~5180-5200 cm −1 , which is related to the combination of O-H stretching and bending, while the peak at~4800 cm −1 is related to the combination of both C-O stretching and O-H deformation. The peak at 6880 cm −1 may be related to the first overtone absorption of O-H stretching vibration and the weak peak at~8256 cm −1 is related to the second overtone of symmetric stretching of the -CH bonds of -CH 3 groups. Other less evident absorbance peaks are caused by overtones and combination modes of C-H, N-H, and O-H. The O-H and C-H vibrations are caused by compounds such as polysaccharides, triterpenoids and proteins [34,44]. As shown in Figure 2, the high overlap of NIR spectra makes it difficult to directly utilize the spectral information of target components. This problem can be solved by preprocessing the spectral information for better signal-to-noise ratio.

Elimination of Outliers
Outliers can have deleterious effects on model accuracy. We used Monte Carlo crossvalidation method to identify potential outliers in NIR spectra. As shown in Figure 3, three samples were identified as outliers and thus excluded from model building.

Determination of PC Cultivation Regions
A classification model based on an RFM integration algorithm was used to identify PC cultivation regions. RFM integration algorithm could classify samples from different cultivation regions based on their similarities calculated in the model-training process.
To develop a more reliable model, samples were divided into calibration group (108 samples) and validation group (27 samples) at a ratio of 4:1 by classical Kennard-Stone (K-S) algorithm. The calibration group was used to optimize the model parameters and establish the optimal classification model. The validation group was used to verify the performance of the classification model. The separation plot of samples from different cultivation regions was shown in Figure 4.  Furthermore, the validation group was used to validate the reliability of the classification model we developed. The accuracy of classification prediction model was calculated as 92.59%, or 25 out of 27, with false results from only two samples (Table 3).

Model Development and Optimization
By using Kennard-Stone (K-S) algorithm, 108 samples were selected to optimize the pretreatment processes and develop the prediction model, while the remaining 27 samples were used to validate the quantitative prediction model. Detailed information is provided in Table 4; the results show that the calibration group covers the range of the validation group, making the model more stable. In this section, a PLSR algorithm was used to build these quantitative prediction models, and the number of PLSR factors were optimized by 10-fold cross-validation method.  We further employed several signal pretreatment methods, including multiplicative scatter correction (MSC), Savizky-Golay method (SG), standard normal variate (SNV), and combined process methods, to improve the model performance, which was measured by RMSEC, R 2 cal , RMSEP and R 2 pre . The results obtained by different signal preprocess methods were showed in Table 5 and suggested that SG-1D was the most suitable for the water-soluble extracts and polysaccharides, while SNV worked best for the ethanol-soluble extracts and sum of five triterpenoids. The results in Table 5 allow us to develop the prediction models using the optimized experimental conditions. The model for determination of water-soluble extract, alcoholsoluble extract, polysaccharides and the sum of five triterpenoids generated R 2 cal of 0.978, 0.972, 0.987 and 0.975, respectively, and exhibited RMSEC values of 1.083, 1.502, 0.056 and 0.165, respectively. In general, a good model would generate results with R 2 cal and R 2 pre close to 1, and RMSEC and RMSEP close to 0. A value of R greater than 0.9 indicates a strong correlation between the determined and actual values, and great accuracy of determination. The results in Table 5, with R cal and R pre values close to 1 and RMSEP and RMSEC values close to 0, demonstrate that the calibration models are capable of identifying the producing regions with good accuracy.

Validation of Quantitative Model
After the above processes were completed, the validation group of samples were used to validate the optimal prediction model. From Table 5, the values of R 2 pre and RMSEP are: 0.962 and 1.934, respectively, for water-soluble extract; 0.940 and 1.926, respectively, for alcohol-soluble extract; 0.965 and 0.0079, respectively, for polysaccharides; and 0.961 and 0.191, respectively, for the sum of five triterpenoids, as listed in Table 5. These results demonstrated the high accuracy of the models we developed.
To further evaluate the model, the relative error (RE) of the determined values was calculated. In general, the RE is less than 5% and indicates that the determined values are very close to the values in the reference. As can be seen from Figure 5, the RE of polysaccharides, water-soluble extract, alcohol-soluble extract and the sum of five triterpenoids is 4.055%, 3.821%, 4.344% and 3.744%, respectively. Therefore, it can be concluded that four models have high levels of accuracy and are capable of the fast quantitative analysis of PC chemical components.

Conclusions
The present study combined NIR spectroscopy with chemometrics, including RFM and PLSR, and developed models to determine PC cultivation regions and quality parameters represented by water-soluble extract, alcohol-soluble extract, polysaccharides and the sum of five triterpenoids. We selected optimal pretreatment method and chemometric method to optimize sample classification and generate quantitative models. Using the classification model, we achieved a classification accuracy of 92.59% for the validation group, with only two out of 27 samples incorrectly classified. In the meantime, we developed and optimized the PLSR model that was independent of the calibration or prediction set, with R 2 greater than 0.9 and relative error smaller than 5% for most of the samples. The results verified the feasibility of this method in measuring the content of PC chemical components.
PC is a medicinal and nutritional material commonly used in functional food, healthy snacks and drinks. In recent years, demand for PC has increased continuously, leading to the increasing popularity of PC cultivation. Many factors of cultivation environment affect the quality of PC and cause discrepancies in the quality of products from different regions. NIR spectroscopy is an electromagnetic wave that is mainly the combination of overtone bands from hydrogen-containing function groups, such as C-H, N-H, O-H and their combination of both C-O stretching and O-H deformation. These absorptions contain rich composition information, making NIR spectroscopy very suitable for the measurement of hydrocarbon organic compounds in PC polysaccharides, water-soluble extract, alcoholsoluble extract and triterpene acids. In this study, we used the NIR method, which has multiple advantages over traditional methods, to rapidly and accurately evaluate the quality of PC. The method can be applied to general screening and rapid quality estimation, which, together with more samples in the future, will further improve the accuracy of the models we developed.