1. Introduction
Poria, the dried sclerotium of the fungus
Poria cocos (Schw.) Wolf from the Polyporaceae family, is one of the Chinese traditional and valuable medicinal herbs widely used in both food and medicine [
1,
2]. Its annual societal demand exceeds tens of thousands of tons.
Poria cocos has the effects of promoting diuresis, eliminating dampness, strengthening the spleen, and calming the mind, making it valuable in a variety of applications including health foods, cosmetic products, and pharmaceutical preparations [
3,
4]. According to numerous studies, the primary active components of
Poria cocos are polysaccharides and triterpenoids [
5,
6]. Polysaccharides accounts for 84% by weight among all constituents in the dried sclerotium [
7]. With the increasing cultivation and application of
Poria cocos, researchers have discovered more of its medicinal values, such as anti-tumor, anti-inflammatory, and immune-regulating functions [
8,
9].
According to reports from the China Report Hall in 2024, China is the primary producer of Poria cocos, accounting for approximately 70% of global production. In 2022, the production in China reached 252,000 tons, and by 2024, the national output had increased to 500,000 tons, representing an 8% year-on-year growth. It is projected that by 2025, the cultivated area in China will exceed 2 million mu, with an annual production of 1 million tons. These data indicate that China is the main cultivation region for Poria cocos. Due to differences in geography, climate and other factors, samples from different provinces vary significantly in appearance, nutritional composition, and bioactive constituents. For example, Xia et al. [
10] reported that Poria samples from eight provinces (including Yunnan, Sichuan and Anhui) exhibited significant differences in the content of seven major triterpenoid compounds. In addition, three carbohydrates (mannose, galactose, and palatinose) had the highest relative abundance in samples from YN, followed by AH, and the lowest in JZ, while the relative abundance of the four amino acids (proline, l-alanine, l-norleucine, and kainic acid) was the highest in samples from JZ, followed by those from AH, and the lowest in samples from YN. Here, JZ represents Hunan Province, AH represents Anhui Province, and YN represents Yunnan Province [
11].
The processing of
Poria cocos (including peeling, slicing, steaming and other procedures) significantly alters its original morphological and chemical characteristics [
12]. The processing techniques for
Poria cocos differ across China, varying in specific procedures and excipient dosages. While the Traditional Chinese Medicine sector features a lengthy industrial chain, the accountability for managing its standardization is not clearly established [
13,
14]. Compounding this issue, the current quality control system for Chinese herbal medicines lacks standardized authentication labeling for processed Poria products. The quality evaluation of
Poria cocos products is a comprehensive system, primarily based on the standards of the
Pharmacopoeia of the People’s Republic of China. Key parameters include: active ingredient content (e.g.,
Poria cocos polysaccharides and triterpenes), physicochemical indicators (moisture, ash, and extractives), safety indicators (heavy metals and pesticide residues), and traditional appearance traits (color, texture, and pattern).
These challenges make it difficult to accurately trace the geographic origin of Poria using traditional morphological identification methods. Accurate origin identification, however, is essential for the rational development and management of Poria resources across different regions and provides a scientific basis for clinical applications. Therefore, developing rapid and reliable methods for determining the geographical origin of Poria holds significant practical value in both commercial and research contexts.
Currently, common methods for determining the origin of Poria include Terahertz spectroscopy (THz), Hyperspectral imaging [
15], Electronic nose technology, High-performance liquid chromatography (HPLC) and Near-infrared spectroscopy (NIR), among others [
16,
17]. However, these techniques have several limitations, such as high instrumentation costs, complex data processing, and destructive sample preparation. In contrast, Raman spectroscopy (RS) is a powerful molecular spectroscopy technique that enables rapid, non-destructive analysis while providing detailed molecular structural information. With its advantages of high detection efficiency and low operational costs, Raman spectroscopy has been widely applied in the analysis of polymer materials, biomolecules, pharmaceuticals, and food products [
18,
19]. For instance, Liu et al. [
20] successfully differentiated three starch classifications using Raman spectroscopy combined with PCA-SVM analysis. de Angelis et al. [
21] utilized surface-enhanced Raman spectroscopy (SERS) for rapid determination of phenolic compound in chamomile. Tian et al. [
22] applied Raman spectroscopy combined with chemometrics to classify rice samples from different regions in Heilongjiang Province. They also employed PCA to reduce the dimensionality of rice spectroscopy data. Their study demonstrated that PCA is effective in extracting Raman spectroscopy features of rice, thereby improving classification and prediction accuracy. Currently, Raman spectroscopy has not yet been applied to the identification of Poria origin, its proven utility in food authentication and traceability studies of traditional Chinese medicines suggests strong potential for its application in this field. Peta et al. used gel tactile sensing to record and discriminate fingerprints. This technique could potentially quantify micro-surface features of Poria cocos, such as roughness, texture, and pores, providing a complementary approach for origin identification alongside chemical analysis [
23].
Principal Component Analysis (PCA) is a computationally efficient, unsupervised machine learning method that excels in dimensionality reduction and cluster analysis [
24,
25]. It can extract essential structural information from datasets with limited sample sizes. Unlike machine learning methods, which typically require large sample sizes to improve predictive accuracy, PCA effectively addresses this limitation.
In this study, Poria samples from Chuxiong in Yunnan, Shangyu in Shaanxi, Yuexi in Anhui and Shennongjia in Hubei were selected as research subjects. A novel approach was developed by combining Raman spectroscopy with an improved PCA algorithm—multi-matrix projection discrimination—to establish an accurate and efficient model for identifying the geographical origin of Poria.
4. Geographical Discrimination of Poria Based on Grouped Modeling and Projection
4.1. Grouping Principle
The above results indicate that the spectral differences in Poria cocos caused by geographical origin are relatively small. Therefore, the principal component vectors obtained from PCA performed directly on the entire spectral dataset cannot effectively capture the variations among spectra from different origins.
According to Equations (5) and (6), any individual spectrum can be mathematically represented by its mean spectrum, eigenvectors, principal component scores, and deviations:
Here, represents the mean spectrum, represents the score vector on the first k principal components, is the matrix composed of the first k eigenvectors, and is the residual resulting from the incomplete reconstruction of the original spectrum using the first k principal components.
Equation (7) indicates that any spectrum can be expressed as a linear combination of three components: the common spectral background, the reconstructed spectrum in the principal component subspace, and the residual.
When the eigenvector matrix effectively captures the most significant and representative variation patterns of spectra from a given geographical origin, the spectra of samples from that origin can be optimally reconstructed within the principal component subspace. In this case, the corresponding residual is minimized, and the reconstruction error, defined as the Euclidean distance of, becomes smaller.
essentially reflects the degree of matching between a sample spectrum and the spectral model of a given geographical origin. A smaller indicates a higher degree of correspondence between the sample spectrum and the origin-specific spectral model.
Based on the above analysis, we propose a method for geographical origin discrimination using the residual magnitude. Specifically, PCA is performed separately on the spectra from each origin to obtain the corresponding origin-specific eigenvector matrices. A sample spectrum from a particular origin is then projected and reconstructed onto each of these eigenvector matrices. The reconstructed spectrum is most accurately restored in the principal component subspace of its true origin, resulting in the smallest residual and minimal reconstruction error. Accordingly, the sample can be assigned to the origin corresponding to the minimal residual.
To clearly illustrate the discrimination procedure, a test spectrum from Yunnan is used as an example, detailing its projection and reconstruction across the eigenvector matrices of each origin and the corresponding origin classification method.
4.2. Data Processing Procedure
4.2.1. Establishment of Respective Principal Component Matrices for Different Origins
First, PCA was performed separately on each dataset in a grouped manner: PCA was independently applied to the dataset of each origin (comprising 25 preprocessed Raman spectra), resulting in four distinct principal component matrices. Each matrix contains 24 principal components, with the corresponding eigenvalues and contribution rates arranged in descending order. This ordering facilitates the interpretation of component significance across origins and provides a rational basis for selecting the optimal number of principal components for classification.
Figure 4 illustrates the cumulative contribution rates of principal components for the Yunnan, Anhui, Shaanxi, and Hubei datasets.
As shown in
Figure 4, the cumulative contribution rates of the first six principal components for all four geographical origins exceeded 90%, indicating that these components retain the vast majority of information contained in the original spectra. This demonstrates that PCA can significantly reduce data complexity while preserving the main spectral features. The Poria samples from Yunnan exhibit the highest cumulative contribution rate for the first principal component among the four origins, whereas the Shaanxi samples show the lowest. This suggests that in the PCA, certain primary characteristics of Yunnan Poria samples exhibit a higher degree of variability within the dataset, providing a relatively stronger explanatory power for the overall data. Such a distinction may imply fundamental differences in the chemical composition or physical structure of Yunnan Poria compared to samples from other regions. These disparities in principal component eigenvalues may serve as valuable indicators for both quality assessment and geographical origin identification of Poria.
4.2.2. Testing Spectral Projection Reconstruction
We selected the first k principal components to construct the classification model and obtained the origin-specific eigenvector matrices
. For classification, a test Raman spectrum of known origin is selected, denoted as
t1, and assumed to belong to the Yunnan group. First, the mean spectrum of each origin is subtracted from the test sample spectrum T
1 to obtain four centralized spectra:
where
represents the mean spectrum of the
i-th origin dataset. This step serves to eliminate baseline offsets among different origins and enhances characteristic differences between classes.
Next, a projection transformation is performed using the eigenvector matrix of each origin, yielding the corresponding principal component scores:
Subsequently, by performing an inverse transformation to the principal component scores and adding back the respective mean spectrum, the reconstructed spectra of sample
t1 for each origin are obtained:
To clearly visualize the differences between the reconstructed spectra and the original test spectrum
t1, we selected the first seven principal components to construct the classification model. The comparison results between the reconstructed spectra and
t1 are presented in
Figure 5.
As illustrated in the figure, the reconstructed spectra from the four origins exhibit a high degree of consistency with t1 in several key aspects, including the overall spectral morphology, the positions and intensities of characteristic peaks, and the general spectral trend. This consistency demonstrates that the eigenvector matrices corresponding to each origin effectively capture and represent the key spectral features and structural information within the Poria Raman data. The information retained in these matrices is sufficient to accurately reconstruct spectra that closely resemble the original sample through projection-based recovery. Moreover, these results suggest a degree of intrinsic similarity or correlation among the Raman spectra of Poria from different geographical origins. Despite the geographical differences, the principal components extracted from each origin share a considerable overlap in the primary information dimensions. Notably, in the region around 1100 cm−1, the reconstructed spectra from Yunnan and Shaanxi show better alignment with t1 compared to those from Anhui and Hubei, indicating possible differences in spectral detail that may serve as a basis for origin differentiation.
4.2.3. Reconstruction Error
Based on the above process, four reconstructed spectra of the test sample
t1 were obtained. The reconstruction errors between each reconstructed spectrum and
t1 were calculated and compared. A smaller reconstruction error indicates that
t1 is more similar to the reconstructed spectrum in the principal component subspace of that origin, suggesting that
t1 is more likely to originate from this geographical source. The reconstruction error
is expressed as the Euclidean distance and is calculated using the following formula:
Figure 6 presents a histogram of the Euclidean distances between
t1 and the four reconstructed spectra. As shown in the figure, the Euclidean distance between
t1 and the reconstructed spectrum based on the Yunnan eigenvector matrix is the smallest. This suggests that
t1 most likely originates from Yunnan, which is consistent with our initial assumption.
4.3. Relationship Between the Number of Principal Components and Accuracy
The number of principal components directly influences the classification performance of the model. Using too few components may result in underfitting, while too many may lead to overfitting, ultimately degrading the model’s predictive accuracy and its generalization ability to new test data. Therefore, selecting an appropriate number of principal components is crucial. Generally, the number of required principal components is determined by calculating and accumulating the variance contribution rates of each component until a predefined threshold (e.g., 85% or 90%) is reached. The optimal number should be chosen based on specific research goals and practical considerations. In this study, we evaluated classification accuracy of the model using different numbers of principal components. As a result, the first six principal components were selected for model construction, as they offered the best balance between dimensionality reduction and classification performance. The corresponding confusion matrices for various numbers of principal components are shown in
Figure 7.
When the first six principal components are used to construct the feature vector matrices, the cumulative contribution rate exceeds 90% for all four origins, as illustrated in
Figure 4. This indicates that these components effectively capture the majority of the significant spectral features in the Raman data of Poria from each origin. Under this configuration, the model achieves its highest classification accuracy of 97.5%. The recall rates for Yunnan, Anhui, Shaanxi, and Hubei samples are 100%, 90%, 100%, and 100%, respectively. The results indicate that, even with a limited number of samples, performing PCA in a grouped manner can effectively extract the spectral features of each geographical origin and construct origin-specific eigenvector matrices. When a test spectrum is reconstructed using the eigenvectors (principal components) of its true origin, the reconstruction error for that origin is minimized. Using the minimal reconstruction error for origin discrimination, an accuracy of 97.5% was achieved.
To mitigate the potential randomness caused by the small sample size, leave-one-out cross-validation (LOOCV) and an independent test set were employed. In the training set, each sample was sequentially left out while varying the number of principal components k from 1 to 15. The results showed that the highest accuracy in cross-validation was achieved at k = 12 (97.00% ± 17.14%), whereas at k = 6, the accuracy was 94.00% ± 23.87%, with the results tending to stabilize.
An independent test set, completely excluded from model training (4 origins × 10 spectra), was used to further assess generalization. The highest test set accuracy of 97.5% was obtained at k = 6, whereas at k = 12, the accuracy dropped to 87.5%. These findings suggest that although k = 12 performed best in cross-validation, it exhibited signs of overfitting, while the model with k = 6 achieved a better balance between accuracy and generalization ability.
4.4. PCA-SVM
The core principle of Support Vector Machines (SVM) is to construct a maximum-margin hyperplane in the feature space, thereby achieving optimal separation between classes. The decision boundary of SVM is determined solely by a few critical samples near the classification margin (called support vectors), rather than the entire dataset. This characteristic enables SVM to maintain strong generalization performance, even when trained on limited data, effectively addressing the dependency of machine learning models on large training sets. Liu et al. employed Raman spectroscopy coupled with PCA-SVM to classify three starch types, achieving an optimized test set accuracy of 93.67% [
20].
In this study, 100 preprocessed Raman spectra of Poria cocos were used as the training set, and 40 spectra were reserved as the test set. PCA was first applied for dimensionality reduction, followed by selection of an appropriate number of principal components and kernel functions to construct the classification models. Ten-fold cross-validation was employed to robustly evaluate model performance.
For the linear kernel, after systematically adjusting the number of principal components and the penalty parameter C, the optimal performance was achieved with the first 9 principal components and C = 8, yielding a cross-validation accuracy of 0.91, a training set accuracy of 0.97, and a test set accuracy of 0.90.
For the RBF kernel, the model performed best with the first 10 principal components, C = 10, and γ = 0.00, resulting in a cross-validation accuracy of 0.91, a training set accuracy of 0.96, and a test set accuracy of 0.925.
However, when using the polynomial kernel, the classification performance was clearly inferior to that of the linear and RBF kernels.
Table 2 presents a comparison of the classification accuracy between the PCA-SVM model and the multi-matrix projection discrimination model based on PCA for Poria geographical origin identification. The results indicate that, given the current dataset size, the proposed method achieves higher accuracy in distinguishing the origins of Poria.
4.5. Discussion
The proposed method achieved 97.5% accuracy on the test set, with LOOCV and independent testing confirming that six principal components provided the best balance between accuracy and generalization. In comparison, a PCA-SVM model optimized with 10-fold cross-validation achieved a maximum accuracy of 92.5% using the RBF kernel. These results demonstrate that the proposed approach is more capable of capturing discriminative features among different origins with limited samples, demonstrating higher accuracy and superior generalization ability.
This study achieved high classification accuracy and demonstrates strong practical and economic potential. The method requires no complex pretreatment or costly reagents, enabling rapid, nondestructive detection. It can prevent adulteration and false origin labeling, supporting standardized and intelligent traceability of Poria cocos. Overall, this approach provides both a methodological innovation and a practical tool with significant potential for large-scale application.
Although this study achieved promising results, it still has certain limitations. The sample size was limited and covered only four production regions, which may restrict the generalizability of the model. Future work will focus on expanding the dataset to include more regions and other medicinal materials, developing portable Raman devices for rapid on-site detection, building a standardized Poria cocos spectral database, and further optimizing classification algorithms to improve stability and accuracy.