Next Article in Journal
Leaching of Rhenium from Secondary Resources: A Review of Advances, Challenges, and Process Optimisation
Previous Article in Journal
Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geochemical Anomaly Detection via Supervised Learning: Insights from Interpretable Techniques for a Case Study in Pangxidong Area, South China

1
China Aero Geophysical Survey and Remote Sensing Center for Natural Resources, Beijing 100083, China
2
Center for Earth Environment & Resources, Sun Yat-sen University, Zhuhai 519000, China
*
Author to whom correspondence should be addressed.
Minerals 2026, 16(1), 49; https://doi.org/10.3390/min16010049
Submission received: 3 November 2025 / Revised: 29 December 2025 / Accepted: 29 December 2025 / Published: 31 December 2025

Abstract

Machine learning (ML) algorithms are widely applied across various fields due to their ability to extract high-level features from large training datasets. However, their use in geochemical prospecting and mineral exploration remains limited because mineralization—a rare geological event—often results in insufficient training samples for supervised ML. Generating adequate training data is thus essential for applying supervised ML in this domain. In this study, we augmented training samples by utilizing adjacent samples centered around known mineral deposits and then employed random forest (RF) modeling to identify multivariate geochemical anomalies associated with mineralization. To evaluate the robustness of data augmentation and gain insights into the geochemical survey data, we applied interpretable ML techniques—feature importance and partial dependence plots (PDPs)—to clarify the data processing within mineral prospectivity mapping. The proposed methodology was tested in the Pangxidong Area, South China. The identified geochemical anomalies show strong spatial correlation with known mineral deposits, while feature importance rankings and PDPs validate the effectiveness of the proposed methodology. This practice enhances the applicability of supervised ML in geochemical prospecting and mineral exploration as well as the application of interpretable techniques for understanding data processing of multi-geoinformation.

1. Introduction

The delineation of geochemical anomaly is important for locating mineral exploration targets, which is a process of discriminating anomalies from a population of geochemical samples [1]. Various methods have been introduced and adapted for effectively describing the complex population distribution of geochemical survey data. Besides the traditional frequency-based univariate statistical methods and multivariate statistical methods, the family of power-law-based fractal/multifractal models have been widely applied for geochemical anomaly identification [2]. When it comes to machine learning algorithms for geochemical anomaly identification, most of the geochemical anomaly detection is conducted in an unsupervised manner [3], where the known mineral deposits/occurrences are conventionally used to interpret and validate the resulting geochemical anomalies. For example, the continuous restricted Boltzmann machine model [4], deep autoencoder network model [5,6], dictionary learning model [7], etc., could discriminate the geochemical sample by encoding and decoding to reconstruct the datasets for finding the anomalous samples deviating from the main population of the datasets. And isolation forest models [8] and unsupervised K-nearest neighbor (KNN) models [9] could directly identify geochemical anomalies by calculating the anomaly score for each geochemical sample.
Although a few supervised learning algorithms have been applied in geochemical prospecting aiming at narrowing down the targeting area for mineral exploration, the fact that the mineralization is a rare geological event leading to an insufficient number of training samples impedes the application of supervised learning in this domain [10], but it can still be formalized as a highly unbalanced supervised classification/regression problem, where the effectiveness of classification and regression models is subjected to the imbalance of training dataset derived from a lower number of known mineral deposits and large amount of unlabeled non-mineral deposits [11]. For instance, several models have been employed, such as the stepwise regression model [12], geographically weighted regression model [13], maximum margin metric learning model [14], geographically weighted lasso model [15], and the deep convolutional neural network model [10], the latter of which utilized known mineral deposit locations and their adjacent areas as labeled samples.
Although, for data analysis, ML could leverage the information within datasets without presuming specific data distributions, such as normal or power-law distributions, instead, they apply learning-based principles to extract meaningful information. However, the rise in increasingly complex, black-box models has made interpreting data processing in certain ML algorithms a considerable challenge. In this study, geochemical anomaly delineation is framed as a supervised learning task. To elucidate the role of each geochemical element as an evidence map in mineral prospectivity mapping, interpretable ML techniques (i.e., feature importance and PDPs) were employed for assessment. The mineral-deposit-based positive samples and randomly selected negative samples are constructed to form different training datasets for training random forest algorithms. The main objective of this study is to provide insights into the delineated geochemical anomalies by using interpretable techniques for supervised machine learning and furthermore broaden the horizons in dealing with geochemical survey data.

2. Methods

2.1. Random Forest

Random forest is an ensemble learning algorithm that aggregates predictions from multiple decision trees to form a robust predictive model [16]. These decision trees, which constitute the “forest,” serve as the base classifier. Each tree is trained on a distinct subset of the original training data. These diverse subsets are generated using a sampling technique called “bootstrap aggregating” (or “bagging”) [17,18]. Bagging creates each subset by randomly sampling the original dataset with replacement, which means selected instances are not removed after being chosen and can be reused for training subsequent trees.
Random forest trees are constructed by recursively splitting the root node into binary child nodes. This splitting process at each internal node continues iteratively until a pre-specified stopping condition is met [18,19]. Unlike a standard decision tree, each node split within a random forest tree uses only a randomly selected subset of the available predictor variables as potential discriminative conditions. The algorithm evaluates all possible splits within this subset to identify the one that maximizes the purity of the resulting child nodes. Purity measures the homogeneity of class labels within a node; a node is pure if all samples belong to the same class. This study employs the Gini index (IG), a widely used measure of impurity (or, inversely, purity), calculated as follows [20]:
I G ( f ) = i = 1 n   f i ( 1 f i )
where f i is the proportion (probability) of samples belonging to class i at node n , defined as:
f i = m j m
Here, m j is the number of samples of class j , and m is the total number of samples at the node. The final prediction of the random forest is determined by majority voting over the predictions of all individual trees. The dual randomness—selecting different bootstrap samples for each tree and a random subset of variables for each node split—reduces correlation between trees and increases ensemble diversity. This effectively enhances the algorithm’s robustness and mitigates overfitting [21,22].

2.2. Partial Dependence Plots

The partial dependence plot (PDP) visualizes the marginal effect of one or two features on the predictions generated by a machine learning model [23]. It reveals the nature of the relationship between a target variable and a selected feature, indicating whether it is linear, monotonic, or more complex. For example, in the scenarios of linear regression models, PDPs consistently yield linear relationships. The partial dependence function for a regression model is formally defined as follows:
f ˆ S ( X S ) = E X C [ f ˆ ( X S , X C ) ] = f ˆ ( X S , X C ) d P ( X C )
where the set S contains the features for which we want to visualize partial dependence (typically one or two features), while X C represents all other features in the machine learning model f ˆ , treated as random variables. Together, the feature vectors X S and X C comprise the complete input space. Partial dependence computes the marginal effect by averaging model predictions over the distribution of X C . This marginalization yields a function that isolates the relationship between features in S and the predicted outcome, incorporating their interactions with other features while eliminating dependence on X C . The partial dependence function f ˆ S is estimated using the Monte Carlo method, which averages predictions over the training data while conditioning on features in S:
f ˆ S ( X S ) = 1 n i = 1 n   f ˆ ( X S , X C ( i ) )
The partial dependence function f ˆ S quantifies the average marginal effect on model predictions for specified values of features in set S . In its estimation formula, X C ( i ) denotes actual feature values from the dataset for the non-target features in C , while n represents the total number of instances. Crucially, PDP calculations disregard potential correlations between features in S and C . When such correlations exist, this may incorporate unrealistic or impossible feature combinations into the averaged results. For classification models outputting class probabilities, the PDP visualizes a specific class’s probability across varying values of the feature(s) in S . As a global interpretation method, the PDP aggregates insights from all instances to characterize the overall relationship between features in S and model predictions.

3. Geological Setting and Mineralization of the Study Area

3.1. Geological Setting

The study area, Pangxidong district, is located in the northwest of Lianjiang City, Guangdong Province, South China, which was formed by the late Paleozoic collision of Yangtze and Cathaysian block [24,25,26]. The strata exposed are mainly composed of Cretaceous and Quaternary strata. The Cretaceous formation is composed of breccia, conglomerate, and shale- and silt-laden sandstone, while the Quaternary formation comprises an alluvial layer, flood product, and residual diluvial layer.
The Pangxidong district has been recognized as one of China’s 10 major silver bases since the 1980s, a status that stems from the successive discovery and exploitation of several giant–supergiant silver–gold polymetallic mines in the region (Figure 1) [27]. Previous studies including both field survey and sampling analysis indicated that these silver–gold mines genetically belong to the structurally controlled alteration rocks type silver–gold deposits [28,29]. The key geological characteristics of this silver–gold deposit type in the study area are their structural controls and alteration signature. The mineralization is primarily controlled by NE-trending strike–slip faults and their subsidiary fractures and is associated with a hydrothermal alteration assemblage that includes sulfides and quartz veins, which envelop and overprint the fault systems [28]. The ore mineral assemblage comprises silver, gold, and sulfide minerals, primarily argentite, sphalerite, galena, chalcopyrite, and pyrite. The dominant gangue minerals are silicates, including quartz, feldspar, sericite, chlorite, and hornblende [27]. The hydrothermal alteration related to the silver–gold mineralization is mainly characterized by potash-feldspathization, chloritization, sericitization, silicification, pyritization, fluorination, and carbonatation [30].

3.2. Stream Sediment Geochemical Data

The geochemical data were obtained from a 1:50,000 stream sediment survey conducted by the Guangdong Pangxidong Potential Mineral Prospect Investigation Program. A total of 1885 stream sediment samples evenly distributed throughout the study area were collected with average sampling density of about 4 sites per 1 km2. The concentrations of the 16 elements including Ag, Au, Cu, Pb, Zn, Mn, W, Sn, Mo, As, Sb, Bi, Hg, F, Ba, and B were analyzed for each sample by Inductively Coupled Plasma–Atomic Emission Spectrometry (ICP-AES), Atomic Absorption Spectrometer (AAS), and X-Ray Fluorescence (XRF) [27] in the geophysical and geochemical laboratory, Geology & Mineral Exploration Development Authority of Jiangxi Province, China.
The study area was divided into cells with grid size of 100 m × 100 m, and 3 × 3, 5 × 5, and 7 × 7 cells (corresponding to training datasets A, B, and C in Table 1) with known deposit centers serve as positive samples. For the selection of negative samples, the study area was divided into two parts by median of concentrations of Au; the negative samples were randomly selected from an area with Au concentrations lower than the median (Figure 2). The geochemical concentrations of grid point were obtained by interpolating the stream sediment survey data as evidence maps. Besides the training samples derived based on the interpolated evidence maps, four original geochemical survey data points (i.e., raw data) adjacent to the locations of known mineral deposits are also selected as training samples (i.e., training datasets R) for assessing the model performance.

4. Results and Discussion

The RF-based geochemical anomaly detection here was conducted based on the four created positive training datasets and corresponding equal number of negative training datasets (Table 1). For better comparison, the parameters were set as follows after trial and error throughout the study: n_estimators = 200, max_depth = 20, and other parameters set as default (e.g., max_features = ‘sqrt’). The parameters were assessed by Halving Gridsearch, where the estimators: [100,200,300,400,500,600] and max_depth: [10,15,20,25,30,40] were searched for best parameters. The model performance with the best parameters trained by different training datasets were compared with that of the proposed parameters (i.e., n_estimators = 200, max_depth = 20, and other parameters set as default). Moreover, with 20% of the training datasets as test data, the RF modeling for each training dataset was repeated 100 times. The RF model trained with training dataset R (i.e., the training samples derived from raw data) achieved good classification performance with average accuracy (0.881), precision (0.937), recall (0.835), and F1_score (0.871) (details of those metrics can refer to [31]). In contrast, the RF modeling with training dataset derived from the interpolated data (i.e., training datasets A, B, and C) showed a notable rise in performance, with average accuracy (>0.987), precision (0.999), recall (>0.976), and F1_score (0.987) (Table 2). Although this is exactly what we want, we should be aware that this optimistic performance rise likely contains a degree of spatial autocorrelation (data leakage) that artificially inflates performance metrics.
Firstly, for understanding the evidence maps used in RF modeling, the correlation coefficient matrix was calculated (Table 3) based on the original 1885 stream sediment sample points, which indicated that most of the geochemical elements are uncorrelated with each other and only Sb and As have a higher correlation coefficient (0.89). Considering the closure effect of the geochemical data, which may bring out spurious correlations among the different variables [32], the centered log-ratio transformation (clr) is applied for the geochemical data and the Biplot based on the clr-transformed data is shown in Figure 3. The PCA biplot explains only ~36% of the total variance that can be accounted for by correlation coefficient matrix (i.e., low coefficient between each other).
During the RF modeling, the importance of features (i.e., evidence maps) was measured in terms of mean decreased Gini index, which computed the average gain of purity by using splits of a given variable [33]. Practically the sum of the importance for each evidence is 1 and, accordingly, a standard 0.05 threshold method was conventionally applied for selecting most important features [34]; thus, the results of RF modeling indicated that Ag, Au, Zn, Sb, Pb, and F are the key evidence maps and geochemical elements of Sn and B contribute little in the modeling (Figure 4). Despite the low explained total variance, there are also some clues in the Biplot of PC1 and PC2 where the Ag–Au, Zn–Pb–Ba, and As–W are highly correlated in their perspective quadrants, and the Mn, Hg, Cu, and Sb show close relationship with the cluster of Ag–Au. While the Sn and B are located at the opposite side of Ag–Au (Figure 3). For the reason that Gini importance is reported to be biased and not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories, another variable importance measure, “permutation importance,” is used to justify the Gini importance (see Supplementary Materials). As the permutation importance is calculated based on test samples, different training datasets with test_size = 0.2 were used and showed Ag–Au is the most important feature, which aligns with that of Gini importance (Figure S1), but with large standard deviation and lost the information of other evidence maps. Moreover, when test_size is set to 0.4 for training dataset B and C for RF modeling, both Au–Ag showed their relative importance and RF modeling with training data C demonstrated lower standard deviation, which imply that the permutation importance is sensitive to the size of test samples used (Figure S2). Finally, when training RF modeling, the evidence map was removed one by one and the top 5 features in terms of Gini and permutation importance were illustrated for comparison (Figure S3), and the consistency of the ranking of evidence maps between Gini and permutation importance proved the robustness of the RF model trained.
As the PDPs can disclose the relationship between a target variable and a selected feature [35], in the scenarios of MPM practice, it was applied for revealing the response relationships of evidence maps and predicted probabilities in the given study area (Figure 5). All of the curves of PDPs indicate that the higher concentrations of geochemical elements for each evidence map raise the increment of the predicted probabilities in the final prospectivity maps. The most important evidence maps (Ag, Au, and Zn) in the PDPs have sharply changed trajectories (Figure 5 and Figure S4 in the Supplementary Materials) and that of the secondarily important evidence maps (Pb, F and Sb) (Figure 5) showed that the predicted probabilities increase gradually as the content of geochemical elements increases. The nonsignificant evidence maps showed no obvious response to changes in the content of geochemical elements (i.e., flat trajectories) of PDPs.
PDPs suffer from their dependence on the independence assumption—specifically, that the feature(s) for which partial dependence is computed exhibit no correlation with other features. The correlated feature in calculating PDPs may integrate unrealistic feature value combinations within their averaged outputs. Considering that Ag–Au evidence maps are most important in the RF modeling and permutation importance for Au–Ag showed large standard deviation, the robustness of the PDPs was justified by retraining the RF model with the same parameters as before but excluding the Ag–Au evidence maps. The rankings of evidence importance indicated that the evidence maps of Zn, F, Sb, and Pb are the most significant evidence with slight difference in orders compared with the former RF modeling (Figure 6). Correspondingly the trajectories of PDPs for Zn, F, Sb, and Pb exhibited little difference with that of previous RF modeling (Figure 7).
From the perspectives of model training, those insights from the feature importance and PDPs as well as their consistence among RF modeling with different training datasets justify the feasibility of training data augmentation by using an adjacent sample of known mineral deposits.
The predicted prospectivity map based on RF modeling with different training datasets is depicted in Figure 8 and Figure 9 for comparison, where the latter excludes evidence maps of Ag and Au. The known mineral deposits within the study area exhibit strong spatial alignment with the high-probability regions delineated by the prospectivity map. In order to quantitatively compare the performance of the RF modeling, P-A plots were used. A P-A plot integrates the percentage of known mineral occurrences captured within each prospectivity class with the area covered by that class (expressed as a percentage of the total study area) [36]. In a P-A plot, the intersection points of two curves—the prediction rate curve for known mineral occurrences and the cumulative area percentage curve—serves as a key criterion for evaluating and comparing mineral prospectivity models. An intersection point located higher on the plot indicates a better model, which contains more known mineral deposits within a smaller area. All known deposits fall within the top 5% of high-probability areas (Figure 10), which means that RF modeling with different training datasets in the given study area can capture well the prospects of potential mineralization. To dig into the patterns of the predicted mineral prospectivity maps, the violin plots were introduced to analyze the characteristic of resulted probabilities of prospectivity maps. Violin plots are a method for visualizing the distribution of numeric data. It combines features of two powerful statistical tools, the box plot and kernel density plot. The central body of a violin plot, aptly called the “violin,” is formed by mirroring a kernel density estimate (KDE) plot around a central axis. The width of the violin at any given value reflects the estimated data density (relative frequency or proportion) at that point. A wider section indicates a higher concentration of data points within that value range, while a narrower section indicates fewer points. Inside the violin plot, a short line marks the median, while a thin box (representing the interquartile range, IQR) spans from the 25th percentile (Q1) to the 75th percentile (Q3) [37]. These plots effectively reveal the full distribution shape of the resulting probabilities generated by RF modeling.
From the violin plots for the six resulting prospectivity maps, the probability values of RF modeling with training dataset A tend to having a high median value compared to that with training dataset B and C (Figure 11). The high probability values (>0.8) occupied less than 2.6% of the study area. Except the RF modeling with training dataset A without Ag and Au evidence maps (Figure 11d), the shapes of the violin are similar to each other.

5. Conclusions

In this paper, we conducted geochemical anomaly detection in the fashion of supervised learning facilitated by interpretable techniques. Considering the scarcity of known mineral deposits, which are traditionally used for creating the positive samples for supervised learning, the adjacent locations around known mineral deposits were utilized to augment the training dataset. With different training datasets modeled by RF, feature importance and PDPs analysis for evidence maps, as well as the P-A plot and violin plot analysis for the resulting mineral prospectivity, the conclusion from this case study is as follows: (a). for MPM with few known mineral deposits existing in the study area, the locations near the known mineral deposits are capable of serving as positive samples for supervised learning in detecting geochemical anomalies and thus can be a potential sample method for data augmentation. And it is expected to be investigated in the scenario of MPM with multi-geoinformation. (b). The interpretable techniques such as feature importance and PDPs are powerful tools to pierce into the data processing by machine learning; they not only can help the practicer understand the datasets but also provide a good way to assess the performance of different machine learning algorithms.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/min16010049/s1, Figure S1. The comparison of Gini importance and Permutation importance based on RF modeling with different training datasets (a, b, c, d for training datasets R, A, B, C). Figure S2. Feature importance based on RF modeling with different test_size = 0.4 (a: training dataset B; b: training dataset C). Figure S3. Feature importance based on RF modeling with training dataset C. (a–d for without Au, Au-Ag, Au-Ag-Zn and Au-Ag-Zn-F). Figure S4. The Partial dependence plots of the evidence maps used in RF modelling with training data R. (Red indicates active area) [38].

Author Contributions

Writing—original draft preparation, Q.C.; conceptualization, S.Z.; data curation, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 42402303, No. 42476272) and National Science and Technology Major Project (No. 2025ZD1008703).

Data Availability Statement

The data is available on request.

Acknowledgments

We thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, Y.; Lu, L. The Anomaly Detector, Semi-supervised Classifier, and Supervised Classifier Based on K-Nearest Neighbors in Geochemical Anomaly Detection: A Comparative Study. Math. Geosci. 2023, 55, 1011–1033. [Google Scholar] [CrossRef]
  2. Liu, Y.; Zhou, K.; Cheng, Q. A new method for geochemical anomaly separation based on the distribution patterns of singularity indices. Comput. Geosci. 2017, 105, 139–147. [Google Scholar] [CrossRef]
  3. Gonbadi, A.M.; Tabatabaei, S.H.; Carranza, E.J.M. Supervised geochemical anomaly detection by pattern recognition. J. Geochem. Explor. 2015, 157, 81−91. [Google Scholar] [CrossRef]
  4. Chen, Y.; Lu, L.; Li, X. Application of continuous restricted Boltzmann machine to identify multivariate geochemical anomaly. J. Geochem. Explor. 2014, 140, 56−63. [Google Scholar] [CrossRef]
  5. Xiong, Y.; Zuo, R. Recognition of geochemical anomalies using a deep autoencoder network. Comput. Geosci. 2016, 86, 75−82. [Google Scholar] [CrossRef]
  6. Zhang, S.; Xiao, K.; Carranza, E.J.M.; Yang, F.; Zhao, Z. Integration of auto-encoder network with density-based spatial clustering for geochemical anomaly detection for mineral exploration. Comput. Geosci. 2019, 130, 43−56. [Google Scholar] [CrossRef]
  7. Chen, Y.; Shayilan, A. Dictionary learning for multivariate geochemical anomaly detection for mineral exploration targeting. J. Geochem. Explor. 2022, 235, 106958. [Google Scholar] [CrossRef]
  8. Chen, Y.; Wang, S.; Zhao, Q.; Sun, G. Detection of Multivariate Geochemical Anomalies Using the Bat-Optimized Isolation Forest and Bat-Optimized Elliptic Envelope Models. J. Earth Sci. 2021, 32, 415−426. [Google Scholar] [CrossRef]
  9. Chen, Y.; Zhao, Q.; Lu, L. Combining the outputs of various k-nearest neighbor anomaly detectors to form a robust ensemble model for high-dimensional geochemical anomaly detection. J. Geochem. Explor. 2021, 231, 106875. [Google Scholar] [CrossRef]
  10. Zhang, C.; Zuo, R.; Xiong, Y. Detection of the multivariate geochemical anomalies associated with mineralization using a deep convolutional neural network and a pixel-pair feature method. Appl. Geochem. 2021, 130, 104994. [Google Scholar] [CrossRef]
  11. Carreño, A.; Inza, I.; Lozano, J.A. Analyzing rare event, anomaly, novelty and outlier detection terms under the supervised classification framework. Artif. Intell. Rev. 2019, 53, 3575−3594. [Google Scholar] [CrossRef]
  12. Nazarpour, A.; Paydar, G.R.; Carranza, E.J.M. Stepwise regression for recognition of geochemical anomalies: Case study in Takab area, NW Iran. J. Geochem. Explor. 2016, 168, 150−162. [Google Scholar] [CrossRef]
  13. Tian, M.; Wang, X.; Nie, L.; Zhang, C. Recognition of geochemical anomalies based on geographically weighted regression: A case study across the boundary areas of China and Mongolia. J. Geochem. Explor. 2018, 190, 381−389. [Google Scholar] [CrossRef]
  14. Wang, Z.; Dong, Y.; Zuo, R. Mapping geochemical anomalies related to Fe–polymetallic mineralization using the maximum margin metric learning method. Ore Geol. Rev. 2019, 107, 258−265. [Google Scholar] [CrossRef]
  15. Wang, J.; Zuo, R. Assessing geochemical anomalies using geographically weighted lasso. Appl. Geochem. 2020, 119, 104668. [Google Scholar] [CrossRef]
  16. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5−32. [Google Scholar] [CrossRef]
  17. Rodriguez-Galiano, V.; Sanchez-Castillo, M.; Chica-Olmo, M.; Chica-Rivas, M. Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geol. Rev. 2015, 71, 804−818. [Google Scholar] [CrossRef]
  18. Rodriguez-Galiano, V.; Chica-Olmo, M.; Chica-Rivas, M. Predictive modelling of gold potential with the integration of multisource information based on random forest: A case study on the Rodalquilar area, Southern Spain. Int. J. Geogr. Inf. Sci. 2014, 28, 1336−1354. [Google Scholar] [CrossRef]
  19. Carranza, E.J.M.; Laborte, A.G. Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines). Comput. Geosci. 2015, 74, 60−70. [Google Scholar] [CrossRef]
  20. Breiman, L.; Friedman, J.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall/CRC: New York, NY, USA, 1984. [Google Scholar] [CrossRef]
  21. Cutler, D.R.; Edwards, T.C., Jr.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J. Random forests for classification in ecology. Ecology 2007, 88, 2783−2792. [Google Scholar] [CrossRef]
  22. Sun, T.; Li, H.; Wu, K.; Chen, F.; Zhu, Z.; Hu, Z. Data-Driven Predictive Modelling of Mineral Prospectivity Using Machine Learning and Deep Learning Methods: A Case Study from Southern Jiangxi Province, China. Minerals 2020, 10, 102. [Google Scholar] [CrossRef]
  23. Molnar, C. Interpretable Machine Learning; Lulu Press: Morrisville, NC, USA, 2019. [Google Scholar]
  24. Gilder, S.A.; Gill, J.; Coe, R.S.; Zhao, X.; Liu, Z.; Wang, G.; Yuan, K.; Liu, W.; Kuang, G.; Wu, H. Isotopic and paleomagnetic constraints on the Mesozoic tectonic evolution of south China. J. Geophys. Res. 1996, 101, 16137–16154. [Google Scholar] [CrossRef]
  25. Mao, J.W.; Chen, M.H.; Yuan, S.D.; Guo, C.L. Geological characteristics of the Qinhang (or Shihang) metallogenic belt in south China and spatial-temporal distribution regularity of mineral deposits. Acta Geol. Sin. 2011, 85, 635–658, (In Chinese with English abstract). [Google Scholar]
  26. Zhou, Y.Z.; Li, X.Y.; Zheng, Y.; Shen, W.J.; He, J.G.; Yu, P.P.; Niu, J.; Zeng, C.Y. Geological settings and metallogenesis of Qinzhou Bay—Hangzhou Bay orogenicjuncture belt, south China. Acta Petrol. Sin. 2017, 33, 667–681, (In Chinese with English abstract). [Google Scholar]
  27. Xiao, F.; Wang, K.; Hou, W.; Erten, O. Identifying geochemical anomaly through spatially anisotropic singularity mapping: A case study from silver-gold deposit in Pangxidong district, SE China. J. Geochem. Explor. 2020, 210, 106453. [Google Scholar] [CrossRef]
  28. Wang, Z.W.; Zhou, Y.Z. Geological characteristics and genesis of the Pangxidong-Jinshan Ag-Au deposit in Yunkai terrain, south China. Geotecton. Metallog. 2002, 26, 193–197, (In Chinese with English abstract). [Google Scholar]
  29. Lin, Z.W.; Zhou, Y.Z.; Qin, Y.; Zheng, Y.; Liang, Z.P.; Zou, H.P.; Niu, J. Ore-controllingstructure analysis of Panxidong-Jinshan silver-gold orefield, southern Qin-Hang belt: Implications for furthern exploration. Mineral Deposits 2017, 36, 866–878, (In Chinese with English abstract). [Google Scholar]
  30. Chen, M.; Zheng, Y.; Chen, X.; Yu, P.; Zhang, G.; Wu, Y.; Huang, Y.; Wang, X.; Shu, L.; Lin, Z. High-Cd sphalerite in the Pangxidong Pb-Zn-Ag deposit (Yunkai Domian, South China): Insight for physicochemical condition of orogenic-type deposit. Ore Geol. Rev. 2024, 167, 105974. [Google Scholar] [CrossRef]
  31. Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  32. Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B Methodol. 1982, 44, 139–160. [Google Scholar] [CrossRef]
  33. Zhao, B.; Wu, J.; Yang, F.; Pilz, J.; Zhang, D. A novel approach for extraction of Gaoshanhe-Group outcrops using Landsat Operational Land Imager (OLI) data in the heavily loess-covered Baoji District, Western China. Ore Geol. Rev. 2019, 108, 88–100. [Google Scholar] [CrossRef]
  34. Prasetiyowati, M.I.; Maulidevi, N.U.; Surendro, K. Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest. J. Big Data 2020, 8, 84. [Google Scholar] [CrossRef]
  35. Zhang, S.; Carranza, E.J.M.; Fu, C.; Zhang, W.; Qin, X. Interpretable Machine Learning for Geochemical Anomaly Delineation in the Yuanbo Nang District, Gansu Province, China. Minerals 2024, 14, 500. [Google Scholar] [CrossRef]
  36. Yousefi, M.; Carranza, E.J.M. Fuzzification of continuous-value spatial evidence for mineral prospectivity mapping. Comput. Geosci. 2015, 74, 97–109. [Google Scholar] [CrossRef]
  37. Hintze, J.L.; Nelson, R.D. Violin plots: A box plot-density trace synergism. Am. Stat. 1998, 52, 181–184. [Google Scholar] [CrossRef]
  38. Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007, 8, 25. [Google Scholar] [CrossRef]
Figure 1. Geological map of the study area.
Figure 1. Geological map of the study area.
Minerals 16 00049 g001
Figure 2. Evidence maps and training sample selection.
Figure 2. Evidence maps and training sample selection.
Minerals 16 00049 g002
Figure 3. Biplot of the first two PCs.
Figure 3. Biplot of the first two PCs.
Minerals 16 00049 g003
Figure 4. The importance of the evidence maps used in RF modeling.
Figure 4. The importance of the evidence maps used in RF modeling.
Minerals 16 00049 g004
Figure 5. The partial dependence plots of the evidence maps used in RF modeling.
Figure 5. The partial dependence plots of the evidence maps used in RF modeling.
Minerals 16 00049 g005
Figure 6. The importance of the evidence maps used in RF modeling without Au and Ag.
Figure 6. The importance of the evidence maps used in RF modeling without Au and Ag.
Minerals 16 00049 g006
Figure 7. The partial dependence plots of the evidence maps used in RF modeling without Au and Ag evidence maps.
Figure 7. The partial dependence plots of the evidence maps used in RF modeling without Au and Ag evidence maps.
Minerals 16 00049 g007
Figure 8. The mineral prospectivity map based on RF modeling ((ac) RF modeling with training datasets A, B, and C, respectively).
Figure 8. The mineral prospectivity map based on RF modeling ((ac) RF modeling with training datasets A, B, and C, respectively).
Minerals 16 00049 g008
Figure 9. The mineral prospectivity map based on RF modeling without Au and Ag ((ac) RF modeling with training datasets A, B, and C, respectively).
Figure 9. The mineral prospectivity map based on RF modeling without Au and Ag ((ac) RF modeling with training datasets A, B, and C, respectively).
Minerals 16 00049 g009
Figure 10. The P-A plots for different predicted mineral prospectivity maps ((ac) for RF modeling with training datasets A, B, and C, respectively; and (df) for RF modeling with training datasets A, B, and C without Ag and Au evidence maps).
Figure 10. The P-A plots for different predicted mineral prospectivity maps ((ac) for RF modeling with training datasets A, B, and C, respectively; and (df) for RF modeling with training datasets A, B, and C without Ag and Au evidence maps).
Minerals 16 00049 g010
Figure 11. The violin plots for different predicted mineral prospectivity maps ((ac) for RF modeling with training datasets A, B, and C, respectively; and (df) for RF modeling with training datasets A, B, and C without Ag and Au evidence maps).
Figure 11. The violin plots for different predicted mineral prospectivity maps ((ac) for RF modeling with training datasets A, B, and C, respectively; and (df) for RF modeling with training datasets A, B, and C without Ag and Au evidence maps).
Minerals 16 00049 g011
Table 1. Training dataset used in RF modeling.
Table 1. Training dataset used in RF modeling.
Training DatasetsSample
Counts
Train/Test
Splits
Interpolated or Not
A9020%Interpolated
B25020%Interpolated
C48620%Interpolated
R4020%Not interpolated
Table 2. The model performance of the RF modeling with different metrics across the 100 runs.
Table 2. The model performance of the RF modeling with different metrics across the 100 runs.
Dataset ADataset BDataset CDataset R
MeanStddevMeanStddevMeanStddevMeanStddev
Accuracy0.9870.0230.9960.0090.9950.0090.8810.089
Precision0.9990.010.9990.0070.9990.0030.9370.109
Recall0.9760.0460.9920.0160.9940.0120.8350.156
F1_Score0.9870.0250.9960.0090.9970.0060.8710.101
Stddev: standard deviation.
Table 3. Correlation coefficient matrix.
Table 3. Correlation coefficient matrix.
ElementAuBSnCuAgBaMnPbZnAsSbBiHgMoWF
Au1.00−0.01−0.020.280.28−0.010.030.320.33−0.010.010.040.000.07−0.010.06
B 1.000.12−0.08−0.01−0.120.14−0.07−0.050.140.060.00−0.01−0.060.100.26
Sn 1.000.320.00−0.07−0.050.120.020.140.090.410.070.220.280.41
Cu 1.000.23−0.020.070.360.710.010.040.380.070.200.19−0.03
Ag 1.000.000.010.450.420.000.020.040.070.120.010.06
Ba 1.000.130.220.05−0.03−0.02−0.06−0.04−0.17−0.080.06
Mn 1.000.180.050.170.200.10−0.05−0.11−0.01−0.08
Pb 1.000.510.010.050.210.090.230.080.21
Zn 1.000.010.030.080.080.110.010.08
As 1.000.890.030.000.030.050.08
Sb 1.000.030.040.030.020.03
Bi 1.000.030.330.570.08
Hg 1.000.070.040.09
Mo 1.000.290.21
W 1.000.22
F 1.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Q.; Zhang, S.; Zhou, Y. Geochemical Anomaly Detection via Supervised Learning: Insights from Interpretable Techniques for a Case Study in Pangxidong Area, South China. Minerals 2026, 16, 49. https://doi.org/10.3390/min16010049

AMA Style

Chen Q, Zhang S, Zhou Y. Geochemical Anomaly Detection via Supervised Learning: Insights from Interpretable Techniques for a Case Study in Pangxidong Area, South China. Minerals. 2026; 16(1):49. https://doi.org/10.3390/min16010049

Chicago/Turabian Style

Chen, Qing, Shuai Zhang, and Yongzhang Zhou. 2026. "Geochemical Anomaly Detection via Supervised Learning: Insights from Interpretable Techniques for a Case Study in Pangxidong Area, South China" Minerals 16, no. 1: 49. https://doi.org/10.3390/min16010049

APA Style

Chen, Q., Zhang, S., & Zhou, Y. (2026). Geochemical Anomaly Detection via Supervised Learning: Insights from Interpretable Techniques for a Case Study in Pangxidong Area, South China. Minerals, 16(1), 49. https://doi.org/10.3390/min16010049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop