Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry

Mostafaei, Kamran; Jodeiri Shokri, Behshad; Mirzaghorbanali, Ali

doi:10.3390/min16060628

Open AccessArticle

Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry

by

Kamran Mostafaei

^1,*

,

Behshad Jodeiri Shokri

^2,3

and

Ali Mirzaghorbanali

^2,3

¹

Department of Mining, Faculty of Engineering, University of Kurdistan, Sanandaj 66177-15175, Iran

²

School of Science, Engineering, and Digital Technologies, University of Southern Queensland, Springfield Central, Ipswich, QLD 4300, Australia

³

Centre for Future Materials (CFM), University of Southern Queensland, Toowoomba, QLD 4350, Australia

^*

Author to whom correspondence should be addressed.

Minerals 2026, 16(6), 628; https://doi.org/10.3390/min16060628

Submission received: 15 May 2026 / Revised: 8 June 2026 / Accepted: 10 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Application of Geophysical and Geochemical Techniques in Mining Engineering)

Download

Browse Figures

Versions Notes

Abstract

Stream sediment geochemistry is a widely used reconnaissance tool in early-stage mineral exploration, particularly in regions where direct evidence of mineralisation is limited. Because stream sediment anomalies provide indirect geochemical signatures and are typically constrained by limited ground-truth information, labelled datasets are often scarce and spatially biased. This limitation restricts the applicability of supervised learning approaches and highlights the need for robust unsupervised methods. In this study, six unsupervised techniques, Principal Component Analysis (PCA), Non-negative Matrix Factorisation (NMF), Uniform Manifold Approximation and Projection (UMAP), Autoencoder (AE), Deep Embedded Clustering (DEC), and an Averaged Ensemble Index (AVE), were evaluated for integrating multivariate stream sediment geochemical data and delineating gold prospectivity zones. Eight gold-related elements (Au, As, Ag, B, Hg, Mo, Sb, and W) were selected based on regional metallogenic characteristics and previously reported geochemical associations. To facilitate direct comparison, all model outputs were normalised to a fuzzy membership scale ranging from 0 to 1. Model performance was quantitatively assessed using Receiver Operating Characteristic–Area Under the Curve (ROC–AUC) and Matthews Correlation Coefficient (MCC) metrics based on independently verified mineralised and non-mineralised locations. The results indicated that DEC and AE consistently outperformed the other methods investigated, achieving the highest ROC–AUC and MCC values, whereas UMAP exhibited comparatively weaker performance. The findings demonstrated that unsupervised representation learning approaches, particularly DEC and AE, provided a more effective framework for integrating multivariate geochemical data and delineating gold-related anomalies in data-limited exploration environments than conventional dimensionality reduction and heuristic integration methods.

Keywords:

geochemical data; stream sediment; unsupervised learning; anomaly detection; gold

1. Introduction

Stream sediment geochemistry has long been recognised as one of the most effective reconnaissance-scale tools for mineral exploration, particularly in regional surveys where direct observation of mineralisation is limited [1,2,3,4]. At scales such as 1:100,000, stream sediment datasets provide an integrated geochemical signal that reflects the combined influence of upstream lithology, weathering processes, hydrothermal alteration, and mineralisation. This ability to capture geochemical information from extensive catchment areas makes stream sediment geochemistry especially valuable for early-stage mineral exploration and target delineation.

Despite its advantages, identifying geochemical anomalies associated with ore systems remains a significant challenge in mineral exploration. Conventional anomaly detection methods, including univariate thresholding, background–anomaly separation, and classical statistical techniques, often fail to adequately capture the multivariate and compositional characteristics of geochemical datasets [5,6]. As a result, these approaches may oversimplify complex geochemical interactions, leading to uncertain anomaly delineation and reduced exploration effectiveness, particularly in regions where geochemical contrasts are subtle.

These challenges are particularly evident in gold exploration, where geochemical signatures of mineralisation are often weak, spatially dispersed, and influenced by complex geological processes. Orogenic gold deposits, in particular, are commonly controlled by structural features and are associated with extensive alteration halos rather than distinct, localised geochemical enrichments [7,8]. In stream sediment datasets, geochemical signatures associated with gold mineralisation are often expressed indirectly through pathfinder elements, resulting in complex multielement patterns that are difficult to identify using conventional element-by-element analytical approaches [9,10].

In response to these limitations, multivariate statistical and machine learning approaches have been increasingly adopted in mineral exploration research. Among these, unsupervised methods have gained considerable attention for their ability to uncover intrinsic data structures without labelled training samples, which are often unavailable during the early stages of exploration [11,12,13,14]. Techniques such as principal component analysis, non-negative matrix factorisation, nonlinear manifold learning, and neural network-based representation learning have shown considerable potential for extracting latent patterns from high-dimensional geochemical datasets [15,16,17,18].

Despite the increasing adoption of unsupervised techniques, their application in mineral exploration has largely been limited to the visual interpretation of anomaly maps and qualitative geological assessment. Objective quantitative validation and systematic comparison of different unsupervised anomaly detection methods remain relatively scarce in the literature, particularly where independent mineral occurrence data are available for evaluation [19,20,21]. This lack of rigorous assessment constrains the ability to determine the true exploration value, reliability, and broader applicability of these approaches.

To the best of the authors’ knowledge, few studies have systematically compared multiple unsupervised anomaly detection methods using independent mineral-occurrence data to quantitatively validate regional stream-sediment geochemistry. To address this gap, this research developed a comparative framework to evaluate multiple unsupervised anomaly detection methods using stream sediment geochemical data collected from an orogenic gold-bearing region at a scale of 1:100,000. Following comprehensive data preprocessing and compositional data treatment, continuous anomaly scores are generated using a range of linear, nonlinear, and neural network-based techniques. The resulting anomaly maps are quantitatively evaluated against independent datasets comprising known mineral occurrences and non-mineralised control points using receiver operating characteristic (ROC) analysis and complementary performance metrics. By focusing on anomaly ranking rather than hard classification or clustering, the proposed framework provides an objective and practically relevant assessment of unsupervised anomaly detection methods for regional-scale gold exploration targeting.

2. Geological and Geochemical Settings

2.1. Regional Tectonic Framework and Geology

The study area is located in northwestern Iran, along the border between the Baneh and Saqqez counties of Kurdistan Province and the Sardasht county of West Azerbaijan Province (Figure 1). Geographically, it extends approximately between 45°30′–46°00′ E longitude and 36°00′–36°30′ N latitude. From a regional tectonic perspective, the area is situated within the northwestern segment of the Sanandaj–Sirjan Zone (SSZ), one of the most significant metallogenic belts in Iran [22,23].

The SSZ represents an elongated continental crustal block that separated from Gondwana during the Late Paleozoic and subsequently accreted to the Eurasian margin in the Late Triassic. It is widely interpreted as the continental margin associated with the Neo-Tethyan subduction system. It has undergone prolonged episodes of tectonic, metamorphic, and magmatic activity from the Mesozoic to the Cenozoic. Consequently, the SSZ hosts a diverse range of mineralisation styles and is recognised as one of the most prospective regions for gold exploration in Iran [24,25,26,27].

The study area is characterised by intense tectonic deformation, expressed through numerous thrust faults, shear zones, and fault systems that predominantly trend northwest–southeast, parallel to the Zagros Orogenic Belt [22,23]. The prevailing structural regime is interpreted as shear–thrust-dominated, reflecting compressional-to-transpressional tectonic processes associated with Late Cretaceous to Early Cenozoic convergence. Magmatic activity is widespread throughout the study area, particularly in its eastern sector, where large intrusive bodies were emplaced within a shear-compressional tectonic regime. These intrusions range in composition from granite and granodiorite to diorite and gabbro and are generally assigned to the Upper Cretaceous [20,24]. Many of the plutonic bodies exhibit varying degrees of deformation and metamorphism, with mylonitic fabrics locally developed in response to subsequent tectonic activity. The oldest lithological units in the region comprise Precambrian metamorphic rocks, including schist, phyllite, slate, meta-rhyolite, and gneiss. These rocks have undergone low- to medium-grade metamorphism and exhibit a structural fabric consistent with the regional northwest–southeast tectonic trend. Paleozoic units occur sporadically throughout the area, whereas Permian limestones are locally exposed and lack clear stratigraphic continuity. Cretaceous sedimentary–volcanic sequences crop out discontinuously across the region. In the southwestern part of the study area, Paleocene-aged coloured mélange assemblages are exposed and are interpreted as remnants of a continental rift-related tectonic setting [28,29].

2.2. Gold Mineralisation Characteristics

Gold mineralisation within the study area reflects the broader metallogenic characteristics of the SSZ. This tectonic belt hosts a diverse range of gold deposit types, including orogenic, epithermal, porphyry-related, Carlin-type, and volcanogenic massive sulphide (VMS) systems. However, a substantial proportion of the known gold occurrences in the northwestern segment of the SSZ are classified as orogenic gold deposits [7,30,31].

Two principal styles of gold mineralisation are recognised within the study area: (1) shear zone-hosted gold mineralisation, represented by occurrences such as Zavakoh, Mirgenaghshineh, and Sheikh Choopan; and (2) gold-bearing massive sulphide mineralisation hosted by volcanic rocks, exemplified by the Barika deposit [24,25,29]. Structural elements, particularly thrust faults and shear zones, exert a fundamental control on fluid migration and ore deposition. Hydrothermal alteration associated with gold mineralisation is predominantly argillic and propylitic, while the principal ore-related sulphide minerals include pyrite, arsenopyrite, and chalcopyrite.

2.3. Stream Sediment Geochemical Dataset

The geochemical dataset used in this study was derived from a regional stream sediment survey conducted across the Alut 1:100,000 geological map sheet. Stream sediment sampling was designed to provide representative coverage of the drainage network, with an average sampling density of approximately one sample per 3 km². To ensure spatial representativeness, samples were collected as close as possible to the centroid of each catchment basin while maintaining a maximum sampling distance of less than 1 km along active stream channels (Figure 2).

A total of 835 stream sediment samples were collected across the study area. The samples were air-dried, sieved to the <200-mesh fraction, and subsequently analysed for a suite of 20 elements. Geochemical analyses were performed in accredited laboratories using a combination of inductively coupled plasma mass spectrometry (ICP-MS) and atomic absorption spectrometry (AAS). The resulting dataset was obtained from the Geological Survey and Mineral Exploration Organisation of Iran (GSI).

The stream sediment geochemical dataset is multivariate and spatially irregular, reflecting the complex geological, tectonic, and mineralisation characteristics of the study area. The analysed variables include gold together with a range of major and trace elements commonly used in regional mineral exploration programs. As is typical of reconnaissance-scale stream sediment surveys, the dataset exhibits pronounced skewness, non-normal distributions, and compositional constraints arising from the closed-sum nature of geochemical data.

Given these characteristics, the dataset is particularly well-suited to multivariate and unsupervised analytical approaches for anomaly detection and pattern recognition. Detailed descriptions of the data preprocessing procedures, element selection criteria, and anomaly detection methodologies are provided in the subsequent sections. The spatial distributions of the selected gold-related elements are presented in Figure 3 to illustrate the inherent geochemical heterogeneity of the study area. These maps are included to provide a preliminary visual representation of geochemical patterns and are not interpreted individually for exploration targeting.

3. Data Preprocessing and Geochemical Feature Selection

3.1. Stream Sediment Geochemical Data and Preprocessing

The geochemical dataset used in this study comprises stream sediment samples collected from the Alut 1:100,000 geological map sheet, located in the northwestern segment of the Sanandaj–Sirjan Zone (SSZ). A total of 835 samples were collected at an average sampling density of approximately one sample per 3 km². Sampling locations were selected near the centroids of drainage basins, with a maximum sampling distance of approximately 1 km along active stream channels, to ensure representative coverage of upstream lithological and structural variations. All samples were sieved to the <200-mesh fraction and analysed for 20 elements using standard geochemical analytical techniques, including atomic absorption spectrometry (AAS) and inductively coupled plasma mass spectrometry (ICP-MS), at accredited laboratories. The resulting dataset was obtained from the Geological Survey and Mineral Exploration Organisation of Iran (GSI) and forms the primary input for subsequent geochemical processing and anomaly detection analyses.

Geochemical datasets commonly contain censored values, particularly measurements reported below analytical detection limits. If not properly addressed, such values may introduce bias into multivariate analyses and anomaly detection results. In this study, values below detection limits were treated using a robust substitution approach based on a fraction of the corresponding detection limit, consistent with established practices in geochemical data analysis [32,33]. This procedure preserves the statistical characteristics of the dataset while minimising artificial variance inflation and distortion of inter-element relationships. The corrected dataset was subsequently screened for outliers attributable to analytical or sampling errors, ensuring that the retained extreme values represent genuine geochemical variability and potential mineralisation-related anomalies.

Following the treatment of censored values, descriptive statistical analyses were performed on the processed geochemical dataset (Table 1). The results indicate substantial distributional heterogeneity among the analysed elements. Several elements exhibit strongly positive skewness, most notably Au (skewness = 19.06), Mo (13.78), and W (11.80), reflecting the presence of extreme values associated with localised geochemical enrichment. Gold displays an exceptionally high coefficient of variation (CV = 1065.76), highlighting the highly heterogeneous distribution of gold within the stream sediment dataset and its association with mineralisation-related anomaly patterns. Overall, the predominance of strongly skewed and heavy-tailed distributions indicates a substantial departure from normality, supporting the application of data transformation procedures and multivariate unsupervised anomaly detection methods.

3.2. Selection of Gold-Related Elements

Element selection is a critical stage in geochemical targeting, as it directly influences both anomaly detection performance and the interpretability of the resulting models. In this study, eight elements (Au, As, Sb, Hg, Mo, Ag, B, and W) were selected prior to anomaly modelling based on: (i) the established geochemical signature of orogenic gold systems; (ii) the regional metallogenic characteristics of the Sanandaj–Sirjan Zone (SSZ); and (iii) documented pathfinder associations reported in previous studies conducted within the study area [24,25,29].

Orogenic gold mineralisation in the SSZ is commonly associated with elevated concentrations of As, Sb, Hg, and Ag, reflecting hydrothermal fluid–rock interaction, sulphide mineralisation, and structurally controlled fluid flow. Tungsten (W) and boron (B) are frequently linked to hydrothermal alteration and fluid evolution processes, whereas molybdenum (Mo) may indicate deeper magmatic or hydrothermal contributions spatially associated with mineralised systems. Gold (Au) was retained as the principal indicator element due to its direct genetic relationship with the target mineralisation.

The selected element suite, therefore, integrates both ore-related and pathfinder geochemical signatures, providing a geologically informed basis for multivariate anomaly detection and exploration targeting. In addition to the conventional pathfinder elements As, Sb, Hg, and Ag, the selected element suite includes B, Mo, and W because they provide complementary information on hydrothermal processes associated with gold mineralisation. Boron is commonly enriched in hydrothermal alteration zones and may serve as an indicator of fluid pathways. In contrast, tungsten is frequently associated with reduced hydrothermal systems that are spatially related to orogenic gold mineralisation. Although molybdenum is more commonly linked to magmatic–hydrothermal environments, it may also reflect deeper hydrothermal inputs and has been reported in association with several gold-bearing hydrothermal systems. The inclusion of these elements, therefore, enhances the representation of hydrothermal alteration, fluid evolution, and mineralisation-related geochemical processes within the study area.

The selected element assemblage was defined to capture both direct indicators of gold mineralisation and geochemical signatures associated with alteration and fluid activity [7,30,31]. Elements lacking a documented genetic or spatial relationship with orogenic gold systems in the SSZ were excluded from the modelling stage to minimise the influence of unrelated geochemical variability and improve model interpretability.

3.3. Compositional Data Transformation

Stream sediment geochemical data are inherently compositional because element concentrations represent parts of a whole and are therefore subject to a constant-sum constraint. Direct application of conventional statistical or machine learning techniques to raw compositional data may generate spurious correlations and biased interpretations. To overcome these limitations, log-ratio transformations were applied before multivariate analysis. In this study, the centred log-ratio (CLR) transformation was used for exploratory data analysis and visualisation because it preserves the relative information among components while treating all variables symmetrically. For machine learning and clustering applications, the isometric log-ratio (ILR) transformation was employed because it yields orthonormal coordinates in Euclidean space and satisfies the statistical requirements of distance-based and optimisation-driven algorithms [34,35,36,37]. The combined use of CLR and ILR transformations enables both meaningful interpretation of geochemical relationships and statistically rigorous modelling of compositional data. Consequently, the transformed dataset provides a robust foundation for subsequent unsupervised anomaly detection analyses and multivariate pattern recognition [37,38].

3.4. Overview of Analytical Workflow

Following data correction and compositional transformation, the processed geochemical dataset was used as input for a suite of unsupervised learning and anomaly detection techniques. Dimensionality reduction, clustering, and feature extraction methods were applied to identify geochemical patterns and anomalous zones that may be associated with gold mineralisation. Detailed descriptions of the individual algorithms and modelling procedures are provided in the following sections.

To facilitate consistent comparison and integration of results derived from multiple methods, all anomaly outputs were subsequently rescaled to a common fuzzy membership domain. Figure 4 presents the overall analytical workflow adopted in this study, encompassing stream sediment data preparation, compositional data transformation, unsupervised modelling, ensemble integration, and quantitative validation.

4. Unsupervised Anomaly Detection Methods

In mineral exploration, particularly during the early stages of regional assessment, labelled datasets are often limited, incomplete, or affected by spatial bias. Consequently, unsupervised learning techniques have been increasingly adopted to identify latent geochemical structures and delineate anomalous zones without relying on predefined classes [39,40]. These approaches seek to detect departures from background geochemical behaviour by exploiting the statistical, geometric, and representation-learning characteristics of multivariate datasets [41,42,43]. In this study, six unsupervised techniques representing distinct analytical paradigms were applied to integrate the selected geochemical variables (Au, As, Sb, Hg, Mo, Ag, B, and W) and generate continuous anomaly scores. The methods were grouped according to their underlying analytical principles to highlight methodological diversity and facilitate a comprehensive evaluation of their relative performance and complementarity.

4.1. Principal Component Analysis (PCA)

Principal component analysis (PCA) is a widely used linear dimensionality-reduction technique that transforms correlated variables into orthogonal components that capture the maximum variance in a dataset [44,45]. In geochemical exploration, PCA has been extensively applied to identify elemental associations, reduce background variability, and enhance geochemical signatures associated with mineralisation processes [46,47,48]. In this study, principal components with strong loadings on gold and associated pathfinder elements were interpreted as potential indicators of mineralised systems and subsequently used to characterise anomalies.

4.2. Non-Negative Matrix Factorisation (NMF)

Non-negative Matrix Factorisation (NMF) is a matrix decomposition technique that factorises the original data matrix into a set of additive, non-negative latent factors, thereby providing physically interpretable representations of underlying geochemical processes [49,50,51]. Owing to the inherently non-negative nature of geochemical concentration data, NMF has gained increasing attention in geoscientific applications involving pattern recognition, source identification, and anomaly detection [52,53]. In this study, the intensities of the extracted NMF factors were used to generate continuous anomaly scores, with elevated values interpreted as indicative of geochemical patterns potentially associated with mineralisation.

4.3. Uniform Manifold Approximation and Projection (UMAP)

Uniform Manifold Approximation and Projection (UMAP) is a nonlinear manifold learning technique designed to preserve both local neighbourhood relationships and broader global structures during dimensionality reduction [54,55]. By representing high-dimensional geochemical data as a fuzzy topological manifold, UMAP can capture complex nonlinear relationships that may not be adequately resolved by linear methods [56]. In this study, anomaly scores were derived from distance- and density-based metrics computed within the low-dimensional embedding space, with elevated scores indicating samples that deviate from dominant geochemical patterns and trends [42,43].

4.4. Autoencoder (AE)

Autoencoders (AEs) are neural network architectures that learn compact latent representations of high-dimensional data by minimising the reconstruction error between the input and output layers [57,58]. In the context of unsupervised anomaly detection, samples associated with large reconstruction errors are commonly interpreted as anomalous because they deviate from the dominant patterns learned from the background population [59,60,61]. In this study, reconstruction residuals generated by the autoencoder were used to derive continuous anomaly scores, with higher residuals indicating a greater degree of geochemical anomaly.

4.5. Deep Embedded Clustering (DEC)

Deep Embedded Clustering (DEC) is a deep learning framework that jointly performs representation learning and clustering within a unified optimisation process [62,63]. By learning a low-dimensional feature space while simultaneously refining cluster assignments, DEC can reveal complex data structures that conventional clustering approaches may not capture. Although DEC is primarily designed for clustering applications, its soft cluster assignment probabilities and distances within the embedded feature space can also be used to quantify the degree of anomaly exhibited by individual samples [64,65,66]. In this study, DEC-derived anomaly measures were used to generate continuous anomaly scores, with higher values indicating greater deviation from dominant geochemical populations.

4.6. Averaged Ensemble Index (AVE)

Ensemble integration strategies are widely employed in mineral prospectivity mapping and geochemical anomaly detection to enhance robustness, mitigate method-specific bias, and reduce uncertainty associated with individual models [42,67]. Rather than introducing an additional learning algorithm, this study employs an Averaged Ensemble Index (AVE) as an integration framework to synthesise the complementary information extracted by multiple unsupervised methods.

The AVE was constructed by averaging the standardised continuous anomaly scores generated by PCA, NMF, UMAP, AE, and DEC. Before integration, all anomaly outputs were rescaled to a common fuzzy membership range of [0, 1], ensuring comparability among methods and preventing any single model from disproportionately influencing the ensemble result. The resulting AVE map represents a consensus-based anomaly index that emphasises spatially persistent geochemical signals consistently identified across multiple analytical paradigms.

As a baseline ensemble model, the AVE provides a reference framework for evaluating the relative performance of individual unsupervised methods while offering a more stable representation of mineral prospectivity in data-scarce, early-stage exploration environments. By integrating complementary information from multiple analytical approaches, the ensemble strategy reduces sensitivity to method-specific artefacts and improves the reliability of anomaly delineation [68].

4.7. Score Standardisation and Anomaly Map Generation

To ensure inter-method comparability, anomaly scores from all algorithms were normalised to a common scale prior to spatial analysis and mapping. The standardised outputs were subsequently converted into continuous raster layers, enabling consistent spatial comparison and quantitative validation across different anomaly detection methods.

All elemental distribution maps and model-derived anomaly outputs were transformed to a fuzzy membership scale ranging from 0 to 1. This transformation was applied solely to standardise the data and facilitate comparison and integration among heterogeneous geochemical and model-derived layers. It does not constitute an anomaly classification procedure but rather provides a common framework for multi-method evaluation and ensemble analysis.

4.8. Model Implementation and Hyperparameter Setting

To ensure a fair and consistent comparison, all unsupervised methods were applied to the same set of selected gold-related geochemical variables and identical pre-processed input data. For PCA, NMF, and UMAP, three latent components were retained to generate continuous prospectivity indices. The NMF model was implemented using random initialisation with a maximum of 500 iterations. In contrast, UMAP was configured with three embedding dimensions, 15 nearest neighbours, a minimum distance of 0.1, and a fixed random seed (42) to ensure reproducibility. The key hyperparameters of the used methods were given in Table 2.

The AE and DEC models were implemented using compact neural network architectures with a three-dimensional latent representation. Model configurations were selected through preliminary experiments aimed at achieving stable latent representations while minimising unnecessary model complexity and reducing the risk of overfitting. Training was performed using the Adam optimiser and mean squared error (MSE) reconstruction loss. For DEC, clustering was initially established using K-means and subsequently refined through iterative optimisation of the latent feature space.

Rather than performing extensive hyperparameter tuning for individual methods, parameter values were selected to provide stable, reproducible, and comparable representations across multiple runs. This strategy ensured that differences in performance primarily reflected the intrinsic characteristics of the algorithms rather than the effects of dataset-specific parameter optimisation.

5. Validation Strategy and Performance Assessment

Reliable validation is a critical component of mineral prospectivity mapping and geochemical anomaly detection studies, particularly when unsupervised learning methods are employed. In regional-scale exploration, the limited availability and potential spatial bias of labelled data often restrict the applicability of fully supervised approaches [69,70]. Consequently, independent validation using known mineralised and non-mineralised locations is widely adopted to assess the predictive performance of unsupervised models while preserving their exploratory nature [71,72,73].

In this research, validation was conducted using a set of known gold occurrences within the study area that were not involved in model development, parameter selection, or optimisation. These locations were used exclusively to evaluate the spatial correspondence between predicted anomaly intensities and independently verified mineralisation occurrences. By separating model construction from model evaluation, the validation framework provides an objective assessment of each method’s ability to identify geochemical patterns associated with gold mineralisation.

5.1. Reference Data for Validation

A total of 11 known gold mineralisation occurrences were used as positive reference locations and were complemented by a set of non-mineralised control points representing background geochemical conditions. The non-mineralised locations were selected outside known mineralised zones to minimise spatial ambiguity and maximise contrast with anomalous areas. This validation framework enables an objective assessment of model performance while preserving the unsupervised nature of the applied methods and avoiding the need for labelled training data.

5.2. ROC Analysis and AUC Estimation

ROC analysis was employed to evaluate the discriminative performance of each method by comparing the cumulative detection rates of mineralised and non-mineralised locations across a range of anomaly thresholds [74,75,76]. The area under the ROC curve (AUC) was used as a threshold-independent measure of model performance, with values approaching 1 indicating strong discriminative capability. In contrast, values near 0.5 correspond to performance equivalent to random prediction [77,78]. ROC–AUC analysis is particularly well suited to mineral exploration studies characterised by limited labelled data, as it enables objective comparison of predictive performance without requiring predefined classification thresholds [79,80]. Consequently, the ROC framework provides a robust basis for evaluating and ranking the relative effectiveness of alternative unsupervised anomaly detection methods.

5.3. Matthews Correlation Coefficient (MCC)

In addition to ROC–AUC analysis, the Matthews Correlation Coefficient (MCC) was calculated to evaluate the balance between true positives, true negatives, false positives, and false negatives [81,82]. MCC is widely recognised as a robust performance metric for binary classification problems, particularly in situations involving imbalanced datasets, which are common in mineral exploration applications [83].

MCC values range from −1 to +1, where +1 indicates perfect agreement between predictions and observations, 0 corresponds to performance equivalent to random prediction, and −1 represents complete disagreement. Owing to its ability to account for all elements of the confusion matrix, MCC provides a balanced assessment of classification performance and complements ROC–AUC analysis in evaluating anomaly detection models.

5.4. Comparative Evaluation of Methods

The predictive performance of PCA, NMF, UMAP, AE, DEC, and AVE was systematically evaluated using ROC–AUC and MCC metrics. This multi-metric evaluation framework provides a comprehensive assessment of model performance by capturing both discriminatory capability (ROC–AUC) and classification balance (MCC). The ensemble-based AVE index was included to assess whether integrating complementary anomaly patterns from individual methods could enhance the robustness and stability of anomaly detection results.

6. Results

6.1. Anomaly Maps Derived from Individual Unsupervised Methods

Following geochemical preprocessing and the selection of eight gold-related elements (Au, As, Ag, Hg, Mo, Sb, W, and B), six unsupervised methods were applied to integrate the stream sediment geochemical data and generate regional gold prospectivity maps (Figure 5). The evaluated methods comprised PCA, NMF, UMAP, AE, DEC, and AVE. To ensure consistency and facilitate inter-method comparison, the outputs of all methods were transformed into continuous anomaly scores and normalised to the [0, 1] interval. This standardisation enabled direct comparison of anomaly intensities and spatial patterns across methods despite differences in their underlying analytical frameworks. The resulting prospectivity maps exhibit distinct spatial characteristics associated with the applied methods. Linear decomposition approaches, including PCA and NMF, generally produced smoother and more spatially extensive anomaly patterns, reflecting dominant variance structures within the geochemical dataset. In contrast, the neural network-based approaches (AE and DEC) delineated more localised and sharply defined anomalous zones, highlighting subtle multivariate relationships among the selected gold-related elements. UMAP, as a nonlinear manifold learning technique, identified anomaly patterns with relatively high spatial coherence while preserving complex relationships within the multivariate dataset. The AVE map displays a broader, more continuous anomaly distribution, reflecting the integrated response of the individual methods and emphasising geochemical patterns consistently identified across multiple analytical approaches.

6.2. Comparative Analysis of Anomaly Patterns

To evaluate the consistency and divergence among the six unsupervised methods, the resulting anomaly maps were compared through spatial and qualitative analysis. Although the magnitude, geometry, and spatial extent of anomalous zones vary among methods, several areas are consistently identified as anomalous across multiple approaches. The recurrence of these anomalies suggests persistent geochemical patterns that are relatively insensitive to the assumptions and analytical characteristics of individual methods. In contrast, certain anomalous features are identified by only one or a limited number of methods, reflecting differences in algorithmic sensitivity to local geochemical variability, nonlinear relationships, and background geochemical structure. These variations highlight the influence of methodological choice on anomaly delineation and demonstrate the importance of comparing multiple analytical approaches when evaluating exploration targets. Indeed, the comparative analysis indicates that the six methods provide complementary perspectives on the geochemical dataset. Anomalous zones consistently identified by multiple independent methods may therefore represent higher-confidence targets for subsequent exploration and field verification.

6.3. Spatial Agreement Analysis and Priority Target Delineation

Given that the primary objective of this study is to compare the performance of different unsupervised methods for integrating multi-element geochemical data, an additional algorithmic fusion procedure was not applied beyond the AVE framework. Instead, a spatial agreement analysis was conducted to provide a transparent and decision-oriented synthesis of the anomaly detection results. For each anomaly map, the upper percentile of anomaly scores (approximately the highest 10%–15%) was extracted and converted into a binary representation distinguishing anomalous from non-anomalous areas. These binary layers were subsequently overlaid to quantify how many methods identified each spatial unit as anomalous. The resulting agreement map assigns values ranging from 0 to 6, representing the degree of consensus among the six applied methods. Areas characterised by high agreement values correspond to locations where multiple independent algorithms consistently identify anomalous geochemical signatures. Such zones are considered priority exploration targets because their delineation is supported by several analytical approaches rather than being dependent on the behaviour of a single method. Furthermore, the spatial correspondence between high-agreement zones and known gold occurrences provides additional support for the effectiveness of the consensus-based interpretation framework.

6.4. Rationale for Validation Based on Spatial Consistency

Because the applied methods are unsupervised, visual interpretation of anomaly maps alone is insufficient for objective performance assessment. Although labelled data are typically limited in early-stage mineral exploration, the presence of documented gold occurrences within the study area provides an independent basis for model validation.

Accordingly, the anomaly maps generated by the individual methods, together with the spatial agreement map presented in Section 6.3, were quantitatively evaluated. The underlying premise is that anomalies consistently identified by multiple independent methods are more likely to reflect geochemical signatures associated with mineralisation than anomalies detected by a single approach. In the following section, this premise is quantitatively assessed by comparing anomaly predictions against known gold occurrences and non-mineralised control locations. This evaluation enables an objective comparison of the relative predictive performance, reliability, and robustness of the investigated methods.

7. Validation and Discussion

7.1. Validation Framework and Reference Data

Although the proposed prospectivity mapping framework is fully unsupervised, independent validation was conducted using a limited set of known mineralised and non-mineralised locations. This approach reflects realistic early-stage mineral exploration conditions, where verified labels are limited and are primarily used for post-model evaluation rather than model training. A total of 11 confirmed gold occurrences and 11 non-mineralised reference locations were used for validation. While the number of validation points is relatively small, it is representative of regional-scale exploration settings, where verified mineral occurrences are sparse and carefully documented. Consequently, the objective of the validation is comparative rather than absolute model certification. The validation locations were overlaid on each prospectivity map generated by the six unsupervised methods, and prospectivity scores were extracted to enable both ranking- and threshold-based performance evaluations. To ensure consistency, all prospectivity maps were normalised to a common fuzzy membership scale ranging from 0 to 1. This standardisation facilitated direct comparison among methods despite differences in their underlying analytical frameworks. Non-mineralised reference locations were selected from areas lacking documented mineral occurrences and favourable geological indicators in order to minimise ambiguity in the representation of background conditions.

7.2. ROC-AUC Analysis

The ROC curve was used to evaluate each method’s ability to discriminate between mineralised and non-mineralised locations across a range of decision thresholds. The area under the ROC curve (AUC) provides a threshold-independent measure of ranking performance, reflecting discriminatory capability rather than strict classification accuracy. This characteristic makes ROC–AUC particularly suitable for mineral prospectivity mapping, where the primary objective is to prioritise anomalous locations based on their likelihood of mineralisation rather than assign binary class labels. Figure 6 presents the ROC curves for the six investigated methods, with the corresponding AUC values summarised in the same figure. Higher AUC values indicate a greater ability to distinguish mineralised occurrences from background locations and therefore reflect superior anomaly ranking performance.

The ROC–AUC results reveal clear differences in the discriminatory performance of the investigated methods. DEC achieved the highest AUC value (0.87), followed by AE (0.80), indicating superior capability in distinguishing mineralised from non-mineralised locations. These results suggest that representation-learning-based approaches are effective at capturing complex nonlinear geochemical patterns associated with gold mineralisation. The AVE and NMF methods exhibited moderate performance, with AUC values of 0.78 and 0.76, respectively. PCA achieved an AUC of 0.71, indicating acceptable but comparatively lower performance. The reduced effectiveness of PCA may reflect the limitations of linear dimensionality reduction when applied to geochemical datasets characterised by complex multivariate and nonlinear relationships. UMAP produced the lowest AUC value (0.59), only slightly exceeding the level expected from random discrimination.

Given the limited number of validation samples, the reported AUC values should be interpreted primarily as relative indicators of model performance rather than definitive statistical estimates. Nevertheless, the results provide a useful basis for comparing the ability of different unsupervised methods to rank and prioritise geochemical anomalies in a regional exploration context.

7.3. MCC Analysis

In addition to ROC–AUC analysis, model performance was evaluated using the MCC, which provides a balanced measure of classification performance by simultaneously accounting for true positives, true negatives, false positives, and false negatives. MCC is particularly valuable in mineral prospectivity studies because it incorporates all elements of the confusion matrix and provides a robust assessment of classification quality [81,82,83]. The optimal decision threshold for MCC computation was determined by selecting the Youden Index maximum on the ROC curve, thereby ensuring consistency between threshold-independent (ROC–AUC) and threshold-dependent (MCC) performance evaluations. As shown in Figure 7, DEC achieved the highest MCC value (0.65), closely followed by AE (0.64), indicating the strongest overall classification performance among the investigated methods. PCA (0.57), NMF (0.55), and AVE (0.54) exhibited moderate performance and produced relatively similar MCC values. In contrast, UMAP yielded the lowest MCC (0.27), indicating comparatively weaker classification consistency within the adopted validation framework. The MCC results are broadly consistent with the ROC–AUC analysis, with DEC and AE ranking as the best-performing methods and UMAP exhibiting the weakest performance. This agreement between independent evaluation metrics strengthens confidence in the comparative ranking of the investigated unsupervised approaches.

7.4. Comparative Discussion of Unsupervised Integration Strategies

The comparative evaluation of the six unsupervised integration methods reveals clear differences in their ability to synthesise multi-element geochemical information into reliable gold prospectivity maps. Although all methods were applied to the same set of fuzzified geochemical variables and evaluated using an identical validation framework, their performance varied considerably due to differences in their underlying analytical principles and sensitivities to data structure. The objective of this study is not to identify a universally optimal predictive model, but rather to compare the behaviour and robustness of alternative unsupervised integration strategies under realistic regional exploration conditions characterised by limited validation data. Linear representation-based approaches, such as PCA, exhibited relatively conservative behaviour, characterised by lower false-positive rates but only moderate true-positive rates. This suggests that PCA primarily captures dominant variance structures within the dataset, potentially overlooking subtle yet geologically meaningful geochemical signatures associated with mineralisation. While this characteristic may reduce the occurrence of spurious anomalies, it can also limit the detection of weak or spatially localised mineralisation-related patterns. In orogenic gold systems, mineralisation is commonly expressed through subtle but spatially coherent pathfinder-element associations rather than dominant variance structures, which may explain the comparatively moderate performance of PCA. NMF demonstrated improved performance relative to PCA, reflecting its ability to extract additive and non-negative geochemical patterns that are more consistent with the nature of geochemical concentration data. The higher predictive performance of NMF suggests that parts-based representations are better suited to capturing mineralisation-related associations among elements. Nevertheless, its moderate MCC and ROC–AUC values indicate that NMF remains sensitive to noise and compositional redundancy, particularly where mineralisation is controlled by complex multivariate interactions among pathfinder elements. The nonlinear embedding approaches exhibited contrasting behaviour. UMAP, despite its effectiveness in dimensionality reduction and visual exploration of high-dimensional datasets, produced the weakest predictive performance among the methods investigated. This result is consistent with UMAP’s primary objective: preserving topological relationships and neighbourhood structure rather than maximising anomaly separation. Consequently, the preservation of local data structure alone does not necessarily translate into effective anomaly delineation or exploration targeting performance.

Neural network-based approaches, particularly AE and DEC, consistently outperformed the other investigated methods. The strong performance of the autoencoder suggests that nonlinear dependencies among gold-related geochemical variables contain important information that cannot be fully captured by linear or matrix-factorisation-based approaches. By learning compact latent representations of the multivariate geochemical dataset, the autoencoder identified anomaly patterns associated with mineralisation while maintaining balanced predictive performance.

Among all methods, DEC achieved the highest ROC–AUC and MCC values, indicating superior capability in discriminating between mineralised and non-mineralised locations. This performance can be attributed to the joint optimisation of latent feature representation and clustering-oriented embedding, which enhances the separation between background geochemical populations and mineralisation-related anomalies. The results further demonstrate that continuous anomaly scores derived from the DEC latent space provide an effective basis for prospectivity mapping and exploration targeting. The AVE approach achieved relatively high sensitivity but was associated with more false-positive predictions. This behaviour suggests that simple averaging strategies may overemphasise cumulative geochemical responses, thereby enlarging the spatial extent of anomalous zones. Such responses may incorporate background enrichment patterns that are not directly related to mineralisation. Nevertheless, AVE remains a useful baseline integration method because of its simplicity, transparency, and widespread application in mineral exploration studies. Although it assumes equal contributions from all input variables and does not explicitly account for nonlinear relationships, it provides a valuable benchmark against which more advanced unsupervised learning approaches can be evaluated.

Overall, the comparative results indicate that methods capable of modelling nonlinear and multivariate relationships provide a more realistic representation of the geochemical processes associated with gold mineralisation. This advantage becomes particularly important in regional exploration settings, where mineralisation signatures are often subtle, spatially dispersed, and controlled by complex interactions among multiple pathfinder elements.

8. Conclusions

This study presents a comprehensive unsupervised framework for integrating multivariate stream sediment geochemical data to delineate gold prospectivity zones under data-limited exploration conditions. By combining compositional data analysis, geologically informed element selection, and multiple unsupervised learning techniques, the proposed framework provides an effective alternative to supervised approaches that require large and well-labelled datasets.

The comparative evaluation of six integration methods demonstrated that model performance is strongly influenced by the underlying analytical approach. Linear and heuristic methods effectively captured dominant geochemical structures but were less successful in identifying subtle mineralisation-related patterns. In contrast, nonlinear and representation learning-based approaches, particularly neural network models, showed a greater capacity to model the complex multivariate relationships associated with gold mineralisation.

Among the investigated methods, Deep Embedded Clustering (DEC) consistently achieved the highest performance on both ROC–AUC and MCC, highlighting the effectiveness of jointly optimising latent feature representation and clustering structure for geochemical anomaly detection. The use of continuous anomaly scores rather than discrete class assignments further enhanced the interpretability and practical applicability of the resulting prospectivity maps.

The validation framework based on ROC–AUC and MCC proved effective for benchmarking unsupervised models with limited reference data. By combining threshold-independent and threshold-dependent evaluation metrics, the approach provides a balanced assessment of anomaly detection capability and classification performance in regional exploration settings.

Overall, the results demonstrate that advanced unsupervised learning techniques, when integrated with geochemical domain knowledge and appropriate data preprocessing, can substantially improve the identification and prioritisation of prospective mineralisation targets. The proposed framework provides a transparent and scalable methodology for regional-scale prospectivity mapping and offers a practical decision-support tool for mineral exploration in areas characterised by limited geological information and sparse validation data.

From an operational exploration perspective, DEC and AE emerged as the most promising methods for regional target prioritisation, particularly in data-constrained environments where reliable labelled datasets are unavailable or incomplete. Future research should evaluate the transferability of the proposed framework across different geological settings, deposit types, and geochemical datasets to further assess its general applicability.

Author Contributions

Conceptualisation, K.M.; methodology, K.M. and B.J.S.; software, K.M.; validation, K.M.; data curation, K.M.; writing—original draft preparation, K.M. and B.J.S.; writing—review and editing, K.M., B.J.S. and A.M.; project administration, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are not publicly available. Access may be granted upon reasonable and urgent request to the corresponding author, subject to ethical and technical feasibility.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Salminen, R.; De Vos, W.; Tarvainen, T. Geochemical Atlas of Europe; Geological Survey of Finland: Espoo, Finland, 2008. [Google Scholar]
Reimann, C.; Filzmoser, P.; Garrett, R.; Dutter, R. Statistical Data Analysis Explained: Applied Environmental Statistics with R; John Wiley & Sons: Chichester, UK, 2011. [Google Scholar]
Govett, G.J.S. Rock Geochemistry in Mineral Exploration; Elsevier: Amsterdam, The Netherlands, 1983; Volume 3, pp. 1–461. [Google Scholar]
Macheyeki, A.S.; Li, X.; Kafumu, D.P.; Yuan, F. Elements of exploration geochemistry. In Applied Geochemistry: Advances in Mineral Exploration Techniques; Elsevier: Amsterdam, The Netherlands, 2020; pp. 1–43. [Google Scholar]
Reimann, C.; Filzmoser, P. Normal and lognormal data distribution in geochemistry: Death of a myth. Consequences for the statistical treatment of geochemical and environmental data. Environ. Geol. 2000, 39, 1001–1014. [Google Scholar] [CrossRef]
Carranza, E.J.M. Geochemical Anomaly and Mineral Prospectivity Mapping in GIS; Elsevier: Amsterdam, The Netherlands, 2008; Volume 11, pp. 1–368. [Google Scholar]
Groves, D.I.; Goldfarb, R.J.; Gebre-Mariam, M.; Hagemann, S.G.; Robert, F. Orogenic gold deposits: A proposed classification in the context of their crustal distribution and relationship to other gold deposit types. Ore Geol. Rev. 1998, 13, 7–27. [Google Scholar] [CrossRef]
Fang, X.; Tang, J.; Beaudoin, G.; Song, Y.; Chen, Y. Geology, mineralogy and geochemistry of the Shangxu orogenic gold deposit, central Tibet, China: Implications for mineral exploration. Ore Geol. Rev. 2020, 120, 103440. [Google Scholar] [CrossRef]
Goldfarb, R.J.; Groves, D.I.; Gardoll, S. Orogenic gold and geologic time: A global synthesis. Ore Geol. Rev. 2001, 18, 1–75. [Google Scholar] [CrossRef]
Kazapoe, R.W. A review of the characteristics and geological settings of orogenic gold deposits of the Boule Mossi Domain: Implication for gold exploration. Geol. Ecol. Landsc. 2025, 9, 690–705. [Google Scholar] [CrossRef]
Cracknell, M.J.; Reading, A.M. Geological mapping using remote sensing data: A comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information. Comput. Geosci. 2014, 63, 22–33. [Google Scholar] [CrossRef]
Carranza, E.J.M.; Laborte, A.G. Data-driven predictive modeling of mineral prospectivity using random forests: A case study in Catanduanes Island (Philippines). Nat. Resour. Res. 2016, 25, 35–50. [Google Scholar] [CrossRef]
Mostafaei, K.; Kianpour, M.; Yousefi, M. The use of the Gray Wolf Optimizer algorithm in separating anomalies from the background, Case study: Alut area. Sci. Q. J. Geosci. 2024, 34, 75–88. [Google Scholar]
Mostafaei, K.; Yousefi, M.; Kreuzer, O.; Kianpour, M.N. Simulation-based mineral prospectivity modeling and Gray Wolf optimization algorithm for delimiting exploration targets. Ore Geol. Rev. 2025, 177, 106458. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.F.; Chica-Rivas, M. Evaluation of different machine learning methods for land cover mapping of a Mediterranean area using multi-seasonal Landsat images and Digital Terrain Models. Int. J. Digit. Earth 2014, 7, 492–509. [Google Scholar] [CrossRef]
Zhang, S.; Carranza, E.J.M.; Wei, H.; Xiao, K.; Yang, F.; Xiang, J.; Xu, Y. Data-driven mineral prospectivity mapping by joint application of unsupervised convolutional auto-encoder network and supervised convolutional neural network. Nat. Resour. Res. 2021, 30, 1011–1031. [Google Scholar] [CrossRef]
Mostafaei, K.; Maleki, S.; Jodeiri Shokri, B.; Yousefi, M. Predicting gold grade by using support vector machine and neural network to generate an evidence layer for 3D prospectivity analysis. Int. J. Min. Geo-Eng. 2023, 57, 435–444. [Google Scholar]
Sun, K.; Chen, Y.; Geng, G.; Lu, Z.; Zhang, W.; Song, Z.; Zhang, Z. A Review of Mineral Prospectivity Mapping Using Deep Learning. Minerals 2024, 14, 1021. [Google Scholar] [CrossRef]
Reimann, C.; Filzmoser, P.; Garrett, R.G. Background and threshold: Critical comparison of methods of determination. Sci. Total Environ. 2005, 346, 1–16. [Google Scholar] [CrossRef] [PubMed]
Mostafaei, K.; Kianpour, M.N.; Yousefi, M.; Saleki, M. Improving the results of the fractal model of geochemical Mineralization Probability Index Using the Gray Wolf Algorithm on the Stream Sediments Data of Sarduiyeh-Baft Area. J. Min. Environ. 2025, 16, 1165–1178. [Google Scholar]
Mostafaei, K.; Kianpour, M.N.; Yousefi, M. Delineation of gold exploration targets based on prospectivity models through an optimization algorithm. J. Min. Environ. 2024, 15, 597–611. [Google Scholar]
Stöcklin, J. Structural history and tectonics of Iran. AAPG Bull. 1968, 52, 1229–1258. [Google Scholar] [CrossRef]
Ghorbani, M. A Summary of Geology of Iran. In The Economic Geology of Iran: Mineral Deposits and Natural Resources; Springer Geology; Springer: Dordrecht, The Netherlands, 2013; pp. 45–64. [Google Scholar] [CrossRef]
Aliyari, F.; Rastad, E.; Mohajjel, M.; Arehart, G.B. Geology and geochemistry of D–O–C isotope systematics of the Qolqoleh gold deposit, Northwestern Iran: Implications for ore genesis. Ore Geol. Rev. 2009, 36, 306–314. [Google Scholar] [CrossRef]
Afzal, P.; Ahari, H.D.; Omran, N.R.; Aliyari, F. Delineation of gold mineralized zones using concentration–volume fractal model in Qolqoleh gold deposit, NW Iran. Ore Geol. Rev. 2013, 55, 125–133. [Google Scholar] [CrossRef]
Azizi, H.; Stern, R.J. Jurassic igneous rocks of the central Sanandaj–Sirjan zone (Iran) mark a propagating continental rift, not a magmatic arc. Terra Nova 2019, 31, 415–423. [Google Scholar] [CrossRef]
Haji, E.; Safari, H.; Shafiei Bafti, B.; Mojallal, M. Saqez–Sardasht Goldfield, North Sanandaj-Sirjan Zone, Iran: A Tectono-Metallogenic Synthesis. Acta Geol. Sin. 2020, 94, 1693–1710. [Google Scholar] [CrossRef]
Hosseini, S.A.; Afzal, P.; Sadeghi, B.; Sharmad, T.; Shahrokhi, S.V.; Farhadinejad, T. Prospection of Au mineralization based on stream sediments and lithogeochemical data using multifractal modeling in Alut 1: 100,000 sheet, NW Iran. Arab. J. Geosci. 2015, 8, 3867–3879. [Google Scholar] [CrossRef]
Mohammadpour, M.; Bahroudi, A.; Abedi, M.; Rahimipour, G.; Jozanikohan, G.; Khalifani, F.M. Geochemical distribution mapping by combining number-size multifractal model and multiple indicator kriging. J. Geochem. Explor. 2019, 200, 13–26. [Google Scholar] [CrossRef]
Sillitoe, R.H. Gold Deposit Types: An Overview. In Geology of the World’s Major Gold Deposits and Provinces; Sillitoe, R.H., Goldfarb, R.J., Robert, F., Simmons, S.F., Eds.; Society of Economic Geologists: Littleton, CO, USA, 2020. [Google Scholar] [CrossRef]
Aliyari, F.; Rastad, E.; Mohajjel, M. Gold Deposits in the Sanandaj–Sirjan Zone: Orogenic Gold Deposits or Intrusion-Related Gold Systems? Resour. Geol. 2012, 62, 296–315. [Google Scholar] [CrossRef]
Sanford, R.F.; Pierson, C.T.; Crovelli, R.A. An objective replacement method for censored geochemical data. Math. Geol. 1993, 25, 59–80. [Google Scholar] [CrossRef]
Verbovšek, T. A comparison of parameters below the limit of detection in geochemical analyses by substitution methods Primerjava ocenitev parametrov pod mejo določljivosti pri geokemičnih analizah z metodo nadomeščanja. RMZ-Mater. Geoenviron. 2011, 58, 393–404. [Google Scholar]
Aitchison, J. Principles of compositional data analysis. In Multivariate Analysis and Its Applications; IMS Lecture Notes-Monograph Series; Institute of Mathematical Statistics: Beachwood, OH, USA, 1994; pp. 73–81. [Google Scholar]
Filzmoser, P.; Hron, K.; Templ, M. Applied Compositional Data Analysis: With Worked Examples in R; Springer: Cham, Switzerland, 2018; Volume 1, pp. 1–280. [Google Scholar]
Shelton, J.L.; Engle, M.A.; Buccianti, A.; Blondes, M.S. The isometric log-ratio (ilr)-ion plot: A proposed alternative to the Piper diagram. J. Geochem. Explor. 2018, 190, 130–141. [Google Scholar] [CrossRef]
Liu, X.; Wang, W.; Pei, Y.; Yu, P. A knowledge-driven way to interpret the isometric log-ratio transformation and mixture distributions of geochemical data. J. Geochem. Explor. 2020, 210, 106417. [Google Scholar] [CrossRef]
Darabi-Golestan, F.; Hezarkhani, A. Evaluation of elemental mineralization rank using fractal and multivariate techniques and improving the performance by log-ratio transformation. J. Geochem. Explor. 2018, 189, 11–24. [Google Scholar] [CrossRef]
Wang, J.; Zhou, Y.; Xiao, F. Identification of multi-element geochemical anomalies using unsupervised machine learning algorithms: A case study from Ag–Pb–Zn deposits in north-western Zhejiang, China. Appl. Geochem. 2020, 120, 104679. [Google Scholar] [CrossRef]
Guan, Q.; Ren, S.; Chen, L.; Yao, Y.; Hu, Y.; Wang, R.; Chen, W. Recognizing multivariate geochemical anomalies related to mineralization by using deep unsupervised graph learning. Nat. Resour. Res. 2022, 31, 2225–2245. [Google Scholar] [CrossRef]
Wu, G.; Chen, G.; Cheng, Q.; Zhang, Z.; Yang, J. Unsupervised machine learning for lithological mapping using geochemical data in covered areas of Jining, China. Nat. Resour. Res. 2021, 30, 1053–1068. [Google Scholar] [CrossRef]
Zhang, S.; Carranza, E.J.M.; Fu, C.; Zhang, W.; Qin, X. Interpretable Machine Learning for Geochemical Anomaly Delineation in the Yuanbo Nang District, Gansu Province, China. Minerals 2024, 14, 500. [Google Scholar] [CrossRef]
Min, Y.; Zhao, J.; Sui, Y.; Liu, J.; Zou, S.; Zhu, H. Comparison of manifold learning algorithms for identifying geochemical anomalies associated with copper mineralization. Sci. Rep. 2025, 15, 39628. [Google Scholar] [CrossRef] [PubMed]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
Hajihosseinlou, M.; Maghsoudi, A.; Ghezelbash, R. Geochemical Anomaly Detection and Pattern Recognition: A Combined Study of the Apriori Algorithm, Principal Component Analysis, and Spectral Clustering. Minerals 2024, 14, 1202. [Google Scholar] [CrossRef]
Wang, H.; Yuan, Z.; Cheng, Q.; Zhang, S. Geochemical anomaly mapping using sparse principal component analysis in Jining, Inner Mongolia, China. J. Geochem. Explor. 2022, 234, 106936. [Google Scholar] [CrossRef]
Li, H.; Li, Z.; Ouyang, Y.; Yang, L.; Deng, Y.; Jiang, Q.; Zeng, H. Application of principal component analysis and a spectrum-area fractal model to identify geochemical anomalies associated with vanadium mineralization in northeastern Jiangxi Province, South China. Geochem. Explor. Environ. Anal. 2022, 22, geochem2021-090. [Google Scholar] [CrossRef]
Sadeghi, M.; Casey, P.; Carranza, E.J.M.; Lynch, E.P. Principal components analysis and K-means clustering of till geochemical data: Mapping and targeting of prospective areas for lithium exploration in Västernorrland Region, Sweden. Ore Geol. Rev. 2024, 167, 106002. [Google Scholar] [CrossRef]
Lin, X.; Boutros, P.C. Optimization and expansion of non-negative matrix factorization. BMC Bioinform. 2020, 21, 7. [Google Scholar] [CrossRef]
Gan, J.; Liu, T.; Li, L.; Zhang, J. Non-negative matrix factorization: A survey. Comput. J. 2021, 64, 1080–1092. [Google Scholar] [CrossRef]
Huang, Z.; Cai, D.; Sun, Y. Towards more accurate microbial source tracking via non-negative matrix factorization (NMF). Bioinformatics 2024, 40, i68–i78. [Google Scholar] [CrossRef]
Ren, Z.; Zhai, Q.; Sun, L. A novel method for hyperspectral mineral mapping based on clustering-matching and nonnegative matrix factorization. Remote Sens. 2022, 14, 1042. [Google Scholar] [CrossRef]
Yuan, K.; Shang, Y.; Guo, H.; Dong, Y.; Liu, Z. A model of feature extraction for well logging data based on graph regularized non-negative matrix factorization with optimal estimation. Complex Intell. Syst. 2025, 11, 180. [Google Scholar] [CrossRef]
Armstrong, G.; Martino, C.; Rahman, G.; Gonzalez, A.; Vázquez-Baeza, Y.; Mishne, G.; Knight, R. Uniform manifold approximation and projection (UMAP) reveals composite patterns and resolves visualization artifacts in microbiome data. MSystems 2021, 6, 10–1128. [Google Scholar] [CrossRef] [PubMed]
Ghojogh, B.; Crowley, M.; Karray, F.; Ghodsi, A. Unified Spectral Framework and Maximum Variance Unfolding. In Elements of Dimensionality Reduction and Manifold Learning; Springer: Cham, Switzerland, 2023; pp. 479–497. [Google Scholar] [CrossRef]
Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; Walton, M. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 252, 119547. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Sun, K.; Zhang, J.; Zhang, C.; Hu, J. Generalized extreme learning machine autoencoder and a new deep neural network. Neurocomputing 2017, 230, 374–381. [Google Scholar] [CrossRef]
An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
Feng, B.; Chen, L.; Xu, Y.; Zhang, Y. Comparative study on three autoencoder-based deep learning algorithms for geochemical anomaly identification. Earth Space Sci. 2022, 9, e2022EA002626. [Google Scholar] [CrossRef]
Luo, T.; Zhou, Z.; Tang, L.; Gong, H.; Liu, B. Identification of Geochemical Anomalies Using a Memory-Augmented Autoencoder Model with Geological Constraint. Nat. Resour. Res. 2025, 34, 23–40. [Google Scholar] [CrossRef]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; PMLR 48. pp. 478–487. [Google Scholar]
Sheng, G.; Wang, Q.; Pei, C.; Gao, Q. Contrastive deep embedded clustering. Neurocomputing 2022, 514, 13–20. [Google Scholar] [CrossRef]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; pp. 1753–1759. [Google Scholar] [CrossRef]
Hoseinzade, Z.; Bazoobandi, M.H. Deep embedded clustering: Delineating multivariate geochemical anomalies in the Feizabad region. Geochemistry 2024, 84, 126208. [Google Scholar] [CrossRef]
Saremi, M.; Hoseinzade, Z.; Yousefi, M. A deep embedded clustering algorithm in conjunction with an ensemble technique for mineral prospectivity mapping. Sci. Rep. 2025, 15, 38086. [Google Scholar] [CrossRef]
Truong, X.Q.; Tran, T.L.; Dang, T.C.; Tran, V.A.; Truong, X.L. Gold Mineral Prospectivity Mapping Using the Ensemble Model Approach, Case Study of Northwestern Thanh Hoa Province, Vietnam. Int. J. Geoinform. 2025, 21, 1–16. [Google Scholar] [CrossRef]
Nerger, L.; Hiller, W. Software for ensemble-based data assimilation systems—Implementation strategies and scalability. Comput. Geosci. 2013, 55, 110–118. [Google Scholar] [CrossRef]
Carranza, E.J.M.; Laborte, A.G. Data-driven predictive mapping of gold prospectivity, Baguio district, Philippines: Application of Random Forests algorithm. Ore Geol. Rev. 2015, 71, 777–787. [Google Scholar] [CrossRef]
Shahrestani, S.; Sanislav, I.; Fereydooni, H. Rotation-based outlier detection for geochemical anomaly identification in stream sediment multivariate data. Earth Sci. Inform. 2025, 18, 1–18. [Google Scholar] [CrossRef]
Bourdeau, J.E.; Zhang, S.E.; Lawley, C.J.; Parsa, M.; Nwaila, G.T.; Ghorbani, Y. Predictive geochemical exploration: Inferential generation of modern geochemical data, anomaly detection and application to northern Manitoba. Nat. Resour. Res. 2023, 32, 2355–2386. [Google Scholar] [CrossRef]
Huang, D.; Zuo, R.; Wang, J.; Tolosana-Delgado, R. Combining sequential Gaussian co-simulation and Monte Carlo dropout-based deep learning models for geochemical anomaly detection and uncertainty assessment. Appl. Geochem. 2025, 184, 106385. [Google Scholar] [CrossRef]
Bi, R.; Liu, D.; Xia, Q. Identification of Geochemical Anomalies Using a Deep Semi-supervised Anomaly Detection Model: Bi, Liu, and Xia. Nat. Resour. Res. 2026, 35, 1409–1422. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Hoo, Z.H.; Candlish, J.; Teare, D. What is an ROC curve? Emerg. Med. J. 2017, 34, 357–359. [Google Scholar] [CrossRef]
Nakas, C.T.; Bantis, L.E.; Gatsonis, C.A. ROC Analysis for Classification and Prediction in Practice; Chapman and Hall/CRC: Boca Raton, FL, USA, 2023. [Google Scholar]
Heda, V.; Dubey, R.; Tewari, A.; Sarkar, B.C. Integrating Domain-Aware Machine Learning for Mineral Prospectivity Modelling. In Innovative and Responsible Mining for Inclusive Growth (AMC 2025); Springer Proceedings in Earth and Environmental Sciences; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Ehigiator-Irughe, R.; Odumosu, J.O.; Nwodo, J.; Oladosu, O.; Ehigiator, M.O. Mapping of High Potential Gold Mineralization Zones from Merged Optical and Radar Images Using Analytical Hierarchical Process (AHP). Trop. J. Built Environ. (TJOBE) 2025, 6, 106–120. [Google Scholar]
Tende, A.W.; Gajere, J.N.; Amuda, A.K.; Ige, O.O.; Bale, R.B.; Aminu, M.D.; Faisal, M. Prospectivity mapping of gold and cassiterite mineralization using satellite multispectral imagery, geophysical data, and weighted sum model. Model. Earth Syst. Environ. 2025, 11, 156. [Google Scholar] [CrossRef]
Qaderi, S.; Maghsoudi, A. Comparative Analysis of One-Dimensional Convolutional Neural Network and Predictive Raster Averaging for MVT Pb-Zn Mineralization Using the Weighted Class Distribution Evaluation Framework. Appl. Geochem. 2025, 194, 106595. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. An invitation to greater use of Matthews correlation coefficient in robotics and artificial intelligence. Front. Robot. AI 2022, 9, 876814. [Google Scholar] [CrossRef] [PubMed]
Diallo, R.; Edalo, C.; Awe, O.O. Machine Learning Evaluation of Imbalanced Health Data: A Comparative Analysis of Balanced Accuracy, MCC, and F1 Score. In Practical Statistical Learning and Data Science Methods; Awe, O.O., Vance, A.E., Eds.; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023, 16, 4. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structural zones map of Iran showing the location of the study area in the Sanandaj–Sirjan zone.

Figure 2. Distribution of stream sediment samples over the study area.

Figure 3. Spatial distribution of selected gold-related elements ((a) Au; (b) Ag; (c) As; (d) B; (e) Hg; (f) Mo; (g) S and (h) W) in stream sediment samples across the study area. These maps illustrate the heterogeneous geochemical background and provide contextual information for subsequent multivariate anomaly detection analyses.

Figure 4. Workflow of the unsupervised geochemical prospectivity mapping framework.

Figure 5. Comparison of fuzzy-scaled prospectivity maps generated by unsupervised methods: (a) PCA; (b) NMF; (c) UMAP; (d) AE; (e) DEC, and (f) AVE.

Figure 6. ROC curves comparing the performance of six unsupervised methods for gold prospectivity based on known mineralised and non-mineralised locations. The diagonal black line represents the reference line corresponding to random classification (AUC = 0.5).

Figure 7. Performance comparison of unsupervised methods based on MCC.

Table 1. Descriptive statistics of stream sediment elements after censored data treatment.

Variable	Mean	StDev	CoefVar	Minimum	Median	Maximum	Skewness
Ag	0.11	0.032	30.88	0.01	0.1	0.47	3.30
As	15.50	11.35	73.29	1.92	11.6	111	2.23
Au	0.008	0.0796	1065.76	0.0004	0.002	1.84	19.06
Ba	426.97	170.51	39.94	120	400	2500	4.01
B	53.59	29.37	54.82	5	50	200	1.13
Be	1.97	0.683	34.78	0.18	1.9	4.7	0.68
Bi	0.291	0.113	38.92	0.1	0.26	1.38	3.04
Co	23.60	6.82	28.89	10	22.5	62	1.30
Cr	173.50	253.95	146.36	50	140	4600	12.86
Cu	35.55	9.38	26.38	13	35	68	0.33
Hg	0.096	0.043	44.81	0.011	0.09	0.41	1.22
Mn	1000	470.6	47.08	220	900	4200	1.99
Mo	1.51	1.69	112.25	0.14	1.19	39	13.78
Ni	79.10	38.60	48.80	18	72	830	9.54
Pb	18.89	7.702	40.78	5	17	72	2.09
Sb	1.38	1.075	77.99	0.15	1.13	18.2	5.89
Sn	3.66	1.058	28.93	2.1	3.5	21	7.59
Ti	7636	2979	39.01	1900	7000	28,000	1.38
W	2.42	8.1	335.32	0.09	1.25	149	11.80
Zn	94.81	22.8	24.04	51	92	380	3.04

Table 2. Key hyperparameters used for unsupervised integration methods.

Method	Key Parameters
PCA	Number of components = 3; composite index calculated as the mean of retained component scores
NMF	n_components = 3; init = random; max_iter = 500; random_state = 42
UMAP	n_components = 3; n_neighbors = 15; min_dist = 0.1; metric = Euclidean
AE	Three-dimensional latent representation; ReLU activation; Dropout = 0.2; Epochs = 150; Batch size = 16; Optimizer = Adam; Loss function = MSE
DEC	Three-dimensional latent representation; K-means initialisation; Number of clusters = 3; KL divergence optimisation
AVE	Equal-weight arithmetic average of fuzzified geochemical layers

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mostafaei, K.; Jodeiri Shokri, B.; Mirzaghorbanali, A. Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry. Minerals 2026, 16, 628. https://doi.org/10.3390/min16060628

AMA Style

Mostafaei K, Jodeiri Shokri B, Mirzaghorbanali A. Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry. Minerals. 2026; 16(6):628. https://doi.org/10.3390/min16060628

Chicago/Turabian Style

Mostafaei, Kamran, Behshad Jodeiri Shokri, and Ali Mirzaghorbanali. 2026. "Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry" Minerals 16, no. 6: 628. https://doi.org/10.3390/min16060628

APA Style

Mostafaei, K., Jodeiri Shokri, B., & Mirzaghorbanali, A. (2026). Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry. Minerals, 16(6), 628. https://doi.org/10.3390/min16060628

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Evaluation of Unsupervised Machine Learning Methods for Orogenic Gold Exploration Using Stream Sediment Geochemistry

Abstract

1. Introduction

2. Geological and Geochemical Settings

2.1. Regional Tectonic Framework and Geology

2.2. Gold Mineralisation Characteristics

2.3. Stream Sediment Geochemical Dataset

3. Data Preprocessing and Geochemical Feature Selection

3.1. Stream Sediment Geochemical Data and Preprocessing

3.2. Selection of Gold-Related Elements

3.3. Compositional Data Transformation

3.4. Overview of Analytical Workflow

4. Unsupervised Anomaly Detection Methods

4.1. Principal Component Analysis (PCA)

4.2. Non-Negative Matrix Factorisation (NMF)

4.3. Uniform Manifold Approximation and Projection (UMAP)

4.4. Autoencoder (AE)

4.5. Deep Embedded Clustering (DEC)

4.6. Averaged Ensemble Index (AVE)

4.7. Score Standardisation and Anomaly Map Generation

4.8. Model Implementation and Hyperparameter Setting

5. Validation Strategy and Performance Assessment

5.1. Reference Data for Validation

5.2. ROC Analysis and AUC Estimation

5.3. Matthews Correlation Coefficient (MCC)

5.4. Comparative Evaluation of Methods

6. Results

6.1. Anomaly Maps Derived from Individual Unsupervised Methods

6.2. Comparative Analysis of Anomaly Patterns

6.3. Spatial Agreement Analysis and Priority Target Delineation

6.4. Rationale for Validation Based on Spatial Consistency

7. Validation and Discussion

7.1. Validation Framework and Reference Data

7.2. ROC-AUC Analysis

7.3. MCC Analysis

7.4. Comparative Discussion of Unsupervised Integration Strategies

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI