1. Introduction
Machine learning (ML) is transforming how geologists interpret the subsurface, enabling fully automated, data-driven workflows that can convert complex assay tables into robust three-dimensional resource models. A particularly compelling application is in mine tailings, where the depositional record mirrors operational history and forms vertical clusters of material as processing and discharge conditions change over time. Unlike deeper ore bodies, tailings are located at or near the surface, making them more accessible and enabling potential reprocessing. These deposits often contain residual concentrations of valuable elements such as copper, cobalt, gold, and rare earth elements (REEs), presenting a significant opportunity for secondary resource recovery. However, they remain chemically reactive and, if mismanaged, can pose serious risks to surrounding water and soil quality [
1,
2,
3,
4,
5].
As the mining industry places more emphasis on responsible tailing management, robust geochemical data analysis has become central to modern geoscience practice. Comprehensive tailing assays provide critical insights into mineral composition, elemental distributions, and potential contaminants, and they support both re-mining strategies and environmental monitoring, as documented by Gitari et al. [
1,
2]. At the same time, these datasets are typically large, high-dimensional, and noisy, which stretches the capacities of conventional analytical techniques. In mine planning and tailing re-mining, drillhole data derived from such assays form the basic input for delineating mineralization domains and evaluating secondary material recovery, while spatial clustering of drillhole data based on geochemical or geotechnical attributes supports resource evaluation and operational planning [
3,
4].
Tailings pose particular difficulties for spatial clustering because they are usually low-grade, exhibit limited spatial continuity, and display marked compositional variability [
5]. Their vertical layering reflects changes in processing and discharge through time and generates nonstationary downhole sequences with thin, rapidly alternating beds and sharp campaign contacts, which make it easy for conventional clustering to over-segment zones or overlook true depositional packages [
6]. Traditional, semi-manual clustering depends heavily on expert judgment, which introduces subjectivity and inconsistency [
7], so there is a growing need for automated, reproducible methods that can handle data intensive tasks with minimal human bias. Recent advances in data analysis provide such tools: automated spatial clustering grounded in ML, deep learning (DL), and geostatistical modeling can reduce subjective choices, improve reproducibility, and align with real-time mining workflows, where continuous data collection and near-term analysis support timely operational adjustments [
8,
9,
10]. For tailing re-mining, integrating geochemical data with automated clustering yields more stable zone definitions, clearer estimates of grade distribution, and a more transparent basis for assessing environmental risk.
In recent years, numerous studies have demonstrated the effectiveness of combining ML, DL, and geostatistics for drilling-based geological interpretations, lithological classification, and resource characterization [
11,
12,
13,
14]. Horrocks et al. [
15], for instance, showcased automated lithological classification in coal exploration using support vector machines, Naïve Bayes, and artificial neural networks; their methods proved scalable and efficient in Queensland, Australia. In a different geological setting, Wang et al. [
16] applied DL models to stream sediment geochemical data for leucogranite exploration in the Himalayan orogen, showing strong spatial clustering performance for rare metal deposits. Moreover, Silversides and Melkumyan [
4] leveraged Gaussian processes on measurement-while-drilling data to identify geological boundaries in Australian banded iron formations.
Multi scale structure in drillhole data has been captured using wavelet-based methods, where wavelet transforms were used by Hill et al. [
17] to classify lithochemical units and continuous wavelet transforms combined with k-means were applied by to refine boundary detection and reduce misclassification. Building on this time series perspective, recurrence analysis was introduced as a multivariate technique for efficient detection of geological boundaries in mineral exploration boreholes and offshore gas wells [
18], and a hybrid scheme combining recurrence analysis with k-means was developed to detect rock boundaries and classify rock types in an iron ore mine [
3]. Complementary improvements in boundary correlation and unit mapping were obtained when principal component analysis was applied to multiple well logs in carbonate reservoirs [
19], while random forest classifiers integrated with explainable artificial intelligence (AI) achieved over 90% accuracy in lithological mapping of arid crystalline terrains in the Middle [
20].
Complementing these time series approaches, Markov models have been used to encode sequential structure and uncertainty in facies sequences and three-dimensional geological models. Deng et al. [
21] used a hidden Markov model (HMM) to infer magma flow paths and injection points from a mineralization block model, while Ouyang et al. [
22] represented vertical stratigraphic transitions with Markov chains and propagated them via Monte Carlo simulation to quantify uncertainty in three-dimensional surfaces. At the logs and seismic scale, Talarico et al. [
23] showed that higher-order Markov chains remain a transparent way to impose realistic facies transition probabilities in seismic-to-facies inversion, even when compared with more flexible Recurrent Neural Networks.
In geostatistical clustering, Fouedjio [
24] proposed a hierarchical clustering method for identifying spatially contiguous clusters. This work was later extended by Fouedjio et al. [
25] who introduced a spectral clustering approach method tailored to geostatistical data, integrating kernel-based measures of spatial dependence to form contiguous and geologically meaningful clusters, validated on both synthetic and real-world datasets. Related work has also linked AI with geostatistical modeling, where Jalloh et al. [
26] combined a feedforward neural network with variogram-based kriging in a mineral sands deposit and van der Grijp et al. [
27] applied multiple point statistics with direct sampling in a structurally complex gold deposit, together showing that AI-assisted geostatistics can better capture nonlinear grade patterns, respect spatial continuity, and represent geological uncertainty for mine planning.
Despite these advances, both AI and geostatistics face challenges in spatial clustering. Large datasets can overwhelm geostatistical methods, causing computational bottlenecks and difficulty handling nonstationary or non-normal data. Conversely, AI models often lack inherent physical constraints, risking results that overlook key geological structures and spatial continuity [
26,
28,
29].
To address these issues, this paper introduces the Geostatistical k-means Recurrent Neural Network (GkRNN), which combines k-means clustering, spatial continuity from geostatistical analysis, and sequence modeling to produce an automated, spatially informed interpretation of mine tailing data. The following sections present the study site and dataset, outline the workflow, assess geological consistency and operational relevance against k-means and Gaussian Mixture baselines, and conclude with key limitations, practical implications, and directions for future work.
3. Results
The GkRNN workflow was applied to a multivariate, CLR-transformed geochemical dataset from 82 tailing drillholes. Clustering was carried out in a spectral space that encodes joint spatial continuity, and the resulting depthwise labels were subsequently regularized through an HMM layer and an LSTM network that was trained to remain consistent with the learned transition structure. In combination, these steps produced a small number of stratigraphically coherent zones in each drillhole that follow the depositional layering and suppress thin, noise-driven oscillations.
3.1. Determination of the Global Cluster Number by a Continuity-Aware Spectral Elbow Criterion
A model selection analysis was undertaken to identify a number of clusters
k that is both parsimonious and expressive for the tailings deposit. The procedure operated in a continuity-aware spectral space
that was derived from a joint similarity graph blending a multivariate, kernel variogram-based geochemical affinity with a spatial nearest neighbor term. The normalized graph Laplacian was eigendecomposed to obtain
(ten leading coordinates), and k-means clustering was performed for candidate values
k ∈ [
2,
8]. For each value of
k, the within-cluster inertia was recorded. This quantity decreases as clusters become more internally homogeneous, but typically exhibits diminishing improvements as
k increases.
The elbow location was identified using the maximum distance to chord rule, which selects the value of
k that has the largest perpendicular distance from the straight line connecting the endpoints of the inertia curve. As shown in
Figure 5, a pronounced change in slope occurs at
k = 4. Beyond this point, inertia decreases only slowly, which indicates that additional clusters would mainly subdivide existing regimes rather than reveal new, geologically meaningful structure. Accordingly,
k = 4 was adopted as the global cluster count for subsequent depthwise regularization and sequence modeling. This choice balances compositional coherence with model simplicity and is consistent with the expected convex decay behavior of inertia in continuity-aware embeddings of tailing data.
3.2. Delineation of Multivariate Compositional Domains Using Continuity-Aware Spectral Clustering
A continuity-aware spectral representation was constructed from a joint similarity graph that combined a kernel variogram-based geochemical affinity with a spatial nearest neighbor term. K-means clustering was applied to the leading spectral coordinates, and the resulting labels were interpreted as depth sequences. These sequences were regularized with a left-to-right HMM and with an LSTM trained on sliding windows that contained normalized coordinates, leading spectral coordinates, distances to class centroids, and a local similarity score. After sequence learning, a minimum thickness rule was applied to remove very thin, non-persistent runs. Unless stated otherwise, a discrete viridis palette is used throughout, and the figures show the final sequence consistent zones.
The overall multivariate structure can be examined in the variables-versus-depth mosaic for the tailing deposit, where nine elements (Mo, Sr, Zn, Ca, Cu, Fe, K, S, Ti) are plotted against depth and colored by zone (
Figure 6). Bands of similar color appear at comparable depths from hole to hole, which indicates that the zones represent coherent stratification rather than isolated classifications. Element-specific banding is most pronounced for Sr, Zn, Fe, and K, so these tracers contribute most strongly to boundary definition, whereas the remaining variables exhibit more gradual contrasts that describe transitional behavior.
Consistency at the single-hole scale was investigated using borehole MPT 19 01_AL, where nine elemental logs are plotted against depth with a zone strip at the right margin (
Figure 7). Depth increases downward, and each trace retains a fixed color by variable. Changes in composition align closely with label transitions, and boundaries appear stable rather than alternating rapidly. One zone displays intermediate values across several elements and acts as a transitional layer between enriched and depleted intervals. The visual agreement between compositional breaks and the contiguous color blocks in the zone strip suggests that the sequence-aware labeling captured persistent units rather than pointwise noise.
Deposit scale stacking was then summarized using depth binned zone proportions computed at 0.5 m resolution for the tailings deposit (
Figure 8). At each depth the stacked ribbons sum to one, and the vertical axis increases downward. A gradual shift can be seen: shallow intervals contain larger proportions of zones 0 and 1, the proportion of the intermediate zone rises through the middle part of the profile, and the deepest section is dominated by zone 2. The smooth trajectories of the ribbons and the absence of abrupt reversals indicate that the labels evolve as a depth-ordered time series rather than as scattered classifications. In this sense, the GkRNN workflow respects the time series character of the tailing deposit, because the Markov and LSTM components learn persistence and transition patterns that govern how zones appear and change with depth.
To complement this depth binned view, a 3D facies model was interpolated on a regular grid restricted to the convex hull of the boreholes (
Figure 9). Within this footprint, the final GkRNN zones form laterally continuous layers between drillholes and display vertical transitions that are consistent with the depth binned proportions in
Figure 8. This 3D representation shows both vertical and lateral facies variations in a single image and confirms that the workflow produces a coherent volumetric domaining of the tailing deposit.
Cluster level geochemical signatures were summarized using fingerprints that plot standardized means and 95 percent confidence intervals across the nine variables (
Figure 10). Clear separation is observed for Sr, Zn, Fe, and K, which indicates that these tracers have the strongest discriminating power for the tailings deposit. More moderate offsets on Mo, Ca, S, and Ti reflect gradational behavior that is typical of transitional materials. One zone occupies intermediate values for several elements and therefore acts as a bridge between enriched and depleted regimes. Because values were standardized as deposit wide z scores, the contrasts represent relative enrichment rather than absolute concentrations.
Directional continuity in the GkRNN zones was evaluated with experimental semivariograms computed vertically and horizontally for all nine elements, and spherical models were fitted by least squares (
Figure 11 and
Figure 12). For the vertical direction, within-hole pairs were formed after ordering samples by depth; for the horizontal direction, plan view pairs were restricted to similar depths using a small vertical tolerance and a search radius set to a fraction of the deposit footprint. Populous zones showed well-defined ranges and modest nuggets in both directions, which indicates strong internal coherence of the GkRNN partitions. In many variables the lateral range exceeded the vertical range, a pattern that matches the bedded architecture expected in a tailings deposit. The clearest structure was observed for Sr, Zn, Fe and K, agreeing with the zone contrasts seen in the variables-versus-depth mosaics and in the cluster fingerprints.
Interpretation in the context of the tailings deposit is therefore straightforward. The large zones appear to represent stratified packages that persist laterally across the footprint, consistent with relatively uniform discharge conditions and settling processes. A rare zone with limited spatial support exhibited short ranges and low sills, which is consistent with patchy or lens-like material produced by local reworking. While uncertainty increases for that zone because pair counts are small, the directional behavior remains compatible with the mapped GkRNN zones.
Taken together, the mosaic level stratification, the alignment of compositional breaks at the scale of individual drillholes, the smooth depthwise proportions, the discriminative fingerprints, and the directionally consistent semivariograms indicate that the workflow captured both composition and structure in a way that is faithful to tailing deposition.
4. Discussion
4.1. Spatial Comparison Between GkRNN, K-Means, and Gaussian Mixture
Spatial coherence and geological plausibility were evaluated by comparing the continuity-aware GkRNN workflow with a conventional k-means baseline and a Gaussian Mixture model.
Figure 13 shows the 3D clustering results for the tailings deposit. The GkRNN model produces vertically consistent, sheet-like domains that follow the expected lamination of the tailings body. In contrast, k-means generate highly fragmented patches that oscillate along depth, while the Gaussian Mixture result lies between these two extremes, with somewhat smoother packages than k-means but still more breaks and isolated lenses than GkRNN. These visual differences reflect how each method handles spatial context. GkRNN operates in a spectral space that encodes joint continuity and then regularizes labels along depth with Markov and LSTM sequence models, which encourages persistence where composition and context agree. Both k-means and Gaussian Mixture assign labels point-by-point in feature space and have no built in notion of neighborhood or stratigraphic order, so small geochemical perturbations can flip labels along a borehole.
The limitations of k-means are most evident in drillholes that contain gradual geochemical transitions. Because k-means imposes spherical clusters of similar size in Euclidean space, it tends to break softly varying logs into many small segments and often mistakes local noise for lithological change [
46,
47]. In drillhole MPT-19-13_AL, for example, the k-means solution splits the log into more than twenty cluster units, which is implausible for laminated tailings with known depositional continuity. The Gaussian Mixture model replaces hard Voronoi partitions with overlapping Gaussian components in the CLR space, which softens some of the abrupt k-means boundaries. Nevertheless, the mixture still treats each depth independently and cannot enforce vertical ordering, so it continues to introduce extra small scale units that do not correspond to distinct depositional campaigns. In contrast, the GkRNN segmentation of the same log remains within a small number of thicker, contiguous packages and places boundaries only where there is clear multielement support. In this sense, the time series character of the downhole data is respected only by GkRNN, which explicitly learns transition patterns along depth, while the other two methods remain purely point-based.
These behavioral differences have direct practical implications. Every additional contiguous unit increases the work required for variogram modeling, grade estimation, and domaining, and propagates noise into block models and short-term planning. Under the contiguous-unit definition used in this study, k-means yields a mean of 10.4 units per drillhole, with a median of 10, a range of 2 to 22, and a standard deviation of 4.2. GkRNN maintains stratigraphic coherence with a mean of 3.85 units per drillhole, a median of 4, a narrow range of 1 to 4, and a standard deviation of 0.59, which reduces the unit count by roughly two-thirds relative to k-means. The Gaussian Mixture model lies between these two, with a mean of 6.5 units per drillhole, a median of 7, a range of 1 to 12, and a standard deviation of about 2.0. Fewer, thicker, and more stable domains from GkRNN translate into simpler and more defensible models, clearer grade-control decisions, and less ambiguity when drawing dig lines, whereas Gaussian Mixture offers only a partial reduction in complexity relative to k-means.
Stability achieved by GkRNN also matches the intended four-layer architecture of the synthetic tailings deposit. Across 82 drillholes, exactly four contiguous units are recovered in 76 holes, corresponding to 93% of the dataset. K-means recovers four units in only four holes (5%), and the Gaussian Mixture model reaches four units in eight holes (10%). The distributions in
Figure 14 make this contrast clear. The GkRNN counts form a tight peak centered at four units per hole. Gaussian Mixture shows a broader distribution that is shifted upward but still much closer to the GkRNN peak than to k-means. K-means displays the widest spread and many high-count outliers, which is a classic signature of depthwise label chattering.
Table 1 summarizes these statistics and confirms that GkRNN provides the strongest control on contiguity and realistic thickness, Gaussian Mixture improves over the purely point-based k-means but still allows unnecessary subdivision, and k-means is the least suitable of the three for stratigraphically plausible, planning-ready domaining.
Table 1 summarizes the number of geochemically distinct contiguous units identified per drillhole in the new run. Across 82 drillholes, GkRNN concentrates tightly around four units per hole, with a mean of 3.85, median 4, range 1–4, standard deviation 0.59, and exactly four units in 76 of 82 drillholes (93%). K-means fragments the downhole logs, yielding a mean of 10.4 units per drillhole, median 10, range 2–22, standard deviation 4.2, and only 4 drillholes (5%) with exactly four units. The Gaussian Mixture model provides an intermediate outcome, with a mean of 6.5 units per drillhole, median 7, range 1–12, standard deviation about 2.0, and 8 drillholes (10%) with exactly four units. Together with
Figure 13 and
Figure 14, these values show that the sequence-regularized GkRNN labeling best preserves stratigraphic coherence and realistic thickness, Gaussian Mixture offers a modest improvement over k-means, and k-means tends to over-segment the logs under the contiguous-unit definition.
4.2. Methodological Contributions of the Continuity-Aware GkRNN Workflow
The continuity-aware GkRNN workflow combines compositional preprocessing, continuity-based similarity, and sequence modeling in a single pipeline. Compositional assays are first mapped with a CLR transformation and a kernel variogram so that the similarity graph reflects both multivariate grade structure and spatial continuity. A spectral embedding of this graph provides a low-dimensional space where clusters are already shaped by spatial dependence, and an elbow analysis in this space selects a small set of clusters that represent the main compositional regimes without chasing local noise. Downhole labels are then treated as depth-ordered sequences: a left-to-right HMM and an LSTM trained on sliding windows promote persistence when composition and context agree and allow changes only when there is multielement support, while a minimum thickness filter removes short, isolated runs. The resulting zones reproduce the depth-ordered mosaics, smooth zone proportions, and compact contiguous unit counts, so compositional breaks and stratigraphic continuity are enforced in a consistent way.
The workflow is built to operate on a regular depth scale. All drillhole assays are composited to a fixed interval before variograms, spectral embedding, and sequence models are fitted, so GkRNN works on equally spaced depth steps and is not tied to any specific physical sampling interval, provided that the chosen support is regular and fine enough relative to vertical correlation ranges and typical facies thicknesses; logs sampled at different intervals or at a new target resolution can be recomposited and the continuity statistics recomputed without changing the architecture. Behavior in the presence of skewed or sparse variables is governed mainly by preprocessing and continuity modeling: after the CLR transform, variables are standardized, joint kernel variogram matrices are regularized by eigenvalue clipping and a small ridge term, and the resulting dissimilarities are rescaled with a median-based parameter τ so that clustering takes place in a fixed low-dimensional spectral embedding that summarizes joint continuity, meaning k-means acts on smoothed, continuity-aware features rather than on raw sparse or heavy-tailed variables. The components of this framework are formulated in a general way and do not rely on features unique to mine wastes, so the same pipeline can, in principle, be applied to fluvial sediments, deep ore bodies, or other geological environments wherever multivariate assays and basic spatial continuity allow joint variograms and sequence statistics to be estimated; tailings are a particularly suitable first test case because dense drilling and stratified, laterally extensive packages from repeated deposition match the continuity-aware facies concept that GkRNN is designed to capture, while other settings would mainly require practical tuning of the variogram lag structure and neighborhood graph to account for channel anisotropy or sparser, deeper drilling.
5. Conclusions
A continuity-aware, sequence-regularized workflow was developed to transform noisy compositional drillhole assays into stratigraphically coherent domains for tailings deposits. The Geostatistical k-means Recurrent Neural Network (GkRNN) integrates centered log-ratio (CLR) preprocessing, a kernel-variogram-based spectral representation, k-means clustering, and depth regularization via a hidden Markov model (HMM) and a long short-term memory (LSTM) network. Applied to 82 drillholes containing nine elements, the method produced a small number of stable zones per hole, smooth depthwise trends, and semivariograms consistent with laminated material. Compared with a conventional k-means baseline and a Gaussian Mixture model, the mean number of contiguous units per hole decreased from 10.4 (k-means) and 6.5 (Gaussian Mixture) to 3.85 with GkRNN, with exactly four units recovered in 93% of holes versus 5% and 10%, greatly simplifying subsequent modeling and planning.
Viewed through the variogram constraints and the HMM, GkRNN can be considered a general clustering approach that promotes spatial continuity and vertical stratigraphic coherence in layered datasets. Because the preprocessing stage is modular, it can be adapted to case-specific applications, which makes the workflow readily extendable to three-dimensional stratigraphic modeling of borehole data beyond tailings. Potential uses include improved constraint of reservoir properties such as porosity and permeability in petroleum settings, as well as characterization of stratiform mineral systems such as Kupferschiefer and of groundwater aquifers.
Future extensions should integrate multi-sensor inputs such as hyperspectral, wireline, and measurement-while-drilling logs, and include uncertainty envelopes around boundaries. Probabilistic recurrent networks and adaptive thickness priors could further refine transitions and capture site-specific depositional patterns. With these enhancements, GkRNN could advance toward near-real-time domaining with explicit quality controls, supporting safer and more efficient tailing re-mining.