Study of the Inﬂuence of Non-Deposit Locations in Data-Driven Mineral Prospectivity Mapping: A Case Study on the Iskut Project in Northwestern British Columbia, Canada

: The accuracy of data-driven predictive mineral prospectivity models relies heavily on the training datasets used. These models are usually trained using data for “known” deposit locations as well as “non-deposit” locations that are based on randomly generated point patterns. In this study, data related to the Seabridge Gold Inc Iskut project, an epithermal Au deposit in northwestern British Columbia (BC), Canada, are used to test the utility of data-driven mineral prospectivity modeling. The input spatial dataset is comprised mostly of publicly available data. Data for 18 vein and epithermal Au known mineral occurrences (KMO) are obtained from the BC Geological Survey’s MINFILE repository and selected as training deposit locations. A total of eleven sets of non-deposit locations (NDL) were also created, including one set of selected non-prospective KMO for Au deposits from the MINFILE and ten sets of random point patterns. Given the scale of this study, most of the KMO recorded on the property are of the epithermal deposit type. Hence, they could not be used as a selection criterion. Data-driven mineral potential models are generated using the random forest (RF) algorithm and trained on multiple data sets. The comparison of RF models demonstrated that using non-prospective KMO generates more accurate predictions than the random point pattern. The produced mineral prospectivity maps delineated multiple areas with higher discovery potential, which matched viable targets for the Au-Cu epithermal-porphyry system identiﬁed through previous Seabridge Gold Inc. (Toronto, ON, Canada) ﬁeld reconnaissance and drilling programs.


Introduction
With new mineral deposits becoming more challenging to find, geoscientists have focused on development of novel methods to assist with mineral deposit discovery. Development of the geographic information system (GIS) technology, improved computing power, and application of data-driven methods, such as machine learning, are enabling the evolution of quantitative methods of geoscientific data analysis, including mineral potential mapping (MPM) [1,2]. For instance, in 2018, Goldcorp Inc., Vancouver, BC, Canada (now part of the Newmont Corporation) and IBM announced a partnership with a goal to utilize the IBM Watson supercomputer and its artificial intelligence (AI) framework to aid mineral targeting at the Red Lake Mines in northwestern Ontario, Canada.
MPM consists of combining multiple layers of geoscience data into a map identifying areas favorable for mineral exploration. The process can be summarized into five main steps: definition of the exploration model for the type of deposit sought, selection of the geospatial dataset to be used, processing of data, creation of the predictor maps, integration of the predictor maps to create a predictive model [1,2].
There are three main categories of modeling methods using GIS-environment: knowledgedriven, data-driven and hybrid.
The knowledge-driven approach requires "experts" to assign weights and to assess the relative importance of each evidential layer as they relate to the specific exploration model being used. This method is subjective but has the advantage of being well suited for greenfield areas with missing or scarce data and where few deposits are known [3]. Examples of knowledge-driven approach include Boolean logic [1,2,4], index overlays [1,2,4,5], fuzzy logic [6][7][8][9][10] and evidential belief [2,6].
The data-driven approach uses the spatial relationship between the geospatial features and known mineral deposits (training set) to estimate the model parameters. It is well suited for well-established mining areas where a large number of known mineral deposits is available to quantify the spatial associations with evidential features and to guarantee the performance and robustness of the model. Methods such as weights of evidence [4,7,[11][12][13], logistic regression [11,12], neural networks [11,14,15], support vector machine [15][16][17][18] and random forest [3,13,[19][20][21][22] are examples of data-driven approaches.
Since 2000, the number of publications on data-driven methods for MPM has largely increased, especially articles using machine learning algorithms (MLA) [23]. In recent years, the random forest (RF) algorithm has proven to offer a new approach to MPM. Contrarily to other MLA, like artificial neural networks or support vector machine, RF is not a 'black-box' algorithm, meaning that the inner workings of the algorithm are known and can even be represented (i.e., decision trees). It is also simpler to parameterize, more stable and computationally light [15,22,24,25].
For any data-driven method, the training dataset should contain a sufficient number of samples to train a given model, and studies showed that RF can be accurate even with a small training set (i.e., less than 20 deposit locations) [19,20,24,26,27]. Moreover, the training dataset should be balanced, meaning that the dataset must contain an equal number of deposits and non-deposit, to avoid results to be biased for one class or the other [19]. Training deposit locations are usually discovered deposits or known mineral occurrences (KMO) in the study area for the commodity and the type of deposit sought. On the other hand, non-deposit locations are usually generated by random points following specific criteria [19,20,24,25,27] or random locations in lithologies considered unprospective [3,15,22,26].
In this paper, the relative influence of the non-deposit locations in the training dataset is assessed by the accuracy of the RF model to MPM. MPM models using the RF algorithm were created with different training data set. We generated ten sets of the random point pattern using a three criteria selection that we compared with KMO that were categorized as non-prospective and distal to every deposit location. In a broader sense, this study is testing whether non-prospective KMO (for one commodity but can be of similar deposit type) should be preferably used instead of randomly generated locations in MPM, when such data set is available.

RF Algorithm
The RF algorithm is a collection of decision tree classifiers trained to increase their diversity and reduce generalization error of the aggregate classifier made of the individual trees [28]: where the {θ k } are independent identically distributed random vectors, and each tree casts a unit vote for the most popular class at input x. A RF can be composed of either classification or regression trees. The algorithm uses a modified version of the bagging (or bootstrap aggregating) technique to create an ensemble of n tree decision trees [29]. This technique increases the diversity of the trees. In the bagging process, each tree is trained on 2/3 of the input samples. The training set is sampled randomly from the original dataset with replacement (i.e., no deletion of the data selected from the original dataset for the generation of the next subset). In other words, to grow a tree, the input data can be used more than once or not at all. This allows the RF to be more stable and robust to outliers in the input data set, as well as increasing prediction accuracy [28].
The remaining 1/3 of the training samples are referred to as out-of-bag (OOB) samples. The OOB samples can be used to evaluate performance, removing the need for crossvalidation. The resulting OOB error is an unbiased estimate of the generalization error and converges as the number of trees increases; thus, RF does not over-fit the data [28].
Each tree is grown on a random subset of m try features selected from the input evidential features. This increases the diversity of trees within the forest and reduces the correlations between the trees. The RF algorithm does not apply pruning on the grown trees. The output of the RF is calculated differently depending on the type of decision trees. For regression trees, the output is the average of the predictions from all the trees, whereas for classification trees, the output is the majority vote of all the trees. A simplified diagram of the RF algorithm is presented in Figure 1. The RF algorithm tries to maximize purity of the tree grown by making each child nodes 'purer' than the parent node. The tree impurity I(T) is defined by [30]: where I(t) = i(t)p(t) with p(t) an impurity function and i(t) the node impurity function. The decision tree search through all candidate splits to find the optimal split to reduce I(T) or, equivalently, maximizes the information gain [30]: where I(t) is the impurity of the parent node, I(t L ) is the impurity of the left leaf and I(t R ) is the impurity of the right leaf. The Gini Index (I G ) is the impurity function employed in this study. It is defined as [30]: where p j is the proportion of samples that belongs to class m for a particular node and m the number of classes. The decision tree splitting criterion is based on choosing the attribute with the lowest Gini impurity index (I G ). The importance of each feature can be evaluated by the RF algorithm. To measure the importance of the k-th feature, the values of the k-th feature are permuted among the training data while keeping the rest constant. The OOB error estimation is used to measure the decrease in accuracy [28].

Study Area: Iskut Project
The Iskut project has undergone mineral exploration since the early 1900s. Since the discovery of the Snip Mine in 1964 (Skeena Resources: https://www.skeenaresources.com/ projects/snip; accessed on 21 October 2019), the property has had relatively systematic exploration, which has led to the discovery of the Johnny Mountain Mine and the definition of the Bronson Slope deposit. These discoveries, in conjunction with surface anomalies across the property, have seen over C$38 million spent on exploration looking for analogs to these deposits. Seabridge Gold acquired the property in 2016 and has since undertaken exploration for porphyry and epithermal deposits on the property.

Geological Setting
The Iskut property is located in northwestern British Columbia (BC), in the metallogenically important Stewart-Iskut River area also known as the "Golden Triangle.
The Iskut Project lies on the western margin of the Stikine Terrane (Stikinia). Three distinct units of the Stikine Terrane ranging in age from Upper Paleozoic to Middle Jurassic were recognized in the area ( Figure 2). The oldest rocks are Upper Paleozoic metamorphosed and deformed limestone, clastic sedimentary rocks, and polymodal volcanic rocks of the Stikine Assemblage [31,32]. Two groups of the Mesozoic arc-related strata are present in the area: Late Triassic folded marine volcanic and sedimentary arc-related strata with some degree of alteration and low-grade metamorphism of the Stuhini Group and the Early to Middle Jurassic subaerial and submarine volcanic and sedimentary rocks of the Hazelton Group [31]. The two units are separated by a regional angular unconformity. The Bowser Lake basin sedimentary rocks unconformably overly the Jurassic strata and cover mineral deposits to the East of the study area. Quaternary basalts and local volcanism are observed to the North and Northeast of the property along the Iskut river.
There are three major intrusive suites mapped on the property. Two of these are major regional metallogenic events that occur in Stikinia during the late Triassic and over an extended period from Early to Middle Jurassic [31,33]. The Stikine Plutonic Suite emerged from a pulse of arc growth in the Late Triassic (221-236 Ma; [33]) and is coeval with the Stuhini Group strata [31,33]. This intrusion is coincident with emplacement of Cu-Au-Ag enriched pluton, an important metallogenic event within the Cordillera [31,33]. Two magmatic events occurred between Early to Middle Jurassic and are associated with two different mineralization styles. The first event, in Early Jurassic (187-195 Ma; [33]), is associated with porphyry deposits (e.g., Kerr-Sulphurets-Mitchell, Galore Creek and Red Chris) and epithermal gold-veins (e.g., Brucejack) that tend to occur in lower Hazelton Group volcanic rocks [31,33]. The second event is associated with exhalative mineralization (Eskay Camp deposits) and characterized by bimodal volcanic rocks coeval with the upper volcanic sequence of the Hazelton Group [31]. These form as exhalative deposits within a deep marine oceanic crust setting.

Au Mineralization
The Iskut project lies in the metallogenic rich "Golden Triangle" area. Two formerly producing mines and a well defined mining project are located in the study area: the Johnny Mountain mine, a vein hosted gold deposit with a production of 92,300 oz gold, 145,000 oz silver, 2,270,000 lbs copper, the Snip mine, a shear-vein gold deposit with a production of 1,032,000 oz gold, 390,000 oz silver, 550,000 lbs copper and the copper-gold Bronson Slope deposit (190Mt@0.36 g/t Au, 0.122% Cu) (Richards, 2005, unpublished).
Mineralization in the area is principally shear-hosted gold and base metal veins deposits like the Snip, Johnny Mountain (Stonehouse) or Inel deposits. They are associated with brittle-ductile deformation and porphyritic stock and intrusion of the Early Jurassic Texas Creek Plutonic Suite [31,33]. The structural style may be host-rock dependent: mineralized shear-veins are hosted by a clastic sequence of the Stuhini Group at the Snip and the Inel deposits, while dilatant quartz-sulfide veins are hosted by Jurassic coarse volcanic and intrusive rocks at the Stonehouse deposit (but are also present at the Snip and the Inel deposits) [34]. From a comparison between the Stonehouse and the Snip deposit, Rhys [35] suggests that intrusion, semi-brittle deformation, and a mineralizing hydrothermal system were closely related temporally and genetically and that gold was deposited during the formation of the vein and not by later mineralization or remobilization.

Conceptual Exploration Model
Since epithermal ore deposits are formed by tectonic scale earth process systems that concentrate hydrothermal fluids, the exploration model presented below accounts for processes critical to the formation of that type of deposit: source-pathway-trap (physical and/or chemical)-preservation. Based on the mineral system approach to exploration targeting, Seabridge Gold defines the conceptual exploration model at a district-scale for epithermal Au deposits at the Iskut project with the following criteria: • Proximity to mineralized porphyritic intrusions; • Proximity to faults; • Presence of hydrothermal alteration zones; • Geochemical enrichment in gold and associated pathfinder elements; • Viability of host rock.
The most probable source of gold-bearing fluids in the area are the porphyry intrusives of Texas Creek Plutonic Suite and Stikine Plutonic Suite. Mineralization is synchronic with brittle-ductile deformation (i.e., faulting, folding, and shearing) characterized by mineralized dikes and veins having similar orientation as tectonic structures. Faults and fractures can serve two functions; they can be the both conduits taken by metal-rich fluids and physical traps to those fluids. Lithologies from the Stuhini Group and the Hazelton Group are the host lithologies of the known deposits in the area (e.g., Stonehouse, Inel, Bronson slope). They can be considered chemical traps, due to a change in RedOx conditions or geochemical assemblage for instance, as well as physical traps, because of a change in density for example. Hydrothermal alteration and geochemical enrichment in gold and associated pathfinder elements are evidence of chemical traps as the reaction of the mineralized hydrothermal fluids with wall rock. For NDL, three selection criteria are used. First, all sets should have an equal number of NDL to that of the deposit locations. Second, the NDL should be located far from any known deposit to ensure different geospatial characteristics to nearby deposits [2,19]. The third criteria depend on the nature of the data: random locations or KMO.
Point pattern analysis was applied to define a buffer distance from every deposit location. That buffer represents the distance beyond which there is a 100% probability of finding another deposit from any deposit. In the study area given the selected deposits, that distance is 8000 m (Figure 3). However, this length is too restrictive for this study, as it would exclude more than half of the study area. Instead, a buffer distance of 2000 m, representing a 78% probability of finding a neighboring deposit from that distance. Hence, NDL are to be selected in the study area excluding a 2000 m buffer from every deposit location (Figure 4).  Unlike deposit locations, which are 'rare' events and tend to cluster, the NDL should be distributed randomly through the study area as they should not be representative of any particular geological process [2]. Therefore, we generated ten sets of the random point pattern containing 18 independent points, that we generated using the rpoint function from the spatstat (v1.61-0) library in R software (R version 3.5.3 (Great Truth), released on 11 March 2019. Retrieved from R project website: (available online: https://www.r-project. org/; accessed on 18 March 2019). The number of sets generated was chosen arbitrarily so that the different models could be easily plotted and compared. The number of points was chosen to be equal to the number of deposit locations, to obtain a balanced dataset (i.e., equal number of deposit to non-deposit location).
The second set of NDL constitutes of selected KMO from the MINFILE depository that were considered non-prospective for the Au deposit. Every occurrence that has gold listed as one of their first three listed commodities was discarded. A total of 19 locations were left after this selection process (see Appendix A). However, due to the scale of the study area (i.e., project-scale), the majority of the KMO in the area are related either to an epithermal or a porphyry system. Therefore, the deposit type could not be used as a selection criterion as it would have been too restrictive (i.e., not enough samples to conduct this study). Thus, some selected non-prospective KMO can be pre or post-dating the selected prospective KMO and have a similar geospatial signature (e.g., similar pathways, similar traps). It is assumed that the source of fluids is different, which would explain the difference in metals association between the prospective and non-prospective KMO (i.e., with or without Au).
Each sample in our training sets is attributed to a binary variable such as 1 s for prospective locations and 0 s for non-prospective location.

Predictor Maps
The geospatial dataset for this study is selected based on availability of the data and its usefulness to be used as proxy for our conceptual exploration model.
Geological data were derived from 1:50,000 British Columbia digital geology data compilation that was last updated on 5 April 2018 [36] (original dataset related to this article can be found at https://catalogue.data.gov.bc.ca/dataset/bedrock-geology, a public online data repository hosted at Data Catalogue by the Government of British Columbia; accessed on 25 March 2019). In the study area, Au-hydrothermal deposits are strongly correlated with intrusions from the Late Triassic-Early Jurassic and structurally controlled [34]. Therefore, we created predictor maps of distances to Texas Creek Plutonic Suite and Stikine Plutonic Suite intrusions and distance to fault traces at 500 m intervals. Moreover, reactive lithologies can act as a chemical trap for Au deposition. As such, reactive lithologies of the Stuhini Group and the Hazelton Group were categorized as 'favorable' host-rock while the other lithologies present in the study area were categorized as 'non-favorable' host-rock.
The geochemical data comes from a compiled database of soil samples from historic geochemical surveys conducted by private companies from 1981 to 2011. The elements analyzed and the analytical methods used in each survey varied and were not always adequately reported. Exploration efforts focused on areas surrounding known prospects, thus do not cover the entire study area. Only the twelve most present elements in the database were kept for further analysis: Ag, Au, As, Ba, Co, Cu, Fe, Mn, Mo, Pb, Sb, Zn. Values below the lower detection limit were replace by half the detection limit. No imputation was performed on the missing data. Hence, our geochemical dataset contains some missing data. The dataset was transformed using centered log-ratio (clr) [37]. Then, principal component analysis (PCA) is applied to the transformed dataset. Principal components can be interpreted as describing separate geological processes (i.e., differentiation, alteration, mineralization, weathering) [38]. In this study, only PC1 and PC2 were kept as the variation between the loadings of the different elements decreased as the number of principal components increased. They account for 34% and 14% of the variance respectively. Based on the loadings of the different elements (Table 1), PC1 and PC2 seem to represent potential metal associations (e.g., Ag-Au-Sb), and enrichment or depletion. The clr-transformed Au values and the PC1 and PC2 were interpolated using the Inverse Distance Weighted (IDW) algorithm. The geophysical data consists of magnetic data only, as it was the only available data in the area with a relatively small spatial resolution (200 m × 200 m) and was downloaded from the Canadian Aeromagnetic data base. Only processed data was available for download, and we chose to use the first vertical derivative as it is useful to enhance near-surface structure.
For remote sensing data, we used ASTER Level 1 Precision Terrain Corrected Registered At-Sensor Radiance (L1T) data, with a spatial resolution of 15 m, 30 m, and 90 m for the NIR, SWIR and TIR bands respectively. The SWIR and TIR band images were re-sampled to the VNIR band images resolution. The project study area is densely vegetated in the valleys but vegetation becomes scarce with altitude. The area is covered by snow from October to April with presence of multiple glaciers. In order to maximize bedrock exposure, we selected scenes acquired in summer to minimize snow cover and with minimum cloud cover ( Figure 5). The ASTER data was atmospherically corrected and converted to relative reflectance using the Semi-Automatic Classification Plugin in QGIS software. Hydrothermal alteration can effectively be mapped with ASTER data [39][40][41]. Band ratio (BR) and relative absorption band depth (RBD) methods are useful to enhance the absorption feature of some characteristic alteration minerals [42,43]. In this study, the following ratios to map the main alteration present in a hydrothermal system are used [44,45]: A total of twelve predictor maps are generated to map the mineral potential of epithermal Au in the study area (Figures 6 and 7).

Cell Size
Using the methodology laid out by Carranza [46], the unit cell size is determined using point pattern analysis. First, the higher limit of a set of suitable cell size is determined by finding the distance where the likelihood of finding deposits located next to one another is null. That distance is 515 m. The lower limit depends on the predictor map with the smallest scale and can be estimated using the following formula [47]: where MS is the map scale factor. In this study, the predictor maps were derived from  [46]. The most suitable cell size for our study area is defined by the intersection of the two straight lines fitted to the log-log plot ( Figure 8); thus, we selected a 50 m cell size.

RF Algorithm Parameters
As presented in Section 2, RF requires only two essential parameters: k and m try . The k parameter represents the number of trees in the ensemble and m try is the number of input features selected to do the splitting at each node of a tree. In this study, various values of k (from 500 to 5000) and selected k = 1000 as the generalization error started to converge from k ≥ 1000. The lowest RMSE on the OOB samples was used to select the optimal value of m try .

Model Evaluation
After optimization of the RF parameter, the performance of the best-fit models was comprehensively evaluated using confusion matrix, indices of predictive accuracy, and success-rate curves.
A series of statistical indices are calculated from the confusion matrix and permit to evaluate the predictive performance of the trained model: Speci f icity = TN TN + FP (12) where TP is True Positive, TN is True Negative, FN is False Negative, and FP is False Positive. The kappa index measures the fit between the predictor maps and the training dataset [48], the sensitivity and specificity indicate whether the deposit cells or non-deposit cells are correctly classified to their corresponding class respectively. Success-rate curves can be employed to evaluate the overall performance of the models. The curve is generated by calculating the percentage of correctly delineated training deposit in a prospective area for a given threshold with an increment of 5-percentile of the likelihood value [22].

Relative Importance of Predictor Maps
As explained in Section 2, the RF algorithm measures the importance of each predictor maps which provides insights on the best proxies for mineralization in the study area ( Figure 9). Overall, the five most important predictors are PC2, intrusion, host rock, fault, and Au in respective order from higher to lower cumulated percentage. These features represent the principal components of any epithermal Au exploration: source, pathway, physical trap, and chemical trap.
The source of the mineralized hydrothermal fluids is represented by the proximity to porphyritic intrusive bodies of Late Triassic to Early Jurassic age. The faults correspond to the pathway taken by gold-bearing fluids and they can also act as physical traps for those fluids. Lithologies from the Stuhini and Hazelton Group can act as a chemical trap for mineralized fluids. Geochemical anomalies of Au and other pathfinder elements, represented by Au and PC2 predictor maps, are also proxies of a chemical trap.
Among the different hydrothermal alteration mapped using the ASTER images, the phyllic and silica alteration have the highest cumulated percentage.

Predictive Accuracy of the Model
The predicted values range between 0 and 1 and symbolize the probability of occurrence of Au mineral deposit. A threshold of 0.5 was used to classify predictions and to calculate the statistical indices in Tables 2 and 3. Cells with a value higher than the threshold are considered prospective, whereas values below the threshold are considered non-prospective.  Table 3. Accuracy of models generated with selected non-prospective KMO NDL and the mean and standard deviation of accuracy indices of models generated with different sets of random point pattern NDL.

Selected Mean Sd
Accuracy The average accuracy of the random models is lower than the selected model, with 78% and 84%, respectively. The kappa values of 56% and 67% of the averaged random models and the selected model respectively indicate that both models have a moderate fit between the predictor maps and the training datasets [48].
Of all the random models, only R1 yields better results than the selected model, but overall, the selected model is more accurate than the random models. Although the results indicate that all models can capture the spatial relationship between the predictor maps and the training datasets, the selected model is the most accurate.

Performance of RF Modelling
In the previous section, the predictive accuracy of each model is reviewed. However, the models are not evaluated in a spatial context. For that purpose, success-rate curves, describing the performance of the RF modeling based on the resulting predictive maps, are used ( Figure 10). Of the eleven mineral prospectivity maps, the worst performing model is the R1 model which requires 35% of the study area to delineate all of the training deposits. In contrast, the best performing maps capture all of the training deposits in 15% of the study area. Those maps are obtained with the Sel, R6 and R9 models (Figures 11-13).

Conclusions
In this study, different training sets using selected non-prospective KMO and ten sets of randomly generated non-deposit location are compared. The type of deposit could not be used as a selection criterion to select the KMO because the area is an extensive epithermal system. Hence, some of the non-prospective KMO can have a similar geospatial signature as our deposit training data set and introduce bias.
Across all the models, the five predictor maps with the highest cumulated relative importance are representative of the source, pathway, and both physical and chemical traps of an epithermal Au deposit. The different model predictive accuracy are compared and it is found that the model using selected non-prospective KMO has higher accuracy than the average accuracy of the models using random NDL with 84% and 78%, respectively. Thus, using the listed commodities as a discriminant to select non-prospective KMO is enough to have an accurate resulting model.
The predictive maps are evaluated using success-rate curves. The best mineral prospectivity maps are obtained with the R6, R9,and Sel training sets that capture 100% training deposits in 15% of the prospective area. Therefore, when available, it is recommended using KMO classified as 'non-prospective' for the commodity sought rather than the randomly generated non-deposit training set. However, for the larger study area were a diversity of deposit types is present, it is recommended to use the commodities and the deposit type as selection criteria to strengthen the predictive accuracy of the model further.
As found in this study, MPM using RF can be used in early stages of an exploration project when only public data are available. By analyzing and interpreting the response of the target variable to a set of predictor variables, RF is very similar to other knowledgeguided data-driven methods such as evidential belief and weights of evidence modeling. RF can also impute missing values, both continuous and categorical data, particularly when handling heterogeneous datasets, which is the case in this study. This yields an out-of-bag imputation error estimate without the need of a test set or elaborate cross-validation. These characteristics make RF a non-black-box exploration method, which is more suitable for mineral prospectivity modeling than other currently used machine learning approaches.