Mapping Mineral Prospectivity Using a Hybrid Genetic Algorithm–Support Vector Machine (GA–SVM) Model

Machine learning (ML) as a powerful data-driven method is widely used for mineral prospectivity mapping. This study employs a hybrid of the genetic algorithm (GA) and support vector machine (SVM) model to map prospective areas for Au deposits in Karamay, northwest China. In the proposed method, GA is used as an adaptive optimization search method to optimize the SVM parameters that result in the best fitness. After obtaining evidence layers from geological and geochemical data, GA–SVM models trained using different training datasets were applied to discriminate between prospective and non-prospective areas for Au deposits, and to produce prospectivity maps for mineral exploration. The F1 score and spatial efficiency of classification were calculated to objectively evaluate the performance of each prospectivity model. The best model predicted 95.83% of the known Au deposits within prospective areas, occupying 35.68% of the study area. The results demonstrate the effectiveness of the GA–SVM model as a tool for mapping mineral


Introduction
Mineral prospectivity mapping (MPM) is a critical step in mineral exploration and exploitation, as it reduces uncertainty and risk by narrowing the target area [1][2][3]. In MPM, multiple datasets (e.g., geological, geophysical, geochemical, and remote sensing data) are collected, analyzed, and integrated to delineate target areas most likely to contain mineral deposits of interest. To achieve this goal, a variety of MPM approaches have been proposed, which can be categorized into knowledge-driven and data-driven methods [4,5].
(1) Knowledge-driven MPM methods use expert knowledge to qualitatively assess the importance of each evidence layer for known deposits of the type sought. Index overlay [6,7], fuzzy logic [8][9][10], and multiple-criteria decision-making methods [11][12][13] are examples of knowledge-driven MPM methods, which are used in frontier or less-explored areas (so-called "greenfields") with no or very few known mineral deposits of the desired type.
(2) Data-driven MPM methods analyze and quantify spatial associations between each evidence layer and the locations of known deposits that share a common genesis, and include weights of evidence [14,15], evidence belief functions [16,17], and logistic regression [18,19]. These methods are commonly applied in well-explored areas with sufficient known mineral deposits of the desired type.
Over the last decade, some machine learning methods as data-driven methods have been developed for MPM. These include support vector machine (SVM) [20][21][22], a marginbased classifier based on small sample learning that has good generalization capabilities [23] and is an effective tool to model the complex nonlinear relationships between evidence layers and mineral occurrences. However, in standard SVM, classification performance is heavily dependent on parameter selection (hyper-parameters and kernel parameters) in cases with no criteria or principles to follow when setting SVM parameters. Genetic algorithm (GA) [24] is a well-known and widely used method for variable selection [25,26]. GA provides a search technique that solves optimization problems by employing simulated evolution via "survival of the fittest" using various genetic functions. Therefore, an SVM classifier incorporating GA for parameter optimization has great potential in MPM, making full use of the unique merits of these two data-mining approaches.
With respect to training data, SVM, as a supervised algorithm, is different from the traditional data-driven methods used in MPM (e.g., weights of evidence), which usually require both mineralized and non-mineralized training datasets. Because an optimal separating hyperplane between the mineralized and non-mineralized locations is affected by both the mineralized and non-mineralized training datasets, learning bias can be caused by imbalanced training datasets, increasing misclassification [27]. Consequently, balancing the number of mineralized and non-mineralized training datasets is an efficient way to obtain more reliable classifications [28]. However, the selection of non-mineralized samples is challenging, as it is not possible to identify whether all non-mineralized samples are truly non-mineralized, because of the complexity of geological conditions [29]. In this context, Carranza et al. [30] summarized four criteria for the selection of non-mineralized samples: (1) non-mineralized samples must be randomly distributed in the study area; (2) non-mineralized samples should be distal to any known mineralized samples; (3) nonmineralized samples must have values for all the univariate geoscience spatial data; and (4) the number of non-mineralized samples must be equal to the number of mineralized samples. In addition, point pattern analysis was applied to evaluate the spatial pattern of non-mineralized samples and determine the optimal distance between the mineralized and non-mineralized samples. Nykänen et al. [31] proposed that other types of known deposits can be used as non-mineralized samples in well-explored areas, whereas random locations that are geologically constrained could represent non-mineralized samples in greenfields. In recent years, various sampling techniques, such as undersampling and oversampling, have been used to select non-mineralized samples [1,32,33]. Prado et al. [33] used the synthetic minority over-sampling technique and random under-sampling to create 400 training datasets with proportions of mineralized-to-non-mineralized samples ranging from 600:30 to 30:600.
In this study, SVM and GA were combined to optimize parameter design and develop a predictive model for mapping Au prospectivity zones in Karamay, NW China. For this purpose, after constructing five evidence layers from geological and geochemical data using spatial data processing methods and a prediction-area (P-A) plot [34,35], point pattern analysis was employed to estimate and randomly select non-mineralized samples based on the selection criteria. Subsequently, GA-SVM models trained using different training datasets were employed to delineate target areas and generate binary prospectivity maps. Ultimately, the F1 score and spatial efficiency were compared between different prospectivity models to evaluate their performance.

Support Vector Machine
Support vector machine (SVM), introduced by Vapnik [23] and proposed for classification and regression tasks, is a novel type of machine learning method. SVM is constructed on the Vapnik-Chervonenkis dimension theory and the structural risk minimization principle. In essence, it employs a nonlinear transformation of the inner product function definition to transform the input space into a high-dimensional space, where it finds the optimal linear separating hyperplane. A detailed description of SVM can be found in Cristianini and Shawe-Taylor [36]. Here, a brief summary of SVM is provided.
Given a training set of instance-label pairs (x i , y i ), i = 1, 2, . . . , n, where x i ∈ R d and y i ∈ {+1, −1}: In linearly separable cases, the following optimization problems need to be solved to find an optimal separating hyperplane: where ω is a vector normal to the hyperplane, and b is a scalar quantity. The Lagrangian multipliers method was introduced to solve the aforementioned problem and obtain a classifying determination function: In linear non-separable cases, a non-negative slack variable ξ i ≥ 0, i = 1, 2, . . . n, was introduced, and the equation to be solved became: where C is the penalty parameter, which has an important effect on the accuracy of the SVM classifier. This should be predetermined by the user. Similar to the linearly separable cases, this optimization model can be solved using the Lagrangian multipliers method.
In nonlinear separable cases, the input features are mapped into a new high-dimensional feature space using a kernel function K x i , x j , transforming it into a linearly separable case. Several kernel functions, including the polynomial kernel, radial basis function (RBF) kernel, and sigmoid kernel, are popular. This study used the RBF kernel function (Equation (4)), which is an effective kernel function with fewer parameters and provides excellent overall classification performance: where σ is the kernel parameter, which is always greater than zero.

GA-SVM Model
Genetic algorithm (GA) was first introduced by Holland [24] as an adaptive optimization technique based on the Darwinian evolutionary hypothesis of natural selection. The aim of GA is to find optimum solutions within the potential areas by defining a fitness function and applying the biological processes of natural selection, crossover, and mutation to individuals in the population. Compared to traditional algorithms, GA can handle large search spaces efficiently and is less prone to converging on a locally optimal solution. Recently, GA has been progressively developed in conjunction with other techniques and has been applied to many optimization problems [37][38][39].
Therefore, GA is used to optimize the SVM parameters σ and C based on the process of natural selection, in which Accuracy is adopted as the fitness function to evaluate the quality of the solutions. For a two-class mapping of the mineral prospectivity problem, the classified results can be represented as a confusion matrix (Table 1), which was defined using Equation (5). To determine the optimal parameters for SVM model, k-fold crossvalidation [40,41] was used to construct a series of independent test datasets for the GA-SVM model, which was trained with the remaining k − 1 subsets of the training dataset, to search for the best fitness; k = 5 could achieve an adequate balance between the reliability of the calculation time and parameter estimation [42]. After repeating the cross-validation process, the fitness was obtained by calculating the accuracy of each test dataset.
Here, true positives (TP) and true negatives (TN) are the numbers of known mineralized samples and known non-mineralized samples, respectively; FP (false positives) and FN (false negatives) are the numbers of predicted mineralized samples and predicted non-mineralized samples, respectively. The proposed GA-SVM model was employed to extract the optimal combined parameters of SVM to distinguish between prospective and non-prospective areas. The procedure involved in the GA-SVM model for MPM is divided into three parts (Figure 1), as follows:

•
Data processing. After constructing a geospatial database, geological maps and geochemical data were analyzed to map five evidence layers and generate training and testing datasets. • GA optimization. After setting the initial parameters for GA and SVM, the training dataset was used to train an SVM model, while the fitness was calculated by k-fold cross-validation classification accuracy. If the termination conditions were satisfied, the optimal parameters of SVM were determined. Otherwise, the selection, crossover, and mutation operations were performed to create a new population, and the GA optimization process was repeated.

•
Classification. An SVM model was trained with the optimal parameters, and a prospectivity map was produced. Ultimately, the F1 score and spatial efficiency were combined to evaluate the classification ability of the GA-SVM model.

Performance Evaluation
In MPM, although evaluating the prospectivity model's ability to identify mineralized and non-mineralized locations is equally important, evaluating mineralized locations often has greater significance for the following reasons: (1) MPM aims to identify and distinguish mineralized locations; (2) a mineralized location misclassified as a non-mineralized location can result in the loss of important mineralization and incur high costs [43]; and (3) nonmineralized locations always introduce uncertainty, which is often randomly selected. Therefore, the F1 score [44], which is the harmonic mean of precision and recall, was used to measure the ability of the prospectivity model to identify mineralized locations: Here, precision represents the probability of known deposits being correctly classified as deposits, and recall represents the probability of known deposits in the total number of classified deposits, as follows: The relevant parameters are mentioned in Table 1 and Equation (5).

Study Area
The study area is located in the eastern part of the Tangbale-Hatu belt (western Junggar region, China), part of the eastern extension of the Balkhash-Junggar metallogenic domain, in which a number of Au deposits were discovered, including Hatu, Qi-II, Qi-III, Qi-IV, and Qi-V. The study area covers approximately 11,784 km 2 . The region is characterized by ophiolitic mélanges, Carboniferous volcanic-sedimentary rocks, and granitoid intrusions ( Figure 2). Several ophiolitic mélanges show contact relationships with the Lower Carboniferous volcanic-sedimentary strata via faults [45]. Extensive Carboniferous volcanic-sedimentary rock outcrops, which are mostly distributed on both sides of the Darabut, occur in three successive formations: the Tailegula, Baogutu, and Xibeikulasi formations (ordered from bottom to top) [46]. The regional structure is dominated by a series of NE-trending faults. The larger NE-trending faults, namely, the Darabut, Anqi, and Hatu, constitute the basic framework of the region. Distributed between these large faults are various-sized granite bodies, including the Akebasitao, Hatu, Miaoergou, and Karamay plutons, formed 300-280 Ma [47][48][49]. These granite plutons provided a favorable tectonic environment for Au mineralization.
Based on the zonal distribution or clustered occurrence of metallogenesis, the study area can be roughly divided into two metallogenic belts: the Hatu metallogenic belt and Baogutu metallogenic belt [50,51] [50]. Thus, based on the synthetic studies of the typical Hatu and Kuogashaye Au deposits, a conceptual model for Au mineralization in the study area was proposed, as shown in Table 2.

Metallogenic Factor Description
Regional geological background

Tectonic environment
The north-east fault is the main tectonic line in the region. The crustal uplift and depression transitional zone to the north of the Darabut fault shows evidence of intense magmatic and volcanic activities and is the main ore-forming material source of Au deposits.
Intrusive rocks Intermediate-acid intrusive rocks are closely spatially related to mineral deposits.
Ore-bearing strata The vast majority of Au deposits are located in the Tailegula and Baogutu Formations of the upper Carboniferous in the Paleozoic. Ore-forming epoch Middle and late Variscan age Wallrock alteration Common forms of wallrock alteration include silicification, pyritization, arsenpyritization, and sericitization.

Regional geochemical field
The geochemistry of this region is dominated by Au anomalies. High concentrations of Au exist distributed between Toli and Karamay, with clear concentration centers and zoning.

Data
In this study, a spatial dataset was derived from established multi-source geological spatial databases containing geological and geochemical data. Geological maps at a scale of 1:200,000 were collected from the Bureau of Geology and Mineral Resources of Xinjiang. Stream sediment geochemical data at a scale of 1:200,000 were obtained from the National Geochemical Mapping Project of China, with a sampling density of 1 per 4 km 2 [52].

Evidence Layers
The selection of evidence layers requires consideration of the characteristics of Au deposits, favorable conditions for Au mineralization, and the available data in the study area. Five evidence layers were used to produce a potential map, as shown in Table 3. Table 3. Summary of Evidence Layers Used in this Study.

Criteria Evidence Layer Relevance
Geology Proximity to lithostratigraphic contacts The ore-forming elements migrate to the lithostratigraphic contacts and accumulate, resulting in precipitation, enrichment, and mineralization.

Proximity to NE-trending faults
The region's main tectonic line runs NE and provides the driving force, the migration channel, and the depositing space for the mineral flow. Fault intersection density Fault density reflects the location of frequent magma and hydrothermal activity, and the frequent superimposition of ore-forming elements. Fault linear density Geochemistry PC1 scores generated by singularity indices of ore-forming elements Ag, As, Au, and Sb are present in high concentrations above ore bodies. These elements can be used to differentiate provenance characteristics, understand the migration and evolution patterns of elements, and distinguish geochemical anomalies.
In this area, Au deposits are mainly hosted in Carboniferous formations or in contact with granitoid intrusions. The contacts of the granitoid intrusions and the stratigraphic units between the Tailegula, Xibeikulasi, and Baogutu Formations were extracted from a 1:200,000 geological map of the study area, and a map of proximity to lithostratigraphic contacts was produced using Euclidean distance in the ArcGIS environment (Figure 3a). Similarly, a map of proximity to NE-trending faults was generated (Figure 3b), because NE-trending faults play a dominant ore-controlling role during Au mineralization. Fault density, consisting of fault intersection density and fault linear density, reveals the spatial relationship between the development and accumulation levels of faults and Au deposits. Here, the fault intersection density and fault linear density were analyzed, and the corresponding maps were generated using point density and line density in the ArcGIS environment, respectively (Figure 3c,d).
The element contents of Ag, As, Au, and Sb obtained from 1:200,000 stream sediment geochemical data were analyzed using the singularity mapping technique [53] and principal component analysis (PCA) [54,55]. The singularity mapping technique was implemented to identify local anomalies of Ag, As, Au, and Sb from geochemical background fields based on sliding windows in MATLAB. To highlight the inherent relevance of multiple elements and reduce the uncertainty of each single element, PCA was used to integrate multi-element singularity indices, based on their correlations to delineate comprehensive anomalous areas. As shown in Figure 4a, the first principal component (PC1) accounted for 40.7% of the overall variance, whereas PC2, PC3, and PC4 modeled an additional 23.1%, 21.0%, and 15.2% of the total variance, respectively, indicating the importance of each component. In addition, the resulting PC1 shows positive loadings on the singular association of Ag, As, Au, and Sb, which is consistent with the characteristics of the geochemical anomalies (Figure 4b). Accordingly, PC1 can be used as an evidence layer for Au deposits. As shown in Figure 4c, low PC1 scores have a strong spatial association with most of the known Au deposits. More details on the analytical methods and data processing can be found in Zhou et al. [56].

Target Variable and Feature Vectors
The application of the GA-SVM model for MPM requires a training dataset with geological feature vectors of five evidence layers and a target variable to represent mineral prospectivity. The target variable expresses mineralized locations or non-mineralized locations with scores of 1 and 0, respectively. For mineralized locations, we used 24 known Au deposits to ensure classification accuracy. For non-mineralized locations, we used point pattern analysis [57,58] to analyze the nearest-neighbor distances between every pair of deposits within the study area to determine the optimal distance from known deposits at which the probability of finding a deposit decreased. In this study, most of the nearest-neighbor distances were less than 14.5 km, and there was only one outlier. Thus, 14.5 km is regarded as the optimal distance for the selection of non-mineralized samples. In addition, to obtain a balanced dataset, the number of non-mineralized samples should be the same as the number of known Au deposits. To this end, we created four training datasets, each consisting of 24 randomly selected non-mineralized samples, according to the selection criteria used by Carranza et al. [30]. Figure 5 shows the spatial distribution of the four training datasets.
The feature vector is a multidimensional numeric vector representing a combination of the attributes of evidence layers in a specific location. In this study, the attributes of the five evidence layers were encoded as either 1 or 0, where 1 and 0 indicate favorable and unfavorable conditions for Au mineralization, respectively. Consequently, to obtain binary patterns for the evidence layers used in the GA-SVM model, it was necessary to define the optimum threshold for classifying the maps. The P-A plot, which is a simple prediction rate-occupied area plot [34,35], was employed to determine the optimum thresholds with respect to the evidence layers. When the intersection point of two curves is high in a P-A plot, it portrays a small area containing a large number of mineral deposits. In this study, the P-A plot consisted of the curve of the percentage of known mineral occurrences corresponding to the classes of the evidence layer and the curve of the percentage of occupied areas corresponding to the classes of the evidence layer. Therefore, the location of the intersection point in the P-A plot could guide us in finding the binary pattern of evidence layers for Au mineralization. Figure 6 shows that (1) the optimum distances between the location of Au deposits and lithostratigraphic contacts and NE-trending faults are 1028.63 and 1108.10 m, respectively; (2) the optimum densities between the location of Au deposits and fault intersection density and fault linear density are 0.10 and 0.35, respectively; and (3) the optimum cutoff value between the location of Au deposits and PC1 scores was 22.1. According to the above optimal values and spatial associations between each evidence layer and the known Au deposits, all the evidence layers were encoded and combined to generate 2968 feature vectors.

Mineral Prospectivity Mapping
The GA-SVM model implemented in this study was programmed using the LIBSVM package [59] as a supplementary tool in MATLAB. The GA-SVM model with the parameters shown in Table 4 was used to search for the kernel parameter σ of RBF and the penalty parameter C of SVM (Figure 7), and the best fitness values corresponding to the optimal parameters of SVM were obtained (Table 5). Prospectivity models were established using these optimal parameters to determine the spatial associations between evidence layers and mineralized locations and produce prospective maps for mineral exploration (Figure 8).   The penalty parameter of SVM 0-100 σ The kernel parameter of RBF for SVM 0-100 k k-fold cross-validation 6  The confusion matrices and F1 scores for the individual models are presented in Table 6 to demonstrate the performance evaluation results of the GA-SVM model presented in this study. In terms of the confusion matrices, most of the known deposits were classified accurately, and the highest precision was 0.96. In addition, the results show that the F1 score based on training dataset 2 was the highest, indicating that it had the greater ability to distinguish mineralized locations than the other models.
Although the F1 score provides a proxy for measuring the classification ability of the GA-SVM models, it cannot assess the spatial efficiency of the prospectivity model classifications [33]. Therefore, the number of known deposits in the prospectivity area and the percentage of occupied areas corresponding to the prospectivity area for each prospectivity model were calculated to measure the relative spatial efficiency. From the statistical comparison between the GA-SVM models with the four training datasets (Table 7), it is obvious that the GA-SVM model that used training dataset 2, which occupied the largest area of the study area, had a larger number of known Au deposits, accounting for the poor spatial efficiency of the prospectivity model, although the F1 score was the highest. This may have resulted from overfitting. In contrast, although the lowest F1 score was obtained for the GA-SVM model trained using training dataset 3, it was more efficient in its classification than the other models. This illustrates that the prospectivity model is sensitive to randomly selected non-mineralized samples.   The prospectivity model with training dataset 4 was the best in terms of both the F1 score and the spatial efficiency of the classification, as it reduced the target area of the study area while predicting the same number of known deposits and exhibited good performance in identifying mineralized locations. The prospective areas in Figure 8d occupied 35.68% of the study area and contain 95.83% of the known Au deposits. From the perspective of the spatial domain, the spatial distribution of the best prospectivity map (Figure 8d) showed a spatial correlation with proximity to NE-trending faults, which is consistent with the model of Au mineralization.

Conclusions
This study employed a hybrid support vector machine (SVM) model with genetic algorithm (GA) to discriminate between prospective and non-prospective areas for Au deposits in Karamay, northwest China. The findings support the following conclusions:

•
Since SVM generalization performance is heavily dependent on parameters σ and C, it is necessary to adopt GA as an objective function to select better combinations of the two parameters for SVM.

•
Owing to the characteristic of P-A plot, it can be used for classifying evidence layers into binary patterns. It is important to note that the knowledge of the metallogenic model should be applied to differentiate favorable and unfavorable areas in the binary maps.

•
A key procedure in implementing the GA-SVM model was the selection of the training dataset, especially the 'non-mineralized' locations. In complex geological environments, it is impossible to identify non-mineralized locations; thus, point pattern analysis is a useful measure for determining the optimal distance at which non-mineralized locations can be randomly selected based on the selection criteria.

•
The performance of the GA-SVM model for distinguishing prospective areas in the study area was evaluated using both the F1 score and spatial efficiency. The best prospectivity model predicted 95.83% of the known Au deposits within prospective areas, occupying 35.68% of the study area.

•
The best prospectivity map, as classified by the GA-SVM model, displayed a strong spatial correlation between prospective areas and proximity to NE-trending faults. This conforms to the characterization of spatial associations between geological features and Au deposits, indicating that the results emphasize the strong control of Au mineralization by NE-trending faults within the study area.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.