Quantitative Geochemical Prediction from Spectral Measurements and its Application to Spatially Dispersed Spectral Data

Featured Application: Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory. Abstract: The efficacy of predicting geochemical parameters with a 2-chain workflow using spectral data as the initial input is evaluated. Spectral measurements spanning the approximate 400-25000nm spectral range are used to train a workflow consisting of a non-negative matrix function (NMF) step, for data reduction, and a random forest regression (RFR) to predict 8 geochemical parameters. Approximately 175000 spectra with their corresponding chemical analysis were available for training, testing and validation purposes. The samples and their spectral and chemical parameters represent 9399 drillcore. Of those, approximately 20000 spectra and their accompanying analysis were used for training and 5000 for model validation. The remaining pairwise data (150000 samples) were used for testing of the method. The data are distributed over 2 large spatial extents (980 km 2 and 3025 km 2 respectively) and allowed the proposed method to be tested against samples that are spatially distant from the initial training points. Global R 2 scores and wt.% RMSE on the 150000 validation samples are Fe(0.95/3.01), SiO 2 (0.96/3.77), Al 2 O 3 (0.92/1.27), TiO(0.68/0.13), CaO(0.89/0.41), MgO(0.87/0.35), K 2 O(0.65/0.21) and LOI(0.90/1.14), given as Parameter(R 2 /RMSE), and demonstrate that the proposed method is capable of predicting the 8 parameters and is stable enough, in the environment tested, to extend beyond the training sets initial spatial location.


Introduction
The routine collection of spectral reflectance measurements from drillcore and/or laboratory ready samples is now common enough that it is natural to assess the feasibility of using the spectral measurements for quantitative prediction. This action is already performed in various guises with the selected methodology generally based around the desired outcome.
Relatively simple spectral indices have been routinely used within the spectral remote sensing community with great success for many years [1][2][3][4][5] and more recently with proximal spectral sensing of drillcore samples within the exploration community [6][7][8][9][10]. The latter has been driven by the proliferation of hand-held and benchtop spectrometers that have successfully lowered the barrier to entry and provided a means of leveraging such data for more sophisticated qualitative and quantitative methodologies to aid in the exploration task.
Due to the ease with which spectral data can be collected, and the fast turnaround of higher order products generated from said reflectance data, it can be highly beneficial in providing a means of early confirmation and assessment of a variety of results that may aid in the exploration decision making process [11,12].
While questions pertaining to mineral identification in various mineral systems can be addressed with indices-based methods the approach used to relate the bulk, or volumetric, properties of a geological sample to its spectrum require more sophisticated methodologies.
Volume based assessment of geological samples is generally derived from a labbased analysis of the sample which comes with an inherent cost and turnaround time. While the cost and turnaround time of accessing results from laboratory samples are generally accounted for it does not mean that earlier access to that knowledge would not aid an explorer or decision maker. An earlier preliminary result may, for example, aid in the earlier definition of an existing economic ore body and allow preplanning prior to laboratory based confirmation. Alternatively, it may assist in the process of apportioning which samples are best suited for a deeper laboratory analysis and/or the actual sampling frequency best suited to answering the question at hand.
The use of a partial least squares [13] models for regression analysis relating to geochemical properties is well established [14][15][16][17]. These models are generally singular output models and require as many models as the number of parameters that are to be predicted. This use of random forest models [18] have been gaining popularity and applied successfully to both classification [19][20][21] and regression [22,23] problems. As well as proving to be easily implemented and robust they can also make multioutput predictions and therefore reduce the need for multiple models.
In this study we will investigate a methodology that uses a 2-step process to ascertain if wt. % estimates from whole-rock geochemistry are reliably predictable from spectral measurements of drillcore samples prepared as pulps. The 2-step process makes use of a dimensionality reduction step followed by a multioutput decision tree approach to predict the wt.% of 8 different whole-rock parameters, the majors and LOI, and its predictive effectiveness when applied to a testing set that encompasses a spatial extent extending beyond the initial training and test dataset.

Materials
The data used in the study are proprietary and are therefore subject to constraints. Namely, the spatial location of the data source cannot be provided without revealing proprietary information.
The data itself represent pulps collected from multiple drillcore which are distributed over 2 large spatial extents of approximately 360 km 2 and 1120 km 2 respectively. In this study a reference to drillcore sample is given to mean spectral sample as measured from a pulp. The complete dataset is comprised of 7 individual datasets that are made up of spectral measurements and whole rock geochemistry (Fe, SiO2, Al2O3, TiO2, CaO, MgO, K2O and LOI).
Other variables included in the whole rock geochemistry were P, S and Mn but are not used in this study as they failed to produce a working model. The entire dataset comprises approximately 175K samples (approximately 25000 per dataset) with dataset 1 randomly split into 20000 and 5000 samples for training and validation respectively. The remaining 6 datasets (approximately 150K samples) were held out for testing. Table 1 gives the summary statistics comprised of the mean, standard deviation, 50% and 75% quartiles, and the maximum value of the 7 datasets and the 8 geochemical parameters examined in the study The spectral data collected from any given pulp sample was via 2 different spectral instruments. The first is the HyLogger [24][25][26] which collected data in the 350-2500nm spectral range and whose spectral outputs are given with a 4nm sampling interval, and the second, a Fourier Transform Interferometer for spectral collection from 2000nm-25000nm with a spectral sampling interval of 3.857 cm-1. To create a single spectrum the FTIR data from 2000-2500nm was disregarded and the remaining spectral signal appended to the HyLogger spectrum.

Methods
The task is to assess the feasibility of predicting whole rock geochemistry parameters with spectral data used as the driving input to the model/s. The combined HyLogger and FTIR spectral data comprise 1476 spectral bands. To reduce computational overhead and to reduce the dimensionality of the spectra we firstly use a non-negative matrix factorisation (NMF) model [27] and follow that with a random forest regression (RFR) model [18] to make our prediction. The model implementations for the NMF and RFR are provided by the python scikit-learn library [28].

NMF
The NMF model is a method of representing data as a linear representation using non-negativity constraints. The imposed non-negative constraint leads to a part-based representation that allows only additive, not subtractive, combinations of the original data [27,29]. Using the NMF in the first step allows us to reduce the dimensionality of the input spectra to a smaller number of components than the entirety of the spectrum and is used as input features to the RFR model.
In the 2-step process we reduce the spectral data to a series of components with the NMF model, and where the components are representations of the additive parts comprising the training signals. Due to the non-negative nature of the components they can have a physically interpretable correspondence [30,31], or when the components are predefined such that the components represent spectral endmembers the parts based weights returned are indicative of the proportions of those endmembers and hence can be used in a linear spectral unmixing [32][33][34][35][36].
The matrix form of the NMF is given by Where V, of size (# samples, # features), is a linear combination of component weights in W with dimensions of (# samples, # components) and components in H (# components, # features). In this study the number of features in the H matrix are the spectral bands of the spectral signals. In practice the scikit-learn NMF implementation is used in the fitting phase with the spectral data to estimate the W and H matrices of the NMF model. It is the H matrix that we are seeking in this portion of the study.
The 20000 spectra selected for training were used in the construction of 6 NMF models to compute 6 H matrices where each H matrix differed only in the number of components. The number of components in each H matrix was 5, 10, 15, 20, 25, and 30 components. The 6 individual NMF models were used to transform the 5000 validation spectra and return 6 W component weighting matrices. Each of the 6 W matrices were then inverse transformed to recover the equivalent V matrix which in turn is directly compared to the 5000 validation spectra. We sought to produce a reconstruction score between the measured spectra and those returned by the inverse NMF such that the difference between the measured spectra and the reconstructed spectra are minimized and an R 2 score of 0.99 is calculated.
The high valued constraint on the R 2 of the reconstruction is set so we are confident that the components in H are able to represent the measured spectra and provide a reduced dimensionality of spectra fitted to the model via the W matrix. In this study 25 components satisfied the criteria of a 0.99 R 2 . The NMF model was then established for 25 components and saved so it could be used to transform new unseen spectra.
Although it isn't explored further, and as noted earlier, the 25 components in H of the resulting NMF model can be considered spectral end-members of the training set [32][33][34]37] and the weighting values W returned in a transform the proportion of each endmember required to produce the measured spectrum. In terms of physical size on disk the trained NMF model occupies approximately 300 KB of space.

RFR
With a dimensionality reduction step (NMF model) the resulting 25 weighting values output for a given sample are used as input features to the RFR model to provide prediction values of the whole rock geochemistry. The whole rock geochemistry for the 8 parameters represents the prediction outputs from the RFR model. In a regression scenario random forests, or random decision forests, are an ensemble method that use a collection of decision trees to output the mean prediction of the individual trees [18]. The benefit of using random forests is they are generally considered robust and self-correcting so can reduce the overfitting often observed in individual decision trees [18,38].
While several implementation parameters can be used to construct an RFR model we have opted to use the defaults of the sklearn.ensemble.RandomForestRegressor model as set in the scikit-learn library with the exception of the maximum depth of the decision trees. In the final RFR model the maximum depth of an individual tree was set at 16.
The maximum depth value was determined by increasing the maximum depth of the RFR model until the R 2 score of the predictors was found to be minimally different to an RFR model with an unbounded maximum depth on the decision trees. The resulting RFR model was then trained and validated with the 20000 and 5000 spectral samples respectively and saved for future use with the remaining validation data. The physical size of the RFR model on disk is 130 MB.
In summary the application of the 2-step methodology after training and validation is as follows: 1. Set any spectral reflectance values that are less than zero to zero (potential measurement errors). This is a requirement since the NMF cannot work with negative inputs. 2. Transform the Nx1476 (1476 being the total number of spectral bands) individual spectra via the precomputed NMF model to the Nx25 sample space, where N is the number of spectral samples. 3. Input the Nx25 NMF transformed spectra into the trained RFR model as input features and retrieve the estimated parameter values for Fe, SiO2, Al2O3, TiO, CaO, MgO, K2O and LOI.

Results
The results contained herein are split into 3 subsections. Namely, spectral, global, and downhole. Each subsection focuses on an aspect of the data and/or its relevance to the results. The spectral subsection will look at spectra associated with the 8 geochemical parameters and the potential minerals associated with said parameters. The global subsection looks at the performance of the prediction model as it applies to the entire collection of validation data. Lastly, the downhole subsection presents a downhole comparison of predicted results against the measured response of 4 drillcore.

Spectral
To gauge the potential differences in the spectra associated with a given element, validation dataset four was used to retrieve the spectra corresponding to each of the 8 geochemical parameters being at their maximum values. These spectra are shown in Figure 1 where the parameter and value of the maximum for the spectral sample is provided in the legend. To distinguish between the absorptions more easily across the 400-25000nm spectral range two separate plots are shown. The upper plot covers the 400-6000nm and the lower plot the 6000-25000nm spectral region. These spectra are not presented to provide an in-depth analysis of the full suite of potential minerals that might be encountered but rather to ascertain if the mineral types are at least consistent with what might be observed when the given geochemical parameter is at a maximum. Figure 1. Indicative spectra selected from dataset 4 corresponding to the maximum value of a given geochemical parameters. The spectra are indicators only of the potential differences in reflectance that might be associated with a given parameter. It is expected, and confirmed, that certain whole-rock parameters will be consistent with given mineral assemblages. Complementary to this figure is Table 2 which provides a summary of some of the major absorption/emission features of minerals present in the study area and noted within the spectra.
Complementary to Figure 1 is Table 2 which provides a summary of some of the major absorption/emission features of minerals present in the study area. It is noted that respective absorption bands can be present in minerals that are not listed in table. The lower and upper wavelength positions are only given for absorption bands where related compositional changes occur otherwise, an estimated central location is provided. In Figure 1 the spectra named Fe, Al2O3 and TiO2 display iron oxide absorptions that are characterised by crystal field interactions around 900 nm [46,47] and are indicative of hematite and goethite. The spectrum associated with the greatest amount of Fe in this case does not appear to have the greatest 900nm absorption depth, as compared to the Al2O3 and TiO2 spectra, and is seemingly free from indicative kaolin group absorptions located at 2206nm [40,47]and 2705nm [40] and 2761nm [40] which are present in the Al2O3 and TiO2 spectra. The Fe spectrum in this case, and because of the lack of other mineral absorptions, is probably a relatively pure Fe sample.
The CaO and MgO (and the sample with the highest LOI) samples are consistent with carbonates. Absorption features associated with carbonates are observed at approximately 2300nm [41,48], 2500nm [41], 3500nm, 4000nm, 4600nm and 6400-6500nm [7]. Calcium dominated carbonates, such as calcite, have absorptions at longer wavelengths in the 2300nm and 2500nm spectral regions as opposed to those carbonates, such as siderite or magnesite, where Fe or Mg replaces the Ca, and the absorption features shift to shorter wavelengths [48][49][50].
As noted, the Al2O3 spectrum display several absorptions commonly associated with kaolinite but also contain jarosite as defined by a distinct absorption at 2260nm [51]. The sample associated with the greatest TiO2 has weak kaolin group absorptions at 2160nm and 2200nm and around 2700nm.
The spectrum relating to the highest valued SiO2 is devoid of discernible absorption features in the VNIR/SWIR but is distinguishable as a quartz sample by the notable peaks located at approximately 8500nm, 9000nm, 12500nm and 12800nm [42,43]. Lastly, the spectrum associated with the highest K2O value is almost free of any discernible absorption features with the exception being kaolin group absorptions around 2700nm. In this case the SWIR absorptions around 2200nm that are also associated with the kaolin group are not discernible.  Table 3, three separate values are referred to, namely the R 2 , the RMSE and the standard deviation of the RMSE. Column 1 names the dataset in question and lists the number of drillholes that are present in each dataset. Any row that refers to the training dataset are the values as returned by applying the models to the 5000 validation samples while the remaining datasets are the results of applying the models to the unseen testing datasets. The results listed for the "Training" dataset are those values as returned by the validation set (5000 samples) for the RFR model trained on the training set (20000 samples). All global averages given are the averages for the 6 testing datasets (given as Set2-Set7). In Figures 3 and 5 the results do not include any values from the training/validation dataset and are only comprised of results from testing datasets (Set2-Set7).

Figure 2.
The calculated R 2 scores for each of the 6 testing datasets (sets 2-7) and the training and validation dataset (simply given as Training) for the 8 whole-rock parameters modeled. The training and validation data return high R 2 scores overall which are found to generally decrease when the model was applied to the testing datasets. The reduction in the overall R 2 scores was most pronounced in the TiO2, CaO, MgO and K2O parameters whose values low for most of the data and are considered as consisting of primarily background (refer to Figure 4).   In this case the RMSE is that reported on a per-drillhole basis so the spread of potential RMSE can be evaluated. While the R 2 scores (Figure 1) for TiO2, CaO, MgO and K2O were found to generally be smaller than the other four majors the small RMSE and known distribution still indicate a fairly successful modeling. An examination of the training/validation data R 2 scores in Table 3 and Figure 2 to that of Set2-Set7 shows the R 2 is generally maintained in the testing datasets but does decrease, notably for TiO2 and K2O, compared to the R 2 for the training/validation data. This is not wholly unexpected as the spatial locations of the training/validation data are in some cases many tens of kilometres removed from the testing cases. Figure 3 shows the measured versus predicted values for the 6 validation data sets with the black line in each plot representing the 1-to-1 line. Specifically, the range of values for TiO2 and K2O are seen to be small as compared to the other parameters with the bulk of the TiO2 and K2O heavily clustered near the origin. The lack of defining range for these two parameters would seem a likely contributing factor to their decreased R 2 values. CaO, MgO and the LOI have a level of bimodal distribution (see Figure 3) and while they also are heavily distributed near the origin the bimodality most likely helps to extend the range and provide clearer paths for the decision trees within the RFR. The bimodal distribution of the LOI, which corresponds with the MgO and CaO distributions, aligns with the spectral examples given in Figure 1 where the spectrum from dataset 4 that corresponding to the greatest LOI value is a carbonate dominated spectrum like the MgO and CaO spectra. Figure 4 shows the estimated cumulative distribution function for the 6 validation datasets and the 8 geochemical parameters. This plot (estimated from kernel density estimators) allows a comparison of the predicted value distributions to the measured. A successful prediction for a given dataset should show the same distribution without major deviations. In general, the distributions follow each other for a given dataset and parameter indicating that the combined NMF-RFR model is working reasonably well. Figure 5 presents the distribution of RMSE for each of the predicted parameters over the 6 validation datasets. No distinction in this figure is made between the 6 datasets and the results are therefore global. In most cases the central RMSE is small compared to the overall range of values for a given parameter. However, and as noted previously, the clustering of values for TiO2 and K2O around the origin implies the RMSE for these parameters is relatively larger than their counterparts.

Downhole
Lastly, and shown in Figure 6 are the downhole predicted and measured values for Fe, SiO2, Al2O3 and LOI of 4 drillcore. The drillcore shown are not from any one dataset and were selected based on their length (randomly selected from all drillcore that had greater than 150 entries) to show the ability of the model. The other 4 geochemical parameters are not shown since the scale of the plots reduces those traces to lines just above zero. The measured values are shown on the left-hand side and the predicted values on the right. The y-axis on all measured-predicted pairs is the same for ease of comparison. The performance of the model is generally observed to be good and matches the measured values well. Some discrepancies can be found but overall, the geochemical parameters as predicted from the spectral input could most certainly be used to ascertain the downhole distribution and value of said parameters. This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

Discussion
The aim of this study was to ascertain if geochemical parameters could be predicted from spectral measurements. The spectral measurements in this study have a 1-to-1 correspondence with whole-rock geochemistry and provide a data rich avenue for investigating the feasibility.
The global findings demonstrated that a relatively small amount, compared to the total number of samples in the entire dataset, of spectral samples and matched geochemistry can be used to successfully train a combined NMF and RFR model to make accurate and quantitative predictions. The dataset used for this study are from a reasonably uniform and non-diverse geology and allowed accurate predictions to be made that extended far beyond the spatial confines of the training and test dataset locations. If the underlying geology was to markedly depart from that of the training and test data used to build the models, then it would require new models to be built that can incorporate such changes.
However, not all the whole-rock entries were modelled well or even able to be modelled. To successfully predict quantitative values for a given element requires that the element in question has a broad range. By this it is meant that if an element, such as P or S, are not well represented within the geology and samples then there is, as intuitively expected, nothing to model. In this study several geochemical elements had a small range of values and/or where distributed close to the origin. These represented background values of which larger values were not always present, as was the case with TiO2 and K2O, and for which the predictive power of the model with these elements was limited. Other elements such as CaO and MgO demonstrated bimodal distributions whose range, even though the bulk of the values are scattered accumulated about the origin, allowed a reasonable quantitative prediction to be made. In most cases the RMSE on those elements which were range restricted were still small and can be used confirm that the element is of background quantities. This aspect was reiterated by producing per-drillcore R 2 scores and RMSE values. Those elements that had an extensive range, such as Fe and SiO2, produced per-hole high R 2 scores and small RMSE values. On the contrary, and specifically for those elements that were range constrained, poor R 2 scores (while still producing low RMSE in most cases) result. However, when drillcore are encountered with an extended range the individual R 2 increases and the RMSE also slightly increase.
While this may seem like an obvious result it is worth noting and bearing in mind that if one trains and validates a model and applies it, for example to a single drillcore, the returned result, if it was compared later to measured values for that same single drillcore, may appear to be poor. In other words, producing an R 2 score on the singular drillcore should not be used as an indicator of model performance.
Additionally, the use of a RFR model in this case has proven to be successful but it has limitations that may necessitate the retraining of the model at future dates. While the RFR is robust it does not extrapolate and hence cannot return predictions beyond the largest and smallest values used to train the model. Thus, if the initial data used to train the model is a subset of a greater range then the model would need retraining to account for the extended range. Indicators this may be needed are results being returned that are consistently at the extent of training data's range.
In this study the spectral data cover a comprehensive wavelength range that might not be considered typical. However, the principle applied should be viable for reduced spectral ranges such as those encountered by the HyLogger only or by FTIR only. It is expected that a reduced range, and hence a lack of absorptions features that are representative of various elements, may lead to a reduction in the overall accuracy of the model depending on the element sought and the spectral range considered. For example, attempting to quantify SiO2 from the VNIR/SWIR spectral range may prove to be extremely difficult due to the lack of absorption features associated with SiO2 in that spectral range. Future work will test this hypothesis by constraining the data to reduced ranges to evaluate the impact on the regressions.

Conclusions
This work represents a new and novel approach to the prediction of whole rock geochemistry from spectral measurements. Previous works have used NMF as a method of spectral unmixing but have not utilized the extra step to use those calculated spectral as inputs to a machine learning process, in this case an RFR, to predict ancillary information like whole rock geochemistry.
In summary a viable method of reliably predicting several whole rock geochemistry parameters from spectral measurements of pulps has been defined and validated against a much larger spatially distributed dataset. Of the 8 parameters modelled, 4 show exceptional promise and have validation R 2 scores greater than 0.8 and RMSE in the low single digit range. Of the other 4 parameters the R 2 were less but the RMSE scores possibly still acceptable. The proposed method could be used to return a quick turnaround of potential downhole distributions and might be used to better spend an analysis budget. Namely, by highlighting spatial regions prior to laboratory based whole rock analysis more focus, through the laboratory analysis, can be made of those areas deemed to be of economic importance. Conversely areas identified by the proposed method of having no economic importance might be subject to laboratory analysis at a reduced sampling space. The use of the NMF model to reduce the dimensionality of the spectral measurements was also shown to be effective and provided reductions in dimensionality, and hence the number of input features to the RFR, by a factor of approximately 59. This has the effect of reducing the computational burden and reducing the overall size of the models. Additionally, the use of preexisting software libraries means that such workflows are accessible to everyone and easily implemented. Data Availability Statement: In this section, please provide details regarding where data supporting reported results can be found, including links to publicly archived datasets analyzed or generated during the study. Please refer to suggested Data Availability Statements in section "MDPI Research Data Policies" at https://www.mdpi.com/ethics. You might choose to exclude this statement if the study did not report any data.

Conflicts of Interest:
The authors declare no conflict of interest.