Machine Learning Applied to K-Bentonite Geochemistry for Identification and Correlation: The Ordovician Hagan K-Bentonite Complex Case Study

: Altered tephras (K-bentonites) are of great importance for calibration of the geologic time scale, for local, regional, and global correlations, and paleoenvironmental reconstructions. Thus, de-finitive identification of individual tephras is critical. Single crystal geochemistry has been used to differentiate tephra layers, and apatite is one of the phenocrysts commonly occurring in tephras that has been widely used. Here, we use existing and newly acquired analytical datasets (electron probe micro-analyzer [EPMA] data and laser ablation ICP-MS [LA-ICP-MS] data, respectively) of apatite in several Ordovician K-bentonites that were collected from localities about 1200 km apart (Minne-sota/Iowa/Wisconsin and Alabama, United States) to test the use of machine-learning (ML) techniques to identify with confidence individual tephra layers. Our results show that the decision tree based on EPMA data uses the elemental concentration patterns of Mg, Mn, and Cl, consistent with previous studies that emphasizes the utility of these elements for distinguishing Ordovician K-ben-tonites. Differences in the experimental setups of the analyses, however, can lead to offsets in absolute elemental concentrations that can have a significant impact on the correct identification and correlation of individual K-bentonite beds. The ML model using LA-ICP-MS data was able to identify several K-bentonites in the southern Appalachians and establish links to K-bentonites samples from the Upper Mississippi Valley. Furthermore, the ML model identified individual layers of mul-tiphase eruptions, thus illustrating very well the great potential of applying ML techniques to tephrochronology.


Introduction
The geochemistry of igneous apatite is complex, and in response to various magmatic conditions during crystallization (e.g., oxygen fugacity, temperature, crystallization rate, chemical composition), its trace-element composition can acquire a characteristic geochemical fingerprint [1,2]. Because of this, the geochemistry of apatite is a powerful tool for testing tectonostratigraphic hypotheses and for the high-resolution correlation of pyroclastics produced by large volcanic eruptions, as demonstrated by its successful use in many investigations of tephras ranging in age from the Paleozoic to the recent Cenozoic [3][4][5][6].
Explosive volcanism during the later Ordovician (470-445 Ma; Figure 1; Figure 2) resulted in many airfall deposits of volcanic ash being spread across much of what is now eastern North America [7][8][9], northern Eurasia [10,11], and western South America [12]. It has long been recognized that the geochemistry of these Ordovician K-bentonites can be used for high-resolution stratigraphy [4,13]. The tephrochronological approach has led to development of a high-resolution framework useful for identifying and correlating stratigraphic packages that are otherwise difficult to identify. For instance, much stratigraphic correlation in the northern part of Laurentia (Figure 1) is based on K-bentonite tephrostratigraphy [4,5,14,15]. Most importantly, a high-resolution stratigraphic framework based on apatite phenocryst geochemistry facilitates testing of stratigraphic and geological hypotheses, e.g., apatite geochemistry provided evidence that the stratigraphic changes across the Sandbian-Katian boundary are time-transgressive across part of the Appalachian basin and therefore are most likely not global events [5].
Those tephras, several of which were generated by arguably the largest explosive volcanic eruptions in the Phanerozoic, have been altered to K-bentonites, including three of the K-bentonite beds in the Hagan K-bentonite complex [16] of eastern North America (Deicke, Millbrig, and Elkport) that are the subject of our investigations reported herein. In recent years, machine-learning techniques (hereinafter ML) have been applied to volcanology [17,18]. Here, we test whether ML techniques can be applied to these deposits to enhance our ability to identify and correlate these eruptive events that could be used to track environmental and tectonic changes through time and space in order to test tectonostratigraphic hypotheses [5,[19][20][21].

Machine Learning Environment and Parameters
We used the open-access machine-learning software WEKA [23,24] for these analyses. WEKA is widely used and has appropriate analytical tools for machine-learning applications for decision tree generation based on geochemical datasets [25]. Datasets were imported and, depending on the geochemical dataset, processed as follows. Datasets were processed similarly, as outlined below.
We built a model based on the Emerson et al. [15] dataset, which we call the "UMV (Upper Mississippi Valley)-Model" For developing this "UMV-Model", we used the J48 algorithm for generating our decision trees. The J48 algorithm is the java software implementation of the C4.5 classifier [26] in WEKA, and it is an algorithm that aims at choosing the particular attribute of the data which most effectively splits its set of samples into subsets at each node of the decision tree. The resulting decision tree was pruned by forcing 15 instanced into a leaf, thereby not overfitting the results. The dataset of Emerson et al. [15] is already published, and can be accessed through their manuscript location.
We also developed the "Big-Ridge-Model", which is based on laser ablation-inductively coupled plasma-mass spectrometry (hereinafter LA-ICP-MS) data of newly and previously reported apatite phenocrysts. We also used the J48 tree-building algorithm, and, as with the EPMA data, the resulting decision tree was pruned by forcing 15 instanced into a leaf, thereby not overfitting the results (Figure 3). The dataset is available in Appendix A. Table 1 presents the location information for the Ordovician K-bentonite samples in this study that were investigated using LA-ICP-MS of apatite phenocrysts and then used for the training and testing datasets for the laser ablation data. At Big Ridge (Figure 2), the Deicke K-bentonite was divided into seven visually distinct intervals (D1-D7 from base to top) and the Millbrig K-bentonite was divided into four evenly spaced and visually distinct intervals (M1-M4 from base to top). Table 1. Location information for each sample, the name of each K-bentonite that was collected at each section, and the key references for each sample. "?" refers to K-bentonites that have been tentatively identified, but not yet geochemically fingerprinted.

Training Set for Geochemical Data Based on Electron Probe Micro-Analyzer (EPMA)
We used the dataset of Emerson et al. [15] for the training set for testing the correlation of K-bentonites using existing EPMA data, and we used a subset of that dataset for the LA-ICP-MS test. That study focused on the geochemistry of apatite phenocrysts of four prominent Katian-Sandbian age K-bentonites recognized within the Decorah Formation in the upper Mississippi Valley states of Minnesota, Iowa, and Wisconsin ( Figure  1). These K-bentonites are the Deicke, Millbrig, Elkport, and Dickeyville K-bentonites. Emerson et al. [15] reported results based on apatite phenocrysts that had been analyzed for 15 elements, and we are using only the datapoints that included Mg and Mn values, as these two elements had been identified before as being especially informative and important when using apatite geochemistry for stratigraphic purposes.

Cross Calibration Datasets for Geochemical Data Based on Electron Probe Micro-Analyzer (EPMA)
We evaluated the training model that was derived from the Emerson et al. [15] dataset with the dataset of Carey et al. [14]. Both studies analyzed distinct K-bentonite layers, some of which had previously been independently identified as the Deicke K-bentonite, or the Millbrig K-bentonite, through detailed correlation using wire-line logs, stratigraphic position, and (or) mineral content and geochemistry [7,30]. Thus, the Deicke and Millbrig samples from those studies are suitable for use as a test set, because their identification is confidently known. We used all analyzed elements and replaced missing values that were included in the Emerson et al. [15] dataset, but not in Carey et al. [14], as missing data values.
For data from the Carey et al. [14] study, we first focused on the Deicke and Millbrig K-bentonites to assess the accuracy of the UMV-model for identifying those two major Kbentonites. The dataset of Carey et al. [14] is also already published, and can be accessed through their manuscript location.

LA-ICP-MS Analysis of Apatite Phenocrysts to Establish the Training and Testing Datasets
We used the LA-ICP-MS data set from Big Ridge (Figure 1) for the training set (Big-Ridge-Model). These K-bentonites include the two thickest and most prominent ones that have previously been identified as the Deicke and Millbrig K-bentonite beds [30], as well as three previously unnamed K-bentonites (Kb-A, Kb-B, and Kb-C. The LA-ICP-MS analyses were performed using a Thermo iCap Qc ICP-MS that was connected to a Cetac G2-213 Nd:YAG laser system in the Department of Geology and Geophysics at Louisiana State University. Isotopes of individual elements analyzed included 24 Mg, 43 Ca, 55 Mn, 57 Fe, 89 Y, 232 Th, 238 U and the rare earth elements La-Lu. Elemental concentrations were normalized against the NIST612 glass standard, using 43 Ca as an internal standard using Iolite 2.5 [31]. During preprocessing, we eliminated data points with inclusions or other problems during the laser ablation session. Durango and Madagascar (MAD) apatites were used as secondary standards. Additional sample processing and analytical details were reported by Herrmann et al. [27]. The results of the LA-ICP-MS analyses used for the training set are tabulated in online Appendix A. The results of the LA-ICP-MS analyses for the testing set are tabulated in online Appendix B. The laser parameters for the individual runs are in Appendix C. The model was derived using the following 15 attributes based on elemental concentrations and different ratios: Mg, Mn, Y, U, REEsum, (La/Sm)N, (La/Yb)N, Pr/Pr*, Ce/Ce*, La/La*, Eu/Eu*, Mn/Eu*, Mg/Mn, Ce*/Y, and (Gd+Sm)/Mn. Rare earth elements were normalized to average chondrite values [32].  The UMV-model has distinct differences in identifying the Deicke and Millbrig Kbentonite correctly (Table 2; Figure 3). We treated the data from the known Deicke and Millbrig K-bentonite locations of Carey et al. [14] as unknowns to test the UMV-model. The Deicke K-bentonite was much more reliably identified than the Millbrig K-bentonite. In particular, cut-off values of Cl concentrations and Mg/Mn ratios led to misidentification on the fringes of individual data clusters ( Figure 4). There is no available LA-ICP-MS dataset that can be used for cross-validation similar to the cross-validation of the UMV-model. Although Sell et al. [5] analyzed apatite phenocrysts from K-bentonites using LA-ICP-MS, the available dataset that is part of the supplementary online material accompanying that 2015 paper unfortunately lacks data for critical elements (e.g., Mg) that were reported by the authors to have been obtained from analyses of several specific K-bentonite samples that are of interest herein, thus preventing us from making a meaningful comparison. Therefore, we analyzed a subset of the apatite phenocrysts from well-known K-bentonites in the UMV (Figure 1) that are equivalent to the samples published by Emerson et al. [15]. The results are shown in Figure 5 and 6, and these new data are available in online Appendix B.

Figure 5.
Big-Ridge-Model decision tree of five Ordovician K-bentonites exposed in outcrop at Big Ridge using the J48 classifier. Shown in the inset are the percentages for correctly and incorrectly classified instances based on different decision modes. RZero: essentially guessing the most abundant class for every instance; OneR: classifying each instance based on only one variable.
The Deicke K-bentonite samples from Minnesota (St. Paul and Rochester locations), and Iowa (Decorah location) are identified by the model. Only the sample from Decorah includes predictions that are not originating from one of the Deicke horizons at Big Ridge. The majority of instances in the samples from St. Paul and Rochester appear to come from horizon D5. At Decorah, the majority of instances are consistent with horizon D7 at Big Ridge. The minority of samples have an affinity with the lower horizons of the Deicke at Big Ridge (D1-D4).
The Millbrig K-bentonite samples from the St. Paul and Rochester sections are identified by the model. The majority of instances in those two samples appear to come from horizons M3 and M4. The minority of samples have an affinity with the lower horizons of the Millbrig at Big Ridge (M1-M2). Only one misidentified sample occurs in the Rochester sample.
The Elkport K-bentonite sample from the Dickeyville, WI section has a majority of Kb-B affinity, however it also has several instances from all other K-bentonites at Big Ridge.

Identification of Unknown K-Bentonites in the Southern Appalachians Using the Big-Ridge-Model
The two K-bentonite samples from the Tidwell Hollow location have clear differences. Sample "TH 'Millbrig'" has a majority of Millbrig horizons identified (M2-M4), while Sample "TH 'Deicke'" has a majority of Deicke horizons identified (D1-D7). Besides Deicke horizons, there were no misidentified horizons in that sample, but the TH 'Millbrig' did have a minor contribution of samples that were identified as Deicke samples.
The "Fort Payne 'Deicke'" K-bentonite samples include only predictions that are originating from different Deicke horizons at Big Ridge. The majority of instances come from horizon D5 and D1, with a minor contribution from D4.
The "Bruton Gap" K-bentonite samples include a mix of predictions that are originating from different Deicke horizons (D1 and D4) and one Millbrig horizon (M4) at Big Ridge. The majority of instances come from the Millbrig horizon.

Evaluation of the ML Models Based on Geochemical Data of Apatite Phenocrysts
The main elements that the UMV-model chooses are based on elemental concentration patterns of Mg, Mn, and Cl. This is consistent with previous suggestions that these three elements are important for distinguishing Ordovician K-bentonites [14]. Nevertheless, the different models lead to misidentifications for the following reasons.
First, different experimental setups for the analysis can lead to offsets in absolute values of the different elements. This has been shown to be an issue for Ordovician Kbentonites before [5,27]. In particular, problems of accurately quantifying Cl and F concentrations in geological samples, including apatites, using EPMA has been well documented [33,34]. Thus, offsets of data points and data point clusters caused by analytical differences can have a significant impact on the correct identification and correlation of individual K-bentonite beds.
Second, the heterogeneity of individual K-bentonites that consist of multiple tephra layers produced by different eruptions, or by different phases of the same eruption, can create problems for the correct identification of a given K-bentonite bed. The algorithm works best if discrete clusters of unique elemental concentration pairs can be found. The J48 classifier used in WEKA, which is based on the C4.5 classifier [26], is an algorithm that aims at choosing the particular attribute of the data which most effectively splits its set of samples into subsets at each node of the decision tree. At least two of the Ordovician Kbentonites, however, have been shown to be the result of multiple eruption phases that led to distinct changes in the geochemistry of the phenocrysts of each eruptive phase. For example, on the basis of textural changes in a key exposure in Tennessee, Haynes [30] hypothesized that the Millbrig K-bentonite might consist of four individual ash falls, each with distinct lithological characteristics. This interpretation is consistent with the identification of temporal geochemical changes during the deposition of each ash layer as identified by unique clusters (Figure 3; see refs. [5,14]). More recently, Herrmann et al. [27] showed that the Deicke K-bentonite also has multiple layers, and they suggest that this finding can most likely be attributed to the amalgamation of several discrete tephra layers into one stratigraphic horizon as seen today. The type localities and type sections of the Deicke and Millbrig K-bentonites are in the Upper Mississippi Valley (UMV) and are relatively far removed from the Ordovician volcanic centers that produced these ash falls. Additionally, because the K-bentonites are much thinner in the UMV (maximum of several cm thick) than they are exposures that are closer to the volcanic centers, like the Big Ridge section ( Figure 2) in Alabama (up to 1.2 m thick; [27,29]_, it is conceivable that not all ash fall layers will have all known clusters represented in the more distal locations. Third, the depositional environments along the active margin of Eastern Laurentia during the eruption of the volcano(es) responsible for producing the tephras that comprise the Hagan K-bentonite complex were highly dynamic. Locally and regionally, these include carbonate platform environments, mixed siliciclastic and carbonate environments, and strictly siliciclastic environments, all of which having been interpreted as being in relatively shallow water setting with currents that probably led to the resuspension, erosion and redeposition of the tephras that could have led to mixing of individual tephra components in what are now the K-bentonite beds of this region [21,29,30,35].
Lastly, closer to the eruption center, it would be expected that the largest phenocrysts will have been deposited. In the case of the Deicke K-bentonite, Herrmann et al. [27] also showed that significant zonation is present within apatite phenocrysts. Thus, magma evolution during the eruption can lead to geochemical zonation within individual apatite grains that at the small spot size of EPMA can lead to noise within the dataset. The generally larger spot size of LA-ICP-MS analyses tend to average these values.

Identification and Correlation of "K-Bentonite B" at Big Ridge
The ML model using LA-ICP-MS data was able to establish a link between a previously unnamed K-bentonite at Big Ridge (field notation K-b B; Figure 1) with the Elkport K-bentonite of the Upper Mississippi Valley. The stratigraphic location of K-b B above the Millbrig is consistent with the relative stratigraphic positions of the Millbrig and Elkport K-bentonites in the Upper Mississippi Valley. Thus, we tentatively assign K-b B as the Elkport K-bentonite, with high confidence. This allows a high-resolution correlation of K-bentonites in Alabama to be expanded from only the Deicke and Millbrig stratigraphic interval, to the younger Elkport level. This is an important finding, one that is critical to establishing an ever-more precise correlation framework for evaluating environmental changes across this interval [20,21] The ML model using LA-ICP-MS data was able to confirm the previous interpretation that the Deicke and Millbrig K-bentonites are present at Tidwell Hollow [29] and that the Deicke is present at Fort Payne [20,29,30]. Confirming the presence of these K-bentonites at Tidwell Hollow is not only important for constraining the timing of environmental changes across this interval [20,21], but also for interpreting the paleogeographic and environmental setting of this area during the later Ordovician. The observation that, despite the small dataset, essentially all Deicke and Millbrig horizons from Big Ridge can be identified at these locations, attests to the closeness to the volcanic center such that effectively no filtering of the tephra through depositional processes occurred.

Identification and Correlation of the Millbrig K-Bentonites at Bruton Gap
The ML model tentatively assigns the unknown K-bentonite from Bruton Gap, Alabama to the Millbrig. This unknown K-bentonite is one of five K-bentonites identified in the Bruton Gap section by Haynes [30], in particular this bed is a relatively thin K-bentonite bed that was collected upsection from the thicker K-bentonite tentatively identified as the Deicke, and downsection from the thicker biotite-rich K-bentonite tentatively identified as the Millbrig. With identification of this thinner K-bentonite as having Millbrig affinity, it is quite possible that the original stratigraphic "picks" of Haynes [30] for the Bruton Gap section should be re-evaluated and compared not with the Big Ridge section to the northeast, but with the Red Mountain Expressway section to the southwest, where lenses and discontinuous beds of limestone are intercalated with the Millbrig K-bentonite, as now seems to be the better explanation for the K-bentonite stratigraphy at Bruton Gap. Additional work is planned for further investigation of the Bruton Gap section and its precise relationships with the other sections in the region.

Fine-Tuning our Understanding of the Eruptive Sequence of Ordovician K-Bentonites Generated by Large Caldera Building Eruptions
At the Big Ridge section ( Figure 2) the Deicke and Millbrig K-bentonites are of relatively great thickness [35], and those exposures can be confidently assumed to represent proximal, coarse-grained archives for the eruption of tephras generated during caldera building events. The Deicke and Millbrig K-bentonites at Big Ridge both reach 1-1.2 m in thickness, and both beds can be subdivided into individual layers based on lithological and mineralogical observations ( Figure 2). In contrast, in the Upper Mississippi Valley, where the stratotypes for many of the prominent Ordovician K-bentonites are located [36], the altered tephras are commonly only a few cm thick at most. The majority of the Deicke and Millbrig apatite phenocrysts from UMV samples appear to have originated during the later stages of the formation of the tephra precursors that have now altered to K-bentonites. The samples from the Deicke in the UMV are dominated by horizons D5 and D7, whereas the Millbrig samples from the UMV are dominated by horizons M3 and M4.
Thus, it appears that the first eruption events were smaller in scale, and the caldera forming, massive events happened at the end of each eruption. This would be consistent with previous observations of K-bentonite samples from Virginia [14] that showed changes in the eruptive sequence. Here, based on the absence of apatite phenocrysts that would be the more distal Upper Mississippi Valley time equivalents with the early stages of volcanism as documented at the more proximal Big Ridge section, we can also document that the eruptive style of the of the explosive eruptions' later stages led to a wider dispersal of the volcanic ash. Therefore, we conclude that in samples from more distal locations relative to these volcanic centers, tephra from the early stages might be missing.

Conclusions
We have demonstrated how ML can be applied to a stratigraphic problem with measurable success. Using apatite phenocryst geochemical data from several Ordovician Kbentonite beds, including the well-known Deicke and Millbrig K-bentonites, which are layers of altered airfall tephra from explosive volcanic eruptions at ~454-455 Ma along the Laurentian margin, ML methodology separates and distinguishes the phenocrysts in a given K-bentonite from those in others, and our results show the value of ML in chronostratigraphy and tephrostratigraphy. Our results are also in accord with prior studies which suggested that both the Millbrig and the Deicke explosively eruptive events consisted of multiple eruptive stages. We establish with confidence that only the phenocrysts from the later stages of the eruption were deposited in the more distal regions that now comprise the many excellent exposures in the Upper Mississippi Valley, as compared with the more complete record of phenocrysts that are present in the more proximal sections of Alabama.
The Elkport K-bentonite is for the first time identified in the Big Ridge outcrop. This finding now allows for correlation of the Upper Mississippi Valley deposits into Alabama at this chronostratigraphically precise horizon. The presence of the Deicke and Millbrig K-bentonites at Tidwell Hollow, Alabama, is confirmed. An unknown K-bentonite at Bruton Gap, Alabama, is tentatively assigned to the Millbrig, but further work is planned to confirm its position within the Bruton Gap sequence and to refine its usefulness for regional correlation.
Future work is needed to establish elemental profiles that can best be used to characterize individual K-bentonite beds based on geochemical fingerprinting. This might include elements that are not frequently analyzed in apatites so far. Finally, we note that care must be taken if samples for the training set are collected from exposures that are relatively far away from the eruption center, as such samples from a more distant location might very well not be representative of the entire eruptive sequence.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data for this paper is available in the Appendices A and B. Table A1. Results from the LA-ICP-MS analyses used as the training set. All elemental abundances are given in ppm.