Towards Consistent Interpretations of Coal Geochemistry Data on Whole-Coal versus Ash Bases through Machine Learning

: Coal geochemistry compositional data on whole-coal basis can be converted back to ash basis based on samples’ loss on ignition. However, the correlation between the concentrations of elements reported on whole-coal versus ash bases in many cases is inconsistent. Traditional statistical methods (e.g., correlation analysis) for compositional data on both bases may sometimes result in misleading results. To address this issue, we hereby propose an improved additive log-ratio data transformation method for analyzing the correlation between element concentrations reported on whole-coal versus ash bases. To verify the validity of the method proposed in this study, a data set which contains comprehensive analyses of 106 Late Paleozoic coal samples from the Datanhao mine and Adaohai Mine, Inner Mongolia, China, is used for the validity testing. A prediction model was built for performance evaluation of two methods based on the hierarchical clustering algorithm. The results show that the improved additive log-ratio is more e ﬀ ective in prediction for occurrence modes of elements in coal than the previously reported stability method, and therefore can be adopted for consistent interpretations of coal geochemistry compositional data on whole-coal vs. ash bases.


Introduction
The modes of occurrence of elements in coal are important because: (1) the release of toxic elements from coal are, in part, dependent on the hosts of these elements [1][2][3][4][5]; (2) they provide insights into the sources of mineral matter in coal, which result from different geological processes [1,6,7]; (3) the technologies designed for critical metals recovery from coal and coal ash largely depend on the modes of occurrence of these elements [8][9][10]; and (4) the modes of occurrence of an element can play an important role in determining the technological behavior of the element [1].In addition to a number of physical and chemical analyses that have been used for determining the modes of occurrence in coal [11,12], some statistical methods have been commonly adopted to investigate the hosts of both major and trace elements in coal.Correlation analysis of element concentrations vs. ash yields is the simplest method that has been widely used in such studies [1].Concentrations of elements in coal are usually reported on two bases: whole-coal and ash bases.The element concentrations in an ash basis can be converted back to those in a whole-coal basis based on ash yields (or loss on ignition), using the formula: [E i ] coal = ([E i ] ash × ash yield), or vice versa; where E i , ash, and coal represent element concentration, ash basis and whole-coal basis respectively.
Element concentrations with a positive correlation with ash yields, indicate a dominant inorganic association.A negative correlation of element concentration with ash yield implies a possible organic association [13,14].
However, it has been found that, in many cases, the modes of occurrence of elements in coal based on correlations between element concentration and ash yield are not consistent in terms of the two bases [13,15,16].For example, the correlation coefficients between the element concentrations and ash yields in the Pennsylvanian coals in the Datanhao and Adaohai mines in Inner Mongolia, China, are not consistent in terms of the two bases [13,15], and, consequently, different modes of occurrence of the elements are inferred.Geboy et al. [16] showed that this inconsistency is attributable to the nature of the elemental concentrations, i.e., compositional data.The sum of major and trace elements in coal is expected to be 100% in an ash basis; however, if based on a whole-coal basis, the sum of elemental concentrations (not including organic C, H, N, and S) plus loss on ignition (LOI, LOI% = 100% − ash yield%) is expected to be 100%.In particular, the sum of the concentrations of all elements plus LOI is 100% (whole-coal basis) from the Datanhao coal mine, Daqingshan Coalfield, Inner Mongolia, northern China [13,15].
Most related literature regarding statistical analysis of compositional data calculated is based on Aitchison [17].For compositional data pertaining to the non-Euclidean space, the statistical analysis in the Euclidean space for coal geochemistry data may result in misleading conclusions.The cause for the correlation difference of elemental concentrations is attributed to sub-compositional incoherence.In order to fully understand the coal compositional data and then more accurately reveal the modes of occurrence of elements based on statistical analysis, coal compositional data need to be first transformed from non-Euclidean space to Euclidean space.As described below, a log-ratio transformation has been proposed to address the incoherence paradox of coal compositional data reported on different bases.

Compositional Data Transformation
In general, compositional data transformation can be performed in three ways: additive log-ratio transformation, centered log-ratio transformation, and isometric log-ratio transformation.

Additive Log-Ratio (alr) Transformation
All the coal geochemistry compositional data X of D part with positive components can be expressed as X = (x 1 , . . ., x D ) : x 1 + . . .+ x D = 1.The coal geochemistry compositional data can be mapped from simplex space S to Euclidean space R, and the result for an observation x i ∈ S can be transformed into y i ∈ R. For the coal geochemistry compositional data, alr transformation [18] is defined as , where x j is a subjective element in coal.

Centered Log-Ratio (clr) Transformation
For solving subjective characteristics of additive log-ratio data transformation, clr [18] data transformation is proposed.For the coal geochemistry compositional data, clr transformation is defined as From the clr transformation view, all the coal geochemistry compositional data can be obtained by log transformation, and thus the result observation is centered.

Isometric Log-Ratio (ilr) Transformation
The isometric log-ratio (ilr) [19] coordinates aim at building an orthonormal basis in the hyperplane.In particular, ilr coordinates set up an orthonormal basis in the hyperplane formed by clr coefficients, and ilr can avoid the singularity occurred with clr coefficients.For the coal geochemistry compositional data, ilr transformation is defined as To solve the problem of consistent correlations whether using ash or whole-coal bases, Geboy et al. [16] proposed the notion of stability between two different elements as a bivariate measure based on ilr transformation [19].While they have shown that the stability between two elements is identical regardless of the reporting basis [16], the stability may seem less illuminating than the correlation coefficient.
In the next section, we propose an improved additive log-ratio data transformation to calculate the correlation between transformed element concentrations.In comparison with other methods, the improved additive log-ratio method is based on the geochemical properties, e.g., the immobility of Zr and Al.Additionally, the method is elegant in predicting the occurrence modes of elements in coal.The results show that the correlation is identical regardless of whole-coal basis or ash basis.Our purpose is to compare the prediction for the mode of occurrence between the improved additive log-ratio data transformation and the stability method [16], by using the hierarchy clustering algorithm [20].

Improved Additive Log-Ratio and Correlation Analysis
The improved additive log-ratio transformation for coal element data is proposed based on additive log-ratio transformation.

Improved Additive Log-Ratio Transformation
For the coal geochemistry trace elements compositional data, the improved additive log-ratio (ialr) can be defined as Here, x j can be assigned with Zr because Zr is a stable element during peat accumulation, diagenetic and epigenetic processes relative to other trace elements.The log ratio between the trace elements x 0 i and Zr is ln Zr .For the coal geochemistry major elements compositional data, the improved additive log-ratio (ialr) can be defined as Here, x j can be assigned with Al 2 O 3 because aluminum is a stable element during peat accumulation, diagenetic and epigenetic processes relative to other major elements (such as Na, Mg, Si, K, Ca) [21][22][23][24][25][26][27][28][29][30], although Al is somewhat mobile in some very specific geological conditions [11,25].
The log-ratio between the major elements x 1 i and Al 2 O 3 is ln

Correlation Analysis of Different Transformation Methods
The correlation coefficients between the concentrations of elements and ash yields can be quite different in terms of the former on ash or whole-coal bases.All the coal geochemistry data (which are shown in the Tables 1-8) can be transformed by using some data transformation methods, in particular, the improved alr, clr [18] and ilr [19].Based on closely related literature, the common data transformation methods for composition data from non-Euclidean space to Euclidean space are alr, clr and ilr.Clr is much better than alr in solving subjective characteristics of additive log-ratio data transformation, while ilr is much better than clr in avoiding the singularity occurred with clr coefficients.The improved alr is the new data transformation method proposed in this paper.Among all the transformed coal geochemistry elements data, the correlations between different element concentrations can be estimated.The correlation between different element concentrations based on our proposed improved alr is the same, regardless of ash or whole-coal bases.Elemental concentrations based either on ash basis or on whole-coal basis have been reported by a number of authors [31][32][33][34][35][36], and either reported basis is correct, because the concentrations of elements in coal ash can be converted back to the whole-coal basis by using the equation [E i ] coal = ([E i ] ash × ash%).The transformed data with improved alr on whole-coal and ash bases can be described as ialr(X) wc = ialr(X) ash .The transformed coal element data with clr on whole-coal and ash bases can be described as clr(X) wc = clr(X) ash .The transformed coal element data with ilr on whole-coal and ash bases can be described as ilr(X) wc = ilr(X) ash .

Correlation Replaced by Stability
In Geboy et al.'s method [16], all the coal geochemistry compositional data follow the ilr transformation method.Then the stability stab x i , x j = exp −var ilr x i , x j between different elements, which is called bivariate measure, was proposed.
Geboy et al.'s experiments [16] proved that the stability between different element concentrations is identical, regardless of the reporting basis.The application of the proposed stability, similar to correlation, has solved the inconsistency problem of the coal geochemical data reported on different bases.
Tables 1-8 show all the correlations and stability [37] between element concentrations for the Adaohai mine and the Datanhao mine, using different data transformation methods based on ash and whole-coal bases.Our improved alr for correlation is consistent regardless of the different reporting bases.The stability proposed by Geboy et al. [16] between element concentrations, which is similar to correlation, demonstrates the consistency regardless of using different reporting bases.

Prediction for Occurrence Mode of Element in Coal Based on Hierarchy Clustering
To verify the performance of the transformation methods for coal geochemistry compositional data, a prediction model for the mode of occurrence of coal element data was built based on the hierarchical clustering algorithm.The element x i of the coal geochemistry was selected as a feature for clustering analysis.

Hierarchical Clustering
Belonging to unsupervised machine learning, hierarchical hierarchy clustering is a basic method with broad applications in different fields [20].It divides unlabeled data into different clusters according to their similarity.Data points are clustered together if their nature is similar and those with dissimilar nature are assorted into different clusters.
For improving the working effect, the hierarchical clustering algorithm [20] is used to measure the similarity between different groups of data features.As the name suggests, it produces hierarchical representations in which the clusters at each level of the hierarchy are created by merging clusters at the next lower level.At the lowest level, each cluster contains a single feature.At the highest level, there is only one cluster containing all the features.
Existing strategies for hierarchical clustering can be divided into two basic paradigms: agglomerative and divisive [20].For our work, the agglomerative strategy is used.The agglomerative strategy starts from the bottom, and at each level it recursively merges a selected pair of clusters into a single cluster.This produces a grouping at the next higher level with one less cluster, and the pair chosen for merging consists of the two clusters with the smallest inter-group dissimilarity.The entire hierarchy represents the ordered sequence of clusters containing all the coal elements.

Similarities Analysis between the Elements in Coal
There are two popular methods used to measure the data similarity: one is based on the distance and the other is based on the correlation coefficient.Distance-based similarity states that two data points with a small distance should have a large similarity, whereas correlation-based similarity state that two data points with a large correlation should have a large similarity [38].Let a weighted graph G(V, E, ω) be a similarity graph [20].The node V(x i ) of the graph G represents an element.The edge E x i, x j represents the relationship between two different elements, and ω represents the similarity of the different elements.The similarity for the coal geochemistry element is expressed as Additionally, the stability given by Geboy is used for the similarity between elements in coal, i.e., d k x i, x j = stab x i , x j = exp −var ilr x i , x j .Furthermore, the similarities graph G for the coal geochemistry compositional data can be expressed as G(V, E, ω) = G V(x i ), E x i, x j , d k x i, x j .

Agglomerative Clustering Algorithm for Prediction
While agglomerative clustering is a mainstream clustering method that can produce an informative hierarchical structure of clusters, the results of agglomerative clustering highly depend on data similarity.Agglomerative clustering starts with every feature representing a single cluster.At each of the N-1 steps, the closest two clusters are merged into a single cluster, producing one less cluster at the next higher level.Following [20], the measure of similarity between element in coal clusters is defined as d k x ik, x jk .
Each level of the hierarchy represents a particular grouping of the element features into disjoint clusters.Recursive binary agglomerative can be presented by a binary tree.The nodes of the trees represent all the elements in coal.The N terminal nodes represent N individual elements.Each nonterminal node has two child nodes.Agglomerative clustering merges the child nodes representing two different clusters to form a parent node.The binary tree is plotted so that the height of each node is proportional to the value of the intergroup similarity between the children.The terminal nodes representing individual element features are all plotted at zero height.This type of graphical display is called a dendrogram [20].
Let I and J represent two clusters.The similarity between I and J d k x ik ∈ I, x jk ∈ J is computed from the set of pairwise feature similarities, where one feature of the pair i is in the cluster I and the other j in the cluster J.Besides average linkage, single linkage, complete linkage, centroid and ward are common clustering approaches.For centroid and ward clustering, they are usually used for distance-based similarity, not correlation-based similarity.After the experiments for describing the occurrence modes of elements in coal in the real coal element dataset of the Datanhao mine and the Adaohai mine [13,15], average linkage agglomerative clustering is much better than for the other two clustering approaches overall.Therefore, the average linkage agglomerative clustering algorithm is used for prediction of the occurrence modes of elements in coal in this paper.Average linkage agglomerative clustering algorithm uses the average similarity between the clusters as shown in Table 9.
It is expressed as , where N I and N J are the respective numbers of the features in each group.The average linkage agglomerative clustering algorithm for prediction of the occurrence modes of elements in coal is executed on the real coal element dataset of the Datanhao mine and the Adaohai mine [13,15].

Input:
The similarities graph G(V, E, ω) = G V(x i ), E x i, x j , d k x i, x j , elements number n.

Results
To demonstrate the consistent interpretation of whole-coal and ash bases of element concentrations, we used the data on element concentrations in the coal and ash samples from the Datanhao and Adaohai mines (Inner Mongolia, China).The coal geochemistry data are transformed based on different coal geochemistry data transformation methods: in particular, the improved alr, Geboy et al.'s method [16], clr and ilr, and the occurrence modes of elements are then deduced.All the hierarchy clustering results from different transformation methods are shown in Figures 1 and 2.    As with other elements, concentrations of rare earth elements and yttrium (REY) could be reported either on ash basis or on whole-coal basis [31][32][33][34][35][36], and the ash basis is particularly suitable for REY potential recovery evaluation in coal combustion products [8,9].In either case of basis, it would be expected that the REY should generally be clustered together if a method is effective.On the basis of trace-elements' geochemical nature and on the investigations by Zhao et al. [15] and Dai et al. [13] using directed analysis (such as SEM-EDS, XRD), the geochemical nature of the two elements in each pair of the following elements, i.e., Sr versus Ba, Sn versus Hg, Cd versus Zn, and As with other elements, concentrations of rare earth elements and yttrium (REY) could be reported either on ash basis or on whole-coal basis [31][32][33][34][35][36], and the ash basis is particularly suitable for REY potential recovery evaluation in coal combustion products [8,9].In either case of basis, it would be expected that the REY should generally be clustered together if a method is effective.On the basis of trace-elements' geochemical nature and on the investigations by Zhao et al. [15] and Dai et al. [13] using directed analysis (such as SEM-EDS, XRD), the geochemical nature of the two elements in each pair of the following elements, i.e., Sr versus Ba, Sn versus Hg, Cd versus Zn, and Nb versus Ta, is similar.Additionally, the major elements including Ca, Mg, Mn and Fe would be expected to be clustered together.Furthermore, Al and Si are both largely associated with silicate minerals and should be expected to be clustered.
The geochemistry data of the coals from the Datanhao mine were used for performance evaluation of all the methods, and the results show that our improved alr is much better predicting the modes of occurrence of the elements, as shown in Figure 1.The prediction accuracy was much better in the similarity of Cd and Zn for the improved alr than Geboy et al.'s method [16].Note that Cd and Zn were not clustered together in Geboy et al.'s [16] method; however, as indicated by Zhao et al. [15] using the direct analysis, the Cd and Zn were both associated with sulfide minerals in these coals.On the other hand, all the rare earth elements were closeted together using the two methods, as shown in Figure 1.In addition, the similarities of the trace elements Sr and Ba, Nb and Ta were the same for the two methods (cf. Figure 1).Finally, the predictions by the two methods on the similarities of the major elements Ca, Mg, Mn and Fe were also the same according to Figure 1.
The coal geochemistry data of the Adaohai mine can also be used for performance evaluation of all the methods, and the results show that our improved alr was more appropriate in predicting occurrence modes of elements than that proposed by Geboy et al. [16].As shown in Figure 2, the prediction accuracy was much better in the similarity of Cd and Zn for our improved alr than Geboy et al.'s [16] method.The same result also appears in Ba and Sr (that is, the improved alr is better than Geboy et al.'s method [16]).In representing the similarities among all the rare earth elements, the two methods exhibited the same results.Meanwhile, the similarities of the major elements Ca, Mg, Mn and Fe predicted are the same by using the improved alr and Geboy et al.'s [16] method.Furthermore, the similarity of the trace elements Nb and Ta was also the same for the two methods as shown in Figure 2.
For comprehensive performance evaluation, the coal geochemistry data of the Datanhao mine was used for comparisons, and the results show that improved alr works much better in occurrence modes of element prediction than clr and ilr do, as shown in Figure 1.For the improved alr, the prediction accuracy was much better in terms of the similarity of Cd and Zn, rare earth elements, trace elements Sr and Ba, Nb and Ta, and the major elements Ca, Mg, Mn and Fe.In contrast, Cd and Zn were not clustered together by the clr method; additionally, Cd and Zn, rare earth elements, trace elements Nb and Ta, and the major elements Ca, Mg, Mn and Fe were not clustered together by the ilr method.
For comprehensive performance evaluation, the coal geochemistry data of the Adaohai mine were used for comparisons, and the results show that our improved alr works much better in occurrence modes of element prediction than clr and ilr do, as shown in Figure 2.For the improved alr, the prediction accuracy was much better in the similarity of Cd and Zn, trace elements Sr and Ba, Nb and Ta, and the major elements Ca, Mg, Mn and Fe.In contrast, the trace elements Sr and Ba were not clustered together by the clr method; additionally, Cd and Zn, trace elements Sr and Ba, Nb and Ta, and the major elements Ca, Mg, Mn and Fe were not clustered together by the ilr method.
While more meaningful geological results are produced by the proposed transformation, we acknowledge that there are still a few anomalies.For example, in the Datanhao samples, Sc-Tl, F-Ga and Hf-Pb were agglomerated very early in the clustering process.Similarly, the Adaohai samples F-Tl, Co-Mo, Hf-U, and Th-Nb were agglomerated very early.Although it is hard to think of a possible geological explanation for this relationship, some can be reasonably inferred.For example, based on selective leaching, Wang et al. [39] found that F is elevated in the similar coals in the Haerwusu coals in the Jungar coalfield, which is closely located to the south of the Daqingshan coalfield, and occurs mainly in boehmite and kaolinite.High gallium concentration is these coals is also associated with these two minerals [30][31][32][33][34][35][36][37][38][39][40][41][42].However, other associations such as Sc-Tl and F-Tl need further investigation.

Conclusions
The correlation between element concentrations has been reported inconsistently for ash and whole-coal bases, which is an enduring problem well known in the research community.In this study, we show that (1) the improved alr data transformation proposed for correlations between element concentrations is consistent regardless of using different reporting bases; (2) the stability proposed by Geboy et al. [16] shows consistency between elements in coal, with similar correlations regardless of different reporting bases; (3) to verify the performance of the improved alr and Geboy et al.'s transformation methods [16] for elements in coal, a prediction model for occurrence mode of element in coal can be elegantly built using the hierarchy clustering algorithm [20].The prediction results show that our improved alr is much better than Geboy et al.'s method [16].
In conclusion, the improved alr is much better than any of the ilr, the clr, and Geboy et al.'s [16] approach.An interesting line of future work is to consider consistent interpretations of coal geochemistry data on whole-coal versus ash bases through deep learning [43][44][45].
Output: Print hierarchy clustering records.Begin: Initialize: Elements clusters C = {c 1 , c 2 , . . ., c n } = {{x 1 }, {x 2 }, . . ., {x n }}, Minimum distance d min , Cluster index a, b.While length(C) > 1 do d min = Infinity.For i = 1 → (length(C) − 1 ) do For j = (i + 1) → length(C) do Calculate the distance between c i and c j : d = xp∈Ci,xq∈Cj d k (xp,xq) |Ci||Cj| If d < d min then d min = d, a = i, b = j.End for End for Merge cluster c a and c b : c tmp = c a ∪ c b .Delete c a and c b from C. Append c tmp to C. Print elements V(x i ) x i ∈ c tmp .End While End

Figure 1 .
Figure 1.Cluster analysis for coal element data from the Datanhao mine.(A) Improved alr on coal and ash basis; (B) Geboy's approach on coal and ash basis; (C) clr on coal and ash basis; (D) ilr on coal and ash basis.

Figure 1 . 22 Figure 2 .
Figure 1.Cluster analysis for coal element data from the Datanhao mine.(A) Improved alr on coal and ash basis; (B) Geboy's approach on coal and ash basis; (C) clr on coal and ash basis; (D) ilr on coal and ash basis.Minerals 2020, 10, x FOR PEER REVIEW 17 of 22

Figure 2 .
Figure 2. Cluster analysis for coal element data from the Adaohai mine.(A) Improved alr on coal and ash basis; (B) Geboy's approach on coal and ash basis; (C) clr on coal and ash basis; (D) ilr on coal and ash basis.

Figure 2 .
Figure 2. Cluster analysis for coal element data from the Adaohai mine.(A) Improved alr on coal and ash basis; (B) Geboy's approach on coal and ash basis; (C) clr on coal and ash basis; (D) ilr on coal and ash basis.

Table 2 .
Correlation using the improved alr approach on coal and ash basis (Datanhao mine).

Table 3 .
Correlation using the clr approach on coal and ash basis (Datanhao mine).

Table 4 .
Correlation using the ilr approach on coal and ash basis (Datanhao mine).

Table 6 .
Correlation using the impoved alr on coal and ash basis (Adaohai mine).

Table 8 .
Correlation using the ilr on coal and ash basis (Adaohai mine).

Table 9 .
Average-linkage algorithm for hierarchical clustering.