1. Automated Analysis of X-Ray Absorption Fine Structure (XAFS)
Structure–activity relationships have long been a research focus in the material sciences, such as chemistry and materials science, forming the theoretical foundation for new material design, particularly in catalysts. Consequently, structural characterisation has become a crucial component of scientific exploration. In the early 1920s, Hertz and Fricke pioneered the discovery of oscillations in the X-ray absorption coefficient occurring before and after the absorption edge of specific atoms in condensed matter. They called this oscillation the X-ray absorption fine structure (XAFS) [
1,
2], which serves as a potent tool for analysing the local structure of materials and is composed of extended X-ray absorption fine structure (EXAFS) and X-ray absorption near-edge structure spectra (XANES). Near-edge X-ray absorption fine structure (XANES) contains rich information about the coordination environment. However, traditional analytical methods (such as linear combination fitting) rely on known standard spectra, limiting their application to unknown systems [
3]. From LSTM to GRU, the gating mechanism has been widely used in deep learning. The GANDALF algorithm is a tabular deep learning algorithm based on a gating mechanism, which is crafted with feature selection and feature engineering. It demonstrates significant advantages over conventional deep learning methods like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) in handling interactions among non-adjacent features, which arise due to the various distances between peaks in the XANES input, thereby offering superior performance on diverse XANES spectra. This study proposes a coordination number prediction method integrating multi-scale feature engineering with the GANDALF algorithm, offering novel insights for structural characterisation of complex material systems.
Section 1 reviews prior research,
Section 2 outlines the proposed methodology for determining transition metal coordination numbers from XANES spectra,
Section 3 presents the results on public theoretical datasets, and
Section 4 concludes with future directions.
1.1. Preliminary Research
Research into automated XAFS analysis has primarily relied on theoretical data derived from computational simulations. Currently, experimental studies determining the coordination number of the absorbing element from XANES spectra still depend on prior knowledge of reference XANES spectra for the relevant system.
Therefore, directly matching reference spectra with pre-computed spectra of test samples represents an intuitive approach. For instance, Chen Zheng et al. first employed FEFF v9 software to accumulate extensive spectral data in a database via quantum mechanical ab initio calculations. Subsequently, their proposed ensemble learning method—the ensemble spectral matching (ELSIE) algorithm—successfully matched test sample XANES data with database entries, thereby analysing the composition and structure of the test samples [
4]. The reliability of such matching algorithms hinges on computational precision, an aspect that has received scant attention in recent XAFS research.
1.2. Machine Learning Research on Small Datasets
Comparing test sample spectra with reference spectra across a broad range, researchers have often focused more on analysing a specific system. Consequently, training a data-driven model and directly using it to derive results has become a more common approach. Dimensionality reduction has become a necessary step due to the high dimensionality of spectral data and the scarcity of data suitable for training sets, and principal component analysis (PCA) is frequently employed in this case. For instance, Oleg A. Usoltsev et al. calculated theoretical XANES spectra using FDMNES and subsequently reduced their dimensionality via PCA. They then employed the multivariate curve resolution (MCR) algorithm to fit the resulting principal components to specific structural parameters of designated metallic crystals [
5]. Similarly, A. Martini et al. applied PCA to reduce the dimensionality of theoretically calculated small-molecule XANES spectral data, followed by radial basis function (RBF) fitting to obtain conformational data [
6]. Beyond this, dimensionality reduction may also be achieved through artificially specified indicators. For instance, Itsuki Miyazato et al. proposed descriptors targeting the atomic number of the element and the energy shift difference Δμ(E)
x. Δμ(E)
x is defined by Equation (1):
where E
x represents the minimum photon energy corresponding to an absorption rate of x within the XANES spectrum, and E
0 denotes the maximum photon energy at an absorption rate of 0. Machine learning models such as support vector machines, trained using the aforementioned descriptors as input, successfully achieve identification of the target element’s valence state [
7]. Shuting Xiang et al. demonstrated CO
2 participation in the reaction pathway by performing linear combination fitting (LCF) on XANES spectra collected for N
2 and CO
2, followed by determination of mixing fractions [
8].
In recent years, with increasing computational capabilities, deep learning methods have gained prominence in data-driven analysis algorithms for synchrotron radiation X-ray absorption fine structure (XAFS) data. This is due to their superior ability to extract underlying information from data compared to traditional multivariate statistical learning models. For instance, Janis Timoshenko et al. employed artificial neural networks (ANNs) trained on theoretically calculated data to analyse the coordination numbers and shapes of metallic nanoparticles, correlating them with the particle size and interatomic distances of actual metallic crystal clusters [
9,
10,
11]. Meanwhile, Shuting Xiang et al. employed a neural network trained on theoretically calculated spectral data to predict the post-reaction structure of single-atom cobalt catalysts after carbon dioxide reduction reactions [
8]. Zhengran Ji et al. opted to augment small samples into large ones via linear combinations, thereby training a convolutional neural network (CNN) to predict manganese oxidation states [
12].
1.3. Machine Learning Research on Large Datasets
While small datasets focus on specific systems in which only specific structure parameters changed, large datasets contain various samples in which only the absorption elements and wave ranges are the same, so the built model can analyse XAFS without prior knowledge. Research on large datasets typically involves computational analysis of structural information from public crystal databases or direct utilisation of publicly available spectral and characteristic data. Decision tree methods are commonly employed. This approach originated with Steven B. Torrisi et al., who extracted XANES spectral features via polynomial fitting, subsequently using random forests for coordination number classification and regression prediction for Bader charges and mean nearest neighbour distances [
13]. Subsequently, recognising that practical samples are predominantly mixtures, Samuel P. Gleason et al. employed linear combinations of XANES spectra to simulate mixtures, successfully predicting the average oxidation state of copper in copper-containing crystal mixtures using random forests [
14]. Tanaporn Na Narong et al. combined XANES with the pair distribution function (PDF) for both oxidation state and coordination number classification, alongside regression for average nearest neighbour bond length. They further interpreted the random forest model using feature importance analysis [
15].
Although these studies demonstrate innovation in data sourcing, existing algorithms still show room for improvement in generalisation capability, struggling to meet the demands of academic frontiers. This research proposes a coordination number prediction method based on the GANDALF (Gated Adaptive Network for Learning Features) algorithm. It exhibits strong generalisation properties and implements gradient descent, making it suitable for large-scale sample training and prediction.
2. Results
This study employs the root-mean-square error (RMSE), coefficient of determination (R
2), and adjusted coefficient of determination (R
2) as evaluation metrics:
where y
i denotes the actual value for the i-th sample;
denotes the model-predicted value for the i-th sample; m denotes the number of samples;
denotes the arithmetic mean of the actual values; and p denotes the number of features, which, in this method, corresponds to the input length of the GANDALF model.
For samples without linear combinations, a common approach of directly inputting XANES spectral vectors from the E-space into a random forest to fit coordination numbers was employed as the random forest control group. Given the scarcity of neural network studies analysing structural information in large-sample XANES spectra, the neural network control group was established by reference to relevant research utilising X-ray absorption spectroscopy for detecting prohibited substances. A 1D U-Net convolutional neural network and a bidirectional long short-term memory (LSTM) neural network were established as neural network control groups [
16]. Test results on the uncombined test set are presented in the table below.
When the absorbed element was Mn, all four models exhibited the poorest predictive performance, with R
2 values below 0.6. Steven B. Torrisi et al. also achieved only 80% classification accuracy for Mn, markedly lower than for other absorbed elements. Similarly, when the absorbing element was Ni or Cu, R
2 values were below 0.7. Within the entire dataset, Ni exhibited a marked predominance of 6-coordinate samples over 4- and 5-coordinate samples; Cu showed a clear preponderance of 5-coordinate samples over 4- and 6-coordinate samples; and Mn displayed a distinct scarcity of 4-coordinate samples compared to 5- and 6-coordinate samples. As the training set was randomly selected, this imbalance was also introduced into the training data. When the absorbing element was V, all four models achieved optimal predictive performance, with R
2 values exceeding 0.73. The V dataset was also the only one in the entire dataset where the number of samples for each coordination number was no fewer than 1000, exhibiting the best sample balance. Steven B. Torrisi et al. proposed in their research that improving sample balance within datasets is necessary to enhance accuracy [
13]. This imbalance is difficult to address because reusing some samples in training models can introduce bias, while under-sampling can cause over-fitting. Both would reduce the performance. The marked variation in model performance across different absorbing elements stems from differing sample balance within the dataset. For the test set derived from linear combinations of XANES spectra, the test results are as follows.
Comparing the results in
Table 1 and
Table 2, both the pseudo-Voigt-GANDALF method (
Supplementary Data) and the random forest method exhibit a decrease in R
2 on the linear combination test set relative to the original test set, though the pseudo-Voigt-GANDALF method showed a smaller decline. In
Table 2, for each absorbing element, the adjusted R
2 of the pseudo-Voigt-GANDALF method is greater than that of the random forest method. This indicates that the pure sample training set cannot fully reflect the relationship between the XANES spectrum of the mixture and the average coordination number of the absorbing elements. In
Table 1, the adjusted R
2 values for the pseudo-Voigt-GANDALF method are lower than those for the random forest method across all absorbing elements. This should be attributed to the random forest method being trained solely on the original training set containing only integer coordination numbers, hence exhibiting stronger performance on the original test set with integer coordination numbers.
The random forest method, with each tree trained independently, is difficult to parallelise and faces computational efficiency challenges during large-sample training. Reducing the sample size weakens the sample diversity balance, lowering prediction accuracy. By contrast, deep learning methods employ gradient descent training with high parallelism, can utilise GPU acceleration, and can effectively learn the relationship between XANES spectral features and coordination numbers from large sample data.
Table 2 demonstrates that, across three metrics, the proposed pseudo-Voigt-GANDALF method outperforms U-Net (representing convolutional neural networks), while U-Net itself outperforms LSTM (representing recurrent neural networks). LSTM’s higher RMSE and lower R
2 indicate limitations in multi-scale feature extraction. Although the U-Net can extract multi-scale features, its extraction method lacks guidance from mechanistic knowledge, resulting in relatively constrained performance in the current system.
Synthesising the results from
Table 1 and
Table 2, the pseudo-Voigt-GANDALF method achieves optimal performance when the absorbing elements are Ti, Co, Ni, and Cu. When the absorbing elements are V, Cr, Mn, and Fe, the model’s performance is only inferior to the random forest model on the original test set. Given computational efficiency challenges for random forest methods in large-scale training, the pseudo-Voigt-GANDALF approach represents the state of the art in this domain. It satisfies preliminary practical application requirements for the best-performing predicted elements, V and Cr.
Furthermore, taking the best-performing absorption elements, V and Cr, as examples, feature importance analysis was conducted on the GANDALF model, predicting their coordination numbers. The feature importance of each input variable in the GANDALF model is as follows:
where I denotes the feature importance of each input variable, N represents the number of GFLU layers, and M
n is the feature mask for the nth layer. The feature importance obtained from the above equation is normalised as follows:
where I
i and I
inorm denote the feature importance and its normalised value for the i-th input variable, respectively, and D is the number of input variables [
17]. The feature importance analysis results of the GANDALF model predicting the coordination numbers of the absorption elements V and Cr are shown in
Figure 1:
When the absorbing element is V, the peak height and peak position in the pre-edge region are strong predictive indicators, with normalised feature importance sums of 0.0788 and 0.0816, respectively. However, the standard deviation of the Gaussian linear function in the post-edge region exhibits a sum of 0.0884 when the spectrum is divided into four and five equal parts, while the sum for the Lorentzian line half-width in the post-edge region is 0.0733 when the spectrum is divided into four and five equal parts, which is close to the sum for peak height and peak position in the pre-edge region across the four scales. When the absorbing element is Cr, for the three features—peak height, Gaussian linear function standard deviation, and Lorentzian linear function half-width—the maximum importance value across all scales, except for peak height in the four-division spectrum, which occurs in the edge-of-band region. This aligns with the conclusion proposed by Tanaporn Na Narong et al. regarding the prediction of coordination numbers, which states that the edge-of-band region possesses non-negligible importance relative to the edge-of-band peak [
15]. Surrounding the absorbing atom, ligands both form potential barriers, creating inner trap states, and interfere with outward-extending atomic orbitals to form outer trap states. The probability of 1s orbital electrons transitioning to inner trap states far exceeds that to outer trap states, yielding broad edge-of-band peaks [
18]. This enables the edge-of-band region to effectively reflect a material’s coordination number information.
4. Conclusions and Discussion
In this paper, based on publicly available computational X-ray absorption fine structure (XANES) spectral datasets, we propose a pseudo-Voigt-GANDALF method to predict the coordination number of the absorbing element corresponding to a specified XANES spectrum. The XANES spectrum was first equally divided into multiple segments at multiple scales. Parameters obtained by fitting each segment with a pseudo-Voigt function, along with the position of the absorption edge, formed a descriptor vector. The GANDALF model was then employed to predict the coordination number from this descriptor vector. Mixture samples were simulated via spectral linear combinations, transforming the classification problem of integer coordination numbers for simple systems into a fitting problem within a continuous range. When the absorbing element was V, the test set R2 achieved 0.8085 and 0.7837 without and with linear combination, respectively. For all tested groups, the performance degradation on the linear combination test set was smaller than that of traditional methods. The test results confirmed the effectiveness of this model in predicting the coordination number of mixed systems. However, significant performance discrepancies emerged between different absorbing elements, indicating that future research should prioritise improving the balance of sample types within the dataset. Feature importance plots from the GANDALF model reveal that the edge-of-band region contains crucial information when the absorbing elements are V and Cr. This further demonstrates that predicting coordination numbers from XANES spectra requires the full spectrum encompassing both the pre-edge and post-edge regions, rather than merely the pre-edge segment.
This paper is based on computational datasets, but future work can focus on training and testing with experimental datasets. Although a better model is probably needed due to measurement error, using experimental datasets can be beneficial, as they overcome the limitations of FEFF9 [
13]. For future research, focus should be placed on improving model architecture. The results of this article suggest that model complexity is positively correlated with performance on both original test sets and linear combination test sets. This is due to the difference in multi-scale feature extraction. Consequently, simpler models that are better at extracting multi-scale features are expected to show better performance.