The Role of Structural Representation in the Performance of a Deep Neural Network for X-ray Spectroscopy

An important consideration when developing a deep neural network (DNN) for the prediction of molecular properties is the representation of the chemical space. Herein we explore the effect of the representation on the performance of our DNN engineered to predict Fe K-edge X-ray absorption near-edge structure (XANES) spectra, and address the question: How important is the choice of representation for the local environment around an arbitrary Fe absorption site? Using two popular representations of chemical space—the Coulomb matrix (CM) and pair-distribution/radial distribution curve (RDC)—we investigate the effect that the choice of representation has on the performance of our DNN. While CM and RDC featurisation are demonstrably robust descriptors, it is possible to obtain a smaller mean squared error (MSE) between the target and estimated XANES spectra when using RDC featurisation, and converge to this state a) faster and b) using fewer data samples. This is advantageous for future extension of our DNN to other X-ray absorption edges, and for reoptimisation of our DNN to reproduce results from higher levels of theory. In the latter case, dataset sizes will be limited more strongly by the resource-intensive nature of the underlying theoretical calculations.


Introduction
Structural techniques, such as X-ray diffraction (XRD) and spectroscopy (XS), have made it possible to determine directly the structures of molecules and condensed matter systems and have had a huge influence across physics, chemistry, and biology. The proliferation of high-brilliance light sources such as 3 rd generation synchrotrons and X-ray free-electron lasers (XFELs) is helping to increase this influence of these techniques by facilitating the measurement of increasingly challenging systems such as operating catalysts [1,2] and short-lived reaction intermediates [3,4].
Besides providing element-and site-specific information on geometric structure, X-ray absorption spectroscopy (XAS) is also able to provide direct information on the electronic structure around the absorption site [5,6]. Indeed, XAS spectra are characterised by X-ray absorption edges that correspond to the excitation of core electrons to the ionisation threshold. The electrons are initially excited to unoccupied or partially-occupied orbitals at energies just below the ionisation potential (IP); these bound transitions, which form the pre-edge spectral features, provide detailed information about the nature of the unoccupied valence orbitals. At energies above the IP, resonances occur due to we demonstrate that the latter not only leads to smaller mean squared error (MSE) but also achieves this faster and with smaller training set sizes, which greatly supports the development of a DNN which is generalisable across the whole periodic table.

Deep Neural Network
A schematic of our DNN is shown in Figure 1. Our DNN is based on the multilayer perceptron model (MLP); an MLP is a class of feed-forward neural network comprising an input layer, n hidden layers, and an output layer. The dimensions of the input layer in our DNN are determined by the representation used (see the "Representation" section). The first hidden layer comprises 1200 neurons and every subsequent layer is reduced in size by 30% relative to the preceding hidden layer; our DNN uses four hidden layers. The output layer comprises X neurons, defined by the discretization of our target XANES spectra. The DNN takes the local environment around an atomic absorption site (featurised using either Coulomb matrix (CM) or radial distribution curve (RDC)) as input. This is passed through the network which consists of four hidden layers to output a predicted spectrum and mean squared error between the theoretical and predicted XANES spectra.
The layers are all fully connected, or 'dense', i.e., each neuron in an arbitrary layer, k, is connected to every neuron in the preceding layer, k − 1, via a matrix of weights, w (k) i,j . The pre-activation value of an arbitrary neuron in an arbitrary layer, z k,j , is given by the linear combination of the input activations, x (k−1) (these being the output activations of each neuron in the preceding layer), and their respective weights: z k,j = ∑ i w (k) ij x (k−1),i . A non-linear activation function, g(z), is then applied to the pre-activation values to compute output activations for the neuron in the layer,ŷ k,j ; our DNN uses a hyperbolic tangent (tanh) activation function which constrains the possible values ofŷ between −1 and +1. The resulting activations, obtained for an arbitrary neuron in an arbitrary layer asŷ k,j = g(z k,j ), then serve as the input activations for the next layer, unless subject to an intermediate transformation.
Information propagates through the MLP via a 'feed-forward' process until it arrives at the output layer. The activations of the output layer are then compared against target activations via evaluation of a cost function, J(W) = 1 n ∑ n i { f (x (i) , W), y (i) }, that quantifies the difference between the obtained, f (x (i) , W), and expected, y (i) , activations over a dataset of n samples as a function of the weights, W, and input activations, x (i) . Our DNN uses a mean-squared error (MSE) cost function of the general form J(W) = 1 n ∑ n i {y (i) − f (x (i) , W)}. The derivatives of J(W) with respect to the internal weights, δJ(W) δW , can be calculated cost-effectively and used to adjust the internal weights such that J(W) is minimised; succinctly, the objective is to find a set of internal weights, W * , for which W * = argmin W J(W).
Our DNN optimizes c.a. three million internal weights via sequential feed-forward and back propagation cycles. Gradients of the MSE cost function with respect to the internal weights are updated iteratively according to the Adaptive Moment Estimation (ADAM) algorithm. Gradients are estimated over minibatches of 100 samples. The learning rate for the ADAM algorithm, η, is set to 3 × 10 −4 .
The performance of our DNN is assessed via K-fold cross validation [40]. The data are randomly partitioned into K folds with K − 1 folds kept in-sample to train the DNN and the remaining fold left out-of-sample to evaluate the performance of the DNN on unseen data. K evaluations are made such that every data sample appears in the out-of-sample testing set once, and in the in-sample training set K − 1 times. The entire procedure can be repeated any number of times with different random K-fold partitions. The repeated evaluations of performance can be used to estimate an error. We mitigate the risk of overfitting our DNN to the training set by assessing the performance of each K-fold on the out-of-sample data only. We use five-fold cross-validation, i.e., an 80:20 in-sample/out-of-sample split, with five repetitions.
Our DNN also utilises dropout [41]; dropout is a regularisation technique that sets the activations of a certain fraction of the neurons in each layer are set to zero during the feed-forward/back-propagation procedure. Utilising dropout encourages a DNN to distribute weights probabilistically, and works to mitigate layers adapting to correct for mistakes in other layers; the latter behaviour would otherwise lead to overfitting, as these adaptions will not generalise well beyond the in-sample dataset. Our DNN uses a dropout of 15%.

Representation
We trial two alternative representations of chemical space: The CM and RDC. The CM representation (M I J ) [33][34][35] is constructed as: Z is the nuclear charge of an atom and R is the position of the atom in Cartesian space. M I J is a symmetric matrix of dimensions N × N where N is the upper limit on the number of atoms designated for inclusion in the CM. The off-diagonal elements of M I J correspond to a Coulombic repulsion term between atoms I and J and the on-diagonal elements of M I J correspond to the Coulomb potentials of the free atoms. The rows of M I J are sorted in descending order according to their Euclidean (L 2 ) norms, ||M I ||, i.e., a permutation of the rows and columns is found that satisfies the inequality: The upper triangle of M I J is then taken and flattened row-wise to yield a feature vector of length Practicably, this feature vector should have the same length, regardless of the size of the system it encodes, if it is to be input into a neural network. If the system contains more than N atoms, the closest N atoms to the absorbing atom are used to construct the CM and the rest are discarded; if the system contains less than N atoms, the remaining rows and columns of the CM are zero-filled.
The sorted CM representation is unique, invariant with respect to atomic indexing, translations, and rotations of the chemical space that it describes, and its construction requires no explicit information on chemical bonding [33][34][35].
Where CM featurisation is used in this work, N = 20. The RDC representation [36][37][38][39] encodes local chemical space as an intensity distribution ( f RDC ) as a function of equally-distributed values of R, where the intensity is defined: Z I and Z J are the nuclear charges of atoms I and J, respectively. r I J is the distance between atoms I and J, and R is a vector obtained by discretizing a linear interpolation between zero and twice the cutoff radius around the absorption site (defining the maximum pairwise distance that can be encoded by the RDC). α is a smoothing parameter that controls the resolution of the RDC. As α is increased, so too is the detail that is visible in RDC, but if α is too large, the RDC starts to become sparse. This is illustrated in Figure 2.  Like the CM representation, the RDC representation is invariant with respect to atomic indexing, translations, and rotations of the chemical space that it describes, and its construction requires no explicit information on chemical bonding [39]. It does not have to be weighted by Z I and Z J alone; indeed, it is possible to construct property-weighted RDCs using any relevant atomic property (e.g., electron affinity, electronegativity, van der Waals radius, ect.) [46,47] to engineer the descriptor for a specific purpose.
An additional advantage of the RDC is that f RDC can be discretised to yield a feature vector of constant length [39,46,47] regardless of the size of the chemical space it describes, i.e., CM featurisation encodes information on a fixed number of atoms and, consequently, a fixed number of interatomic distances, while RDC featurisation flexibly encodes information on all atoms and interatomic distances below a cutoff radius as per specification of R.
Where RDC featurisation is used in this work, α = 10.0 and R = 0.0 1.2 800.0 pm.
The CM and RDC representations are useful descriptors on account of their simplicity; both require very little space in memory, and the requisite operations for their construction are easily vectorisable; large datasets can be featurised quickly, and can typically fit in the memory in their entirety.

Dataset
Our dataset comprises 9040 unique Fe-containing structures harvested from the Materials Project Library via the Materials Project API. Fe K-edge XANES spectra for one arbitrary absorption site per structure have been calculated using multiple scattering theory as implemented in the FDMNES package [12]. The Fe K-edge XANES calculations employed a self-consistent muffin-tin-type potential of radius 6.0 Å around the absorbing site. The interaction with the X-ray field was described using the electric quadrupole approximation, and scalar relativistic effects were included. To transform the computed cross-sections into XANES spectra that can be compared to experiment, the cross-sections need to be convoluted with a function that accounts for the core-hole-lifetime broadening, instrument response, and many-body effects, e.g., inelastic losses. Throughout this work, this convolution has been performed using an energy-dependent arctangent function via an empirical model close to the Seah-Dench formalism [48]: Γ is defined over the energy scale, E, of the XANES spectrum as per specification of the core-level and final-state widths (Γ i and Γ f , respectively), and the centre and width of the arctangent function (E c and E w , respectively). The arctangent convolution is performed as implemented in the FDMNES package [12].
The arctangent convolution is only applied as a post-processing step on XANES spectra estimated by our DNN; our dataset comprises only unconvoluted cross-sections, and our DNN learns from these unconvoluted cross-sections. Figure 3a shows the MSE as a function of the number of in-sample spectra accessible to our DNN during the learning process. The local environment around each Fe absorption site has been featurised either as a CM (black) or RDC (red). In the small-sample limit (ca. 100 in-sample spectra), both representations exhibit similar performance with the MSE of 0.17. However, as the number of in-sample spectra accessible to our DNN is increased, an almost linear improvement in MSE is seen when CM featurisation is used and a MSE of 0.12 is obtained for the large-sample limit (ca. 9000 in-sample spectra). In contrast, RDC featurisation gives a rapid initial improvement, delivering a much smaller MSE than can be achieved via CM featurisation. Beyond ca. 2000 in-sample spectra, the improvement in the MSE begins to slow, but the final MSE in the large-sample limit (ca. 0.08) is still significantly lower than the MSE that can be achieved using CM featurisation. Figure 3b illustrates the performance during the training of the DNN, shown as a function of the number of forward passes through our dataset. While, as seen in Figure 3a, we achieve a lower MSE for the RDC representation, in both cases it is observed that the DNN can be optimised in <500 forward passes through the dataset. This is achievable in as little as five to ten minutes if graphical processing unit (GPU) acceleration is used.

Predictions of Peak Position and Intensity
When predicting XANES spectra, accurate reproduction of the positions and intensities of above-ionisation resonances is crucial, as these directly encode the structural information in the spectrum. Figure 4 shows parity plots of the difference between the estimated and target peak positions on the energy (E Target and E Est. ) and intensity (µ Target and µ Est. ) scales. The upper (Figure 4a,b) and lower panels (Figure 4c,d) display the results from CM and RDC featurisation, respectively.    In both cases, a strong linear relationships are evidenced by the coefficients of determination, R 2 , which are 0.974 and 0.930 for energy and intensity, respectively, if CM featurisation is used, and 0.986 and 0.973 for energy and intensity, respectively, if RDC featurisation is used. As expected from the training curve shown in Figure 3a, the RDC representation performs slightly better exhibiting a lower R 2 in both cases as expected from the narrower spread which is visible in Figure 4. Figure 5 compares six computed XANES spectra with their corresponding out-of-sample DNN estimations. The dashed lines represent the computed and predicted cross-sections (scaled by 50% for clarity) and the solid lines represent the computed and predicted XANES spectra post-convolution of the cross-sections with the arctangent function (Equation (4)). These XANES spectra all belong to the first centile when performance is ranked over all out-of-sample DNN estimations by MSE.

Predictions of Spectra
The top three XANES spectra (Figure 5a-c) were obtained using CM featurisation, while the bottom three (Figure 5d-f) were obtained using RDC featurisation. In the latter case, the DNNestimated XANES spectra can hardly be distinguished from the target XANES spectra; while discrepancies can be observed in the unconvoluted cross-sections on which our DNN is trained, the differences are negligibly small and can be considered insignificant once the arctangent convolution has been applied. In contrast, differences between the DNN-estimated and target XANES spectra are amplified when using CM featurisation. Inspection of the unconvoluted cross-sections suggests that these differences have their origin in the estimated intensities of peaks; this is most apparent in Figure 5a. CM featurisation performs less effectively than RDC featurisation on estimated peak intensities, as also evidenced in Figure 4. Energy / eV (f) Figure 5. Arctangent-convoluted (solid) and unconvoluted (dashed) target (black) and out-of-sample DNN-estimated (red) Fe K-edge X-ray absorption near-edge structure (XANES) spectra for absorption sites in (a,d) C 6 Al 2 Fe 4 O 15 , (b,e) FeOF, and (c,f) Sm 2 Fe 17 H 3 . Spectra belong to the first centile when performance is ranked over all out-of-sample DNN estimations by MSE. Spectra in panels a-c and d-f were obtained using the CM and RDC representations, respectively. Amplitudes of all unconvoluted spectra have been reduced by half for clarity. Figure 6 shows the three samples spectra, optimised using the CM (panels a-c) and the RDC (panels d-f) representation, drawn from the ninety-nineth centile, i.e., the mostly poorly predicted spectra. In this case, as with previous work [32], the principle reason that these are in the lowest centile is due to their underestimation of the spectral intensity that compounds across the energy scale. This is especially true in the case of the RDC representation which finds peaks in the right position. This is less so for the CM representation, which is especially apparent in Figure 6b,c, for which larger deviations are observed.

Discussion and Conclusions
Appropriate featurisation is crucial for achieving best-in-class performance from a DNN. In this contribution, we have outlined the effect that the choice of featurisation (CM or RDC featurisation) has on the performance of our DNN [32] engineered for the prediction of XANES spectra at Fe K-edge. In both cases, a MSE of ≤0.12 is readily achievable in only a few minutes of real-time learning, and the example estimations of out-of-sample XANES spectra shown in Figures 5 and 6 demonstrate that both representations are able to deliver qualitative predictions of out-of-sample XANES spectra, even in the ninety-nineth centile. However, Figure 3 demonstrates that convergence of the DNN during the learning process is faster, better if one is restricted to the small-sample limit, and ultimately achieves a lower final MSE in the large-sample limit. These results evidence that the RDC is to be preferred over the CM as a representation of chemical space for this particular problem and DNN architecture.
At this point, it is important to highlight an important difference between the CM and RDC representations. Throughout this work we have limited the dimensions of M I J to 20 × 20. Optimisation of our DNN has lead us to identify N = 20 as the optimal CM dimension, i.e., that which gives the lowest MSE as evaluated on out-of-sample examples, within the limit of the possible values of N that are large enough to capture the necessary structural information, but not so large as to increase the propensity for overfitting. The maximum radius around an absorption site encoded into a CM of dimensions 20 × 20 is consequently system-dependent, e.g., it depends on the identities of the neighbouring atoms around the absorption site and, by extension, the packing density of the system to be featurised. Figure 7a shows a histogram of the maximum radius around the absorption site encoded into M I J when the dimensions are limited to 20 × 20. The modal radius is ca. 3.5 Å, which is in close agreement with, albeit slightly smaller than, the optimal cutoff radius identified for RDC featurisation (4.0 Å) and this cutoff radius encodes approximately two coordination spheres around the absorption site. The smaller cutoff radius suggests that it is harder for CM featurisation to encode effectively all of the geometric information required to reproduce the XANES spectra as accurately as when using RDC featurisation. Figure 7b shows the reverse, i.e., a histogram of the CM dimensions when the radius around the absorbing atoms is set to 4.0 Å. Here the model dimension is around 25 × 25, meaning that the RDC featurisation, on average, can describe the effect of a larger number of atoms around the absorbing atom.
In summary, the performance of our DNN demonstrates that it is possible to develop a highly generalisable neural network for the prediction of XANES spectra at a specific absorption edge for an arbitrary absorption site, and that the RDC is a robust local descriptor for this purpose. This represents a highly encouraging starting point for our proof-of-principle demonstration which can be developed in a number of ways. Firstly, our theoretical XANES spectra (from which our DNN learns) are calculated under the muffin-tin approximation, and although this represents a computationally cost-effective choice for developing the underlying method, it is clear that the usefulness of our DNN can be considerably improved by moving beyond this. Secondly, the training set from which our DNN learns is composed of perfectly-ordered, homogeneous crystalline systems. While we have previously demonstrated [32] that it can be applied to situations outside of this scope, the sensitivity of our DNN to irregularities in the bulk such as vacancies, defects, undercoordinated sites, and the effects of lattice stress remains unclear. Finally, our DNN primarily considers only the local geometric environment around the absorption site of interest, and its ability to describe the changes in electronic charge state of the absorbing atom and therefore reproduce edge shifts is still uncertain. This could be incorporated as a post-processing step by simply shifting the predicted XANES spectra; this is commonly used for good approximation in time-resolved XAS experiments [49,50]. It remains desirable, however, to have this included from first principles. These aspects will be the focus of future work. Funding: The research described in this paper was funded by the Leverhulme Trust (Project RPG-2016-103) and EPSRC (EP/S022058/1, EP/R021503/1, and EP/R51309X/1). CDR is supported by a Doctoral Prize Fellowship (EP/R51309X/1). MMMM thanks Jazan University (KSA) for supporting her study and funding. This research made use of the Rocket High Performance Computing (HPC) service at Newcastle University. CDR additionally thanks the Alan Turing Institute, via which access to the EPSRC-supported (EP/T022205/1) Joint Academic Data Science Endeavour (JADE) HPC cluster was provided under Project JAD029.

Conflicts of Interest:
The authors declare no conflict of interest.