Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies

: Machine learning (ML) has found increasing use in physical sciences, including research on energy conversion and storage technologies, in particular, so-called sustainable technologies. While often ML is used to directly optimize the parameters or phenomena of interest in the space of features, in this perspective, we focus on using ML to construct objects and methods that help in or enable the modeling of the underlying phenomena. We highlight the need for machine learning from very sparse and unevenly distributed numeric data in multidimensional spaces in these applications. After a brief introduction of some common regression-type machine learning techniques, we focus on more advanced ML techniques which use these known methods as building blocks of more complex schemes and thereby allow working with extremely sparse data and also allow generating insight. Speciﬁcally, we will highlight the utility of using representations with subdimensional functions by combining the high-dimensional model representation ansatz with machine learning methods such as neural networks or Gaussian process regressions in applications ranging from heterogeneous catalysis to nuclear energy.


Introduction
Machine learning (ML) as well as more complex techniques of artificial intelligence (AI) have been finding increasing use in research and development for novel technologies [1][2][3][4][5][6][7]. Whenever one can identify an input-output mapping whose construction is helpful in achieving a research or engineering goal, ML techniques are often of assistance. One can cast a research problem as a problem of constructing such a mapping and then searching in the space of the descriptors (features or inputs) [8,9] for a desirable optimal point. One area where ML is gaining more and more traction is novel energy conversion and storage technologies. These techniques are, in particular, intensely explored for application to the development of technologies typically associated with sustainable generation and use of energy such as advanced types (organic and inorganic materials-based) of solar cells and LEDs (light-emitting diodes) [10][11][12][13][14][15][16][17][18][19][20][21][22], inorganic and organic metal ion batteries [23,24], fuel cells, and generally heterogeneous catalysis including electro-and photocatalysis [25][26][27][28][29][30][31][32][33][34]. This is natural in the sense that the development of these technologies often passes through optimization and balancing of multiple factors acting simultaneously and to opposite ends; for example, in the case of organic solar cells, there is an optimum to be sought between the donor's bandgap, the band offset between the donor and the acceptor, the reorganization energies of both the donor and the acceptor, and the charge transfer integral, etc. [12,15,16]. These properties themselves are a function of the structure of the molecules which can be encoded through multiple descriptors [9]. In principle, most of these properties can be computed and or measured experimentally, but doing so for each candidate molecule is expensive. It is enticing, based on a limited number of measurements and or calculations, to build a mapping-with the help of machine learning-between the chosen descriptors and the properties of interest, and then search the map for an optimal point (such as maximum power conversion efficiency in the case of a solar cell). Similarly, optima need to be sought in the design of battery materials (balancing thermodynamics and kinetics of cation insertion, stability, etc.), catalysts (balancing adsorption energies of key species, kinetics, and stability), and other types of functional materials. Some examples will be given in Section 2.
Functional materials and interfaces are keys to the design of novel energy technologies. As compositional space is too vast to be explored by brute force and trial-and-error approaches; rational guidance of design is necessary. Such rational guidance comes, in particular, from simulations at various scales (from atomistic and electronic structure levels to the macroscopic level). Specifically, for technologies such as fuel cells, batteries, or solar cells, electronic properties are keys to the functionality, and it is atomistic scale, and in particular ab initio, simulations that are providing insight into the mechanism at the material level and thereby can guide rational design. Bandgaps and bandstructures, absorption spectra, molecular adsorption energies, etc., are today routinely computable at the DFT (density functional theory) [35,36] level. Further, structural information is obtainable with techniques such as molecular dynamics [37], Monte Carlo [38], and others which can operate at larger scales than DFT. Such calculations, however, remain costly, in particular, DFT with which routine calculations are still typically limited to about 10 3 atoms. It is therefore enticing to deploy ML methods to replace such calculations or at least to reduce the required number of such calculations. Of course, ML can boost the development not only of the so-called sustainable or "green" energy technologies; we will provide an example in this perspective where it can be used to advance research useful for the future of nuclear energy.
The purpose of this perspective is not a comprehensive or even brief review of ML uses in materials science and energy technologies, nor is it to review staple ML methodsthere is already ample literature on these topics . Novel energy technologies are very knowledge-deep, i.e., they require multiple steps in understanding and development; specifically, they require insight at the electronic structure and interatomic interactions level. Many aspects are being addressed with machine learning, not just predicting material or device properties from descriptors: much is being done at the back end or upstream by using machine learning in the workflow of understanding modeling of materials and phenomena for these technologies, or to improve modeling capabilities [1,[39][40][41][42][43][44]. This in particular requires ML methods and tools capable of working with extremely sparse and unevenly distributed data in multidimensional spaces beyond the capabilities of standard methods. That is the focus of this perspective.
We will first show illustrative examples of how machine learning can be used to help improve the way we produce and use energy, especially for the benefit of so-called sustainable technologies. We will highlight the fact that in these applications, one usually has to deal with multiple variables, i.e., to operate in multidimensional spaces, and we will see that there are properties of multidimensional spaces with which one typically does not have to deal with in one-, two-, or three-dimensional problems. We will show why ML techniques are effective when working with such data. While all major types of machine learning (classification, regression, clustering) are finding use in the above applications, here we focus specifically on regression type problems which are highly relevant, as many types of descriptors (atomic coordinates and other structural descriptors, electron densities, ionization potentials, electron affinities, atomic charges, bandgaps, and redox potentials, among others) are real-valued or can be treated as such (e.g., atomic numbers/nuclei charges). We also focus on supervised machine learning, which is also highly relevant, as in many applications the target values for training points in the space of descriptors are known (for example, potential energy, kinetic energy density, bandgap, etc.). We will briefly introduce neural networks (NN) and Gaussian process regressions (GPR) and their pros and cons. We will then focus on using the so-called high-dimensional model representations to structure the functional representation, in conjunction with NN or GPR used to build components of those representations. We will see that this is especially useful when trying to recover trends or functional forms from sparse data and to gain insight using examples from catalysis and a prospective nuclear technology as well as providing an example of how ML can revolutionize ab initio modeling capabilities, which are critical for the modeling and mechanistic understanding of materials including those used in novel energy technologies.

Examples of Input-Output Mappings Used in ML for Energy Technologies
Humanity is trying to develop more sustainable ways to produce and use energy. Currently the world still largely relies on fossil fuels, which are believed to be the result of absorption of sun energy [45]. The use of fossil fuels generates many pollutants, including poisonous gases and heavy metals, and it also generates large amounts of CO 2 . CO 2 can be recycled through vegetation but unfortunately the timeframe from dead vegetation to fossil fuels is too long to rely on such a natural carbon cycle with current and projected levels of energy consumption. There is a quest for what can be called an anthropogenic chemical carbon cycle [46]. In it, one does not rely on burning of fossil fuels but on a set of technologies which either generate electricity directly from the sun with solar cells [47] and from other sustainable sources such as wind [48], or synthesize, using sunlight and CO 2 and other inputs, fuels which can be utilized in a clean way, such as hydrogen or liquids which can be used with cleaner exhaust in fuel cells or even burnt directly [49][50][51].
Solar and wind technologies are of course intermittent, and to match their output schedule to the demand schedule, storage is necessary [52], in particular with batteries ( [53] and references therein). Batteries are also needed for road transport electrification. At the same time, novel and safer types of nuclear reactors are being developed for reliable low-carbon footprint baseload [54][55][56]. All these technologies become an energy mix. Multiple technologies are used, and in each of these technologies, there are multiple scientific, engineering, and technological problems that need to be addressed. Bringing these technologies together to work in one system is also a challenge. Machine learning can help resolve many of these problems, and we will now show some examples.
The reader will have heard about the "hydrogen economy" and about fuel cells which allow "burning" in a clean way of either hydrogen or other fuels. At the core of these capabilities is heterogeneous catalysis, i.e., reactions happening at and catalyzed by surfaces. Some of the key reactions relevant for energy production and use are driven by solid catalysts including the oxygen reduction reaction [57], which is a critical and still problematic step in fuel cells, the Fischer-Tropsch process [58], which allows producing synthetic fuels from a mixture of hydrogen and CO (so-called syngas); water-gas shift [59] and steam and dry reforming [60,61] are also very important reactions. These reactions permit transformation between hydrocarbons, water, hydrogen, and carbon oxides. Their efficiency depends on the quality of the catalyst. Many efficient catalysts are rare and or expensive, such as platinum, widely used in commercial fuel cells, e.g., those used in Toyota Mirai passenger cars, in particular, due to the lack of non-expensive alternatives.
Design of better catalysts is therefore a major bottleneck on the way to wide deployment of more sustainable energy technologies. Recently, machine learning has found increasing use in rational design of catalysts [25][26][27][28][29][30][31][32][33][34]. Typically, one considers a set of descriptors of catalytic activity which include reaction thermodynamics and kinetics, adsorption energies of key reactants and intermediates, as well as structure and electronic structure descriptors of the catalyst [26,33]; some are shown in Figure 1. To sieve through candidate materials, one encodes them into a set of compositional and structural variables such as atomic numbers and positions, crystal structure and surface cut, and adsorption site. Structures and many of the properties such as adsorption and reaction energies and kinetic barriers as well as electronic structure descriptors are routinely computed, typically with density functional theory [36,62]. Computations play a key role in rational design of catalysts with or without machine learning, as many properties are either not directly accessible experimentally or cannot be measured with high throughput and modest cost. All together this makes many variables on which catalytic activity depends, and one needs to perform searches and uncover relations in multidimensional spaces, which can be done with the help of machine learning [25][26][27][28][29]. Left: Schematic of the SQL data structure of Catalysis-Hub.org, used to store reaction energies (reactions table, green) and DFT calculations (ASE database, blue). Since each reaction energy involves several DFT calculations (and the same DFT calculations can potentially be used for several reactions), a many-to-many mapping schema is used to preserve connections between the table rows. Right: Machine learning-enhanced catalyst candidate prediction: bulk and surface structures retrieved from structure databases like materialsproject.org, OQMD, Catalysis-Hub.org, etc., are used for automated slab generation and enumeration of possible adsorption sites. In an iterative process, limited numbers of DFT-calculated adsorption energies and machine-learning-predicted adsorption energies are used to inform microkinetic models to eventually suggest promising catalyst candidates that should be investigated by experiment. Adapted with permission from [26]. Copyright 2019 Wiley-VCH Verlag GmbH.
To model directly a catalyzed reaction from reactants to products, one would need an interatomic potential energy function, and those potentials also pose a bottleneck in the modeling of heterogeneous catalysis and are more and more often constructed with machine learning methods [1,40,63,64]. There is a key difference between the traditional fossil fuel-based energy technologies and the newer technologies based on solar cells, fuel cells, batteries, etc.: the fossil fuel-based technologies can be understood based on classical thermodynamics and mechanics, while these newer technologies critically depend on quantum phenomena, ultimately on electronic states' energies, occupancies, and localizations. Quantum mechanics based modeling therefore takes center stage. Such modeling is difficult and costly, and machine learning is also used to facilitate it; we will show a couple examples thereof below. This is most obvious in solar cells where details of the electronic structure such as the bandgap, the effective mass, etc., determine directly solar cell performance. Machine learning is now more and more often used to help design better solar cells and specifically materials for solar cells and other optoelectronic applications [12][13][14][15]65,66]. It is used to predict better active materials, to optimize device performance or even optimize fabrication processes [15]. Figure 2 shows main uses of ML for solar cell design as well as most widely used ML methods summarized in a recent review [15]. Artificial neural networks (ANN) remain the most widely used method; also widely used are genetic algorithms (GA), random forest (RF), particle swarm optimization (PSO), simulated annealing (SA), support vector machines (SVM), kernel ridge regression (KRR) and Gaussian process regression (GPR), externally randomized trees (ERT), clustering methods such as K nearest neighbors, and principal component analysis. These are just some methods from by now very many machine learning methods. That is, all three major classes of ML-classification, regression, and clustering-are finding use in solar cell research as well as in catalysis research [26]. These examples from the fields of catalysis and solar cells are barely illustrative. Machine learning (and the same methods as mentioned above) is also used to help design battery materials [23,24,67,68] and for other energy technologies as indicated above, but ML is in demand not just at the material or device level but also at the system level [69][70][71][72]. In an energy mix containing solar farms, wind farms, and other intermittent technologies alongside more established generation methods such as nuclear or natural gas powered stations, one needs to constantly balance supply and demand. Predicting demand and predicting possible supply from solar and wind and choosing how to palliate any excess or shortfall with either battery storage or a call on nuclear or hydrocarbon-based generation is an important problem where there is much potential for machine learning which is yet to be fully realized [73,74]. It also requires working with multidimensional datasets.
To illustrate specifically the challenge posed by the dimensionality of the feature space in ML for materials for novel energy technologies, consider one recent example from the literature [10] of how ML is applied to design better materials for perovskite solar cells [75], in which the main active material and light absorber is a (typically organic-inorganic hybrid) perovskite material. The composition of the perovskite can be changed by different choices of atoms placed in different sites of the crystal lattice. The perovskites most widely used in labs today contain lead, which is not desirable. Moreover, different structural choices also in principle allow one to modulate the bandgap and other properties that could better match with specific electron and hole transporting materials [76], which are in contact with the perovskite in a solar cell. In the example of [10], candidate compositions were encoded with 32 features which included properties of constituent atoms such as ionization potential, orbital radii, etc., as well as properties of the crystal. We will see below that this dimensionality, which from the point of view of applications does not appear to be high (indeed, hundreds to thousands of features are sometimes available), is in fact very high from the point of view of data density.
We point out that in all of the abovementioned applications of machine learning, a multitude of methods are used: classification methods, regression methods, clustering methods; supervised and unsupervised methods. In what follows, we will narrowly focus on specific regression type methods based on neural networks and Gaussian process regressions, and on composite methods using them as building blocks to enable ML in high dimensional spaces from extremely sparse data.

New Technologies and Challengies Require New Simulation Methods-A Large Scope for Machine Learning
Machine learning is also useful to improve simulation methods. The example of machine learning interatomic potentials was cited above; ML is also used to improve the widely used DFT method by learning better exchange correlation functionals, dispersion corrections, corrections to computed excitation energies, etc. ( [39] and references therein). We will draw attention here to only one, but a critically important example that allows showcasing the power of ML. We start with a seemingly purely mathematical problem: consider a function of space ρ(x), where x is a vector of Cartesian coordinates x, y, z. The function is positively definite and integrates to N: ρ(x)dx = N. We represent ρ as a sum of N positively definite pieces, each integrating to 1: Because they are positive definite, they are squares of some functions φ i . It is known that this decomposition can be done so that φ i are orthonormal: We are interested in this quantity T or the corresponding integrand τ(x), which can take two alternative forms: We want to express T without explicit reference to φ i , as a function of ρ-dependent quantities only. It could be any derivative of ρ, any power of ρ, but only of ρ without explicit reference to φ i . The reason why this example is important is that a solution to this problem opens the way to routine, linear-scaling large-scale ab initio modeling of materials with the socalled orbital-free DFT (OF-DFT) [77]. We mentioned above that quantum mechanics-based modeling is critical for understanding and design of novel energy technologies, yet the commonly used Kohn-Sham DFT is unwieldy for systems with more than about 10 3 atoms. Large-scale modeling means more realistic modeling and is required to properly account for a range of phenomena which are intrinsically large-scale (e.g., microstructure-driven properties). In OF-DFT, ρ takes the meaning of the electron density, N is the number of electrons, φ i are single-electron (Kohn-Sham) orbitals [36], T is the kinetic energy, and τ(x) is the kinetic energy density (KED) (we neglect spins and partial orbital occupancy without the loss of generality). The mapping T = T[ρ(x)] is the kinetic energy functional (KEF). Approximate formulas for such an expression (T[ f (x)]) exist but they are not accurate enough for use in most applications where ab initio modeling is needed, including (organic and inorganic) semiconductors and transition metal containing functional materials of novel energy technologies. In the past several years, substantial progress has been being made on this problem with the help of machine learning using, in particular, techniques such as neural networks and kernel methods [41,[78][79][80][81][82]. We will return to this example later in the context of deep learning. Some uses of ML to improve quantum mechanics-based modeling methods are reviewed in [39].

The Curse of Dimensionality and Why ML Techniques Are Effective
We mentioned in an example above 32 features used to describe perovskite materials [10]. Is 32 dimensions high or low? It is (very) high. In multidimensional spaces, one is hit by the so-called curse of dimensionality. Imagine we sample a simple univariate function with M points sufficient to recover the function to a desired accuracy with a common method (e.g., splines). If one wants to maintain the same density of sampling in D dimensions, the number of sampling points would grow exponentially: M D . With M as small as 10 and D as small as 10, one would need 10 10 data. This is the curse of dimensionality. Moreover, when one constructs a function with polynomials or Fourier expansions, not only the number of required samples grows exponentially but the number of terms in the representation as well. The result of this is that it is impossible to achieve good density of sampling just by adding more data. For example, already in 20 dimensions, a million data points is equivalent to only about 2 data per degree of freedom (of an equivalent direct product grid). If we somehow managed to get 10 times more data, it would only increase the density of sampling to about 2.2 data per degree of freedom. Practically, one therefore always works with extremely sparse data. That is why 32 dimensions in the example above is in fact very high. In practice, one starts feeling this curse of dimensionality from about 6 dimensions and up.
However, this debilitating scaling strictly speaking only holds for direct product representations. One reason why various machine learning methods work well in multidimensional spaces is because they avoid direct product representations and thereby alleviate the problem of the curse of dimensionality. We will illustrate this point in Section 3.1.
In multidimensional spaces, some intuitions break down. For example, what is local in low-D may not be in high-D. An example is the Gaussian function, , which is localized around the mean x in the sense that about 70% of the quadrature of this function is within one standard deviation σ from the mean in the one-dimensional case (D = 1). In six dimensions (D = 6), for example, only 10% of the quadrature is from within one standard deviation (in each respective dimension). That is, a multidimensional Gaussian function is no longer a localized function by this measure. Working in high dimensionality has also its advantages. For example, there is the concentration of measure whereby as dimension is increased, the width of the distribution of distances between data points collapses: lim → 0 , where dist max and dist min are maximum and minimum distances between the data points. We observed in our own experience that, for example, neural networks perform better in high-dimensional spaces than in 1, 2, 3 dimensions. This ultimately relates to growing advantages of non-direct product representations in high dimensions. One important consequence of the low density of sampling is that the intrinsic dimensionality of the dataset is less than the dimensionality of the space [83,84]. This is obvious when the data are confined to lower-dimensional hypersurfaces or other shapes, but even if the data are not confined to such a sub-dimensional shape, simply by the virtue of low density, the intrinsic dimensionality of the dataset is lowered. This ultimately justifies representations with lower-dimensional functions, which will be described below.

Brief Introduction to Neural Networks and Gaussian Process Regression
Artificial neural networks [85] are an example of a representation of a multivariate function with univariate functions, which are adapted to the problem by the choice of parameters. They are often presented with the help of an analogy with a biologic neural network, but for regression problems we find more useful the following interpretation. We expand a function f (x), x ∈ R D in a basis of univariate functions σ n with coefficients c n : Here we added a subscript k to indicate that several outputs can be computed from the same basis. The basis functions depend on all components of x but their arguments are scalars dependent on x as well as on weights w and biases b. This parameterization by w and b makes a flexible, non-direct product basis. This representation goes back to the Kolmogorov theorem of 1957 [86], and since then, in a series of papers restrictions on σ have been relaxed [87][88][89][90][91][92][93][94][95]. This expression is a universal approximator even when σ is the same for all n and as long as σ is smooth and nonlinear [96]. In applications, typically σ is the same for all n and typically it is the sigmoid function, σ(x) = (e x − e −x )/(e x + e −x ), but it does not have to be [96]. The parameters w and b and the coefficients c are fitted to reproduce a set of known samples of f, f j = f x j , j = 1, . . . , M. The fit is nonlinear because σ is nonlinear.
The Equation (3) (which is often written as f (x) = σ ∑ N n=0 c n σ(w n x + b n ) , i.e., with a so-called output neuron, which, however, can be subsumed in the definition of f, without loss of generality) expresses a so-called single hidden layer neural network. It is graphically represented in Figure 3, where the arrows to each σ, which are called neurons, reflect the formation a single scalar input. These neurons form a hidden later. The outputs are collected in a sum to one or more outputs in the output layer, i.e., one can fit a multi-sheet function or a function and its derivatives, by the same NN. These are called output neurons. One can collect the outputs of this last layer of neurons and feed them to yet another layer of neurons and so on, obtaining a multilayer or deep neural network: We stress that a single hidden layer network is already a universal approximator. In many applications, one layer would be sufficient, and one needs to be sure if one actually needs a deep NN as it comes at a price of a large number of non-linear parameters. We will show examples later of where one actually needs a deep NN.
The fact that σ can be any smooth function opens new possibilities. We showed, for example, that with exponential neurons one easily obtains a sum of products representation with a relatively small number of terms [97]: Sum of products representations are very useful because they greatly simplify integration of the function (multidimensional integrals over the function f can be computed as sums of products of onedimensional integrals), which is a major advantage when the dimensionality is high. In fact, sum of product representations are required in certain computational methods, for example, in some powerful quantum dynamics methods [98]. A more complicated way to get a sum of products is with multiplicative neural networks which use error function types of neurons [99]: These are examples of a departure from the orthodox network designs which bring significant advantages. We will show below how with even more drastic changes in the architecture one can realize significant advantages.

Gaussian Process Regression (GPR)
Another method we briefly highlight is Gaussian process regression (GPR) [100]. GPR answers the question "given the set of samples f j of a function f (x) at certain points in space x j , what are the expectation values f (x) and their variance ∆f (x) for function values at other points in space x?" One assumes that correlation between data can be described with a kernel, a chosen type of the covariance function k (x 1 , x 2 ). The answer is given by where f is a vector of all f j values, and the matrices K and K* are computed from pairwise covariances among the data: The optional δ on the diagonal has the meaning of the magnitude of Gaussian noise and it helps generalization. Note that Equation (5) as written (and as it commonly appears in the literature) holds for data (the set of f j ) normalized to unit variance, otherwise the left-hand side should be multiplied by the variance of f .
The covariance function is usually chosen as one of the Matérn family of functions [101] given by where Γ is the gamma function and K v is the modified Bessel function of the second kind.
At different values of ν this function becomes a Gaussian ( ν → ∞ ), a simple exponential (ν = 1/2) and various other widely used functions (such as Matern3/2 and Matern5/2 for ν = 3/2 and 5/2, respectively). The parameters of the covariance function are the only parameters, they are hyper-parameters and they are few, as few as one (for an isotropic kernel at fixed ν). This is therefore a non-parametric method, which is an advantage. While hyper-parameters still need to be chosen, the performance is usually about equally good as long as hyper-parameters are in some reasonable range. In Equation (7), the critical parameter is the length parameter l; the prefactor σ 2 is fully correlated with δ in Equation (6). Note that Equation (5) is a non-direct product representation.

Relative Pros and Cons of GPR vs. NN
Because in GPR one has to wield matrices and inverses of matrices of the size M × M, where M is number of data, this method becomes increasingly costly as the dataset size grows. In fact, it quickly becomes prohibitively expensive for datasets as small as tens of thousands of data unless additional approximations are made [102]. However, the method is very accurate. In fact, GPR is the only method which we have seen outperform NN in controlled comparisons and on the quality of observables, i.e., not just various fitting error measures, but controlled tests of the quality of observables computed with functions constructed using NN or GPR [103]. GPR has been shown to be equivalent to an infinitely large neural network [104]. Contrary to NN, it provides estimates of uncertainty (we caution however that ∆f in Equation (5) should not be directly used to compute an error bar of the mode; it reflects the uncertainty of the expectation value, which can be higher or lower than the fitting error [105,106]). It is quite costly not just to build but also to evaluate a GPR model when the datasets are more than a few thousand data. The matrix inverse also means instability when data points are close to each other, although it can be handled, e.g., by using pseudoinverses and appropriate values of δ. We also recently proposed a rectangular GPR, which achieves numeric stability and good generalization without δ [107].
Neural networks, in contrast, work well when some points are close; they work well with millions of data even on a desktop computer [78]. Their disadvantage is the large number of parameters that need to be determined in a nonlinear fit. There is therefore danger of overfitting, and one needs to carefully control for it by a judicious choice of architecture and also by using test sets. Often as few as 5% to 20% of the available data are reserved for testing, with the rest used for training. In our experience, this is grossly insufficient, and we recommend using test sets which are at least as large as training sets. This is a disadvantage as this requires more data. In general, with the same number of training points, an NN achieves a lower accuracy than GPR; alternatively, one needs more data to reach the same accuracy with NN as with GPR [103].
Given the advantages of GPR, it would be desirable to have a scheme in which GPR is used in a regime where it is most efficient, i.e., with few data. The approach which we introduce next allows doing just that.

High-Dimensional Model Representation (HDMR)
We now briefly introduce high-dimensional model representation (HDMR) [108][109][110]. HDMR is an expansion over orders of coupling. Its form is similar to the ANOVA decomposition [111,112]. The idea is to represent the function f (x) as a sum of first an uncoupled approximation, then terms due to collective action of couples of variables, then triples, etc., until the last term that describes the coupling term among all D variables: If taken to this last term, the expansion is exact. If the expression is truncated at some lower-order d terms, it is approximate. In most physical phenomena, the relative importance of these coupling terms drops rapidly with the order of coupling d. Depending on the application, one can stop at, e.g., 3rd order or 2nd order or even 1st order, i.e., uncoupled approximation, without making much error [108].
When one has stopped at some relatively small order of coupling d, one works only with low-dimensional component functions f i 1 ,i 2 ,...,i d (x i 1 , x i 2 , . . . , x i d . All these component functions can be obtained from one and the same set of training points distributed in full D-dimensional space (in which case one obtains the so-called RS (random-sampling)-HDMR [113,114]). This is advantageous on several counts: low-dimensional functions are generally easier to build, they can be built from fewer data e.g., with sparse data, and not suffer from overfitting [115]. We indicated above that sparse data are a key problem and that, e.g., the Gaussian process regression does not work well with large dataset. One can use HDMR to build a multivariate function from sparse data and stay in the "comfort zone" of the method which is used to build the component functions. A representation with lower-dimensional functions is also easier to use in applications, particularly when f needs to be integrated. HDMR in that case allows computing only low-dimensional quadratures.
In the original HDMR formulation, all component functions are mutually orthogonal, i.e., . . , j m } (9) and are equal to multidimensional integrals (specifically, (D-d) dimensional integrals need to be computed for d-dimensional component functions) which are quite difficult to compute when D is high [113,114]; further, any lower-dimensional component functions need to be constructed first before constructing d-dimensional functions. We proposed an extension of HMDR whereby we do not build the entire expansion but directly represent the function f with d-dimensional functions [105,[116][117][118][119].
This is achieved by dropping the orthogonality requirement. One need not build the entire HDMR expansion; it is possible to use only the terms of the desired dimensionality d that provides a desired accuracy. The lower-order terms are effectively subsumed into d-dimensional terms. Most importantly, when using machine learning to represent the component functions, no integrals need to be computed, and it is possible to alleviate the issue of combinatorial growth of the number of HDMR terms with d and D.

Machine Learning of HDMR Terms
HDMR component functions can be built with machine learning methods like neural networks [116,118,120] or Gaussian process regressions [105,106,119]. Practically this can be done by fitting component functions one at a time to the difference of the value of the function f at the training points and the sum of all other component functions: One then cycles through all functions in a sort of a self-consistency loop until convergence. We called the resulting methods (RS-)HDMR-NN and (RS-)HDMR-GPR, respectively [105,120]. In Equation (11) a single NN or GPR instance is used for each component function. In the case of NN, the entire HDMR representation can also be encored into the architecture of one NN. In practice, we found this to be rather cumbersome; the advantage of Equation (11) is that existing NN engines with simple architecture can be used and work well. When using GPR, one can achieve an HDMR form also by using an HDMR-type kernel [119,121]: There is in that case the disadvantage over Equation (11) in that a custom kernel has to be defined, but one uses a single GPR instance and . . , x i d is optimal in the least squares sense without the need for fitting cycles of Equation (11). The individual component functions are then where c = K −1 f and K * i 1 i 2 ...i d is a row vector with elements It can be used, in particular, to prune unnecessary component functions and thereby alleviate the issue of combinatorial scaling of the number of HDMR terms with D and d [106,119]. This scaling is a major bottleneck in high-dimensional problems. It arises because in HDMR the number and dimensionality of terms are hardwired to each other: the number of ddimensional terms is the binomial coefficient C d (D) and can be quite high. An even further extension that we proposed is to represent f (x) as a sum of lower-dimensional functions in new coordinates x (i) which are linear combination of the original coordinates: The full set of x (i) is generally redundant, and the method, when used with NN component functions, was called Red-RS-HDMR-NN [117]. This has the advantage that the number of terms and their dimensionality are uncoupled, and one can achieve good accuracy with a small number N of low-dimensional terms by varying independently N and d. This is illustrated in Figure 4, where the error in the interatomic potential of vinyl bromide is plotted as a function of both the number and the dimensionality of terms. For comparison, in the standard HDMR, there would be about 3000 5-dimensional terms. The coordinate transformation itself can be conveniently done automatically (during the fit) by introduction of an additional NN layer with linear neurons. A nonlinear layer can also be used, but we showed that the universal approximator property of the form of Equation (14) holds even with a linear coordinate transformation [117].

RS-HDMR-NN (Random Sampling High-Dimensional Model Representation Neural Network)
When the product of N and d in the representation of Equation (11) is smaller than the dimensionality of the space D, one obtains dimensionality reduction. Specifically, if the coordinate transformation and the component functions are built with neural networks, one obtains effectively a simple version of an autoencoder. An autoencoder is a type of neural network that performs dimensionality reduction, typically by using layers with diminishing numbers of neurons [122].
Dimensionality reduction can be used to find intrinsic dimensionality of the data, which may be different from the dimensionality of the space. We highlighted above that heterogeneous catalysis is an important pillar of emerging energy technologies. Modeling of catalyzed reactions with either classical or quantum dynamics requires interatomic potentials. This type of HDMR-inspired dimensionality reduction was used to construct interatomic potentials. For example, in [123] the authors constructed with this approach an interatomic potential for catalytic decomposition of nitrous oxide on copper, in the frozen surface approximation, and computed dissociation probabilities with it. This was the first ab initio-based interatomic potential for a polyatomic molecule-surface catalytic reaction with full account of all molecular degrees of freedom (intramolecular as well as orientational with respect to the surface). The interatomic potential was built in a 15-dimensional configuration space that allowed preserving symmetries and asymptotic behavior even though the intrinsic dimensionality of this problem was nine. Figure 5 plots the error (on a test set) in the interatomic potential as a function of dimensionality of a single used component function (i.e., N = 1). The method allows correctly identifying intrinsic dimensionality as a value of d beyond which there is no improvement, and it allows building the function f sampled with only about 2 data per degree of freedom, with no overfitting [123].

RS-HDMR-GPR (Random Sampling High-Dimensional Model Representation Gaussian Process Regression)
The previous example was about using a combination of HDMR with neural networks. Here we give examples of using a combination of HDMR with Gaussian process regression. The first example has to do with the future of nuclear energy. Another example of HMDR-NN will be cited in the next section. Nuclear energy remains a great source of stable baseload electricity, which is needed in the context of intermittent sources such as solar and wind. A typical nuclear fuel cycle includes uranium mining (of uranium oxides), conversion to UF 6 gas, enrichment of the 235 UF 6 fraction, conversion to solid uranium oxide, and then production of nuclear fuel assemblies. A key step in this cycle is uranium enrichment to bring the fraction of uranium-235 from the naturally occurring 0.7% to 3-5% (about 99.3% of naturally occurring U is 238 U) [124]. Not all reactor types require enrichment but most reactors operating in the world do, while other applications (e.g., defense) require much higher enrichment degrees. The enrichment is typically done in the gaseous phase, by enriching UF 6 gas in centrifuges. This is a costly process: the enrichment cost can account for about 10% of the electricity cost [125]. For many years now, researchers and the industry have been studying laser-driven enrichment whereby one excites isotope-sensitive hyperfine transitions either of uranium or uranium hexafluoride [126,127]. This requires very high laser coherency (on the order of 10 5 ) and brings with it various issues. If one could use instead vibrational transitions in UF 6 , some of which are isotopomer-selective (such as the mode at around 628 cm −1 which is different by about 0.6 cm −1 between 235 UF 6 and 238 UF 6 implying necessary coherency on the order of 10 3 ) [128], one could use cheaper and less coherent IR lasers.
One of us recently proposed the concept of IR laser-driven isotopomer selective desorption of UF 6 [129]. We computed that UF 6 can be adsorbed on different graphene derivatives with tunable adsorption energy depending on the derivative, and that in the adsorbed state there exists an isotopomer unique vibrational mode which can be used to heat and make desorb the molecules in an isotopomer-selective way, as illustrated in Figure 6. Vibrational dynamics of UF 6 would be critical for such a technique, and accurate modeling of vibrational properties and dynamics critical for ability to simulate this process [130]. Unfortunately, good, well-resolved vibrational spectra of UF 6 are not even found in the experimental literature [128,131], and to compute accurate vibrational spectra or vibrational dynamics, one requires a good interatomic potential function (potential energy surface, PES), which for a UF 6 molecule is a 15-dimensional function. Accurate PESs for UF 6 are still unavailable, in particular, due to difficulty of building a 15-dimensional function from sparse data.
In [105], we applied the HDMR based approach with Gaussian process regression component functions to construct the 15-dimensional interatomic potential of UF 6 as from samples computed with density functional theory [130]. We trained the model on 2000, 3000, and 5000 data and tested its quality on 50,000 data, i.e., we used a test set much larger that the training set. The results of test set errors obtained with different orders d of HDMR and different numbers of training data are summarized in Table 1. More details of the calculations are given in [105]. If we first look at the fit results with 5000 training data, we see something quite expected: the higher the considered order of coupling, d, the smaller the test error. The smallest error is achieved with a full-dimensional GPR (i.e., d = D, N cf = 1). Consider now the results obtained with only 2000 training data. This is a very sparse dataset, with sampling density of only about 1.7 data per dimension. What we observe here is that the fit with three-dimensional functions gives a better test set error than the full-dimensional fit. This is because with low density of sampling, it is impossible to recover the full Ddimensional function [120]. The data points were sampled quasi-randomly [132] in the 15-dimensional space; they do not lie on subdimensional manifolds, but the information to recover the full D-dimensional terms is just not there. Low data density also increases the danger of overfitting. All this together argues for representations with lower-dimensional functions such as those based on HDMR. Ultimately this has to do with the fact that a finitesize dataset in a D-dimensional space is not a D-dimensional object but has a dimension anywhere between 0 and D [83,84].

When Are Deep NNs Useful?
Finally let us touch on the subject of so-called deep neural networks which have now been widely popularized ("deep learning"). We consider again the problem of kinetic energy functionals for orbital-free DFT, of which a good solution would significantly enhance researchers' capabilities to perform large-scale ab initio simulations, as explained above. We are interested in the electronic kinetic energy or kinetic energy density, which we know how to express through the orbitals φ i , with the expression of Equation (2), but want to express it as a function of density only. We can use any derivatives or powers of ρ or any other functions of ρ but without reference to φ i : T = T[ρ(x)|∇ρ, ∆ρ, ρ n , . . .] or τ = τ[ρ(x)|∇ρ, ∆ρ, ρ n , . . .]. As an example, we show in Figure 7 how the kinetic energy density τ(x) looks for aluminum, magnesium, and silicon crystals: It is very difficult or impossible to derive analytically an expression for the KEF accurate enough for use in applied simulations of most materials [77]. There is now much progress in machine learning this dependence, in particular, with neural networks but also with other methods, including GPR and some of the methods mentioned earlier [41,[78][79][80][81][82]. This problem is a very stringent test for machine learning methods because the accuracy required here is very high, in the order of a thousandth of a percent unless there is significant error cancellation. We consider NN-based learning of τ[ρ(x)]. Figure 8 shows one-dimensional cuts of the kinetic energy densities of bcc Li, hcp Mg, fcc Al, and cubic diamond Si along selection directions in these crystals, computed in [78]. The black lines are for Kohn-Sham kinetic energy density that we want to machine-learn.
Neural networks were trained in a five-dimensional space of density dependent variables which were terms of the fourth-order gradient expansion [133]: The five summands in the integrands served as density-dependent variables (hence "T 4 " in Figure 8).
With single hidden layer neural networks, we could fit accurately the data for each of the materials separately, visually overlapping with the black curve [78]. The red lines are from attempts to fit the kinetic energy densities of all four materials simultaneously with a single hidden layer NN. It is important to be able to do so to make sure that the resulting expression of the kinetic energy as a function of density has certain portability across many materials. The result is obviously not good. With a multilayer neural network, however, we could get a good fit simultaneously for all materials, shown in Figure 8 by the turquoise lines for a four-hidden layer NN with 20 neurons per layer [78]. This difference in the quality of the fit is not due to an inadequate number of neurons-no single-hidden layer NN was able to learn the KED of all the materials simultaneously.
A single hidden layer NN is a universal approximator, and indeed we saw no advantage of using multilayer networks when fitting individual materials, and we also saw no such advantage in our prior works on interatomic potentials [134,135]. One difference between those works and this case is the extremely uneven distribution of data. To illustrate this, we show in Figure 9 distributions (histograms) of the target function values, i.e., kinetic energy density τ (and τ + , see Equation (2)) and of some of the density dependent variables we used [41]. In Figure 9, p = |∇ρ| 2 4(3π 2 ) 2/3 ρ 8/3 is the scaled (to satisfy the so-called exact conditions [136]) squared gradient and q = ∆ρ 4(3π 2 ) 2/3 ρ 5/3 is the scaled Laplacian of the density, TF is for τ TF (r) = 3 10 3π 2 2/3 ρ 5/3 (r)-the , and vW is for τ vW (r) = 1 8 |∇ρ(r)| 2 ρ(r) -the von Weiszacker KED [138]. The KED distribution is very uneven. The distributions of the density-dependent variables are in some cases extremely uneven. What it means is that there are vast parts of the space which are extremely sparsely sampled but which are still important for the quality of the model, and here we clearly see the advantage of a deep NN. Data distribution is an issue that still needs to be better addressed in machine learning. Just using weighted fitting is not sufficient, as we also saw in our research [41,78].
When machine learning the KED from the data whose distribution is shown in Figure 9, we also confirmed that the HDMR-GPR combination is able to inform on relative importance of different combinations of variables (based on Equation (13)) [106]. This is a capability which goes beyond the automated relevance determination (ARD) used with GPR, whereby the inverse of the optimal length scale parameter in GPR (l in Equation (7)) with a squared exponential kernel can be used to determine the importance of different variables [100]. With HDMR-GPR, it is the relative importance of different variable combinations which can be estimated. As non-important variable combinations can be omitted, this provides a possibility to prune the number of HDMR terms, which is important to address a key disadvantage of HDMR (i.e., the combinatorial scaling of the number of terms). The knowledge of relative importance of different combinations of descriptors can be used for KEF development with other methods, not necessarily ML-based, including analytic methods. . Distributions (histograms) of the kinetic energy densities and density dependent variables in a dataset combining data from Al, Mg, and Si at equilibrium geometry as well as under uniform compression and extension. Adapted with permission from [41]. Copyright 2020 American Institute of Physics.

Discussion and Conclusions
Machine learning today is very widely used to assist in the development of different technologies, including energy technologies. Especially novel technologies based on fuel cells, such as advanced solar cells, etc., are very "knowledge-deep", i.e., they require multiple steps in understanding and development. This includes modeling at the materials level, device level, and system level, and at all these levels machine learning is useful, usually for prediction of material or device properties and performance parameters from descriptors. What we attempted to showcase in this perspective is that as far as the uses of ML for the benefit of energy technologies are concerned, there is more than meets the eye: many more aspects are being addressed with machine learning than just predicting material or device properties from descriptors. Much is being done at the back end by using machine learning to construct objects-such as interatomic potentials-which are part of the workflow of understanding and modeling of materials and phenomena for these technologies, or to improve modeling capabilities. This is important in particular for emergent energy technologies, as they require quantum mechanics-based understanding and modeling, which is difficult and costly but can be helped with machine learning.
We saw that when one learns dependences in multidimensional spaces, one always works with sparse data, and in this regime machine learning methods are effective because they avoid direct product representations. We used neural networks and Gaussian process regressions in our work. When one uses (regression type) NNs, there is no need to be restricted to the commonly used sigmoid neurons; one can get some useful properties like sum-of-products with other types of activation functions. When the sampling is sparse, the dimensionality of the data is typically lower than the dimensionality of the space, and it may be impossible to reconstruct the full-dimensional function. The danger of overfitting is also then increased. We reviewed combined methods which represent a multidimensional function with low-dimensional functions of any desired dimensionality, and those component functions are built with either NN or GPR and potentially could be built with other methods. This allows working with very sparse datasets without overfitting and it allows using machine learning methods within their "comfort zone".
Much is being discussed about deep learning and deep networks. One should not forget that a single hidden layer NN is already a universal approximator. In practice though, when the distribution of the data is very uneven, we found that deep NNs are very useful. In other applications, where data were more uniformly distributed, we did not see an advantage with deep NN. In general, the issue of extremely uneven data distributions in some applications is a problem that is yet to be solved in substance. Going forward, we therefore expect methods that go beyond commonly used tools like NN and GPR and to use them instead as building blocks of more involved and powerful methods to gain increased attention and use, as well as methods which are specifically tailored to address the issues of data distribution and sparsity, and the issue of missing data.