Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence

In this paper, we examine two widely-used approaches, the polynomial chaos expansion (PCE) and Gaussian process (GP) regression, for the development of surrogate models. The theoretical differences between the PCE and GP approximations are discussed. A state-of-the-art PCE approach is constructed based on high precision quadrature points; however, the need for truncation may result in potential precision loss; the GP approach performs well on small datasets and allows a fine and precise trade-off between fitting the data and smoothing, but its overall performance depends largely on the training dataset. The reproducing kernel Hilbert space (RKHS) and Mercer’s theorem are introduced to form a linkage between the two methods. The theorem has proven that the two surrogates can be embedded in two isomorphic RKHS, by which we propose a novel method named Gaussian process on polynomial chaos basis (GPCB) that incorporates the PCE and GP. A theoretical comparison is made between the PCE and GPCB with the help of the Kullback–Leibler divergence. We present that the GPCB is as stable and accurate as the PCE method. Furthermore, the GPCB is a one-step Bayesian method that chooses the best subset of RKHS in which the true function should lie, while the PCE method requires an adaptive procedure. Simulations of 1D and 2D benchmark functions show that GPCB outperforms both the PCE and classical GP methods. In order to solve high dimensional problems, a random sample scheme with a constructive design (i.e., tensor product of quadrature points) is proposed to generate a valid training dataset for the GPCB method. This approach utilizes the nature of the high numerical accuracy underlying the quadrature points while ensuring the computational feasibility. Finally, the experimental results show that our sample strategy has a higher accuracy than classical experimental designs; meanwhile, it is suitable for solving high dimensional problems.


Introduction
Computer simulations are widely used in learning tasks, where a single simulation is an instance of the system [1,2]. A simple approach to the learning task is to randomly sample input variables and run the simulations for each input to obtain the features of the systems. Similar approaches are utilized in Monte Carlo techniques [3]. However, even a single simulation can be computationally costly due to its high complexity, and so, obtaining a trustworthy result via sufficient simulations becomes intractable. Mathematical methods and statistical theorems are introduced to generate surrogate models to replace the simulations, especially when dealing with complex systems with many parameters [4,5]. Although the main drawback of surrogate models is that only approximations can be obtained, they are computationally efficient whilst maintaining the essential information of the systems, hence by kernels; meanwhile, Bayesian linear regression or classification methods are introduced to utilize the prior information.
Both the PCE and GP methods build surrogates, but there are some differences between them. The PCE method builds surrogates of a random variable y as a function of another prior random variable x rather than the distribution density function itself. The PCE surrogates are based on the orthogonal polynomial basis corresponding to the p(x), so it is simple to obtain the mean and standard deviation of y. In contrast, the GP utilizes the covariance information so that it performs better in capturing the local features. Although both the PCE and GP approaches are feasible methods to compute the mean and standard deviation of y, the PCE performs more efficiently than the GP method.
As mentioned above, both the PCE and GP methods have their own trade-offs to consider when building surrogates, and there exists a connection to be explored. According to Paul Constantine's work [27], ordinary kriging (i.e., GP in geostatistics) interpolation can be viewed as a transformed version of the least squares (LS) problem, and the PCE can be viewed as the least squares with selected basis and weights. However, the GP reverts to interpolation when the noise term is zero. When taking the noise term into consideration, the Gaussian process with the kernel (i.e., covariance matrix) X T X can be viewed as a ridge regression problem [28] with a regularization term. Furthermore, different numerical methods can affect the precision of the PCE method, as well. For example, Xiu [20] analyzed the aliasing error w.r.t the projection method and interpolation method. Thus, the inherent connection of the two models cannot be simply summarized as an LS solution, and how to output a model with high precision remains an interesting question.
There are connections between the PCE and GP methods that have been explored by R. Schobi, etc. They introduced a new meta-modeling method naming PC-kriging [29] (polynomial-chaos-based kriging) to solve the problems like rare event estimation [30], structural reliability analysis [31], quantile estimation [32], etc. In their papers, the PCE models can be viewed as a special form of GP where a Dirac function is introduced as the kernel. They also proposed the idea that the PCE models have better performance in capturing the global features and that the GP models approximate the local characteristics. We would like to describe the PC-kriging method as a GP model with a PCE-form trend function along with a noise term. The global features are dominated by the PCE trend, and local structures (residuals) are approximated by the ordinary GP process. The PC-kriging model thus introduces the coefficients as parameters to be optimized, and the solution can be derived by Bayesian linear regression with the basis consisting of the PCE polynomials. They also use the LARSalgorithms to calibrate the model and to select a sparse design. They construct a rigid framework to optimize the parameters, validate and calibrate the model and evaluate the model accuracy.
Unlike the PC-kriging, which takes the PCE as a trend, this paper focuses on the construction of the kernel in the GP to solve the regression problems, through which we can combine the two methods into a unified framework, unifying positive aspects from both and in so doing refining the surrogates. In other words, we wish to find the connection between the GP and the PCE by analyzing the attribution of their solutions, and we want to propose a new approach to achieve high-precision predictions. The main idea of this paper is described as follows. Firstly, the PCE surrogate is embedded in a Hilbert space whose bases are the orthonormal polynomials themselves, then a suitable inner product and a Mercer kernel [33] are defined to build a reproducing kernel Hilbert space (RKHS) [33]. Secondly, on the other hand, the kernel of the GP can be de-composited as the product of eigenfunctions, and we can define an inner product to generate a RKHS, as well. We have explicitly elaborated the two procedures respectively and proven that the two RKHS are isometrically isomorphic. Hence, a connection between these two approaches has been established via RKHS. Furthermore, we can obtain a solution of the PCE model by solving a GP model with the Mercer kernel w.r.t the PCE polynomial basis. We name this approach Gaussian processes on polynomial chaos basis (GPCB). In order to illustrate the capability of the GPCB method, we use the Kullback-Leibler divergence [34] to explicitly compare the PDFs of the posterior prediction of the GPCB and PCE method. Provided that the true function can be approximated by a finite number of PCE bases, it can be concluded that the GPCB can converge to the optimal subset of the RKHS wherein the true function lies.
The experimental design from the PCE model, i.e., the full tensor product of quadratures in each dimension, is used in the GPCB. We have overcome two concerns about the PCE and GP, respectively. Firstly, the PCE is based on a truncated polynomial basis, while the GPCB keeps all polynomials, which can be regarded as maintaining information in every feature. Secondly, the GP's behavior depends on the experimental design; however, it often achieves the optimal result in local small datasets. The quadrature points derived from the PCE model are distributed evenly in the input space, and those points have high numerical precision w.r.t the polynomial basis; hence, they can work well with the GPCB. However, we must admit that the GPCB is still a GP approach, so when the dimension of input variables grows, the computational burden is on the table. In order to cope with the high dimensional problems, sampling strategies to lower the number of experimental designs are put on the table. The AK-MCSmethod [35] is a useful tool that adaptively selects new experimental designs; however, the experimental design tends to validate the selected surrogate model. We propose a new method that is model-free and that makes full use of the quadrature points. We randomly choose a sparse subset from the quadrature points to form a new experimental design while maintaining the accuracy. Several classical sampling strategies like MC, Halton and LHS are introduced to compare their capabilities. Our sample scheme has superior performance under the conditions in this paper. The GPCB is a novel method to build surrogate models, and it can be used for various physical problems such as reliability analysis and risk assessment.
This paper is divided into two parts. In Part 1, we discuss the mathematical rigor of the method: a brief summary of PCE and GP is presented in Section 2; the reproducing kernel Hilbert space (RKHS) is introduced to connect these two methods in Section 3; the GPCB method is proposed based on the discussion in Section 4; meanwhile, a theoretical Kullback-Leibler divergence between the GPCB and PCE method is demonstrated. In Part 2, an explicit Mehler kernel is presented with the Hermite polynomial basis in the last part of Section 4; several tests of the GPCB with some benchmark functions are presented in Section 5, along with the random constructive sampling method for high dimensional problems.

Brief Review of PCE and GP
Firstly, we want to have a clear idea of how the PCE and GP work under the circumstances that, e.g., only samples of input X and output Y are obtained. Different assumptions are made to cope with PCE and GP, respectively, and the processing procedures are presented in the following subsections.

Polynomial Chaos Expansion
Just as discussed in Section 2.2, the output is assumed to be represented by a model y = f (x) + , where f (x) : x ∈ Ω → R is the real function underneath and ∼ N (0, σ 2 ). Here, we define x = (x 1 , x 2 , . . . , x d ) T as a d-dimensional vector of independent random variables in a bounded domain Ω ⊂ R d . Suppose {x i , i = 1, . . . , d} are independent and identically distributed; the joint PDF has the form p(x) = ∏ d i=1 p(x i ). In the context of PCE, we aim to seek a surrogate of the model f (x) as an expansion of a series of orthonormal polynomials φ α (x): with Ω i the marginal domain of Ω and δ mn the Kronecker delta. Xiu et al. [20] have summarized various correspondences between the distribution and polynomial basis to form generalized polynomial chaos.
It is proven that the original model f (x) can be approximated to any degree of accuracy in a strong sense [20], e.g., mean-square norm f (x) − ∑ α∈N D β α φ α (x) in an L 2 norm defined on Ω, although f is not necessarily the span of orthonormal polynomial bases. Since we are unable to calculate an infinite series, the truncation scheme corresponding to multi-index α is introduced such that we can rearrange the polynomials. For simplicity, we can rewrite Equation (1) in the following form: We can simply solve the above system via the ordinary least squares method or the non-intrusive method. Specifically, we focus on the non-intrusive projection method, whereby we can directly obtain the coefficients by taking the expectation value of Equation (2) multiplied by φ l (x): where the second equation is derived by the numerical integration techniques, such as the Gaussian quadrature rule, and {X i , i = 1, . . . , N} and {ω i , i = 1, . . . , N} are the corresponding nodes and weights. The integration is exact when f (x) is of polynomial complexity. Together with Equations (2) and (3), f P (x) has the form: { f (X i ), i = 1, . . . , N} remain unknown to us, and usually, they are substituted by {Y i , i = 1, . . . , N}.
Note Y i = f (X i ) + i , so such a substitution will introduce noise into the surrogate; hence, the approximation error is neglected as a source of uncertainty.

Gaussian Process Regression
The analysis of the Gaussian process regression model [26] is reviewed in this section. A Gaussian prior is placed over function f (x), i.e., f (x) ∼ GP (m(x), k(x, x)), where m is the mean function and k is the kernel function, which is positive semi-definite bounded. More specifically, let X = {X i , i = 1, . . . , N} ∈ Ω N be the input data, and let Y = {Y i , i = 1, . . . , N} ∈ R N be the output data, then we have Y = f (X) + with f (X) ∼ N (m(X), k(X, X)) and ∼ N (0, σ 2 I). Bear in mind that the mathematical expression of f (x) is implicit, so f (x) is approximated to achieve the best guess prediction f G in the statistical sense. With the help of Bayes' theorem, prediction and corresponding variance at a new point x can be obtained by the following equations [36]: where K = k(X, X) ∈ R N×N as the covariance matrix with K ij = k(X i , X j ) and K x = k(X, x) ∈ R N×1 , K xx = k(x, x) ∈ R are defined similarly. Note that Equation (5) shows that the mean value of the posterior distribution can be expressed as a linear combination of N kernel functions as follows:

Links between the PCE and GP
The basic concepts of PCE and GP are discussed in Section 2. GP generates a surrogate based on Bayes' theorem and the Gaussian hypothesis; however, it is controlled by the kernel function and the experimental design and usually does not utilize prior distribution information; PCE substitutes the model with orthonormal polynomials, which is more computational efficient, but performs badly when facing noisy or big data. This section aims to build a connection between PCE and GP; hence, they can be studied in the same structure and be combined to improve the performance of the surrogates. The reproducing kernel Hilbert space will be of great help to build such a bridge, and we are going to present it as follows.

Generate an RKHS from a Mercer Kernel Constructed by the PCE Basis
We have obtained a complete orthonormal basis {φ l } of Hilbert space We can see that the PCE method generates surrogates, which are actually a linear combination of {φ l }, so there exists a unique expansion f = ∑ l f l φ l ∈ H. According to Mercer's theorem [33], we aim to define a kernel having the following form: If we have positive weights λ l that satisfy ∑ l λ l φ 2 l (x) < ∞, then for any x ∈ Ω, together with the Cauchy-Schwarz inequality, we have: By checking the right side of the above inequality, the second term is ensured in advance, then f (x) lies in a subspace of H such that: Proposition 1. H P defined in Equation (9) is an RKHS with Mercer kernel defined in Equation (7).

Generate an RKHS from the Reproducing Kernel Map Construction
We aim to compose a space of functions in which all the GP surrogates are embedded. Given Equation (6) and an arbitrary experimental design X, define a space of functions as follows: Proposition 2. H G is a pre-Hilbert space with the inner product < ·, · > H G Now that H G is a pre-Hilbert space and given the norm Proposition 3. H G defined above is the unique RKHS of the kernel k(·, ·).

Reproducing Kernel Hilbert Spaces as a Linkage
H P and H G are RKHS with the Mercer kernel and GP kernel, respectively. We are going to investigate the relationship between the two RKHS, by which we can discuss the two approaches in a unified structure. Let X be a sample set and GP kernel k(·, ·) be a real positive semi-definite kernel, then according to Mercer's theorem, k(X i , X j ) has an eigenfunction expansion: where the eigenfunctions {φ l } are orthonormal, i.e., ∈ H G with experimental design X, then we can rewrite it according to Equations (10) and (11): where c l (X) is identically determined by l and X, and it has a similar form as a function lies in H P .
Actually, given f X (x), g X (x) ∈ H G , we have: The above equation gives us the information that their inner product stands in H P , as well, so we can conclude that f X lies in H P . It also shows us that the two inner products are equivalent. Next, we are going to propose a rigid theorem to prove that the two spaces are isometrically isomorphic. Theorem 1. The reproducing kernel Hilbert space H G of a given kernel k is isometrically isomorphic to the space H P .
According to the proof of Theorem 1 in Appendix A.4, it is reasonable to introduce a weighted l 2 space l 2 1/λ because it is difficult to find a direct linear map between H G (where c l varies according to X and k) and H P (where c l varies according to the distribution of x). Figure 1 shows two flowcharts used to describe different processes in generating the RKHS.
Furthermore, the GP prediction is a combination of the kernel functions, which consist of infinite eigenfunctions, while the PCE prediction is always a combination of finite polynomial bases. The Kullback-Leibler divergence (KL divergence) is a useful criterion to indicate the performance of different surrogate models. We are going to present the comparison of the GPCB and PCE methods with the help of KL divergence in the next section.

Gaussian Process
Well defined kernel function k(·, ·) Sampling X Space H G defined in Equation (10) RKHS with k(·, ·) Eigenfunction expansion in Equation (11) Isomorphic to l 2 Space H P defined in Equation (9) Mercer Kernel defined in Equation (7) Isomorphic to l 2 1/λ RKHS with Mercer kernel

Gaussian Process on Polynomial Chaos Basis
H G and H P are isomorphic as discussed in previous Section 3, so it is natural to come up with the idea that GP can be conducted with k(·, ·) as the Mercer Kernel generated by polynomial basis in the PCE, and the new model is called Gaussian process on polynomial chaos basis (GPCB). In fact, the GPCB generates a PCE-like model, but with a different philosophy. Note that the posterior distribution of the predictions regarding experimental design {X, Y} can be calculated analytically, so we are able to compute the KL divergence as well, which are presented as follows.

Comparison of the PCE and GPCB with the Kullback-Leibler Divergence
The true distribution of the system is always implicit in practice. Without loss of generality, the underlying true system is assumed to be fP(x) = ∑M l=0β l φ l (x) if fP(x) ∈ C 0 (Ω) such that it can be approximated by the polynomials to any degree of accuracy [20].
Firstly, we presume thatM ≤ M, i.e., β in Equation (4) is an unbiased estimator ofβ. Hence, the PCE approximation can be considered as a precise approximate of the true function. We compare the performance of the GPCB and the PCE method by comparing their difference in the posterior distribution of the prediction. It is known that given experimental design {X, Y} and kernel function , the distribution of the prediction of the GPCB reads: where conditions X, Similarly, the prediction of the PCE with the projection method is as well, the previous results indicate: We can evaluate the discrepancy between p G ( f G (x)) and p P ( f P (x)), hence comparing their performance. The KL divergence can be calculated analytically: where a, b are simplified notations for the corresponding parts in Equation (16). In fact, b is the point-wise ratio between the posterior variances of the predictions, and a represents the difference between the posterior mean of the prediction. We discuss the properties of D KL starting from a special case to the general conditions hereafter. Let k be a truncated kernel with the M basis of PCE, i.e., k(x, x ) = ∑ M l=0 λ l φ l (x)φ l (x ). Actually, it can be seen as the assignment of {λ l , l > M} with the value of zero, which can be achieved by optimizing the value λ l with a specific procedure. b can be simplified as: where Λ = diag{λ l } is a diagonal matrix and K = USU T is the eigenvalue decomposition. Let s max be the maximum eigenvalue and s min be the minimal one; the above interval holds because K is a positive definite matrix. Note that b is an invariant with fixed x. On the other hand, the distribution of a is as follows: Here,Ŷ denotes the mean value of the observations, i.e., the true response. It is necessary to state that the randomness of a is brought by the random variable in observation Y. In fact, we have the expectation of D KL (p P , p G ) as follows: Presume that the observation Y is normalized, as well asŶ; hence, Ŷ 2 can be estimated as O(1). The main difference is affected by the kernel k(·, ·) (or Λ) and the term σ 2 . More specifically, The GPCB can achieve a smaller variance than the PCE method in a point-wise manner because b > 1, and the differences between the predictions of the two methods is of the order of σ 2 . Furthermore, if σ 2 is sufficient small, i.e., σ 2 s min , we have b → 1, thus E [D KL (p P , p G )] → 0. If σ 2 = 0, i.e., we investigate the noise-free models, then b = 1 and a = 0, which enforces Σ 1 = Σ 2 and µ 1 = µ 2 , respectively. This means that p G and p P are identical distributions, i.e., D KL (p P , p G ) = 0.
We can conclude that the expected value of D KL (p P , p G ) is bounded by a certain constant, which mainly depends on the σ 2 and Λ. In other words, since σ 2 is given in the prior, and the Λ are optimized; hence, the D KL (p P , p G ) is constrained, which means that the GPCB is as stable as the PCE method.
Nonetheless, if f P has reached a desired prediction precision, then f G with the kernel constructed with the same basis can have a desirable precision, as well as smaller variance.
Secondly, we consider thatM > M, where the β of the PCE method is not an unbiased estimate ofβ any longer. Let pP denote the PCE approximation with theM basis; under the circumstance, k(x, x ) = ∑M k=0 λ l φ l (x)φ l (x ) is achieved by tuning the value of Λ via a certain learning method; hence, D KL (pP, p G ) is also bounded by a constant according to Equation (19), i.e., the GPCB can converge to the precise PCE prediction pP, as well. However, the KL divergence D KL (pP, p P ) is given as: We denote φ r as the basis that belongs to the model fP, but f P . It is shown that the biased PCE f P has smaller variance, however with a bias whose mean value depends on φ r . We notice that φ r represents the high-order polynomials; hence, the bias can be considerably large for general cases, and so is the D KL (pP, p P ). We can conclude that even though the biased PCE has smaller variance, the relatively large bias can lead to a false prediction.
In fact, if the underlying system is smooth enough to be modeled by a polynomial approximation, then we can adaptively increase the number of polynomial bases (and the experimental designs if necessary) to reach a precise approximation. However, on the other hand, we can directly use the GPCB method, which is a one-step Bayesian approximation, that converges to the hypothetical true system fP. Roughly speaking, the GPCB findsM automatically by tuning the parameters Λ instead of adaptively changing the value of P in the PCE method. It indeed provides more convenience for computation. The key problem is the evaluation of Λ. Specifically, we introduce the Mehler kernel [37], which is an analytic expression of the Mercer kernel constructed by Hermite polynomials, and discuss the learning procedure of Λ l .

Construction of the Kernel with Hermite Polynomials
Recall that we have {φ α } in PCE as an orthonormal basis, then we regard them as eigenfunctions of a kernel k(·, ·). Since p(x) follows the standard Gaussian distribution, then φ Here, we denote the real l-th multi-index of the l-th polynomial in Equation (2) as d !, then according to the previous analysis, we can get φ α (l) (x) = He α (l) (x)/ √ α (l) !. We have Mehler kernel Me(X, X ) [37] with orthonormal Hermite polynomials as eigenfunctions: where the eigenvalue is λ for parameter ρ, D x is the symbol representing the row gradient operator, i.e., D x = (∂/∂x 1 , . . . , ∂/∂x d ), and D x is defined similarly. Specifically, in the one-dimensional case: where the eigenvalue λ l = ρ l > 0. The truncated kernel Me M (x, x ) = ∑ M l=0 ρ l φ l (x)φ l (x ), and its attribution can be investigated by varying ρ and M. Figure 2a illustrates the truncated kernel Me M (x, x ), which shows that Me M (x, x ) tends to converge to Me(x, x ) as M grows. Figure 2b shows the values of Me(x, −0.8) with different ρ. It presents to us that the influence of eigenvalue λ l is greater on the Mehler kernel.

Learning the Hyper-Parameter ρ of Me(x, x )
It is clear that λ l = ρ l has a great impact on the kernel values, hence affecting the convergence of the f G (x). We start with a simple example, where f (x) = 5 + x + exp(x), x ∼ N (0, 2 2 ) is the true underlying function, and the noise term is ignored in the observation. Considering the Taylor expansion of exp(x), f (x) can be approximated by the PCE with sufficiently large M. In fact, we can calculate the projection < f (x), φ l (x) > to seek the value of M. When l > M, < f (x), φ l (x) >≈ 0. On the other hand, as discussed in Section 4.1, the GPCB can find M automatically by tuning the hyper-parameter ρ. Let the experimental design X be the zeros of He 10 (x) of Equation (21), i.e., quadrature points corresponding to degree 10; we compare the performance of the GPCB with ρ equal to 0.1, 0.45, 0.7, respectively. The results are displayed in Figure 3. Note that the projection value is the absolute value of the true value in the figure for better illustration. It shows that we are able to approximate f (x) with polynomials up to degree 40. If ρ = 0.45 for the Mehler kernel, the GPCB almost converges to exact f (x), whereas ρ = 0.1 leads to a fast convergence rate and ρ = 0.7 results in a slow convergence rate.  Figure 3 shows that the ρ has a crucial impact on the performance of the GPCB method, so a tractable method to optimize ρ is needed. A natural criterion is the KL divergence, which can be minimized by finding the optimal hyper-parameters ρ. We discussed the KL divergence of the GPCB and the PCE surrogates under the assumption that the PCE surrogate model can approximate the true system to any degree of accuracy. However, the distribution of the real system is usually unknown, which makes the calculation of KL divergence intractable. In fact, it can be easily deduced that the minimization of KL divergence is equivalent to minimizing the negative log marginal likelihood ∆, which (actually is 2∆) reads: It is important to optimize ρ and σ 2 to obtain a suitable kernel to get an accurate approximation. Classical methods like gradient-based techniques can be used to search for the optimal ρ; however, it may perform poorly because it is locally optimized. As we can see in Equation (23), it is indicated that ρ should take a value between zero and one, so we can propose a global method to solve our optimization problem. The algorithm for generating a GPCB approximation is given in Algorithm 1: Algorithm 1: General procedure of the GPCB method with the Mehler kernel.

Numerical Investigation
In this section, we investigate the GPCB method for various benchmark functions. Firstly, we investigate the same example in Figure 3; however, the noise term is considered, i.e., y = f (x) + = 5 + x + exp(x) + is the observation, where x ∼ N (0, 2 2 ), ∼ N (0, 0.1 2 ). Three methods, i.e., GP with the RBF kernel, PCE and GPCB, are implemented. It is necessary to note that the Monte Carlo (MC) sampling strategy is used in the normal GP approaches, and Gaussian quadrature points are introduced in the GPCB approach. The main reason is that the quadrature points are too sparse for the widely-used kernels to capture the local features. For example, in this case, P = 10 in PCE, then the maximum quadrature point is 10.376, which is beyond 3σ x . In the first set of experiments, let P = 10 in PCE, which means 11 sample points are used in the experiments; furthermore, 10,000 samples are introduced as test dataset to output the ECDF (empirical cumulative distribution function) and RMSE (root-mean-squared error). The GP algorithm is implemented by the gpml toolbox [38] written in MATLAB with four different kernels, i.e., linear, quadratic, Gaussian and Matérn-3/2 kernels. The comparisons of the results are displayed in Figure 4.    (16). It is clear that the distribution of GPCB prediction is statistically closest to the true response, although the GP methods with quadratic, Gaussian and Matérn-3/2 kernels outperform the GPCB at some points. Figure 4b compares the ECDF of y based on the test dataset. It shows that both the PCE and GPCB have a similar ECDF with the true value. Upon closer inspection, which is shown in the magnified subregion, it is obvious that the ECDF of the GPCB is almost exactly the same as the real ECDF, which shows that GPCB has actually captured the feature of f (x) with high precision. At the same time, the RMSEs of the PCE, linear kernel, quadratic kernel, Gaussian kernel, Matérn-3/2 kernel and GPCB are 1.0570, 37.3526, 23.6383, 25.6044, 27.4498, 0.4108, respectively. We have implemented another experiment, which uses the degree P (i.e., the number of experimental design) as the second set of experiments, which are illustrated in Figure 4c. Figure 4c shows that GPCB generally outperforms the ordinary GP with the RBF kernel, which indicates that GPCB performs better with a few (or sparse) training points. It is notable that the PCE and GPCB perform with almost the same precision when the degree is greater than 16. It echoes the idea that PCE and GPCB are statistical equivalent, as we present in Section 4.1.
Similar experiments are conducted with a two-dimensional function, which is expressed as f (x) = exp(x 1 )/ exp(x 2 ). Let x 1 , x 2 ∼ N (0, 1), y = f (x) + be the real model where is an independent noise term with a normal distribution N(0, 0.1 2 I 2 ). Unlike the first test function, this test example is a limit state function. Let the maximum degree for each dimension p t be seven for the PCE method, which makes 64 training points in total. Another dataset of 10,000 independent samples is introduced as the test set to calculate the ECDF and RMSE, as well. Similarly, we have the point-wise KL divergence in the region [−2, 2; −2, 2] as shown in Figure 5a. It is clear that the GPCB is globally closer to the true distribution than other methods. The GP with quadratic, Gaussian and Matérn-3/2 kernels can approximate the center part well, while the PCE does not seem to perform as well. Figure 5b shows that the six methods except the linear kernel are able to reconstruct the distribution of the prediction, and upon closer observation, we find that the ECDF of the GPCB and Matérn-3/2 kernel are the best approximations among the six methods. We also consider another set of experiments focusing on the number of experimental designs, which equals (p t + 1) 2 for the 2D function. The RMSEs of the three methods with respect to different p t are displayed in Figure 5c. It also shows that the PCE and GPCB generally outperform the normal GP approaches, and the GPCB has the best performance. To summarize, the GPCB generates a surrogate of infinite series, while the PCE can only generate a surrogate with up to P + 1 polynomials, and they tend to behave with similar precision when P is large enough. A set of sparse quadrature points sampled in the PCE, which are derived from the Gaussian quadrature rule, is a good design for the GPCB. The GPCB with those training points generally performs better than the normal GP methods and PCE. However, the size of such a training set grows dramatically with the dimension (N = (p t + 1) d in total), so it is not practical in real-life applications. We aim to present a strategy of sampling from those quadrature points, namely candidate points in the next section, and analyze the performance of our algorithm on the selected points.

A Random Constructive Design in High Dimensional Problems
As the dimension of a system grows, so do the number of design points of PCE due to a tensor product of quadrature points in each dimension. It is possible that PCE could deal with thousands of points of training data with acceptable computational time; however, it becomes expensive for GP approaches, including our GPCB approach. Monte Carlo sampling techniques can substitute quadrature design; however, these are not always stable. Other sampling strategies like Halton sampling and Latin hypercube sampling [39] are widely used.
In this work, we want to utilize the high accuracy of quadrature points and also want to reduce the massive number of points. Let x ∈ R d , and p t is the maximum degree in each dimension, so p t + 1 quadrature points are needed in each dimension, which makes the total number of tensor products of quadrature points be #{X c } = (p t + 1) d . We seek to find a subset of the candidate design X c . Furthermore, we wish to obtain a subset having a good coverage rate in the space. Therefore, we proposed the random definite design in our paper. Note that the LHS design can be extended to a larger interval (1, (p t + 1) D ) and can produce points at midpoints (endpoints), so we use the LHS design to sample N indices from the interval. More specifically, we presume that the points in X c are equally important, so we arrange those points with a certain order to get their indices. Then, we sample from the indices with the LHS design, and each index is related to a certain quadrature point. It can be easily implemented by the MATLAB built-in function lhsdesign. The corresponding N points are what we need.
Take a three-dimensional input space as an example, where x i ∼ N (0, 1), i = 1, 2, 3. Set p t = 6, then #{X c } = 343. The candidate design and its subset of 50 points X are illustrated in Figure 6. We can see from the figure that our sampling is sparse in the whole set of candidate points, and it behaves uniformly in dimension one as illustrated in Figure 6b, with similar conclusions in the other two dimensions. When projecting our sampling from dimension three to get Figure 6c, we can see that the selected points almost cover every point of X c , which means it has all features in dimensions one and two, i.e., quadrature point values of the two dimensions. It shows that such a method can generate a sparse subset meanwhile guaranteeing the coverage rate in the whole candidate design. We name it the random constructive design. (c) Figure 6. Left: X c and X in 3D view; the blue dots represent the quadrature points, while the red points represent our samplings; middle: this shows the sparsity of our sampling in X c ; right: this shows that our sampling actually covered almost every feature of X c . (a) X c and X in 3D view; (b) one slice of X c ; (c) projection of X on X c in dimensions one and two. Now, we want to find out whether these samples retain their capability of accuracy. Firstly, we use the PCE method to test those samples. We will look into the benchmark Ishigami function [40]: f (x) = sin(x 1 ) + 7 sin 2 (x 2 ) + 0.1x 4 3 sin(x 1 ), where four different sampling strategies are compared here. Set p t = 15, and the candidate design X c has a size of 4096. The error term is eliminated in this simulation for the accuracy test. The RMSE are computed on 10,000 independently-sampled data, and the results are presented below in Figure 7. It can be seen that our sample always performs better than other samples. When the number of samples surpasses 900, the RMSE becomes 1.0605 × 10 −5 , which equals the RMSE with the whole candidate points. Therefore, we only select 20% of X c and get the same precision. Furthermore, if we set our precision to be 10 −2 , only 400 points are needed.
This shows that the quadrature points have high precision in numerical calculation. In other words, the points in the candidate set are good points. Then, the random constructive design is used with the PCE, GP and GPCB methods for the Ishigami function, with the noise term added in the observations. We take the RMSE as a criterion to compare their performance, and the results are illustrated in Figure 8. Figure 8a shows that the GPCB is always better than the PCE method, and they tend to behave the same. However, as the number of sampling points grows, the GP with the quadratic kernel, Gaussian kernel and Matérn-3/2 kernel generally outperform the other methods. We can see that the Ishigami function is a bounded function; therefore, it is likely to fill the whole observation space as the number of samples increases, hence improving the accuracy of the GP method. We plot the ECDF with respect to the three methods when N = 1000 in Figure 8b. It is clear that the GP with the quadratic kernel, Gaussian kernel and Matérn-3/2 kernel can almost recover the true distribution of the response, which is beyond the capability of the PCE and GPCB. A six-dimensional problem is being tested with the G-function [41], which is not like the Ishigami function and is unbounded in the domain [−∞, ∞] 6 : The experiment is performed with the same approaches, and the results are shown below in Figure 9. Figure 9a shows that the GPCB outperforms the PCE and GP with the Gaussian Kernel, and it has similar precision with the quadratic and Matérn-3/2 kernels. We notice that the GPCB is more stable than the PCE method, which behaves badly especially when N = 150, 300. Figure 9b shows that none of these three methods can reconstruct the probability of y very well; however, we can note that the GPCB is still comparatively the closest. Finally, we are going to present a more complicated model with 15 dimensional functions with the following form: The distribution of x is the product of 15 independent distributions, i.e., x i ∼ N (0, 1), i = 1, . . . , 15. This function is introduced by the work of O'Hagan, where a 1 , a 2 , a 3 , M are defined in [42]. We can see that this function is dominated by the linear and quadratic term, so it may be well approximated by the low-order PCE model. Let p t = 3 in the PCE model; we can see from Figure 10a that the GP with the quadratic kernel performs best among the six methods, while the PCE performs better than the GPCB and other GP methods. On the other hand, the GPCB is always generally better than the GP method except with the quadratic kernel for this function. When N = 1000, the PCE can generate y, which follows the real distribution according to the ECDF in Figure 10b.

Conclusions
This paper has examined two different surrogates of computational models, i.e., polynomial chaos expansion and Gaussian process regression. First, we present a brief review of these two approaches. Next, we discuss the relationship between PCE and GP and find that PCE and GP surrogates are embedded in two isomorphic RKHS. Mercer's theorem is introduced to generate a kernel based on a PCE basis, by which a new approach is proposed, which we name GPCB. An example shows that with the same experimental design, GPCB tends to retain useful information in a suitable subspace of the RKHS by changing the hyper-parameters, whereas PCE simply sets the information of the residual to zero. We further investigate the approximation performance on two test functions in 1D and 3D, respectively, and their approximation properties are illustrated. In order to deal with the high dimensional scenario, a random constructive design from the quadrature points is used to generate an experimental design. The results give us several directions for choosing models: basically, the GPCB outperforms the PCE, but when the original model can be well approximated by low-order PCE (Figure 10), it seems cumbersome to introduce the GPCB and GP; when the response function is bounded (Figure 8), if we have enough training resources, the GP can be a better choice; when the objective function is unbounded (Figures 4 and 9) or cannot be approximated by finite polynomials (Figure 5), we should probably choose the GPCB.
Future work can extend the family of the Mercer kernel or equivalent kernel (other than the Mehler kernel presented in this paper) beyond a classical approximation method. We can also analyze the experimental design for GP regression in many ways. Although we can see that our sampling method behaves fair enough in the experiments, there is also the opportunity to discover further suitable experimental design schemes to fit different computational purposes, which would be of great interest. The stability of our method will be investigated in future work, i.e., how many points are needed to train a good surrogate and whether our method always produces a suitable design. Furthermore, we can establish closer connections between numerical analysis and statistics via such combinations. < f (·), k(x, ·) > H P = ∑ l f l λ l φ l (x)/λ l = f (x) f or ∀ x ∈ Ω, We have the conclusion that H P is an RKHS derived from the Mercer kernel.
Appendix A.2. Proof of Proposition 2 Proof. In fact, H G is a space of all finite linear combinations of functions k(x, ·) : Ω → R, so we can denote H G := span{k(x, ·)|x ∈ Ω}, and the elements in H G have the general form of f (x) = ∑ N i=1 f i k(x, X i ). Therefore, different N and all experimental designs X are allowed, which enables that f (x) = ∑ N i=1 f i k(x, X i ), g(x) = ∑ N i=1 f i k(x, X i ) ∈ H' G . The linearity of H G is given by the following explanation.
Let t 1 , t 2 ∈ R be scalars, then we can rewrite t 1 f (1) (x) + t 2 f (2) (x) as a function f (x) such that: j , j = 1, . . . , N 2 }, X = X (1) ∪ X (2) . Additionally, we have: which means that t 1 f (1) (x) + t 2 f (2) (x) also belongs to H G . Then, we would like to show that H G is an inner product space. Let the kernel function be positive semi-definite, and we define the inner product of H G as follows: < ·, · > H G is a well-defined inner product by checking the following conditions: 1. Symmetry: < f , g > H G = ∑ i,j f i g j k(X i , X j ) = ∑ j,i g j f i k(X j , X i ) =< g, f > H G ; 2. Bi-linearity: i g j k(X (2) i , X j ) =a < f (1) , g > H G +b < f (2) , g > H G ; 3. Positive-definiteness: It is obvious that < f , f > H G = f T K f ≥ 0 with the equality iff f = 0.