Parameter Estimation with Data-Driven Nonparametric Likelihood Functions

In this paper, we consider a surrogate modeling approach using a data-driven nonparametric likelihood function constructed on a manifold on which the data lie (or to which they are close). The proposed method represents the likelihood function using a spectral expansion formulation known as the kernel embedding of the conditional distribution. To respect the geometry of the data, we employ this spectral expansion using a set of data-driven basis functions obtained from the diffusion maps algorithm. The theoretical error estimate suggests that the error bound of the approximate data-driven likelihood function is independent of the variance of the basis functions, which allows us to determine the amount of training data for accurate likelihood function estimations. Supporting numerical results to demonstrate the robustness of the data-driven likelihood functions for parameter estimation are given on instructive examples involving stochastic and deterministic differential equations. When the dimension of the data manifold is strictly less than the dimension of the ambient space, we found that the proposed approach (which does not require the knowledge of the data manifold) is superior compared to likelihood functions constructed using standard parametric basis functions defined on the ambient coordinates. In an example where the data manifold is not smooth and unknown, the proposed method is more robust compared to an existing polynomial chaos surrogate model which assumes a parametric likelihood, the non-intrusive spectral projection. In fact, the estimation accuracy is comparable to direct MCMC estimates with only eight likelihood function evaluations that can be done offline as opposed to 4000 sequential function evaluations, whenever direct MCMC can be performed. A robust accurate estimation is also found using a likelihood function trained on statistical averages of the chaotic 40-dimensional Lorenz-96 model on a wide parameter domain.


Introduction
Bayesian inference is a popular approach for solving inverse problems with far reaching applications, such as parameter estimation and uncertainty quantification (see for example [14,25,8]).In this article, we will focus on a classical Bayesian inference problem of estimating the conditional distribution of hidden parameters of dynamical systems from a given set of noisy observations.In particular, let x(t; θ) be a time dependent state variable, which implicitly depends on the parameter θ through the following initial value problem, ẋ = f (x, θ), x(0) = x 0 . (1) Here, for any fixed θ, f can be either deterministic or stochastic.Our goal is to estimate the conditional distribution of θ, given discrete-time noisy observations y † = {y † 1 , . . ., y † T }, where x † i ≡ x(t i ; θ † ) are the solutions of Eq. ( 1) for a specific hidden parameter θ † , h is the observation function, and ξ i are unbiased and i.i.d.noises representing the measurement error.Although the proposed approach can also estimate the conditional density of the initial condition x 0 , we will not explore this inference problem in this article.In our setup, we assume that the functional form of f in (1) and h in (2), the distribution of the noises ξ i , and the initial condition x 0 are known.
Given a prior density, p 0 (θ), Bayes' theorem states that the conditional distribution of the parameter θ can be estimated as, p(θ|y † ) ∝ p(y † |θ)p 0 (θ), where p(y † |θ) denotes the likelihood function of θ given the measurements y † that depend on a hidden parameter value θ † through (2).In most applications, the statistics of the conditional distribution p(θ|y † ) are the quantity of interest.For example, one can use the mean statistic as a point estimator of θ † and the higher order moments for uncertainty quantification.To realize this goal, one draws samples of p(θ|y † ) and estimates these statistics via Monte Carlo averages over these samples.In this application, Markov Chain Monte Carlo (MCMC) is a natural sampling method that plays a central role in the computational statistics behind most Bayesian inference techniques [6].
One way to define the likelihood function of θ, p(y † |θ), is as a product of the density functions of the i.i.d.measurement noises ξ i , where x † i (θ) ≡ x † (t i ; θ) are the solutions of the initial value problem in Eq. (1).Although this parametric modeling approach is convenient, in general, the dependence of the likelihood function on the parameter is implicit through the solutions x † i (θ).Practically, this implicit dependence will increase the computational cost in evaluating the likelihood function since it requires solving the dynamical model in (1) at every proposal in the MCMC chain.
Broadly speaking, the existing approaches to overcome this practical issue can be grouped into two classes.The first class consists of methods that improve/accelerate the sampling strategy; for example, the Hamiltonian Monte Carlo [21], adaptive MCMC [2], delay rejection adaptive Metropolis [9], just to name a few.The second class consists of methods that avoid solving the dynamical model in (1) when running the MCMC chain by replacing it with a computationally more efficient model.This class of approach, also known as surrogate modeling, includes Gaussian process models [12], polynomial chaos [19,18], and enhanced model error [13].Another related approach, which also avoids MCMC on top of integrating (1), is to employ a polynomial expansion on the likelihood function [20].This method represents the parametric likelihood function in (4) with orthonormal basis functions of a Hilbert space weighted by the prior measure.This choice of basis functions makes the computation for the statistics of the posterior density straightforward and thus MCMC is not needed.
In this work, the proposed method is an extension of the surrogate modeling approach where MCMC is employed with a nonparametric likelihood function constructed using data-driven spectral expansion.By nonparametric, we mean that our approach does not model the likelihood function as a product of the given densities of the random noises ξ i as in (4).Instead, we approximate the likelihood function using the kernel embedding of conditional distribution formulation introduced in [24,23].In our application, we will extend their formulation onto a Hilbert space weighted by the sampling measure of the training dataset as in [5].We will rigorously demonstrate that using the orthonormal basis functions of this data-driven weighted Hilbert 2 Conditional density estimation via reproducing kernel weighted Hilbert spaces In our problem, we assume that the observation y ∈ N ⊆ R n , where N is a smooth manifold with intrinsic dimension d ≤ n.In practice, we measure the observations in the ambient coordinates and denote its component as y = {y 1 , . . ., y n }.For the parameter θ space, M has an Euclidean structure with components, θ = {θ 1 , . . ., θ m }, so M is assumed to be either an m−dimensional hyperrectangle or R m .For training, we are given M number of training parameters {θ j } j=1,...,M = {θ 1 j , . . ., θ m j } j=1,...,M .For each training parameter θ j , we generate a discrete time series of length N for noisy observation data {y i,j } = {y 1 i,j , . . ., y n i,j } ∈ R n for i = 1, . . ., N, and j = 1, . . ., M .Here, the sub-index i and the sub-index j of y i,j correspond to the ith observation data for the jth training parameter θ j .Our goal for training is to learn the conditional density p(y|θ) for arbitrary y and θ from the training dataset {θ j } j=1,...,M and {y i,j } i=1,...,N j=1,...,M .
The construction of the conditional density p(y|θ) is based on a machine learning tool known as the kernel embedding of conditional distributions formulation introduced in [24,23].In their formulation, the representation of conditional distributions is an element of a reproducing kernel Hilbert space (RKHS).Recently, we generalized the representation to an element of a reproducing kernel weighted Hilbert space (RKWHS) in [5].In this work, we will use our generalized formulation in [5] for training to represent conditional density functions based on RKWHS.The outcome of the training is an estimate of the conditional density, p(y|θ), for arbitrary y and θ.

Review of nonparametric RKWHS representation of conditional density functions
We first review the RKWHS representation of conditional density functions deduced in [5].Let ψ k (y) be the orthonormal basis functions of L 2 (N , q), where N contains the domain of the training data y i,j , and the weight function q (y) is defined with respect to the volume form inherited by N from the ambient space R n .Let ϕ l (θ) ∈ L 2 (M, q) be the orthonormal basis functions in the parameter θ space, where the training parameters are θ j ∈ M, with weight function q (θ).For finite modes, k = 1, . . ., K 1 , and l = 1, . . ., K 2 , a nonparametric RKWHS representation of the conditional density can be written as follows [5]: where p (y|θ) denotes an estimate of the conditional density p (y|θ), and the expansion coefficients are defined as Here, the matrix C YΘ is K 1 × K 2 and the matrix C ΘΘ is K 2 × K 2 , whose components can be approximated by Monte Carlo averages [5]: where the expectations E are taken with respect to the sampling densities of the training dataset {y i,j } i=1,...,N j=1,...,M and {θ j } j=1,...,M .The equation for the expansion coefficients in Eq. ( 6) is based on the theory of kernel embedding of conditional distribution [24,23,5].See [5] for the detailed proof of Eqs. ( 6)- (8).Note that for RKWHS representation, the weight functions q and q can be different from the sampling densities of the training dataset {y i,j } i=1,...,N j=1,...,M and {θ j } j=1,...,M , respectively.This generalizes the representation in [5] which sets the weights q and q to be the sampling densities of the training dataset {y i,j } and {θ j }, respectively.Incidentally, it is worth mentioning that the conditional density in ( 5) and ( 6) is represented as a regression in infinite-dimensional spaces with basis functions ψ k (y) and ϕ l (θ).The expression ( 5) is a nonparametric representation in the sense that we do not assume any particular distribution for the density function p (y|θ).In this representation, only training dataset {y i,j } i=1,...,N j=1,...,M and {θ j } j=1,...,M with appropriate basis functions are used to specify the coefficients c Y|θ,k and the densities p (y|θ).In Sec. 4, we will demonstrate how to construct the appropriate basis completely from the training data, motivated by the theoretical result in Sec. 3 below.

Simplification of the expansion coefficients (6)
If the weight function q (θ) is the sampling density of the training parameters {θ j } j=1,...,M , the matrix C ΘΘ in (8) can be simplified to be a where δ sl is the kronecker delta function.Here, the second equality follows from the weight q (θ) being the sampling density, and the third equality follows from the orthonormality of ϕ l (θ) ∈ L 2 (M, q) with respect to the weight function q.Then the expansion coefficients c Y|θ,k in (6) can be simplified as with the K 1 × K 2 matrix C YΘ is still given by (7).In this work, we always take the weight function q (θ) to be the sampling density of the training parameters {θ j } j=1,...,M for the simplification of the expansion coefficients c Y|θ,k in (10).This assumption is not too restrictive since the training parameters are specified by the users.
Finally, the formula in (5) combined with the expansion coefficients c Y|θ,k in (10) and the matrix C YΘ in (7) forms a RKWHS representation of the conditional density p (y|θ) for arbitrary y and θ.Numerically, the training outcome is the matrix C YΘ in (7) and then the conditional density p (y|θ) can be represented by (5) with coefficients (10) using the basis functions {ψ k (y)} K1 k=1 and {ϕ l (θ)} K2 l=1 .From above, one can see that two important questions naturally arise as a consequence of the usage of RKWHS representation.First, whether the representation p (y|θ) in ( 5) is valid in estimating the conditional density p (y|θ).Second, how to construct the orthonormal basis functions ψ k (y) ∈ L 2 (N , q) and ϕ l (θ) ∈ L 2 (M, q).We will address these two important questions in the next two sections.

Error estimation
In this section, we focus on the error estimation of the expansion coefficient c Y|θj ,k , and later the conditional density p (y|θ j ) at the training parameter θ j .The notation c Y|θj ,k is defined as the expansion coefficient c Y|θ,k in (10), evaluated at the training parameter θ j .Let the total number of basis functions in parameter space, K 2 , be equal to the total number of training parameters, M , that is, K 2 = M , we can obtain the following relation, 1 For the training parameter θ j , we can simplify the expansion coefficient c Y|θj ,k by substituting Eq. ( 7) into Eq.( 10), where the last equality follows from (11).

Error estimation using arbitrary bases
We first study the error estimation for the expansion coefficient c Y|θj ,k .For each training parameter θ j , the conditional density function p(y|θ j ) ∈ L 2 N , q −1 can be analytically represented in the form, due to the completeness of L 2 (N , q).Here, the analytic expansion coefficient c Y|θj ,k is given by, Note that the estimator c Y|θj ,k in ( 12) is a Monte Carlo approximation of the expansion coefficient c Y|θj ,k in (14), i.e., where the last equality follows from the training dataset {y i,j } i=1,...,N which admits a conditional density p (y|θ j ).Then we can obtain the following unbiasedness and consistency of the estimator c Y|θj ,k .
Proof The estimator c Y|θj ,k is unbiased, where the expectation is taken with respect to the conditional density p (y|θ j ).If the variance, Var Y|θj [ψ k (Y)], is finite, then the variance of c Y|θj ,k converges to 0 as the number of training data N → ∞, Then we can obtain that the estimator c Y|θj ,k is consistent, where Chebyshev's inequality has been used.
If the estimator of p(y|θ j ) is given by the representation with infinite number of basis functions, p(y|θ j ) = ∞ k=1 c Y|θj ,k ψ k (y)q(y), then the estimator p(y|θ j ) is pointwise unbiased for every observation y.However, in numerical implementation, only finite number of basis functions can be used in representation (5).Numerically, the estimator of p(y|θ j ) is given by the representation (5) at the training parameter θ j , p(y|θ j ) = K1 k=1 c Y|θj ,k ψ k (y)q(y).
Then the pointwise error of the estimator, e(y|θ j ), can be defined as It can be seen that the estimator p(y|θ j ) is no longer unbiased or consistent due to the first error term in (18) induced by modes k > K 1 .Next, we estimate the expectation and the variance of a L 2 -norm error of p(y|θ j ) for all training parameters θ j .
Theorem 1 Let the condition in Proposition 1 be satisfied for all {θ j } j=1,...,M , and Var Y|θj [ψ k (Y)] be finite for all k ∈ N + .Define the L 2 -norm error, where e(y|θ j ) is the pointwise error in (18), and dV (y) is the volume form inherited by the manifold N from the ambient space R n [4,5].Then, where E and Var are defined with respect to joint distribution of p(y|θ j ) for all {θ j } j=1,...,M .Moreover, and Var[ e L 2 ] converge to zero as K 1 → ∞ and then N → ∞, where the limiting operations of K 1 and N are not commutative.
Proof The expectation of e L 2 can be estimated as, where the first inequality follows from Jensen's inequality.Here, the randomness comes from the estimators c Y|θj ,k .Due to the orthonormality of basis functions ψ k (y) ∈ L 2 (N , q), the error estimation in (22) can be simplified as, where the inequality follows from the linearity of expectation, and the equality follows from E c Y|θj ,k = c Y|θj ,k in (16) and Var c Y|θj ,k = 1 N Var Y|θj [ψ k (Y)] in (17).Note that in error estimation (23), the first term is deterministic and the second term is random.We have so far proved that the expectation E [ e L 2 ] is bounded by (20).Similarly, we can prove that the variance Var[ e L 2 ] is bounded by (21).
Next, we prove that the expectation E [ e L 2 ] converges to zero as K 1 → ∞ and then N → ∞.Parseval's theorem states that where the inequality follows from p(y|θ j ) ∈ L 2 N , q −1 for all θ j .For ∀ε > 0, there exists an integer K 1 (θ j ) for θ j such that Let then the first term in (23) Then for ∀ε > 0, there exists a sufficiently large number of training data such that whenever N > N min , then Since ε > 0 is arbitrary, by substituting Eq. ( 27) and Eq.(30) into the error estimation (23), we obtain that E [ e L 2 ] converges to zero as K 1 → ∞ and then N → ∞.Note that, we first take K 1 → ∞ to ensure the first error term in (23) vanishes, and then take N → ∞ to ensure the second error term in (23) vanishes.Thus, the limiting operations of K 1 and N are not commutative.Similarly, we can prove that the variance Var[ e L 2 ] converges to zero as K 1 → ∞ and then N → ∞.
Theorem 1 provides an intuition for specifying the number of training observation data N to achieve any desired accuracy ε > 0 given fixed M -parameters and sufficiently large K 1 .It can be seen from Theorem 1 that numerically, the expectation E [ e L 2 ] in (20) and the variance Var[ e L 2 ] in (21) can be bounded within arbitrarily small ε by choosing sufficiently large K 1 and N .Specifically, there are two error terms in Eqs.(20) and (21), the first being deterministic induced by modes k > K 1 , and the second random induced by modes k ≤ K 1 .For the deterministic term (k > K 1 ), the error can be bounded by ε/2 by choosing sufficiently large K 1 satisfying (26).In our implementation, the number of basis functions K 1 is empirically chosen to be large enough in order to make the first error term in Eqs.(20) and (21) for k > K 1 as small as possible.
For the random term (k ≤ K 1 ), the error can be bounded by ε/2 by choosing sufficiently large N satisfying N > N min = 2M K 1 D/ε [Eq.( 29)].The minimum number of training data, N min , depends on the upper bound of Var Y|θj [ψ k (Y)], D. However, the upper bound D may not exist for some problems.This means that for some problems, the assumption for finite Var Y|θj [ψ k (Y)] in Theorem 1 may be not satisfied.Even if the upper bound D exists, it is typically not easy to evaluate its value given an arbitrary basis ψ k ∈ L 2 (N , q) since one needs to evaluate Var Y|θj [ψ k (Y)] for all k = 1, . . ., K 1 and j = 1, . . ., M .Note that Theorem 1 holds true for representing p(y|θ j ) with an arbitrary basis {ψ k } ∈ L 2 (N , q) as long as p(y|θ j ) ∈ L 2 N , q −1 for all θ j and Var Y|θj [ψ k (Y)] is finite for k ≤ K 1 and j = 1, . . ., M .Next, we provide several cases in which Var Y|θj [ψ k (Y)] is finite for k and j.
Remark 1 If the weighted Hilbert space L 2 (N , q) is defined on a compact manifold N and has smooth basis functions ψ k , then Var Y|θj [ψ k (Y)] is finite for a fixed k ∈ N + and j = 1, . . ., M .This assertion follows from the fact that continuous functions on a compact manifold are bounded.The smoothness assumption is not unreasonable in many applications since the orthonormal basis functions are obtained as solutions of an eigenvalue problem of a self-adjoint second-order elliptic differential operator.Note that the bound here is not necessarily a uniform bound of ψ k (Y) for all k ∈ N + and j = 1, . . ., M .As long as Var Y|θj [ψ k (Y)] is finite for k ≤ K 1 and j = 1, . . ., M , the upper bound D is finite and then Theorem 1 holds true.
Remark 2 If the manifold N is a hyperrectangle in R n and the weight q is a uniform distribution on N , then Var Y|θj [ψ k (Y)] is finite for a fixed k ∈ N + and j = 1, . . ., M .This assertion is an immediate consequence of Remark 1.
In Theorem 1, N min depends on the upper bound of Var Y|θj [ψ k (Y)], D, as shown in (29).In the following, we will specify a Hilbert space, referred to as a data-driven Hilbert space, so that N min is independent of D and is only dependent of M , K 1 , and ε.As a consequence, we can easily determine how many training data N for bounding the second error term in Eqs. ( 20) and ( 21).

Error estimation using a data-driven Hilbert space
We now turn to the discussion of a specific data-driven Hilbert space L 2 (N , q) with orthonormal basis functions ψ k .Our goal is to specify the weight function q such that the minimum number of training data, N min , only depends on M , K 1 , and ε.Here, the overline • corresponds to the specific data-driven Hilbert space.The second error term in ( 23) can be further estimated as, where the basis functions are substituted with the specific ψ k .Notice that ψ k (y) are orthonormal basis functions with respect to the weight q in L 2 (N , q).One specific choice of the weight function q (y) is where q (y) has been normalized, i.e., If the weight function q (y) is not normalized, there are only constant times for the error e L 2 in (19).Note that the weight function q (y) by ( 32) is a discretization of the marginal density function of Y with Θ marginalized out, where p(y, θ) denotes the joint density of (Y, Θ).Essentially, the weight function q (y) in ( 32) is the sampling density of all the training data {y i,j } i=1,...,N j=1,...,M , which motivates us to refer to L 2 (N , q) as a data-driven Hilbert space.
Next, we prove that by specifying the data-driven basis functions ψ k ∈ L 2 (N , q), the variance Var Y|θj ψ k (Y) is finite for all k ∈ N + and j = 1, . . ., M .Subsequently, we can obtain the minimum number of training data, N min , to only depend on M , K 1 , and ε, such that the expectation E [ e L 2 ] in (20) and the variance Var[ e L 2 ] in (21) are bounded above by any ε > 0.
Proof Notice that for all k ∈ N + , we have where the last equality follows directly from the orthonormality of basis functions ψ k (y) ∈ L 2 (N , q).From above Eq.( 35), we can obtain that for all k ∈ N + and j = 1, . . ., M , the variance Var Y|θj ψ k (Y) is finite.
Theorem 2 Given the same hypothesis as in Proposition 2, then where e L 2 is defined by ( 19) and c Y|θj ,k is given by ( 14).Moreover, E [ e L 2 ] and Var[ e L 2 ] converge to zero as K 1 → ∞ and then N → ∞, where the limiting operations of K 1 and N are not commutative.
Proof According to Proposition 2, we have that the variance Var Y|θj ψ k (Y) is finite for all k ∈ N + and j = 1, . . ., M .According to Proposition 1, since Var Y|θj ψ k (Y) is finite, we have that the estimator c Y|θj ,k is both unbiased and consistent for c Y|θj ,k .All conditions in Theorem 1 are satisfied, so that we can obtain the error estimation of the expectation E [ e L 2 ] in (20) and error estimation of the variance Var[ e L 2 ] in (21).
The second error term in E [ e L 2 ] (20) and Var[ e L 2 ] (21) can be both bounded by Eq. ( 35), so that we can obtain our error estimations (36) and (37).
Choose K 1 as in (26) such that the first term in (36) and ( 37) is bounded by ε/2.The second term M K 1 /N in (36) and (37) can be bounded by an arbitrarily small ε/2 if the number of training data Then both the expectation E [ e L 2 ] and the variance Var[ e L 2 ] can be bounded by ε.Since ε > 0 is arbitrary, the proof is completed.
Recall that by applying arbitrary basis functions to represent p (y|θ) in ( 5), it is typically not easy to evaluate the upper bound D in (28) which implies that it is not easy to determine how many observation data, N min [Eq.( 29)] should be used for training.However, by applying the data-driven basis functions ψ k to represent p (y|θ) in ( 5), the minimum number of training data, N min [Eq.( 38)], becomes independent of D, and is only dependent of M , K 1 , and ε as can be seen from Theorem 2. To let the error induced by modes k ≤ K 1 be smaller than a desired ε/2 , we can easily determine how many observation data, N min [Eq.( 38)], should be used for training.In this sense, the specific data-driven Hilbert space L 2 (N , q) with the corresponding basis functions ψ k is a good choice for representing (5).
We have so far theoretically verified the validity of the representation (5) in estimating the conditional density p(y|θ j ) using any appropriate basis functions [Theorem 1].In particular, using the data-driven basis ψ k (y) ∈ L 2 (N , q), we can easily control the error of conditional density estimation by specifying the number of training data N [Theorem 2].To summarize, the training procedures can be outlined as follows: (1-A) Generate the training dataset including training parameters {θ j } j=1,...,M and observations {y i,j } i=1,...,N j=1,...,M .The length of training data N is empirically determined based on the criteria (29) or (38).
(1-B ) Construct the basis functions for parameter θ space and for observation y space by using the training dataset.For y space, we need to empirically choose the number of basis functions K 1 to let the error induced by modes k > K 1 as small as possible.In particular, for the data-driven Hilbert space, we will provide a detailed discussion on how to estimate the data-driven basis functions of L 2 (N , q) with the sampling density q from the training data in the following Sec. 4. Note that this basis estimation will introduce additional errors beyond the results in this section which assume the data-driven basis functions are given.
(1-C ) Train the matrix C YΘ in (7) and then estimate the conditional density p (y|θ) by using the nonparametric RKWHS representation (5) with the expansion coefficients c Y|θ,k (10).
(1-D) Finally, for new observations y † = {y † 1 , . . ., y † T }, define a likelihood function as a product of conditional densities of new observations y † given any θ, Next, we address the second important question for the RKWHS representation [Procedure (1-B )]: how to construct basis functions for θ and y.Especially, we focus on how to construct data-driven basis functions for y.

Basis functions
In this section, we address the main issue of how to construct basis functions for parameter θ and observation y.This section will be organized as follows.In Sec.4.1, we discuss how to employ analytical basis functions for parameter θ and for observation y as in the usual polynomial chaos expansion.In Sec.4.2, we discuss how to construct the data-driven basis functions ψ k ∈ L 2 (N , q) with N being the manifold of the training dataset {y i,j } i=1,...,N j=1,...,M and the weight q by (32) being the sampling density of {y i,j } i=1,...,N j=1,...,M .

Analytic basis functions
First, we address the issue of how to choose the training parameters and how to construct the corresponding basis functions using these training parameters for θ space.If no prior information about the parameter space is known, we can assume that the training parameters are uniformly distributed on the parameter θ space.
In particular, we choose M number of well-sampled training parameters where × denotes a Cartesian product, two parameters θ s min and θ s max are the minimum and maximum values of the uniform distribution for the s th coordinate of θ space.Here, the well-sampled uniform distribution corresponds to a regular grid, which is a tessellation of m-dimensional Euclidean space R m by congruent parallelotopes.Two parameters θ s min and θ s max are determined by For M regularly spaced grid points θ s j , we set γ = .5(Ms − 1) −1 in all of our numerical examples below, where M s is the number of training parameters in the s th coordinate.For example, see Fig. 1 for the 2D wellsampled uniformly distributed data {(5, 5), (6, 5), . . ., (12,12)} (blue circles).In this case, the 2-dimensional box M is [4.5, 12.5] 2 (red square).
On this simple geometry, we will choose ϕ k to be the tensor product of the basis functions on each coordinate.Notice that we have taken the weight function q to be the sampling density of the training parameters in order to simplify the expansion coefficient c Y|θ,k in (10).In this case, the weight q is a uniform distribution on M. Then for the s th coordinate of the parameter, θ s , the weight function q s (θ s ) is a uniform distribution on the interval [θ s min , θ s max ] and one can choose the following cosine basis functions, where Φ k s (θ s ) form a complete orthonormal basis of L 2 ([θ s min , θ s max ], q s ).This choice of basis functions corresponds to exactly the data-driven basis functions produced by the diffusion maps algorithm on uniformly distributed data set on a compact interval, which will be discussed in Sec.4.2.Although other choice such as the Legendre polynomials can be used, this choice will lead to a larger value of constant D in (29) that controls the minimum number of training data for accurate estimation.Subsequently, we set L 2 (M, q) = m s=1 L 2 ([θ s min , θ s max ], q s ), where ⊗ denotes the Hilbert tensor product, q (θ) = m s=1 q s (θ s ) is the uniform distribution on the m-dimensional box M. Correspondingly, the basis functions ϕ k (θ) are a tensor product of Φ k s (θ s ) for s = 1, . . ., m, where k = k 1 , . . ., k m , and θ = θ 1 , . . ., θ m .Based on the property of the tensor product of Hilbert spaces, {ϕ k (θ)} forms a complete orthonormal basis of L 2 (M, q).
We now turn to the discussion of how to construct analytic basis functions for y.The approach is similar to the one for parameter θ, except that the domain of the data is specified empirically and the weight function is chosen to correspond to some well-known analytical basis functions, independent of the sampling distribution of the data y.That is, we assume the geometry of the data has the following tensor structure, N = N 1 × . . .× N n , where N s will be specified empirically based on the ambient space coordinate of y.Let y s be the s th ambient component of y, we can choose a weighted Hilbert space L 2 (N s , q s (y s ; α s )) with the weight q s depending on the parameters α s and being normalized to satisfy R q s (y s ; α s )dy s = 1.For each coordinate, let Ψ k s (y s ; α s ) be the corresponding orthonormal basis functions which possess analytic expressions.Subsequently, we can obtain a set of complete orthonormal basis functions ψ k ∈ L 2 (N , q) for y by taking the tensor product of these Ψ k s as in (43).
For example, if the weight q s is uniform, N s ⊂ R is simply a one-dimensional interval.In this case, we can choose the cosine basis functions Ψ k s for y as in (42) such that the parameters α s correspond to the boundaries of the domain N s , which can be estimated as in (41).In our numerical experiments below, we will set γ = 0.1.Another choice is to set the weight q s (y s ; α s ) to be Gaussian.In this case, the domain is assumed to be the real line, N s = R.For this choice, the corresponding orthonormal basis functions Ψ k s are Hermite polynomials, and the parameters α s , corresponding to the mean and variance of the Gaussian distribution, can be empirically estimated from the training data.
In the remainder of this paper, we will always use the cosine basis functions for θ.The application of (5) using cosine basis functions for y is referred to as the Cosine representation.The application of (5) using Hermite basis functions for y is referred to as the Hermite representation.In the next section, we introduce one method to construct data-driven basis functions ψ k ∈ L 2 (N , q), which respects the geometry and the sampling distribution of the training data {y i,j } i=1,...,N j=1,...,M .

Data-driven basis functions
In this section, we discuss how to construct a set of data-driven basis functions ψ k ∈ L 2 (N , q) with N being the manifold of the training dataset {y i,j } i=1,...,N j=1,...,M and weight q in (32) being the sampling density of {y i,j } for all i = 1, . . ., N, and j = 1, . . ., M .The issues here are that the analytical expression of the sampling density q is unknown and the Riemannian metric inherited by the data manifold N from the ambient space R n is also unknown.Fortunately, these issues can be overcome by the following diffusion maps algorithm [7,4,5].

Learning the data-driven basis functions
Given a dataset y i,j ∈ N ⊆ R n with the sampling density q(y) (32), defined with respect to the volume form inherited by the manifold N from the ambient space R n , one can use the kernel-based diffusion maps algorithm to construct an M N ×M N matrix L that approximates a weighted Laplacian operator, L = ∇ log (q)•∇+ , that takes functions with Neumann boundary conditions for compact manifold N with boundary if the manifold has boundary.The eigenvectors ψ k of the matrix L are discrete approximations of the eigenfunctions ψ k (y) of the operator L, which form an orthonormal basis of the weighted Hilbert space L 2 (N , q).Each component of the eigenvector ψ k ∈ R M N is a discrete estimate of the eigenfunction ψ k (y i,j ), evaluated at the training data point y i,j .The sampling density q defined in (32) is estimated using a kernel density estimation method [3].
In contrast to the analytic continuous basis functions in above Sec.4.1, the data-driven basis functions ψ k (y) ∈ L 2 (N , q) are represented nonparametrically by the discrete eigenvectors ψ k ∈ R M N using the diffusion maps algorithm.The outcome for training is a discrete estimate of the conditional density, p (y i,j |θ), which estimates the representation p (y|θ) (5) at each training data point y i,j .
In our implementation, we use the variable-bandwidth diffusion maps (VBDM) algorithm introduced in [4], which extends the diffusion maps to non-compact manifolds without boundary.See the supplementary material of [10] for the MATLAB code of this algorithm.We should point out that this discrete approximation induces errors in the basis function, which are estimated in detail in [4].These errors are in addition to the error estimations in Sec. 3.
We note that if the data are uniformly distributed on a one-dimensional bounded interval, then VBDM solutions are the cosine basis functions, which are eigenfunctions of the Laplacian operator on bounded interval with Neumann boundary conditions.This means that the cosine functions in (42) that are used to represent each component of θ are analog of the data-driven basis functions.The difference is that with parametric choice in (42), one avoids using VBDM in the expense of specifying the boundaries of the domain, [θ s min , θ s max ].In the remainder of this paper, we refer to an application of ( 5) with cosine basis functions for θ and VBDM basis functions for y as the VBDM representation.
However, a direct application of the VBDM algorithm suffers from the expensive computation cost for large training data.In the next Sec.4.2.2, we will address the issue of how to construct basis functions for large number of training data.Second issue arises from the discrete representation of the conditional density in the observation y space using the VBDM algorithm.Notice that the VBDM representation, p (y i,j |θ), is only estimated at each training data point y i,j .A natural question is how to extend the representation onto new observations y t / ∈ {y i,j } i=1,...,N j=1,...,M that are not part of the training dataset [Procedures (1-C ) and (1-D)].In Sec.4.2.3, we will introduce an extension method for approximating the representation on new observations.

Data reduction
When M N is very large, the VBDM algorithm becomes numerically expensive since it involves solving an eigenvalue problem of matrix size M N × M N .Notice that the number of training parameters M grows exponentially as a function of the dimension of parameter, m, if well-sampled uniformly distributed training parameters are used.To overcome this large training data problem, we employ an empirical data reduction method to reduce the original M N training data points {y i,j } i=1,...,N j=1,...,M to a small B ( M N ) number of training data points yet preserving the sampling density q (y) in (32).Subsequently, we apply the VBDM algorithm on these reduced training data points.
The basic idea of our method is that we first cluster the training dataset {y i,j } i=1,...,N j=1,...,M into B number of boxes and then take the average of data points in each box as a reduced training data point.First, we cluster the training data {y i,j }, based on the ascending order of the 1 st coordinate of the training data, y 1 i,j , into B 1 number of groups such that each group has the same number = M N/B 1 of data points.After the first clustering, we obtain B 1 groups with each group denoted by G 1 k1 for k 1 = 1, . . ., B 1 .Here, the super-index 1 denotes the first clustering and the sub-index k 1 denotes the k 1 th group.Second, for each group G 1 k1 , we cluster the training data {y i,j } inside G 1 k1 , based on the ascending order of the 2 nd coordinate of the training data, y 2 i,j , into B 2 number of groups such that each group has the same number = M N/B 1 B 2 of data points.After the second clustering, we obtain totally B 1 B 2 groups with each group denoted by G 2 k1k2 for k 1 = 1, . . ., B 1 , and k 2 = 1, . . ., B 2 .We can operate such clustering n times, where n is the dimension of the observation y ambient space.After n times clustering, we obtain B ≡ n s=1 B s groups with each group denoted by G n k1k2...kn with k s = 1, . . ., B s , for all s = 1, . . ., n.Each group is a box [see Fig. 2 for example].After taking the average of the data points in each box G n k1k2...kn , we obtain B number of reduced training data points.In the remainder of this paper, we denote these B number of reduced training data points by {y b } b=1,...,B and refer to them as the box-averaged data points.Intuitively, this algorithm partitions the domain into hyperrectangle such that P r (y ∈ G n k1...kn ) ≈ 1/B.Note that the idea of our data reduction method is analogous to that of multivariate k-nearest neighbor density estimates [15,17].
To examine whether the distribution of these box-averaged data points is close to the sampling density q (y) of the original dataset, we apply our reduction method to several numerical examples in the following.Figure 2(a) shows the reduction result for few number (= 64) of uniform data in [0, 1] × [0, 1].Here, B 1 = 4 and B 2 = 4 so that there are total B = 16 boxes and inside each box there are 4 uniform data points (blue circles).It can be seen that the box-averaged data points (red circles) are far away from the well-sampled uniform data points (cyan crosses).However, when uniform data points (blue circles) increase to a large number (= 6400), box-average data points (red circles) are very close to well-sampled uniform data points (cyan crosses) as shown in Fig. 2(b).This suggests that these box-averaged data points nearly admit the uniform distribution when there are a large number of original uniform data points.Figs.2(c) and (d) show the comparison of the kernel density estimates applied on the box-averaged data for different B for standard normal distribution and the distribution proportional to exp − −X 2 1 + X 3 1 + X 4 1 , respectively.It can be seen that the reduced box-averaged data points nearly preserve the distribution of the original large dataset, N = 640, 000.
When M N is very large, the VBDM algorithm for the construction of data-driven basis functions [Procedure (1-B )] can be outlined as follows.We first use our data reduction method to obtain B ( M N ) number of box-averaged data {y b } b=1,...,B ⊆ N ⊆ R n with sampling density q (y) ≈ 1 M M j=1 p(y|θ j ) in (32).The sampling density q is estimated at the box-averaged data y b using all the box-averaged data points {y b } b=1,...,B by a kernel density estimation method.Implementing the VBDM algorithm, we can obtain orthonormal eigenvectors ψ k ∈ R B , which are discrete estimates of the eigenfunctions ψ k (y) ∈ L 2 (N , q).The bth component of the eigenvector ψ k is a discrete estimate of the eigenfunction ψ k (y b ), evaluated at the box-averaged data point y b .Due to the dramatic reduction of the training data, the computation of these eigenvectors ψ k ∈ R B becomes much cheaper than the computation of the eigenvectors ψ k ∈ R M N using the original training dataset {y i,j } i=1,...,N j=1,...,M .Then we can obtain a discrete representation (5) of the conditional density at the box-averaged data points y b , p (y b |θ).

Nyström extension
For the Procedure (1-C ), when training the matrix C YΘ in (7), we need to know the estimate of the eigenfunction ψ k (y i,j ) at all the original training data y i,j / ∈ {y b } b=1,...,B .However, we only have the discrete estimate of the eigenfunction ψ k (y b ) at the box-averaged data points y b .This suggests that we need to extend the eigenfunctions ψ k (y b ) onto all the original training data {y i,j } i=1,...,N j=1,...,M .Notice that this extension problem is the same as the aforementioned second issue for the VBDM algorithm: how to extend the representation onto new observations y t / ∈ {y i,j } i=1,...,N j=1,...,M that are not part of the training dataset [Procedure (1-D)].One solution to this issue is that we can first extend the eigenfunctions ψ k (y i,j ) to the new observation, ψ k (y t ), and then represent the density, p (y t |θ), with a set of extended basis functions ψ k (y t ).
For the convenience of discussion, the training data that are used to construct the eigenfunctions are denoted by y old r r=1,...,R , and all the data that are not part of y old r r=1,...,R are denoted by y new .To extend the eigenfunctions ψ k y old r onto the data point y new / ∈ y old r r=1,...,R , one approach would be to use the Nyström extension [22] that is based on the basic theory of RKHS [1].Let L 2 (N , q) be a RKWHS with a symmetric positive kernel T : N × N → R. Then for any function f ∈ L 2 (N , q), the Moore-Aronszajn 1 + X 3 1 + X 4 1 )], respectively.For comparison, also plotted is the analytic probability density of the distribution.The total number of the points is 640000.It can be seen that the reduced box-averaged data points nearly preserve the distribution of the original dataset.theorem states that one can evaluate f at a ∈ N with the following inner product, f (a) = f, T (a, •) q .In our application, this amounts to evaluating, where the non-symmetric kernel function T : N × N → R is related to the symmetric kernel T by T (y i , y j ) = q −1/2 (y i ) T (y i , y j ) q 1/2 (y j ) with q (y i ) being the sampling density of y old r r=1,...,R at y i .See the detailed evaluation of the kernels T and T for the Nyström extension in [11].After obtaining the estimate of the eigenfunction ψ k (y new ) using the Nyström extension, we can train the matrix C YΘ in (7) for large M N and then obtain the representation of the conditional density on arbitrary new observation y t , p (y t |θ).
To summarize this section, we have constructed two different sets of basis functions for y, the analytic basis functions of L 2 (N , q) such as the Hermite and Cosine basis functions which assume that the manifold is R n or hyperrectangle, respectively, and the data-driven basis functions of L 2 (N , q) with N being the data manifold and q being the sampling density that are computed using VBDM algorithm.In the next section, we will compare the numerical performances of the Hermite, Cosine, and VBDM basis functions for parameter estimation application.

Parameter estimation using the Metropolis scheme
Here, we briefly review the Metropolis scheme for estimating the posterior density p(θ|y † ) given new observations y † = {y † 1 , . . ., y † T } for a specific parameter θ † .The key idea of the Metropolis scheme is to construct a Markov chain such that it converges to samples of conditional density p(θ|y † ) as the target density.In our application, the parameter estimation procedures can be outlined as follows: (2-A) Suppose we have θ 0 ∼ p(θ 0 |y † ) > 0, then for i ≥ 1 we can sample θ * ∼ κ (θ i−1 , θ * ).Here, κ is the proposal kernel density.For example, use the random walk Metropolis algorithm to generate proposals, κ (θ i−1 , θ * ) = N (θ i−1 , C), where C, the proposal covariance, is a tunable nuisance parameter.(2-C ) Generate a sufficiently long chain and use chain's statistic as an estimator of the true parameter θ † .
Take multiple runs of the chain started at different initial θ 0 , and examine whether all these runs converge to the same distribution.The convergence of all the examples below have been validated using 10 randomly chosen different initial conditions.
One can show that the likelihood function for this problem is the inverse matrix gamma distribution, If a prior is defined to be also the inverse matrix gamma distribution, Σ ∼ IM G (α 0 , 2, 0), for some value of α 0 , then the posterior density p(Σ|X † ) can be obtained by applying the Bayes' theorem, The posterior mean can thereafter be obtained as, See Section 7 in the supplementary material for the detail derivation of the posterior density function and the posterior mean.
To compare with the analytic conditional density p(X|Σ) (46), we learn three RKWHS representations of the conditional density function, p (X|Σ), by using the same training dataset.For training, we use M = 64 well-sampled uniformly distributed training parameters [shown in Fig. 1], (σ 2 X1 , σ 2 X2 ), where σ 2 Xj ∈ {5, 6, . . ., 12}, which are denoted by {Σ j } M j=1 .For each training parameter Σ j , we generate N = 640000 number of well-sampled normally distributed observation data.For Hermite and Cosine representations, we use 20 number of basis functions for each coordinate, and then we can construct K 1 = 400 basis functions of two-dimensional observation, X, by taking the tensor product.For the VBDM representation, we first compute B = B 1 × B 2 = 100 × 100 box-averaged data by our data reduction method [Sec.4.2.2].After reducing the number of training data from M N to B, we can learn K 1 = 400 data-driven basis functions from B box-averaged data using the VBDM algorithm [4].X1 , σ 2 X2 ) = (5, 5).It can be seen from Figs. 3(b)-(d) that all the pointwise errors are small compared to the analytic p(X|Σ) in Fig. 3(a) so that all representations of conditional densities p (X|Σ) are in excellent agreement with the analytic p(X|Σ) [Fig.3(a)].This suggests that for the Hermite representation, the upper bound D (28) in Theorem 1 is finite so that the representation is valid in estimating the conditional density as can be seen from Fig. 3(b).On the other hand, the upper bound D for the Cosine and the VBDM representations are always finite as mentioned in Remark 2 and Proposition 2, respectively.We should also point out that for this example, the VBDM representation performs the worst with errors of order 10 −4 compared to the Hermite and Cosine representations whose errors are on the order of 10 −6 .This larger error in the VBDM representation is because the data-driven basis functions are estimated by discrete eigenvectors ψ k ∈ R B , so additional errors [4] are introduced through this discrete approximation (especially on the high modes) on the box-averaged data, {y b } b=1,...,B , B = 10000.On the other hand, for Hermite and Cosine representations, their analytic basis functions are known so that the errors can be approximated by (20) in Theorem 1. See Section 7 in the supplementary material for more discussions about the numerical results for this example.
Figures 4(b), (c), and (d) display the densities of the chain by using Hermite, Cosine, and VBDM representation, respectively.The densities are plotted using kernel density estimate on the chain ignoring the first 10000 iterations.For comparison, Fig. 4(a) displays the analytic posterior density (47).It can be seen from Fig. 4 that the posterior densities by three representations are in excellent agreement with each other and with the analytic posterior density (47).Figure 4 also shows the comparison between the posterior mean (48) and the MCMC mean estimates.From our numerical results, MCMC mean estimates by all representations and the analytic posterior mean (48) are identical within numerical accuracy.Therefore, for this 2D OU-process example, all representations are valid in estimating the posterior density and posterior mean of parameter Σ.
Next, we will investigate a system for which the intrinsic dimension d is less than the dimension of ambient space n.From the system, we will find that representations exhibit quite different performances between analytic and data-driven basis functions in estimating the posterior density.

Example II: Three-dimensional system of SDE's on a torus
Consider a system of SDE's on a torus defined in the intrinsic coordinates (θ, φ) ∈ [0, 2π) where W 1 and W 2 are two independent Wiener processes, and the drift and diffusion coefficients are .
The initial condition is (θ, φ) = (π, π).Here, D is a parameter to be estimated.This example exhibits non-gradient drift, anisotropic diffusion, and multiple time scales.The training dataset are generated by numerically solving the SDE in (49) with a time step ∆t = 0.1 and then mapping this data into the ambient space, R 3 , via the standard embedding of the torus given by Here, x ≡ (x, y, z) are observations.This system on a torus satisfies d < n, where d = 2 is the intrinsic dimension of x and n = 3 is the dimension of ambient space R n .Our goal is to estimate the posterior density and the posterior mean of parameter D given discrete-time observations of x † for a specific parameter D † .
For training, we use M = 8 well-sampled uniformly distributed training parameters, {D j = j/4} 8 j=1 .For each training parameter D j , we generate N = 54000 observations of x.For Hermite and Cosine representation, we construct 10 basis functions for each x, y, z coordinate in Euclidean space.After taking tensor product of these basis functions, we can obtain K 1 = 1000 basis functions on the ambient space R 3 .For VBDM representation, we first compute B = B 1 × B 2 × B 3 = 30 3 box-averaged data points by our data reduction method.However, we find that some of the B box-averaged data points are far away from the torus.After ignoring these points, we eventually choose B = 26020 of the box-averaged data points that are close enough to the torus for training.Then, we learn K 1 = 1000 data-driven basis functions on N from these 26020 box-averaged data points using the VBDM algorithm.For diagnostic comparisons, we construct another representation p(x|D j ), named intrinsic Fourier representation, which can be regarded as an accurate approximation of p(x|D j ), as it uses basis functions defined on the intrinsic coordinates (θ, φ) instead of x ∈ R 3 .See Section 8 in detail for the construction and the convergence of the intrinsic Fourier representation.
Figure 5 displays the comparison of the density estimates.It can be observed from Fig. 5 that the VBDM representation is in good agreement with the intrinsic Fourier representation, whereas Hermite and Cosine representations of p(x|D j ) deviate significantly from the intrinsic Fourier representation.The reason in short is that if the density p(θ, φ|D) in (θ, φ) coordinate is in L 2 [0, 2π) 2 , then the corresponding VBDM representation with respect to dV (x) is in L 2 N , q −1 .However, the representation [Hermite and Cosine] with respect to dx, x ∈ R 3 is not in L 2 R 3 , q −1 .A more detailed explanation of this assertion is presented in Section 8.
We now compare the MCMC estimates with the true value, D † = 0.9 from T = 10000 observations.For this simulation, we set the prior to be uniformly distributed and empirically choose C = 0.01 for the proposal.Figure 6 displays the posterior densities of the chains for all representations (each density estimate is smoothed with KDE on a chain of length 400000).Also displayed is the comparison between the true value D † and the MCMC mean estimates by all representations.Here, the mean estimate by the intrinsic Fourier representation nearly overlaps with the true value D † = 0.9 as shown in Fig. 6.The mean estimate by the VBDM representation is closer to the true value D † compared to the estimates by Hermite and Cosine representations.Moreover, it can be seen from Fig. 6 that the posterior by the VBDM representation is close to the posterior by intrinsic Fourier representation, whereas posterior densities by Hermite and Cosine representation are not.This result suggests that when the intrinsic dimension is less than the ambient space dimension, d < n, the VBDM representation with data-driven basis functions in L 2 (N , q) is superior compared to the representations with analytic basis functions.

Example III: Five-dimensional Lorenz-96 model
Consider a five-dimensional Lorenz-96 model [16] with periodic boundary, x j+J = x j .The initial condition has a standard normal distribution for each x j and is kept the same for all training parameters.Our goal here is to estimate the posterior density and posterior mean of the hidden parameter F given a time series of noisy observations u = (u 1 , u 2 , u 3 , u 4 , u 5 ), where the noise variance σ 2 j = 2.5e−3×Var[x j ].Here, x j (t m ) denotes the approximate solution (with Runge-Kutta method) at discrete times t m = 20m t, where t = 0.05 is the integration time step.
For training, we use M = 8 well-sampled uniformly distributed training parameters, {F j = 7.55 + 0.1j} 8 j=1 .For each training parameter F j , we generate N = 500 noisy observations u.For Hermite and Cosine representation, we construct 5 Hermite and cosine basis functions for each coordinate, which yields a total of K 1 = 5 5 = 3125 basis functions in R 5 .For the VBDM representation, we directly apply the VBDM algorithm to learn K 1 = 3125 data-driven basis functions on manifold N from M N = 4000 training dataset.From the VBDM algorithm, the estimated intrinsic dimension is d = 2.19 which is smaller than the dimension of the ambient space n = 5.
Figure 7 displays the comparison of the conditional densities p(u|F j ) using Hermite (first row), Cosine (second row), and VBDM (third row) representations.For this example, the Hermite and Cosine representations of conditional densities p(u|F j ) have smaller density values, compared to that of the VBDM representation.p(u i,j |F k )/q (u i,j ) = 1 with q being the estimated sampling density of the training dataset.For each p(u|F j ), only the largest 500 conditional densities are depicted with marker size 20 and the rest 3500 small conditional densities are depicted with marker size 1.For the VBDM representation, the conditional densities p(u|F j ) have higher peaks as depicted by the magenta circles [see Figs.7(g)-(i)].This indicates that the manifolds of training data {u i,j } i=1,...,N , N j , are compact.By comparing Figs.7(g), (h), and (i), it can be further observed that the magenta points, corresponding to the several largest conditional densities, are different for each p(u|F j ).This indicates that the manifolds N j change and nearly have no intersection for different training parameter F j .This is the reason why in this example we do not need to use box-averaged data since one can distinguish the training parameter F j only from small size of data by comparing the data manifolds N j .This is the opposite of the previous Example II, where the data manifolds for all training parameters are exactly the same so that more training data are typically needed to distinguish the training parameters by comparing the conditional densities.
We now compare the MCMC mean estimates of the true value, F † = 8 from T = 1000 noisy observations u (t m ).Similar to the previous example, we also use a uniform prior distribution and C = 0.01 for the proposal.We generate the chain for 400000 iterations.Figure 8 displays the posterior densities of the chains for all representations.Also displayed is the comparison between the true value F † and the MCMC mean estimates by all representations.It can be seen from Fig. 8 that the mean estimate by VBDM representation is very close to the true value F † , compared to the mean estimates by Hermite and Cosine representations.Notice also that although the mean estimate from the Cosine representation is not too far from the true parameter value, its posterior density is bimodal, which suggests more uncertainties on the mean estimates.Based on this numerical result, where the estimated intrinsic dimension d ≈ 2 of the observations is lower than the ambient space dimension n = 5, again, the data-driven VBDM representation is superior compared to the Hermite and Cosine representations.

Conclusion
We have developed a framework of parameter estimation approach where MCMC is employed with a nonparametric likelihood function.Our approach approximates the likelihood function using the kernel embedding of conditional distribution formulation based on RKWHS.By analyzing the error estimation in Theorem 1, we have verified the validity of our RKWHS representation of the conditional density as long as p(y|θ j ) ∈ L 2 N , q −1 and Var Y|θj [ψ k (Y)] is finite.Furthermore, the analysis suggests that if the weight q is chosen to be the sampling density of the data, the Var Y|θj [ψ k (Y)] is always finite.This justifies the use of Variable Bandwidth Diffusion Maps (VBDM) for estimating the data-driven basis functions of the Hilbert space weighted by the sampling density on the data manifold.
To overcome the difficulty induced by large training data in the VBDM algorithm, we have provided a data reduction method that keeps the sampling distribution of the original data.We have compared the Hermite, Cosine, and VBDM representations in parameter estimation application on three instructive numerical examples.In the first example, where the dimension of the data manifold is exactly the dimension of the ambient space, d = n, all three representations are valid in estimating the posterior density of parameter.However, in the examples where the dimension of the data manifold is strictly less than the dimension of the ambient space, d < n, only VBDM representation can provide accurate estimation of the true parameter value.
These theoretical and numerical results inspire us to apply VBDM representation in many other parameter estimation applications.The next challenge will be to overcome a very high-dimensional problem, where one needs to identify the relevant features for estimating the parameters and construct the conditional densities on the relevant coordinates instead of on the full high-dimensional data space.This is the subject of our future work.

Appendix for Example I
This section provides more detailed discussions on the derivation of the explicit solutions of Example I and discusses additional numerical results.
In stationary, the solution of Eq. (45) X = (X 1 , X 2 ) admits a Gaussian distribution X ∼ N (0, Σ) with the variance so that the corresponding conditional density function p(X|Σ) is To estimate the posterior density and the posterior mean of the parameters (σ 2 X1 , σ 2 X2 ) in (52), we are given a time series of T observations, X † ≡ (X 1 † , . . ., X T † ), for a specific variance Σ † .Here, each observation X i † has two components X i † = X i † 1 , X i † 2 corresponding to the ith observation of (X 1 , X 2 ) .
Next, we derive the analytic posterior density and the posterior mean of the parameters (σ 2 X1 , σ 2 X2 ) in (52).The likelihood function of the parameter Σ given X † is, where Ψ is a 2 × 2 matrix, (see [4] for details).This implies that for more oscillatory functions ψ k for large k (e.g., Fourier series), larger B is required to control the magnitude of the gradient in this error bound.In contrast, for Hermite and Cosine representations, their analytic basis functions are known so that the errors e l 2 can be approximated by (20) in Theorem 1.
However, when only few basis functions are used (K 1 ≈ 10), it can be seen from Fig. 9 that the error e l 2 by VBDM representation and Hermite representation is relatively small compared to the error by Cosine representation.This is one superiority of VBDM representation, that is, VBDM representation can well approximate the density using only few leading eigenfunctions since these VBDM eigenfunctions are datadriven.For Hermite representation, its weight function happens to be very close to the sampling density q for this OU example, so that Hermite representation can well approximate the density with only few Hermite polynomials.For Cosine representation, since its weight function deviates greatly from the sampling density q, it can be expected that Cosine representation can only provide a fair approximation with few cosine basis functions.As can be seen in Example III in the paper, this observation plays an important role when the data manifold is a sub-manifold of R n with intrinsic dimension much less than n, since the computations with analytic representation becomes very expensive even for n ≈ 5.However, for VBDM representation, only very few basis functions are needed to represent the density.
It is worth mentioning that for VBDM representation, although the l 2 -norm error e l 2 (61) is relatively large (∼ 0.7) [Fig.9] and can not converge to zero due to the additional error induced by the finite dimension of eigenvectors, the eventual posterior density and posterior mean can still be well approximated as shown in Fig. 4.This suggests that in order to well approximate the posterior, the error is only required to be small enough instead of being convergent to zero as number of training data N → ∞. sampled according to the Riemannian metric inherited by N from the ambient space R n .Furthermore, the orthonormal basis functions of the Hilbert space L 2 (N , q −1 ) are the eigenfunctions of the adjoint (with respect to L 2 (N )) of the operator, L = ∇ log(q) • ∇ + ∆, that is constructed by the VBDM algorithm.Incidently, the adjoint operator L * is the Fokker-Planck operator of a gradient system forced by stochastic noises.The point is that this adjoint operator takes density functions of the weighted Hilbert space L 2 (N , q −1 ).Since the Hilbert space L 2 N , q −1 is a function space of some Fokker-Planck operator that acts on densities defined with respect to the geometry of data, then representing the conditional density with basis functions of the weighted Hilbert space L 2 N , q −1 is a natural choice.Thus, the error estimation in Theorem 2 is valid in controlling the error of the estimate.
Next, we will show that the representation of the true density p EX in the ambient space R 3 is not a function of L 2 R, q −1 , where for Hermite representation R is R 3 and q is a normal distribution, and for Cosine representation R is a hyperrectangle containing the torus and q is a uniform distribution.Recall that the torus is parametrized by: x ≡ (x, y, z) = ((2 + r sin (θ)) cos (φ) , (2 + r sin (θ)) sin (φ) , r cos (θ)) , where θ, φ are angles which make a full circle, and r is the radius of the tube known as the minor radius.All observation data are located on the torus with r = 1.Then, the generalized conditional density p(r, θ, φ|D) in (r, θ, φ) coordinate can be defined using the Dirac delta function as follows, p (r, θ, φ|D) = p IC (θ, φ|D) δ (r − 1) , where p IC , defined in (62), denotes the conditional density function in the intrinsic coordinate, (θ, φ), and δ is the Dirac delta function.After coordinate transformation, the density p EX : R 3 → R can be obtained as where J is the Jacobian determinant det [∂ (x, y, z) /∂ (r, θ, φ)].Now it can be clearly seen that due to the Dirac delta function δ (r − 1) in (66), the density p EX (x|D) is no longer in the weighted Hilbert space, p EX (x|D) / ∈ L 2 R, q −1 .Consequently, the error estimation in Theorem 1 becomes invalid in contolling the error of the conditional density.
Numerically, the density p EX (x|D) in (66) cannot be well approximated using only finite number of basis functions for Hermite and Cosine representations (5).If only finite number of Hermite or Cosine basis functions are used for representations, typically Gibbs phenomenon can be observed, i.e., the Dirac delta function δ (r − 1) in (66) will be approximated by a function having a single tall spike at r = 1 with some oscillations at two sides along the r direction.On the other hand, the data-driven basis functions obtained via the diffusion maps algorithm are smooth functions defined on the data manifold N .Therefore, while the Gibbs phenomenon still occurs in this spectral expansion, it is due to finite truncation in representing a positive smooth functions (densities) on the data manifold, and not due to the singularity that occurs in the ambient direction as in (66).

Figure 2 :
Figure 2: (Color online) Data reduction for (a) few number (= 64) of uniformly distributed data, and (b) many number (= 6400) of uniformly distributed data.The 64 blue circles correspond to uniformly distributed data, 16 cyan crosses correspond to well-sampled uniformly distributed data, and 16 red circles correspond to box-averaged data.Boxes are partitioned by horizontal and vertical black lines.The vertical black lines correspond to the first clustering and the horizontal lines correspond to the second clustering.Panels (c) and (d) display the comparison of kernel density estimates on the box-averaged data for different number B for (c) standard normal distribution, and (d) the distribution proportional to exp[−(−X 21 + X 3 1 + X 4 1 )], respectively.For comparison, also plotted is the analytic probability density of the distribution.The total number of the points is 640000.It can be seen that the reduced box-averaged data points nearly preserve the distribution of the original dataset.

Figure 5 :Figure 6 :
Figure 5: (Color online) Comparison of the conditional densities p(x|D j ) estimated by using Hermite representation (first row), Cosine representation (second row), VBDM representation (third row), and intrinsic Fourier representation (fourth row).The left (a)(d)(g)(j), middle (b)(e)(h)(k), and right (c)(f)(i)(l) columns correspond to the densities on the training parameters D 1 = 0.25, D 4 = 1.00, and D 7 = 1.75, respectively.K 1 = 1000 basis functions are used for all representations.For fair visual comparison, all conditional densities are plotted on the same box-averaged data points, and normalized to satisfy1B

Figure 7 :
Figure 7: (Color online) Comparison of the conditional densities p(u|F j ) estimated by using Hermite representation (first row), Cosine representation (second row), and VBDM representation (third row).The left (a)(d)(g), middle (b)(e)(h), and right (c)(f)(i) columns correspond to the densities on the training parameters F 1 = 7.65, F 4 = 7.95, and F 7 = 8.25, respectively.K 1 = 3125 basis functions are used for all representations.For fair visual comparison, all conditional densities are plotted on the training dataset {u i,j } i=1,...,N j=1,...,M on the coordinates (u 1 , u 3 , u 5 ), and normalized to satisfy 1 M N M j=1 N i=1p(u i,j |F k )/q (u i,j ) = 1 with q being the estimated sampling density of the training dataset.For each p(u|F j ), only the largest 500 conditional densities are depicted with marker size 20 and the rest 3500 small conditional densities are depicted with marker size 1.

Figure 8 :
Figure 8: (Color online) Comparison of the posterior density functions by all representations.Also plotted are the true parameter value F † = 8 (green asterisk), and mean estimates by Hermite representation F = 8.24 (blue triangle), Cosine representation D = 8.09 (red square), and VBDM representation D = 8.00 (black circle).

Figure 9 :
Figure 9: (Color online) The l 2 -norm error e l 2 (61) as a function of number of basis functions K 1 for all representations can be bounded by ε/2,