Bayesian Uncertainty Quantification with Multi-Fidelity Data and Gaussian Processes for Impedance Cardiography of Aortic Dissection

In 2000, Kennedy and O’Hagan proposed a model for uncertainty quantification that combines data of several levels of sophistication, fidelity, quality, or accuracy, e.g., a coarse and a fine mesh in finite-element simulations. They assumed each level to be describable by a Gaussian process, and used low-fidelity simulations to improve inference on costly high-fidelity simulations. Departing from there, we move away from the common non-Bayesian practice of optimization and marginalize the parameters instead. Thus, we avoid the awkward logical dilemma of having to choose parameters and of neglecting that choice’s uncertainty. We propagate the parameter uncertainties by averaging the predictions and the prediction uncertainties over all the possible parameters. This is done analytically for all but the nonlinear or inseparable kernel function parameters. What is left is a low-dimensional and feasible numerical integral depending on the choice of kernels, thus allowing for a fully Bayesian treatment. By quantifying the uncertainties of the parameters themselves too, we show that “learning” or optimising those parameters has little meaning when data is little and, thus, justify all our mathematical efforts. The recent hype about machine learning has long spilled over to computational engineering but fails to acknowledge that machine learning is a big data problem and that, in computational engineering, we usually face a little data problem. We devise the fully Bayesian uncertainty quantification method in a notation following the tradition of E.T. Jaynes and find that generalization to an arbitrary number of levels of fidelity and parallelisation becomes rather easy. We scrutinize the method with mock data and demonstrate its advantages in its natural application where high-fidelity data is little but low-fidelity data is not. We then apply the method to quantify the uncertainties in finite element simulations of impedance cardiography of aortic dissection. Aortic dissection is a cardiovascular disease that frequently requires immediate surgical treatment and, thus, a fast diagnosis before. While traditional medical imaging techniques such as computed tomography, magnetic resonance tomography, or echocardiography certainly do the job, Impedance cardiography too is a clinical standard tool and promises to allow earlier diagnoses as well as to detect patients that otherwise go under the radar for too long.


Statistical Model
Let C be the conditional complex. Let t = 1, ..., N t denote the ranked levels of fidelity of a simulation, with level N t being the highest fidelity. z t (x t ) is a vector of simulation results of fidelity-level t given input vector x t . We assume that z t (x t ) is a realisation of a Gaussian Process (GP) z t (x) with a Markov property of order 1, meaning that level t depends on level t − 1 only via the following recursive relationship: with a "difference-GP" δ t (x) and a proportionality constant ρ t−1 . Further, we assume that all information about a level is contained in the data corresponding to the same pivot point at that level and its previous level. Formally, that is Cov z t (x), z t−1 (x ) | z t−1 (x) = 0. The difference-GP δ t (x) shall be defined by the covariance matrix σ 2 t K t (x, x ) and the mean function h t (x)β t . h t (x) is a matrix of regression functions h (k) t evaluated at x = (x (1) , ..., x (j) , ..., x (N x ) ) T with size N x × N β t , where N x is the length of input vector x and N β t is the expansion power, i.e., number of regression functions, at level t. β t = (β (1) t , ..., β (N β t ) t ) T are the coefficients of level t's regression functions, and α t is the set of hyperparameters parametrizing the kernel function k t . Formally, this is At this point, neither have we chosen the basis functions h (k) t building the mean function nor have we chosen the kernel functions k t building the covariance matrices. Let us subsume the parameters , which comprises the input vector at level t, namely x t , and its corresponding computer code outputs at level t, namely z t (x t ), and at the previous level t − 1, namely z t−1 (x t ). Further, we require a nested design of input vectors, i.e., We want to draw conclusions from the predictive posterior probability of z N t (x) at a set of points x, This thing is quite unhandy. We will instead just deal with its moments only, namely the posterior mean z N t (x) and the posterior covariance Cov( where the diagonal of the posterior covariance is the uncertainty band of the prediction. The moments of z N t (x) follow from We have used the fact that p z N t (x)|{δ t (x)} N t t=1 , θ, D reduces to Dirac's delta-distribution since z N t (x) is uniquely determined by Equation (1) and the knowledge of all difference-GPs, Since δ t (x) is assumed to obey a GP, the prior probability of δ t (x t ) is multivariate normal. If the likelihood p D t |δ t (x), θ t is Gaussian, then the posterior probability of δ t (x t ), p δ t (x t )|θ t , D t , is multivariate normal as well. Integration with respect to δ t (x) yields thus the standard result of the posterior mean value (see, e.g., Reference [10]) and results in a replacement of the GP with its posterior mean.

Prediction and Its Uncertainty
To compute posterior mean and covariance, we are thus left with integration with respect to the hyperparameters. This is done analytically with all parameters (β, σ , ρ) but the parameters of the kernel function, α, since those usually occur nonlinearly in the kernel function. We are then left with numerical integration of expectations and covariances, both conditioned on α. We assume flat priors for β and ρ and Jeffreys' prior for σ , i.e., p σ t |C = 1 σ t . The prior of α needs to be chosen only after the covariance kernel has been chosen. Let · | α t denote the expectation value conditioned on α.
Here, for ease of notation, we will instead write · only. The technicalities shall be detailed in the Appendix A , and the result is as follows: where δ t (x t ) is determined from the data and Equation (2), Γ(·) is the complete Gamma-function, | · | is the matrix determinant, and the conditional expectations of the hyperparameters are where δ t1 = 1 for t = 1 and 0 otherwise and the abbreviations are N x t is the number of pivot points in input vector x t , and N β t is the expansion order of level t's mean function. For t = 1, we need to define ρ 0 = 0, a 1 = 1, and b 1 = 0. Quite importantly, we find following the requirement: since otherwise the second moments of σ t are not defined. The numerical evaluation of this result merely involves a couple of matrix operations. The only input is the data, regression functions, and covariance matrices. No parameters need to be tuned.

Algorithm and Mock Data Scrutiny
We test the method for the special case of two levels. Then, posterior mean and covariance are simply We generate mock data according to Equations (1) and (2) and compare the data analysis results to the underlying truth. We have chosen as mean function bases the Legendre polynomials up to orders 10 and 4 for Levels 1 and 2, respectively. This is convenient since this basis is both orthogonal as well as normalized on [−1, 1] already and the map onto the desired domain is trivial. The covariance kernel was chosen to be the squared exponential kernel, where α 1 and α 2 were defined as the inverse of the correlation length squared. This choice was inspired by the typical form of signals encountered in impedance cardiography, about which we will talk more in Section 4. The data set was one sample drawn per level; see Figure 2 . We chose Jeffreys' prior for α t . The integration bounds can be read from Figure 1, and an integration grid of 100 × 100 equally sized volumes turned out to be well converged. As an intermediate result, we compare the multi-fidelity estimates to the true parameter values in Table 1. The posterior probabilities of β t , ρ t , σ t are Gaussian. Since α 1 and α 2 are not and thus cannot be reasonably well described with mean and variance only, we additionally show their posterior probabilities in Figure 1. The predictions and prediction uncertainties are compared to the true mean in Figure 2. We find that both the parameter estimates as well as the predictions statistically match the truth within their uncertainties. The mean function parameter uncertainties (β) clearly illustrate that learning parameters by optimization has little meaning if there is little data. As the data set grows big, the posterior will contract to the maximum likelihood solution. Still, and luckily, the prediction uncertainties are kept low because the mean function parameter uncertainties do not appear in the prediction uncertainties directly. We emphasize that our proposed method naturally is applied to little data problems.
Since on level 2 we only have 11 data points, the expansion order of the mean function is limited to a maximum of 7 according to Equation (4d). Unsurprisingly, the solution rapidly worsens as we approach this constraint and entirely breaks down as we reach it because the posteriors become nonconclusive. This is exactly where we find the strength of our multi-fidelity approach. We can choose a high-order mean function on a level where data is abundant and a low-order mean function on a level which we are actually interested in but where data is scarce. The trick is thus actually that the difference of the levels can be modeled by a low-order mean function.   For the sake of completeness, we report the converged log-evidence to be 430 ± 10. In real-life applications of the method, one should and could compare different choices of mean function expansions and covariance kernel functions by Bayesian model comparison [24], i.e., compute each choice's evidence. Let N α t be the number of hyperparameters in kernel function k t . Since p α|D factorizes, the integrals are N α t -dimensional each rather than one single integral of dimension ∏ t N α t , making the computation of the evidence relatively easy. In our case, numerical Riemann integration was good enough. When choosing more sophisticated kernel functions with more hyperparameters, one might need to use statistical integration methods such nested sampling [25], which conveniently and automatically yields the evidence as well.
For the sake of numerical stability, it is advisable to rescale the data. Further, one might want to improve the condition numbers of the prior covariance matrices, σ 2 t K t (x, x), by adding a small term proportional to the identity matrix, where the proportionality constant should be several orders of magnitude smaller than σ 2 t . Finally, we would like to point out that Equation (4) suggests trivial parallelisation of the code levels. This is not easily recognizable in the presentation of Kennedy and O'Hagan [3] but was found by Le Gratiet and Garnier [26] already.

Application to Finite Element Simulations of Impedance Cardiography of Aortic Dissection
In this section, we quantify the uncertainties of real simulation data. We compare the uncertainties of our Bayesian MuFi-GP with normal Bayesian GP neglecting the additional LoFi data, that is, the special case of N t = 1 in Equation (4). We have described the physical and physiological model in our previous work [20,27] but shall restate a brief summary here for the reader's convenience.
We solve Laplace's equation: with finite elements on a geometry depicted in Figure 3, where V is the electric potential, σ is the electrical conductivity (not to be confused with the regression parameter in the GP kernel), ω is the angular current frequency, ε is the permittivity, and i is the imaginary unit. We had von-Neumann boundary conditions, where V = const. on the top electrode and V = 0 on the bottom electrode, where the surface integral of the current was held constant at 4 mA and air was assumed to be perfectly insulating. The current had a frequency of 100 kHz. We considered one cardiac cycle that spans one second. The dynamics were modelled via a time-dependent radius of the aorta and its dissection, which arises from pressure waves in a pulsatile flow. Further, the blood conductivity is parametrized in time via its dependence on flow velocity. In the dissected aorta, we assume flow to be stagnant [20,28]. The voltage drop is then measured from just below the top electrode to just above the bottom electrode. The impedance is then the ratio of voltage over current, and the admittance is the inverse of the impedance. We used Comsol Multiphysics for the modelling [29]. For the uncertainty quantification, we chose to expand the mean functions in Legendre polynomials up to order 8 and 2 for HiFi and LoFi, respectively. We choose the squared exponential kernel for both covariance matrices. In principle, one should compute the evidence for a number of plausible choices and choose the one with the most evidence. For the mean function, that would most simply be different expansion orders, while the covariance kernel could be taylored to the PDE at hand to enforce physical behaviour, as suggested in References [30,31].
For uncertain parameters, we consider the radius of the dissected aorta and perform a number of simulations with sensible values within the physiological and physical range, i.e., 1.0-24.0 mm. The LoFi data set consisted of 24 time series, each with 21 pivot points in time. The HiFi data set consisted of 3 time series (5 mm, 11 mm, and 18 mm), each with 11 pivot points in time.
In Figure 4, we show the posterior of the kernel parameters, which turns out to be quite conclusive. In Figure 5, we plot the HiFi predictions and uncertainties which are enhanced by LoFi data and compare them to HiFi predictions and uncertainties which are not enhanced by LoFi data as well as to the test data set.

Conclusions
We devised a fully Bayesian multi-level Gaussian process model to improve uncertainty quantification of expensive and little high-fidelity simulation data by augmenting the data set with low-fidelity simulations. Our proposed method is rigorous and logically consistent, no ad hoc assumptions have been made, and the user is spared the embarrassment of having to tune any parameters. The method was scrutinized with mock data and shown to work with as little data as where simple Bayesian Gaussian process regression is not conclusive at all. We applied the method to finite element simulations of impedance cardiography of aortic dissection and quantified the uncertainty due to the unknown size of the aortic dissection. By using meshes of both high fidelity (defined by mesh convergence) and low fidelity, we reduced the uncertainty significantly. We have thus further shown that uncertainties due to geometrical parameters can be described with Gaussian processes on each level of fidelity. With a coarsened mesh, the result is qualitativelybut not quantitatively similar. Usually, that is not good enough and the low-fidelity data is entirely useless to the engineer. Here we show that this is not necessarily true in the context of uncertainty quantification. Ultimately, we want to diagnose aortic dissection from impedance cardiography signals, i.e., in the parlance of probability theory, we need to compare the evidences of healthy and diseased aortae. Unambiguous judgement will most likely, if at all, be possible only with several electrodes at once.
Thus, in the moments of z N t (x), by integration with respect to δ t (x), we can replace δ t (x) by its posterior mean:

. Parameter Posterior and Parameter Estimates
We need the posterior of the parameters: The exponent of this Gaussian is a quadratic form in β t , and we rewrite it as From this form, we can read the (conditional) expectation and covariance of β t . We can now integrate with respect to β t . We assume a flat prior for β t : , N x t as the number of pivot points in input vector x t , and N β t as the number of basis functions of level t. Apparently, the expectation of β is independent of σ and the covariance of β is independent of ρ. We will next tend to integrating with respect to ρ t−1 . We find ρ t−1 in δ t (x t ) only and thus as a quadratic form again: We can now integrate with respect to ρ t−1 . We assume a flat prior: The last random variable we can treat analytically is σ t . With Jeffrey's prior p σ t |C = 1 σ t , we have With the substitution v = Φ t 2σ 2 t , we find this to be a Γ-integral: The moments of σ t are Γ-integrals as well, and we find For the above Γ-integral and the σ-moments to exist, we require γ t − 1 2 − ν 2 > 0, i.e., Equation (4d). Note that, in the case of t = 1, we have no integration with respect to ρ 0 and thus find one power of √ σ 1 less in the above Γ-integral, i.e., we need to substitute γ 1 − 1 2 → γ 1 . The constraint is accordingly weakened for t = 1, which is due to the absence of the single parameter ρ 0 . For most choices of the covariance kernel, we cannot go further analytically. Now, we have everything we need to marginalize σ t in the conditional expectations of ρ t−1 .
which is now relatively easy.
The expected mean function parameters are simply The variance we read from the quadratic form in β t is as follows: with (A t ) kk as the kth diagonal element of A t .

Appendix A.2. Predictive Mean
We rewriteδ t such that The first term, u t , assumes a rather simple form and only depends on ρ t−1 and α. Since we found p ρ, α|D = ∏ k p ρ k , α t |D k previously, and each ρ k occurs linearly. We find The second term is rather trivial.
We have thus shown the predictive mean in Equation (4).

Appendix A.3. Predictive Covariance
We easily find Due to the assumption of independence of the levels or each levels' parameters, formally δ t (x) ⊥ z t−1 (x), the cross-terms mixing different levels vanish in the expectation value and it suffices to reduce the double sum to a single sum. Further, we can factor the expectation of the product of ρ l and the expectation of the sum.
The first factor on the right-hand side can be expressed via the GP's posterior covariance and mean: is the GPs' posterior covariance and the predictive mean we already know as follows: We have thus shown everything we claimed.