Bayesian3 Active Learning for the Gaussian Process Emulator Using Information Theory

Gaussian process emulators (GPE) are a machine learning approach that replicates computational demanding models using training runs of that model. Constructing such a surrogate is very challenging and, in the context of Bayesian inference, the training runs should be well invested. The current paper offers a fully Bayesian view on GPEs for Bayesian inference accompanied by Bayesian active learning (BAL). We introduce three BAL strategies that adaptively identify training sets for the GPE using information-theoretic arguments. The first strategy relies on Bayesian model evidence that indicates the GPE’s quality of matching the measurement data, the second strategy is based on relative entropy that indicates the relative information gain for the GPE, and the third is founded on information entropy that indicates the missing information in the GPE. We illustrate the performance of our three strategies using analytical- and carbon-dioxide benchmarks. The paper shows evidence of convergence against a reference solution and demonstrates quantification of post-calibration uncertainty by comparing the introduced three strategies. We conclude that Bayesian model evidence-based and relative entropy-based strategies outperform the entropy-based strategy because the latter can be misleading during the BAL. The relative entropy-based strategy demonstrates superior performance to the Bayesian model evidence-based strategy.


Introduction
The greatest challenge of the scientific modeling workflow is to construct reliable and feasible models that can adequately describe underlying physical concepts and, at the same time, account for uncertainty [1]. Due to the computational complexity of the underlying physical concepts, numerical simulation models are often too expensive for applications tasks related to uncertainty quantification, risk assessment and stochastic model calibration. The great difficulty here is to establish a consistent and feasible framework that can provide appropriate conceptual descriptions and can simultaneously maintain a reliable time frame of simulations. The latter is the primary reason why a vast majority of ongoing research has been focusing on accelerating the forward model using surrogate models, such as response surfaces, emulators, meta-models, reduced-order models, etc. Due to the high computational costs of the original numerical simulation required for training runs of such surrogates, constructing optimal selection of training points in [64,65] and extended with active learning concepts [13,38] by several authors. Since GPEs are described by their mean and covariance, the choice of a parametric covariance function and estimation of the related hyper-parameters is decisive for constructing the GPE. There are various works available in the literature that are focusing on estimation of the related hyper parameters. Usually, there are several covariance functions known and used in literature, but in particular the squared exponential covariance and the Matérn covariance functions play an important role in geophysical applications [66][67][68].
GPE surrogates have often been used to replicate computationally demanding models. Employing various learning function for selecting GPE training points helps to assure the accuracy in that procedure (see [69][70][71]). Conventionally, learning approaches only focus on GPE training for an underlying model without considering measurement data in the context of Bayesian updating of model parameters. Contrary to that, the study [72] makes use of data only (no numerical model involved) and constructs a GPE model that directly represents the underlying phenomena. It employs information entropy to perform an optimal design of experiments for sensor placements and, in doing so, it exploits the multivariate Gaussian distribution of the GPE model. The study [73] also looks at model-based optimal design of experiments for sensor/measurement selection. Specifically, they look at sample placement for contaminant source identification problems, where the contaminant source geometry is covered by model parameters that are to be inferred. Within their optimization, they used a GPE-based MCMC simulation with local GPE refinement as an auxiliary tool. This means that their active learning strategy is made for planning real-world data collection, not for planning model runs during GPE training. Recently, the study [74] constructed a framework for accelerating MCMCs in Bayesian inference of model parameters. They introduced local approximations of these models into the Metropolis-Hastings kernel of MCMC. In such a setting, the greatest challenge is to replicate behaviors of the original numerical model while accounting for the available observation data at acceptable computational costs. Going to that same direction, the work [75] uses entropy as a learning function in the context of Bayesian assimilation of available data, but it approximates the posterior distribution of model parameters as log normal and relies on a multivariate Gaussian approximation of entropy. Similarly, learning functions for GPE training in Bayesian parameter inference can focus on the posterior mean and variance of model parameters obtained via assimilation of available measurement data. Following this route, the paper [76] suggested minimizing the mean-squared (averaged over the parameter posterior distribution) predictive error of the GPE. Alternatively, the study [77] suggested minimizing the integrated posterior variance of the GPE, again involving an average over the parameter posterior distribution. However, the posterior distribution resulting from Bayesian assimilation of measurement data is typically not multivariate Gaussian and, hence, any corresponding assumptions should be avoided whenever possible. Therefore, the current paper avoids such unnecessary assumptions. Instead, it offers a fully Bayesian view that relies on integral quantities such as BME, RE and IE.
Moreover, the accuracy of GPE depends strongly on the selection of training points [13], and GPEs with different training points can be seen as different surrogate models. Therefore, the main challenge of GPE-based surrogates in Bayesain inference consists of selecting training runs to capture the relevant features of the underlying full-complexity physical model, while focusing its accuracy of the regions of a good fit with the available observation data.
The current paper introduces a novel GPE-based machine learning framework to replicate behaviors of a computational demanding physical model using training runs of that model. The suggested framework focuses the training runs optimally for the parameter inference from observation data in a fully Bayesian view. The introduced framework makes use of Bayesian theory on the three different levels: the first construction of GPE from available training runs and identification of hyper parameters usually employs classical Bayesian principles; the second, incorporation of the available observation data, could be rigorously acceded via Bayesian updating; the third, training runs should be well identified using an adaptive Bayesian active learning strategy for GPE construction that captures the relevant features of the full-complexity model and honor the available observation data. Therefore, the overall framework can be seen as fully Bayesian (i.e., Bayesian 3 ) active learning or, in short, denoted as BAL in the paper. The novelty over the previous studies and our main focus lies in the theirs level where we suggest three novel active learning strategies incorporating the information theory.
The rest of the paper is structured as follows: Section 2 explores the connection between Bayesian inference and information theory for Gaussian process emulators. This section also emphasizes that information entropy and relative entropy for Bayesian parameter inference and for Bayesian active learning can be computed avoiding any assumption or unnecessary multidimensional integration. Section 3 summarises fundamental properties of Gaussian process emulators and offers the three novel Bayesian active learning strategies based on GPEs and information theory for Bayesian parameter inference. Section 4 demonstrates application of the suggested GPE-based BAL strategies using an analytical example and a carbon dioxide benchmark problem. Additionally, Section 4 shows evidence of convergence against a brute-force reference solution for all proposed BAL strategies and demonstrates parameter inference for the proposed active learning strategies.

Construction and Training of Gaussian Process Emulators
We will consider a full-complexity model M producing model response M(ω, x, y, z, t) that depends on the some multi-dimensional parameter input ω at each physical point in space (x, y, z) and time t. The uncertain modelling parameters ω form a vector of random variables ω = {ω 1 , ..., ω n } from the parameter space Ω, where n is the number of uncertain parameters.
A Gaussian process emulator S(ω, x, y, z, t) (i.e., surrogate model) provides an approximation of the full-complexity model M(ω, x, y, z, t) over the parameter space Ω and for each point of space (x, y, z) and time t: (1) Here, h l (ω) for l = 1, . . . , m denote trend basis functions over the parameter space Ω, h 1 is a constant and β l (x, y, z, t) are unknown coefficients of the expansion that only depend on space (x, y, z) and time t. The last term u(ω, x, y, z, t) in Equation (1) indicates the Gaussian process (GP). To introduce this term, we will assume u to be a GP with zero mean E(u) = 0 and covariance Cov(ω, ω ) between u(ω, x, y, z, t) and u(ω , x, y, z, t) given by: Therefore, the GP term u(ω, x, y, z, t) is assumed to be Gaussian distributed u(ω, x, y, z, t) ∼ N (0, k(ω, ω )) according to the covariance kernel function k(ω, ω ) for each point of space (x, y, z) and time t. It is worth mentioning that there are different choices of the covariance kernel function k(·, ·) available. The most common ones are the squared exponential kernel k SE (·, ·) (the same as the Gaussian kernel) and the Matérn kernel k Matérn,ν (·, ·), which are defined as follows: Here, λ j ≡ λ j (x, y, z, t) and λ ≡ λ(x, y, z, t) represent the length scale of auto-correlation along each component ω j with j = 1, . . . , n, the variance parameter is denoted by σ 2 ≡ σ 2 (x, y, z, t) and the gamma function is denoted by Γ. K ν is the modified Bessel function of the second kind with the smoothness parameter ν ≡ ν(x, y, z, t). For a more comprehensive description of several kernel functions, we refer the reader to [62]. The parameters σ 2 , λ, λ j and ν as well as the trend coefficients β l define the mean and correlation function, and are called hyper parameters of S.
Constructing a Gaussian process emulator S(ω, x, y, z, t) for the full-complexity physical model M(ω, x, y, z, t) in Equation (1)  Here, and in the following, we drop the coordinates for space (x, y, z) and time t in our notation for the sake of readability.
First, we look at hyperparameter inference. According to the GP, we will assume that each instance of the model response M Ti for i = 1, . . . , N T can be modeled in a probabilistic sense as: (4)

Introducing vector notation for
. . , ω TN T } T , we can rewrite the GPE representation of a given parameter set ω in the following form: Furthermore, the joint distribution of the random vector U for a given parameter set ω is given by: where the (co)variance matrix K(ω, ω ) for parameter sets ω, ω is defined according to the covariance kernel functions as follows: Equations (5)-(7) allow for estimating the GPEs hyper parameters using several GP-based methods and concepts that are available in the literature [13,62,65], such as the maximum likelihood method or more advanced Bayesian principles [62]. The mentioned Bayesian updating that identifies hyper parameters of the GPE representation is well-known in the literature [13,62,65] and will not be addressed in the current paper. Bayesian principles are again employed to train the Gaussian process emulator S(ω, x, y, z, t) of the full-complexity physical model M(ω, x, y, z, t) based on the available training points ω T = {ω T1 , . . . , ω TN T } T and the corresponding model responses M T = {M T1 , . . . , M TN T } T (see e.g., [78]). The training procedure provides the posterior multivariate Gaussian distribution N ω (µ S , σ S ) with a mean value µ S and a standard deviation σ S of S(ω, x, y, z, t) for any given parameter set ω from the parameter space Ω. The accuracy of the GPE strongly depends on the number of training points and how they have been selected [13]. The question of how to select the training points properly is extremely relevant in general. When is even more challenging, but observation data should be incorporated into the full-complexity model via Bayesian inference of model parameters ω. Moreover, GPE with different training points can be seen as and indeed are different models. Therefore, the current paper will introduce a fully Bayesian view (Bayesian inference in Section 2.2 and Bayesian active learning in Section 3) on the construction of the GPE-based surrogate S(ω, x, y, z, t) that must capture the main features of the full-complexity model M(ω, x, y, z, t) and will be used to assist in Bayesian parameter inference.

Bayesian Updating on Observation Data Using GPE
Bayesian theory offers a statistically rigorous approach to deal with uncertainty during inference, providing probabilistic information on the remaining uncertainty in parameters and predictions while incorporating the available observation data D (vector N D × 1 with N D length of the observation data set) that is usually attributed at specific point in space (x, y, z) and time t. In the Bayesian framework, initial knowledge on modelling parameters ω is encoded in a prior probability distribution p(ω). After Bayesian parameter inference, one obtains a posterior probability distribution of the parameters p(ω|D), which is more informative than the prior distribution. Posterior probability distribution of the parameters p(S|D) could be obtained with the help of the full-complexity model M (i.e., p(ω|D, M)) or with the help of the surrogate model S (i.e p(ω|D, S)) according to the approximation in Equation (1). Due to the high computational demand of the original full complexity model, we will employ the last option in the current paper, i.e., p(ω|D)) ≡ p(ω|D, S) ≈ p(ω|D, M)) (see more details in [79]).
Formally, the posterior parameter distribution p(ω|D) of n uncertain parameters forming the vector of random variables ω = {ω 1 , ..., ω n } is obtained by updating the prior parameter distribution p(ω) in the light of observed data D according to Bayes' Theorem [21]: where the term p(D|ω) is the likelihood function that quantifies how well the surrogate model's predictions S(ω, x, y, z, t) match the observed data D (the full notation corresponding to p(D|ω, S) will be avoided in the paper). The term p(D) (i.e., p(D, S) in full notation) is called Bayesian model evidence (BME) and can be seen as a normalizing constant for the posterior distribution of the parameters ω.
In order to describe how well the GPE predictions S(ω, x, y, z, t) in physical space {x, y, z, t} match the observed data D, we use the following likelihood function p(D|ω) assuming independent and Gaussian distributed measurement errors: where R (N D × N D ) is the diagonal (co)variance matrix of measurement error . The performance of Bayesian updating on available observation date using GPE surrogate can be assessed by employing BME, relative entropy and information entropy [80] that are introduced in the upcoming Sections 2.3-2.5.

Bayesian Model Evidence
The BME value p(D) in the denominator of Equation (8) indicates the quality of the model against the available data and can be obtained by integrating the Equation (8) over the parameter space Ω as: Therefore, BME p(D) can be directly estimated [81] from Equation (11) where N is the size of Monte Carlo sample. We remark that the GPE-based surrogate model S(ω, x, y, z, t) contains approximation errors because the GPE is merely a surrogate model of the full-complexity model M(ω, x, y, z, t). Therefore, GPE-based BME value p(D, S) in Equation (11) is an approximation of full-complexity model BME value p(D, M), i.e., BME = E p(ω) (p(D|ω), S) ≈ p(ω|D, M)). A correction factor for BME value can be incorporated similar to the one in [79].

Relative Entropy
Relative entropy (RE), also called Kullback-Leibler divergence, measures the difference between two probability distributions [43] in the Bayesian context. Relative entropy D KL [p(ω|D), p(ω)] measures the so-called information geometry in moving from the prior p(ω) to the posterior p(ω|D), or the information lost when p(ω) is used to approximate p(ω|D): Estimating the relative entropy in Equation (13) usually requires a multidimensional integration that is often infeasible for most applied problems. However, employing Equation (13) and definition (5) from the recent findings in the paper [40], we avoid this multidimensional integration: Therefore, relative entropy D KL [p(ω|D), p(ω)] can be directly estimated from Equation (14) using Monte Carlo sampling techniques on the GPE: The expression for relative entropy in Equation (15) employs the prior-based estimation of BME values in Equation (11) and a posterior-based expectation of the likelihood E p(ω|D) (ln [p(D|ω)]) that could be obtained, e.g., using a rejection sampling technique or similar [26] as: where N p is the size of posterior sample according to rejection sampling.

Information Entropy
Information entropy (IE) is a measure of the expected missing information and can also be seen as uncertainty of a random variable ω. According to Shannon [42], the information entropy H [p(ω|D)] for a random variable ω with (posterior) parameter distribution p(ω|D) is defined as: However, information entropy H [p(ω|D)] in Equation (17) can not be computed directly from a posterior sample because the posterior density values p(ω|D) are unknown. To overcome this situation, we will employ the definition of D KL [p(ω|D), p(ω)] in Equation (13) to express the information entropy as: Therefore, employing Equation (15), information entropy can be directly estimated according to Equation (A3) in the paper [40] using Monte Carlo sampling techniques on the GPE: Equation (19) does not contain any assumptions and avoids all multidimensional density estimations and integrals in Equation (17). It employs the prior-based estimation of BME values in Equation (10) and a posterior-based expectation of prior densities E p(ω|D) (ln [p(ω)]) and likelihoods E p(ω|D) (ln [p(D|ω)]). The posterior-based expectation of prior log-densities could be obtained as well using rejecting sampling techniques [26]:

Bayesian Active Learning for Gaussian Process Emulators in Parameter Inference
Section 2 described the standard construction of GPE S(ω, x, y, z, t) based on available training runs of the physical model M T = {M T1 , . . . , M TN T } T . As we want to infer the model parameters of the underlying physical model assisted by the constructed GPE surrogate, the training runs of the full model must ensure appropriate convergence. Specifically, such the GPE surrogate S(ω, x, y, z, t) must captures the main global features of the full-complexity model M(ω, x, y, z, t) and, at the same time, have local accuracy in the region of high posterior density p(ω|D) that will emanate during Bayesian inference. However, these regions for local accuracy are unknown a priori. Therefore, the current Section 3 focuses on iterative selection of training points. It employs the link between Bayesian inference and information theory [40] similar to Section 2 in order to perform Bayesian active learning (BAL). The later will iteratively select new training point as the regions with required local accuracy becomes progressively clear during the Bayesian updating of a Gaussian process emulator described in Section 2.
We will consider that the GPE surrogate S(ω, x, y, z, t) in Equation (1) has been constructed based on at least one training point ( The goal of the current Section 3 is to identify the next training point ω BAL T that should be incorporated into the GPE surrogate S(ω, x, y, z, t).
To do so, we will introduce three Bayesian active learning strategies that are based on Bayesian model evidence (Section 3.2), relative entropy (Section 3.3), and information entropy (Section 3.4). These strategies avoid unnecessary approximations and assumptions (such as maximum likelihood estimation, multivariate Gaussian posterior, etc.). Once a new training point ω BAL T has been identified, the model should be evaluated in that new point and the GPE in Equation (1) should be updated with typical GPE-inherent methods. In that way, the presented GPE-based fully Bayesian approach could help to calibrate the physical model to the available measurement data at the reduced computational costs. However, the GPE representation could never be better than the underlying physical model.

Bayesian Inference of Gaussian Process Emulator Incorporating Observation Data
The GPE is a collection of random functions over the parameter space Ω, i.e., a random model response S(ω, x, y, z, t) for each point of space (x, y, z) and time t. The Bayesian identification of hyper parameters during the training on the available model runs M T (ω, x, y, z, t) in Section 2.1 provides the multivariate Gaussian distribution N ω (µ S , σ S ) of the model response S(ω, x, y, z, t) forming response space S for any given parameter set ω from the parameter space Ω. Here, µ S is a mean value and σ S is a standard deviation of model response S(ω, x, y, z, t) at each point of space (x, y, z) and time t. Therefore, we can explore the parameter space Ω using the exploration parameter set ω E and we can assess how the obtained multivariate Gaussian distribution N ω E (µ S , σ S ) meet the observation data D. According to the Bayesian framework [21], we can obtain a posterior probability distribution p BAL ω E (S|D) of the model response for the given parameter set ω E , incorporating the observed data D: where the term p BAL ω E (D|S) is the likelihood function that quantifies how well the GPE predictions S(ω E , x, y, z, t) drawn from the multivariate Gaussian N ω E (µ S , σ S ) match the observed data D and the term p BAL ω E (D) is BME value of GPE for the given parameter set ω E . Assuming independent and Gaussian distributed measurement errors, the likelihood function p BAL ω E (D|S) can be written as:

Model Evidence-Based Bayesian Active Learning
As already mentioned in Section 2.2, BME is often used for model selection in order to identify the most suitable model among a set of competing models or to rank the competing models. During the active learning procedure, one has to identify the best "model" in the sense of the next trained version of GPE, i.e., the best position d of sampling point ω E . This point ω E can be chosen such that it provides the highest BME of the next trained GPE. Therefore, BME value p BAL ω E (D) for each point ω E in the prior parameter space providing models responses S(ω, x, y, z, t) that forms a response space Y can be obtained using the following equation: Equation (23) shows that BME BAL is equal to the expected value E N ω E (µ S ,σ S ) of the likelihood p BAL ω E (D|S) over the prior N ω E (µ S , σ S ) that GPE provides after training: The value BME BAL can be directly estimated from Equation (24) where N BAL is the size of Monte Carlo sample for Bayesian active learning. Therefore, by formal maximization of the model evidence BME BAL , one can find the next training point ω BAL T from the parameter space Ω:

Relative Entropy-Based Bayesian Active Learning
Relative entropy is usually employed for Bayesian experimental design [51] to maximize the expected (marginalized) utility [53]. In the current paper, we will introduce the relative entropy D BAL KL p BAL ω E (S|D), N ω E (µ S , σ S ) to assess the information geometry in moving the GPE from the multivariate Gaussian prior N ω E (µ S , σ S ) to its posterior p BAL ω E (S|D) during the active learning procedure. Formally, the relative entropy D BAL KL p BAL ω E (S|D), N ω E (µ S , σ S ) can be defined for each sampling point ω E from the parameter space Ω as following: Similar to Section 2.4, we can avoid multidimensional integration in Equation (27) using Equation (13) and definition (5) from the paper [40]: The posterior-based expectation of the log-likelihood could be obtained using a rejection sampling technique [26] on the GPE: where N BAL p is the size of the posterior sample according to rejection sampling for Bayesian active learning.
Therefore, during the active learning procedure, we will identify the sampling point ω BAL T from the parameter space Ω that corresponds to maximum relative entropy D BAL It is evident that the optimization problem for RE value in Equation (30) relies not only on BME BAL values from Equation (24). It also relies on the cross entropy represented by the term ω E (D|S) that reflects how likelihood informative for the posterior (see details in [40]). The last term could be obtained using a rejection sampling technique using the GPE evaluations.

Information Entropy-Based Bayesian Active Learning Criterion
Minimizing the expected information loss in terms of information entropy [42] has been suggested to identify the best fitting model [83] and is often used in machine learning. Again seeing the GPE with different training points as different models, we will introduce the information entropy H BAL p BAL ω E (S|D) to assess information loss for each parameter set ω E : Similar to Section 2.5, using Equation (A3) from the paper [40], the information entropy in Equation (31) can be written as follows: where the posterior-based expectation E p BAL ω E (S|D) (ln [N ω E (µ S , σ S )]) could be obtained as well using a rejection sampling technique [26] on the GPE: All terms in Equation (32) could be obtained directly avoiding any multidimensional integration using prior-based or posterior-bases sampling from the GPE, such as rejecting sampling techniques. Therefore, to perform active learning, we will rely on the parameter set ω BAL T that corresponds to the minimum of information entropy H BAL p BAL ω E (S|D) : Equation (34) that minimizes the IE value in Equation (32). It makes use of BME BAL in Equation (24)

Application of GPE-Based Bayesian Active Learning
In the previous section, we have introduced three strategies for Bayesian 3 active learning during the GPE-assisted Bayesian updating of model parameters as described in Section 2. The current section will make use of an analytical example in Section 4.1 and a carbon dioxide benchmark problem in Section 4.2 to illustrate the suggested active learning strategies from Section 3.
In the present work, we use the Matlab fitrgp function [78] to obtain the values of GPE parameters and hyper parameters introduced in Section 2.1 via a traditional Bayesian training on the available model runs. The proposed Bayesian active learning strategies in Section 3 for Bayesian inference in Section 2 have been implemented as an extension of the existing Matlab fitrgp function. The fully Bayesian 3 active learning extension of fitrgp function is available online for the reader through Matlab file exchange [84]. For the sake of consistency, in the current publication, we have used the fitrgp function together with the squared exponential kernel k SE (·, ·) as defined in Equation (3) for all examples. However, various kernel functions could be easily selected within Matlab fitrgp function using various training options. Therefore, the reader is invited to test the suggested Bayesian active learning strategies for own needs exploring the full range of Matlab fitrgp functionality.

Scenario Set up
We will consider a test case scenario in the form of a nonlinear analytical function M(ω, t) of ten (n = 10) uncertain parameters ω = {ω 1 , ..., ω n } from the paper [40]: The uncertain parameters ω in Equation (35) are considered to be independent and uniformly distributed with ω i ∼ U (−5, 5) for i = 1, 10. The prior assumptions on the parameters will be updated using synthetic observation data D = M(ω, t k ) with t k = (k − 1)/9 and k = 1, 10 that correspond to the parameter set ω i = 0 ∀i. The standard deviation of the measurement error is considered to be σ D = 2.

Likelihood Reconstruction during Bayesian Active Learning
We will construct the Gaussian process emulator S(ω, t) in Equation (1) for the test case problem in Equation (35) to approximate the full model M(ω, t) in the parameter space ω for each point of time t. We will start the Bayesian active learning with one training point only (N T = 1). The starting training point corresponds to the mean value of the uncertain parameters ω, i.e., ω T = E p(ω) (ω). We will perform the Bayesian updating in Equation (8) using Monte Carlo sampling [26] on the constructed GPE surrogate S(ω, t) with sample size N = 10 5 (alternative approaches can be used similarly). In order to identify the next training points for the GPE iteration, we employ the three Bayesian active learning strategies introduced in Section 3: the model evidence-based strategy in Equation (26), the relative entropy-based strategy in Equation (30) and the information entropy-based strategy in Equation (34).
Let us illustrate how the GPE-based likelihood function updates during the Bayesian active learning procedure. Additionally, we will assess the corresponding computational costs in terms of number of full model runs. For illustrative purposes, we will reduce the 10D problem (35) to a 2D problem with only two parameters, i.e., ω i = 0 for i = 3, 10. Figures 1-3 show how the GPE's likelihood function cover the 2D parameter space during active learning based on a BME-based strategy, RE-based strategy and IE-based strategy, respectively. Moreover, Figures 1-3 show Monte Carlo reference solutions that have been obtained directly using Monte Carlo sampling on the original model, introduced in Section 4.1.1. The RE-based Bayesian active learning captures the non-Gaussian aspects of the analysed problem in a remarkably effective manner in comparison to the BME-based and the IE-based approaches. The RE-based active learning provides a likelihood estimation that is practically identical to the MC reference solution after 25 model runs. The BME-based strategy captures the main features only in the beginning and requires a longer learning procedure to reflect details of the reference solution. The reason is that the RE relies on both BME and cross entropy that indicates how informative the likelihood for the posterior are (see Equation (28)). Contrary to the BME-based and the RE-based strategies, the IE-based active learning manages to cover only partially the likelihood function in the 2D parameter space for the given computational budget. Additionally, it shows a stagnation during the learning procedure, where 50-model-run training shows no significant improvement in comparison to 30-model-run training (see Figure 3). The difference between the RE and the IE-based active learning strategies consists in the second term in Equation (32). That term denotes the cross entropy and reflects how informative the trained multivariate Gaussian distribution N ω E (µ S , σ S ) of GPE is, for the posterior p BAL ω E (S|D). Formally, it can be seen as a posterior-based expectation of multivariate Gaussian distribution E p BAL ω E (S|D) (ln [N ω E (µ S , σ S )]) and it can overcome the RE value in Equation (32). Apparently, once the trained distribution N ω E (µ S , σ S ) is extremely informative, then the IE-based active learning suggests to add new training point where N ω E (µ S , σ S ) is already very similar to the posterior p BAL ω E (S|D). Therefore, the last property could lead to a stagnation of the information entropy-based active learning, similar to Figure 3.

Assessment of Information Arguments during Bayesian Active Learning
To assess the overall performance of the introduced Bayesian active learning strategies in Section 3, we compute Bayesian model evidence p(D), relative entropy D KL [p(ω|D), p(ω)] and information entropy H [p(ω|D)]. To do so, we will employ the resulting GPE surrogates in Equation (11), (15) and (19), correspondingly. In that way, we avoid any unnecessary multidimensional integration or density estimation via Monte Carlo integration. Figure 4 illustrates how the Bayesian model evidence, the information entropy and the relative entropy adjust their value during Bayesian active learning for the discussed 2D reduction of the original 10D problem. Figure 4 shows the results of the BME-based, the IE-based and the RE-based active learning using red, green and blue lines, respectively. All three approaches reach their plateaus after approximately 20-30 active learning steps that corresponds to 20 (11), (15) and (19) for the available Monte Carlo 10 5 samples in parameter space. Figure 5 illustrates the convergence of the Bayesian model evidence, the information entropy and the relative entropy estimates using GPE to the reference Monte Carlo solution during the Bayesian active learning. The RE-based active learning (blue line) convergences faster to the reference values than BME-based (red line) and RE-based (green line) active learning for all three indicators (BME, D KL [p(ω|D), p(ω)] and H [p(ω|D)]). Figure 5 aligns well with the results and discussion presented in Section 4.1.2. Now, we will consider the full 10D setup of the problem (35) from Section 4.1.1. Similar to our discussion above, we will start the Bayesian active learning procedure with one training point only (N T = 1) corresponding to the mean value of uncertain parameters ω and we will employ again all three introduced strategies. Assessing the performance of the active learning procedures will also compute the BME values p(D), the RE value D KL [p(ω|D), p(ω)] and the IE value H [p(ω|D)] based on the GPE surrogate. We will compare them against the reference values obtained from the plain MC technique on the original model with sample size of 10 5 . The results presented in Figure 6 confirm the anticipations from above and demonstrate a superior performance of RE-based active learning (blue line) in comparison to BME-based (red line) and IE-based active learning (green line). From the computational point of view, Figure 6 shows that the RE-based strategy already reaches an acceptable precision after approximately 200 model runs. This precision for the BME-based and the IE-based strategy can be reached, however, only after 500 model runs. It is worth mentioning that the current 10D setup is extremely challenging for GPE surrogates because of parameter dimensionality and its strong nonlinearity. From the current section, one can conclude that the relative entropy-based Bayesian active learning demonstrates a highly acceptable performance and seems to be the most suitable one for practical applications.

CO 2 Benchmark Set up
We will consider a multi-phase flow problem in porous media, where carbon dioxide (CO 2 ) is injected into a deep aquifer and then spreads in a geological formation. This yields a pressure build-up and a plume evolution. The CO 2 injection into the subsurface could be a possible practice to mitigate the CO 2 emission into the atmosphere. In this study, we use the deterministic model, provided by Köppel et al. [24], which is a reduced version of the model in a benchmark problem defined in the paper [85]. This reduction consists of a radial flow in the vicinity of the injection well, and made primarily due to the high computational demand of the original CO 2 model. It is assumed that the fluid properties such as the density and the viscosity are constant, and all processes are isothermal. The CO 2 and the brine build two separate and immiscible phases, and the mutual dissolution is neglected. Additionally, the formation is isotropically rigid and chemically inert, and capillary pressure is negligible. Overall, the considered CO 2 benchmark problem is strongly nonlinear because the CO 2 saturation spreaders as a strongly nonlinear front that could be challenging to capture via surrogates. For detailed information on the governing equations, the modeling assumption and the approaches, the reader is referred to the original publication [24].
Similar to [24], we consider the combined effects of three sources of uncertainty. We take into account the uncertainty of boundary conditions due to the injection rate, the uncertainty of parameters in the constitutive relations, introduced via uncertainty in the relative permeability definitions, and the uncertainty of material properties, i.e., the porosity of the geological layer. These three sources of uncertainty were introduced for the analysis in [24] using injection rate (IR), power theta (PT) and reservoir porosity (RP), i.e., ω = {IR, PT, RP}.
We consider the CO 2 saturation to be the quantity of interest at a monitoring distance of 15 m from the injection well, measured each 10 days over a period of 100 days. We construct a scenario, in which the synthetic observed saturation values have been generated from the deterministic CO 2 benchmark model itself, with the uncertain parameters to be set as ω Truth = {1.0e − 04, 0.2, 0.3}. Additionally, we will assume that a measurement error of 0.02 (σ D = 0.02) exists for each synthetic observation data. Using the synthetic measurement data, we construct the reference solution conducting a Bayesian updating of the original CO 2  have been obtained based on 10 4 Monte Carlo simulations using Equations (11), (15) and (19), correspondingly. Additionally, the posterior distribution of modeling parameters has been obtained via the same 10 4 Monte Carlo simulations. One model run of the analyzed CO 2 benchmark problem required approximately 3-7 min on a standard computer, depending strongly on the values of modeling parameters. In what follows, we present the results and analyze the performance of the three Bayesian active learning methods, introduced earlier in this paper, applied to the aforementioned CO 2 benchmark set-up.

Assessment of Information Arguments during Bayesian Active for CO 2 Benchmarks
We start the Bayesian active learning process for the CO 2 benchmark model with one training point only (N T = 1) using ω T = E p(ω) (ω). Similar to the previous applications in Section 4.1, we perform the Bayesian active learning procedure by using the BME-based, RE-based and IE-based strategies. Analogously, the performance of the active learning process is analyzed by comparing the BME values p(D), relative entropies D KL [p(ω|D), p(ω)] and information entropies H [p(ω|D)] based on the GPE surrogates against their corresponding MC reference values. Figure 7 illustrates the convergence of the BME value, the IE value and the RE value obtained during GPE-based Bayesian active learning against the reference Monte Carlo values. The results, presented in this figure, demonstrate that the RE-based (blue line) and the BME-based (red line) active learning shows again a superior performance compared to IE-based active learning (green line). Here, the RE-based strategy catches the reference BME values slightly better than the BME-based approach, and both approaches perform similarly well for other quantities of interest. The IE-based active learning demonstrates very similar behaviors to Section 4.1 and confirms the findings that have been reported for the 10D and its 2D reduction problems.

Posterior Distribution of Modeling Parameters for CO 2 Benchmarks
We will consider how good the introduced Bayesian active learning strategies can capture the posterior distribution of modeling parameters. To do so, we will illustrate the posterior distributions and correlations of modeling parameters for the CO 2 Benchmark problem obtained after 50 active learning iterations. Figure 8 presents the results obtained using the BME-based active learning (Figure 8a), the IE-based active learning (Figure 8b) and the RE-based active learning (Figure 8c), and it compares them with the reference Monte Carlo solution (Figure 8d). The BME-based strategy in Figure 8a and the RE-based strategy in Figure 8c capture very well the posterior distributions of all analyzed parameters and their correlations in comparison to the MC reference solution. The information entropy-based strategy in Figure 8b captures acceptably the distributions and the correlations of the injection rate (IR) and the reservoir porosity (RP) parameters. However, it could not properly capture the distribution of the PT parameter controlling the relative permeability distribution. Figure 8b illustrates a very strong overestimation of high-value probabilities of this parameter. Very similar posterior distributions for all strategies have already been observed after 25 iterations of active learning, which corresponds to the convergence shown in Figure 7.
We have used 50 interactions for the GPE-based Bayesian active learning for the demonstrative purposes only. Apparently, such a high number of active learning iterations is unnecessary for practical applications. In the Bayesian context, a stabilization of the posterior distributions indicates convergence of the surrogate representation to the original model in the region of high posterior density. That property could be useful, especially once the reference solution cannot be constructed due to computation reasons, see [35,36]. The convergence of posterior distributions in Figure 8 aligns well with convergence of the information-theoretic indicators shown in Figure 7.
Overall, Section 4.2 indicates that the RE-based Bayesian active learning demonstrates a slightly superior performance over the BME-based strategy, and both strategies are superior to the IE-based approach. Obviously it is very easy to judge the performance of BAL once the reference solution is available. However, as in many practical cases, the reference solution is not available due to a very high computational demand and it is relevant to draw the conclusion without the reference solution. Apparently, all mentioned indicators such as BME value p(D), RE value D KL [p(ω|D), p(ω)], and IE value H [p(ω|D)] reflect relevant information for a Bayesian inference. Therefore, the active learning procedure can be stopped once all such information-theoretic indicators stagnate and reach a plateau because the best possible surrogate representation of the original model have been (almost) reached.

Discussion
We use the link between Bayesian inference and information theory introduced in [40]. This means that we compute BME, RE and IE values while avoiding the criticized multi-Gaussian assumption. Instead, we perform prior-based sampling, i.e., Monte Carlo accompanied by rejecting sampling. Doing so, we consider that the main computations are related to running the original model and we assume feasibility of Monte Carlo sampling on the GPE surrogate.
Alternatively, any posterior-based sampling algorithm (e.g., MCMC) could be used during the Bayesian active learning procedure. However, when any posterior-based sampling algorithm is used, then the values for BME, RE and IE become quite rough approximations because relatively strong assumptions have to be taken. To provide cheap alternatives, various estimates of BME, RE and IE based on known criteria from information theory are now listed in Appendix A for the sake of completeness.
However, the estimates in Appendix A have to be used with care. According to [40], the harmonic mean estimate and the maximum-likelihood estimate for BME provide very unreliable results. Therefore, only rough guesses of the true BME value can be obtained from the maximum a posteriori estimate, Chib's estimate [86], Bayesian information criterion [87] and the Akaike information criterion (with [83] and without second-order bias correction [88]) due to their strong assumptions.
While estimates for BME, RE and IE based on the Kashyap information criterion [89] demonstrated unsatisfactory performance as well, we re-scaled them to a proper scale in [40]. However, the re-scaled Kashyap information criterion still includes unnecessary simplifications of the involved cross entropies. Among all these simplified estimates, the multivariate Gaussian posterior estimate [40] avoids the most unreasonable simplifications for posterior-based sampling and includes the least assumptions for estimating BME, RE and IE. The Gelfand and Dey approach [90] includes assumptions similar to the multivariate Gaussian, but provides slightly inferior results in the cases tested by us. Thus, for posterior-based sampling during Bayesian active learning, we suggested in our 2019 paper [40] to use the multivariate Gaussian estimate that includes least assumptions (Considering multivariate Gaussian distribution for active learning approaches focusing on GPE training on an underlying model only without considering measurement data is fully appropriate. It can be seen as Bayesian 2 active learning and approximation signs in Equations (A25), (A26) and (A27) turn to equality containing no assumptions by definition of GPE.).
Finalizing the discussion, we would like to remark that straightforward applications of the suggested Bayesian active learning strategies (or even already existing approaches) to GPE representations could be computationally very demanding once the problem dimensionality increases. This is mainly caused by the structure of GPE surrogates that is represented via localized kernels. These localized kernels require a lot of training for high-dimensional cases. Increasing the amount of measurement data will help to localize the relevant spots better. However, for high-dimensional problems, the structure of the surrogate should be constructed adaptively and sparse representations will be very beneficial. Alternatively, a preliminary sensitivity analysis (see e.g., [91,92]) could be conducted to partially overcome the problem of dimensionality.

Summary and Conclusions
The current paper deals with Gaussian process emulator that replicates a computational demanding physical model and honors the available observation data establishing fully Bayesian 3 active learning framework. We elaborate the connection between Bayesian inference and information theory and offer a fully Bayesian view on a Gaussian process emulator through a Bayesian inference accompanied by a Bayesian active learning.
The paper employs the fundamental properties of Gaussian process emulator and introduces, in Section 3, three Bayesian active learning strategies. These strategies adaptively identify training sets, for which the full-complexity model must be evaluated. The first Bayesian active learning strategy, relying on Bayesian model evidence, indicates the quality of representation against the available measurements data. The second Bayesian active learning strategy, based on the relative entropy, seeks a relative information gain. The third Bayesian active learning strategy, based on information entropy, considers the expected missing information. The introduced strategies improve the Gaussian process emulator-based surrogate representation of a full-complexity physical model in the region of high posterior density. We employ the information-theoretic arguments to incorporate adaptively the measurements data. We emphasize in the paper that the information entropy and the relative entropy can be computed avoiding any assumption or unnecessary multidimensional integration.
We illustrate the performance of the suggested Bayesian active learning strategies using an analytical example and a carbon dioxide benchmark. Section 4 shows how the suggested approaches capture the likelihood values during an active learning procedure. We also show a visual comparison with the reference Monte Carlo solution for a 2D reduction of the 10D problem. We demonstrate rigorous evidence of convergence against the reference Monte Carlo values for the Bayesian model evidence, the information entropy and the relative entropy obtained via the three Bayesian active learning strategies. Additionally, Section 4 shows the evidence of convergence for the carbon dioxide benchmark problem against the reference solution for all proposed Bayesian active learning strategies. We also illustrate how the suggested Bayesian active learning strategies manage to quantify the post-calibration uncertainty in comparison to available Monte Carlo reference solutions.
Overall, we conclude that the introduced Bayesian active learning strategies for Gaussian process emulators could be very helpful for applied tasks where underlying full-complexity models are computationally very expensive. Moreover, the employed information-theoretic indicators can be used as stop criteria for Bayesian active learning once a reference solution is not available due to a very high computational demand. Our analysis indicates that the Bayesian model evidence-based and the relative entropy-based strategy demonstrate more reliable results in comparison to the information entropy-based strategy, which could be misleading. Additionally, the relative entropy-based strategy demonstrates a superior performance relative to the Bayesian model evidence-based strategy and seems to provide very sensitive arguments for the active learning.
Author Contributions: All authors have substantially contributed to this work, All authors have been involved in writing and revising the paper, and they have all read and approved the submitted version of the manuscript. All authors have read and agreed to the published version of the manuscript.  Figure A1. Convergence of Bayesian model evidence, Information entropy and Relative entropy estimates during active learning for Gaussian process emulator to the reference Monte Carlo solution for the 10D problem: BME-based active learning (red line), IE-based active learning (green line), RE-based active learning (blue line) and IVAR-based active learning (teal line).