Expectation-Maximization Algorithm for the Calibration of Complex Simulator Using a Gaussian Process Emulator

The approximated non-linear least squares (ALS) tunes or calibrates the computer model by minimizing the squared error between the computer output and real observations by using an emulator such as a Gaussian process (GP) model. A potential defect of the ALS method is that the emulator is constructed once and it is no longer re-built. An iterative method is proposed in this study to address this difficulty. In the proposed method, the tuning parameters of the simulation model are calculated by the conditional expectation (E-step), whereas the GP parameters are updated by the maximum likelihood estimation (M-step). These EM-steps are alternately repeated until convergence by using both computer and experimental data. For comparative purposes, another iterative method (the max-min algorithm) and a likelihood-based method are considered. Five toy models are tested for a comparative analysis of these methods. According to the toy model study, both the variance and bias of the estimates obtained from the proposed EM algorithm are smaller than those from the existing calibration methods. Finally, the application to a nuclear fusion simulator is demonstrated.


Introduction
Modern researchers have attempted to develop and use simulation code instead of excessively expensive or infeasible physical experiments in many fields. Advanced computer technology has made it possible to realize very complicated simulations. In most cases, the simulation code contains several unknown parameters (or universal constants). A classic method for determining the universal constants in computer models is non-linear least squares estimation (NLSE). It makes the sum of the squared error between the real observations and the computer responses as minimum. The NLSE, however, will become too computationally expensive or infeasible in terms of time when the computer model is time-consuming to run. In such cases, a statistical emulator can be used to determine the universal constants in the computer model so that the simulator or emulator can represent the real experiments effectively well. This process is known as "code tuning" or "calibration" [1][2][3][4].
Code tuning is defined as the procedure of enhancing the consistency between code responses and experimental data via determining the parameters inside the simulation code [5]. A differentiation can be made between tuning parameters and calibration parameters [3]. However, we use these two terms (calibration and code tuning) interchangeably.

GP Model for Computer Experiments
Sacks et al. [22] proposed the adoption of the GP model for the analysis of deterministic computer experiments. The GP model can be expressed as follows, for the response Y(t) at site t: where β i s are unknown regression coefficients, { f i (·)} is a set of known functions, and Z(·) is the GP (random variables) with mean zero and covariances σ 2 Z R(t). In the above equation, the first term indicates a regression function and the second part (Z(t)) stands for the departure from the assumed regression function. The stochastic process term allows the predictions (5) to interpolate the deterministic responses. For t i = {t i1 , ...t id } and t j = {t j1 , ...t jd }, the covariance between Z(t i ) and Z(t j ) is denoted by where σ 2 Z is the variance of Z(·) and R(t i , t j ) is a correlation function. We select the Gaussian correlation family [26]: where θ k s are non-negative parameters. We define v (t 0 ) and f (t 0 ) as v (t 0 ) = [V(t 0 , t 1 ), ..., V(t 0 , t n )], f (t 0 ) = [ f 1 (t 0 ), ..., f p (t 0 )]. (4) In this case, v (t 0 ) is a correlation vector between the observed (or training) sites and a prediction site t 0 , and f (t 0 ) is a design (or functional) vector of t 0 .
When the correlation function R(·, ·) is assigned and the observations y at sites t are given, the best linear unbiased predictor (BLUP) of Y(t 0 ) is [26] Furthermore, the hyper-parameters in V are generally unknown, and hence, we usually estimate the hyper-parameters using the MLE (maximum likelihood estimation) from the observations y. These estimates are subsequently plugged into (5), causing (5) to become the so-called empirical BLUP (EBLUP) of Y(t 0 ) [26]. We used the "DiceKriging" [27] package of the R program. A brief description of the MLE calculation is provided in Appendix A. We can build the prediction models differently according to the combination of θ s and β s in (1) and (3). We construct the following two models, in this study: Model 1: with common θ, Model 2: with d different θ s.
In the above, "common θ" indicates that d number of θ s are constrained to be a common θ c so that θ 1 = θ 2 = · · · = θ d := θ c . The final error term ( ) is for the randomness or measurement error in the real experiments. The error term is not applied to the computer responses because only a deterministic computer model is considered in this study. Other models such as the one from variable selection algorithms [28][29][30] are of course usable. We recommend References [26,31] for further details on the GP model.

Data Structure and Notations
The experimental data and computer data are subscrived by "E" and "C", respectively, and the combined (both) data are denoted by "B". Let τ be a vector of calibration parameters with the dimension q and T represent the input variables of the simulator corresponding to τ. We denote X for the experimental input variables with the dimension p. In addition, let n C , n E be the sample sizes and n B = n C + n E . More details are presented in Appendix B.

ALS Method
In this subsection, we briefly describe the ALS method considered in Reference [1]. If the simulation code is computationally costly to run, it is very difficult to optimize numerically a certain quantity from the code directly in terms of time. In this case, the ALS employs the GP model as a surrogate or an emulator of the computer code. The ALS first estimates the GP parameters using the MLE for the simulation data. Thereafter, it regards the built GP model as if it were true simulator. The ALS determines τ by minimizing the following approximated residual sum of squares (ARSS): whereŶ(τ, x E ) is the EBLUP of Y(x 0 ), as in (5). The advantage of the ALS method is that it does not need additional runs of the simulator to calculate ARSS(τ) after a GP surrogate has been constructed from a computer dataset. Because there is no explicit minimizer in ARSS(τ), we employ the quasi-Newton iteration in the "optim" package of the R program.
A potential disadvantage of the ALS method is that the emulator (a GP model) is constructed once from the simulation data and it is no longer re-built. To address this defect, an iterative method of the ALS, namely the max-min algorithm, was considered by Seo et al. Reference [4].

Max-Min Algorithm
The tuning constants and GP parameters are alternately estimated by ALS method and the MLE, in the max-min algorithm. This method utilizes both experimental and computer data repeatedly until convergence. The steps are outlined as follows: Step 1: Find the MLE of the GP parametersθ s, β s, andσ 2 Z in Equations (2) and (6) using the simulation data (T S = (τ S , x S ) and y S ) only.

Step 5: Repeat Step 3 and Step 4 until convergence, such as
In Steps 1 and 3, a GP model is constructed for the prediction. Steps 2 and 4 are the same in minimizing ARSS(τ), but Step 2 utilizes computer data only in the prediction Y(x E ), whereas Step 4 uses both data. In Step 3, τ is the estimate obtained from the previous step. Steps 1 and 3 are the same in terms of seaching the MLE, but Step 1 utilizes computer data only, whereas Step 3 utilizes both data. Quasi-Newton numerical algorithms are employed for the optimizations in Steps 2-4.
Seo et al. [4] demonstrated the max-min method works better than the ALS. One defect of this method is that it needs more computing time than the ALS. Further details on this algorithm, including the stopping rule, can be found in [4].

Likelihood-Based Calibration
Cox et al. [1] considered the full likelihood function using the combined data for all parameters, including the calibration parameters τ; variance parameter σ 2 E and GP parameters θ, β, and σ 2 . Thus, all parameters are estimated simultaneously by the MLE method. This method is called as the full MLE, and it is applied to a GP model. The minus two times profile log likelihood function of all parameters is, without constants, [1] considered other likelihood-based approaches. One method is the so-called separated MLE (SMLE). This maximizes the conditional likelihood of the experimental data given the simulation data. In this case, the GP parameters θ, β, and σ 2 are determined by the marginal MLE from the computer data. These estimates are then inserted into in obtaining the conditional MLE. This result is subsequently maximized with respect to τ and γ E , to acquire estimates of the above parameters. Advantages of likelihood-based methods for calibration are that they simultaneously or conditionally utilize the combined data, and so enrich the calibration methods. The SMLE was demonstrated to be superior to the full MLE in Reference [1]. Thus, in this study, we compare the SMLE with the proposed method. Further details of the SMLE are presented in Appendix C.

Proposed Method: EM-Based Calibration
An EM algorithm is an iterative method for determining the MLE or maximum a posteriori (MAP) estimates of parameters in statistical models [32], where the model usually depends on the unobserved latent variables. One example of the unobserved latent variables is the calibration parameters in computer experiments. The EM iteration alternates between an expectation (E) step and a maximization (M) step. The E-step calculates the expected value of the log likelihood at the current parameter estimates provided from the M-step. The M-step computes the parameters by maximizing the expected log likelihood determined in the E-step. These parameter estimates are subsequently employed to determine the distribution of the latent variables (or calibration parameters) in the following E-step [32].
The EM algorithm was described and given its name in 1977 in a paper by Dempster et al. [33]. Since then, the EM algorithm has been applied to many research areas, including computational statistics, machine learning, computer vision, hidden Markov models, item response theory, and computed tomography. See References [34][35][36] for further information on the EM algorithm and its applications.
To date, the EM algorithm has not been applied to the calibration problem of complex computer code. Because the tuning parameters in real experiments can be treated as unobserved latent variables, an EM algorithm may be appropriate to obtain the distribution of the tuning parameters. The steps of the proposed method for a given GP model are presented as follows: Initialization: Provide initial values ( τ) for τ from prior information on τ. M-step: Determine the MLE (ψ) of the GP parameters from the combined data in which τ are inserted into the experimental data. E-step: Set τ as the conditional expectation of τ given the estimates (ψ from the M-step) of the GP parameters obtained under the combined data. Iterate: E-and M-steps until convergence.
In the k-th iteration of the E-step, the conditional expectation of τ is actually the expectation of the posterior of τ: where E ), and p(τ; x B ,ψ (k) ) is a prior (pdf) of τ. Note that F B and V B are functions of τ. A numerical integration method by Reference [37] in the R package "cubature" [38] was used for the calculation of (11). We set a uniform distribution as the prior of τ. Other priors can easily be incorporated.
A slightly modified likelihood function from (8) is employed to calculate the MLE in the M-step: −2 log L(ψ B ; y B , X B , τ), where τ is obtained from the E-step. The M-step is basically the same as Step 3 in the max-min algorithm. The major difference between the two algorithms is that the max-min minimizes the ARSS, whereas the EM calculates the conditional expectation. One defect of the EM method is that it requires more computing time than the ALS or Kennedy-O'Hagan (KOH) method [2].
In each iteration of the E-and M-steps, parameters in the emulator and the combined data are updated. We expect this updating procedure affects positively to the estimation of the calibration parameter. The EM and max-min use the combined data for constructing an emulator, whereas the ALS uses the computer data only. The addition of relevant data generally enhances the prediction capacity of the emulator.
It is worth noting that the EM algorithm converges effectively, according to our experience when executing it in the test function study. The median iteration number is approximately 10 (the first and third quartiles are 6 and 28 iterations, respectively) based on 10 trials for test function 1, as described in detail in the following section. Table 1 presents a classification of the calibration methods based on the τ estimation, emulator building, and outer iteration. It can be observed that the EM algorithm can be viewed as an extension of the KOH method.

Test Function Study
In this section, we describe the application of the calibration methods to test functions (or toy models) in which the true tuning parameters were known a priori. A set of five toy models in different situations were arranged for a comparison of the methods. These test functions are simple toy models, i.e., easy to compute. However, we treated these functions as if they were time-consuming simulators.
The computer data and experimental data with sample sizes n C and n E , respectively, were generated by The five test functions along with n E = n C = 30 and with the true constants of τ are described as follows: Computer data : Experimental data : Computer data : Experimental data : Computer data : Optimal Latin hypercube designs [39,40] were used for sampling in the independent variables for real experiments (x E ) and for computer experiments (τ S , x S ). A total of 30 different designs for computer data were employed for each test function to take into account uncertainty in the design, whereas the real experimental design was fixed.
As a result, the average of estimates and the standard deviations from 30 trials are presented. The averaged Euclidean distance from the estimates to the true values was computed to evaluate the performance of the methods. (In addition to the Euclidean distance, one can consider an weighted distance such as the Mahalanobis distance [41]. It may be more meaningful than the Euclidean distance in the sense that the Mahalanobis distance takes into account the covariances among estimates of calibration parameters. The weighted distance is not considered in this study, but may be useful in the future study.) The root mean squared error (RMSE) of the estimates is also provided. The formula for the RMSE is as follows: where Bias( τ) is the averaged Euclidean distance to the true constants and std( τ i ) is the standard deviation of each estimate obtained from 30 replications. Tables 2-6 present the results for each test function. The averaged Euclidean distance from the estimates to the true constants and the RMSEs of the estimates are displayed in each table. The standard deviations are the numbers in parentheses. In terms of the RMSE, the proposed method offered superior results over the ALS, SMLE, and max-min methods. In particular, the EM shows less bias and lower RMSE than the other methods. We have this result again in Figures 1-5, which present the boxplots of the distance to the true value for each of the five test functions. In many cases, the medians of the EM estimates were nearer to the true constants than those from the SMLE, ALS, and max-min methods. The box lengths of the EM estimates were shorter than those obtained from the other methods. It is notable that the max-min was more effective than the ALS and SMLE in test functions 2-4. One plausible reason that the EM was superior to the max-min is that the numerical integration for the conditional expectation in the EM may be more stable than the numerical optimization of the ARSS in the max-min.
The computing times of the max-min and EM algorithms were much longer than those of the ALS and SMLE (see Table 7). The times were obtained using a personal computer with an Intel i5 CPU (3.6 GHz) and 16 Giga bytes of memory. Thus, we faced the limitation in extending our test functions to a higher-dimensional calibration parameters because of the heavy computing time in the max-min and EM algorithms. Numerical integration method for more than 10 dimensions may not be practical in the EM algorithm. In such high-dimensional cases, a Monte Carlo integration technique would be useful, but it requires more computing time.  Table 3. Same as Table 2 but for test function 2 with true values τ * 1 = 2, τ * 2 = 1, τ * 3 = 3.

Application to Nuclear Fusion Simulator
We present, in this section, the application of the calibration methods to computer code known as "Baldur" [42], which is a time-consuming simulator for the energy confinement time in a nuclear fusion device. It is called a "tokamak" in the Russian language. The mathematical model is simply expressed as follows, for a known complex function f calculated from the Baldur code: where x 1 is the input heating power, x 2 is the toroidal plasma current, x 3 is the line average electron density, and x 4 is the toroidal magnetic field. Calibration parameters τ = (τ 1 , τ 2 , τ 3 , τ 4 ) determine the energy transfer, where each parameter is respectively, related to drift waves, rippling, resistive ballooning, and the critical value of η i [42].
The experimental data consisted of x = {x 1 , x 2 , x 3 , x 4 } with sample size 42 from the Poloidal Divertor Experiment (PDX) tokamak at Princeton. The computer data consisted of (τ, x) with sample size 64 from the Baldur code. The details on data can be found in References [1,42]. Table 8 presents the results of the τ estimation when using the ALS, SMLE, max-min, and EM algorithms on the basis of GP Model 1 and Model 2 from the tokamak data. The results were obtained using R program on a personal computer with an Intel i5 CPU (3.6 GHz) and 16 Giga bytes of memory.  Figure 6 depicts the residuals (y E −Ŷ(T E )) plotted according to the predicted valueŝ Y(T E ) that were obtained by various methods using GP Model 1 and Model 2. The residual plots from all methods exhibited linear trends. The trend for the EM algorithm was the lowest among the methods.

Discussion
Certain basic limitations exist when calibrating computer models to real data. We have experienced that the performance of the calibration methods was influenced significantly by the designs of both the physical and computer experiments [43,44]. Thus, a sequential designing approach must be very useful in practice [10,11,16]. Relevant experimental designs under a sequential tuning will improve the calibration very well.
Iterative versions of calibration methods, including the KOH method, may be available. For example, an iterative version of the SMLE method can be summarized as follows: Step 1: Acquire τ by maximizing the conditional likelihood function of τ on the E-data, given (y C ,ψ C ).
Step 2: [maximization] Acquireψ B by maximizing the likelihood from the both data, in which τ are inserted into the E-data.
Step 3: [maximization] Acquire τ by maximizing the conditional likelihood function of τ on the E-data, given (y C ,ψ B ).
The estimates τ may vary according to the selected emulator. Thus an importance in calibration is the selection of the GP surrogate with some regression variables and GP parameters. We only used simple models (6) in this study. If the optimal GP model is built by the model selection algorithm, the result could be different.

Summary
The ALS method using a GP emulator is a basic calibration technique for complex computer models. However, it exhibits the potential drawback that the emulator is constructed once and it is no longer re-built. To overcome this defect, an iterative (EM) algorithm has been proposed in this study. The calibration parameters of the simulation code are calculated by the conditional expectation (E-step), whereas the GP parameters are updated by maximum likelihood estimation (M-step). These EM steps are alternately repeated until convergence by using both computer and experimental data.
We examined five test functions for a performance comparison. We confirmed that our proposed method (the EM algorithm) provided better results than the SMLE, ALS, and maxmin methods in terms of the RMSE. The disadvantage of the proposed method is that it is more time-consuming than the ALS, because the EM algorithm needs to optimize complex functions and compute the conditional expectation using numerical integration based on the combined data. Nonetheless, the EM method can provide improved calibration as well as a superior emulator of the computer code compared to non-iterative methods, including the ALS and SMLE.

Acknowledgments:
The authors would like to thank the reviewers and the guest editors of the special issue for helpful comments and suggestions, which have significantly improved the presentation of this paper. This report is a part of an invited talk by the second author at the GdR Mascot-Num annual conference at Ecole de Mines St-Etienne, France in 2015. He would like to thank the organizers of that conference for their hospitality. We are grateful to Professor Clifford Singer (Dept. Nuclear Engineering, Univ. of Illinois at Urbana-Champaign) for providing the Tokamak data.

Conflicts of Interest:
The authors declare that they have no conflict of interest. where e is independent and normally distributed random variable with mean zero and variance σ 2 E . Certain contents of this Appendix are similar to those in References [4,6].

Appendix B.2. Computer and Experimental Data
We have the data matrix of the independent variables X C and X E for the computer and experimental data, respectively: x E21 · · · x Ep1 τ 1 τ 2 · · · τ q x E12 x E22 · · · x Ep2 . . . . . . . . .
In the above, t ij in X C represents the j-th value of the i-th T variable (T i ), whereas x Eij and x Cij denote the j-th value of the i-th X variable of the experimental (X Ei ) and computer (X Ci ) input. Furthermore, X E is a n E × (q + p) matrix and X C is a n C × (q + p) matrix. Note that the first part of X E is composed of the unknown parameters τ 1 , · · · , τ q , whereas the corresponding part of X C comprises the input values (t ij ).

Appendix B.3. Combined Data
The following notations are introduced for the combined computer and experimental data: X B is the design matrix of the independent variables. F B is defined as the functions of the input variable values. y B is a vector of the combined responses. Note that T are the input varaibles of the computer code. In this case, X C and F C contain T, whereas X E and F E are the functions of the tuning parameters τ.
The GPM is subsequently applied to the computer and experimental data simultaneously. Let ψ = (τ, θ, γ C , γ E , σ 2 , β), where γ C = σ 2 C /σ 2 and γ E = σ 2 E /σ 2 . Here, σ 2 C and σ 2 E are the variances of the error term ( ) in the GP model for the computer and experimental data, respectively. We set σ 2 C = 0, and thus, γ C = 0, because only a deterministic computer model is considered in this study. When necessary, β C and β E are used to denote the regression coefficients for the computer and real experimental data. Thus, given the independence and normality assumptions, we have where in which R(X C , X E ) represents a n C × n E correlation matrix computed between X C and X E . V B is a n B × n B positive definite covariance matrix for the combined data, where n B = n C + n E .