Generalized Nonlinear Least Squares Method for the Calibration of Complex Computer Code Using a Gaussian Process Surrogate

The approximated nonlinear least squares (ALS) method has been used for the estimation of unknown parameters in the complex computer code which is very time-consuming to execute. The ALS calibrates or tunes the computer code by minimizing the squared difference between real observations and computer output using a surrogate such as a Gaussian process model. When the differences (residuals) are correlated or heteroscedastic, the ALS may result in a distorted code tuning with a large variance of estimation. Another potential drawback of the ALS is that it does not take into account the uncertainty in the approximation of the computer model by a surrogate. To address these problems, we propose a generalized ALS (GALS) by constructing the covariance matrix of residuals. The inverse of the covariance matrix is multiplied to the residuals, and it is minimized with respect to the tuning parameters. In addition, we consider an iterative version for the GALS, which is called as the max-minG algorithm. In this algorithm, the parameters are re-estimated and updated by the maximum likelihood estimation and the GALS, by using both computer and experimental data repeatedly until convergence. Moreover, the iteratively re-weighted ALS method (IRWALS) was considered for a comparison purpose. Five test functions in different conditions are examined for a comparative analysis of the four methods. Based on the test function study, we find that both the bias and variance of estimates obtained from the proposed methods (the GALS and the max-minG) are smaller than those from the ALS and the IRWALS methods. Especially, the max-minG works better than others including the GALS for the relatively complex test functions. Lastly, an application to a nuclear fusion simulator is illustrated and it is shown that the abnormal pattern of residuals in the ALS can be resolved by the proposed methods.


Introduction
In many areas of study, modern scientific researchers attempt to develop and use computer code instead of physical experiments when their cost makes them too expensive or infeasible. With the development of computer technology, it is possible to realize the very complex computer simulation which includes a numerical implementation of a physical/mechanistic mode, although it may have some unknown parameters. One of the classic methods for the estimation of unknown parameters in computer code is the nonlinear least squares estimation (NLSE), which minimizes the sum of the squared differences between the experimental observations and the responses of computer simulations. However, if the computer code is complex and time-consuming to execute, the NLSE becomes too expensive or infeasible in terms of time. In these cases, one can use a statistical model to estimate the parameters in the computer code so that the computer simulation model can explain the physical experimental data very well. This procedure is called "calibration" or "code tuning" [1][2][3][4].
The calibration is formally defined as the process of improving the agreement of a code calculation or set of code calculations with respect to a chosen and fixed set of experimental data via the estimation of the parameters implemented in the simulator [5]. Han et al. [1] differentiated between tuning parameter and calibration parameter. In this study, however, the two parameters are treated as the same. We will be lazy by using those two terms (calibration and code tuning) interchangeably.
Cox et al. [3] proposed an approximated NLSE (ALS) for code tuning, in which they employed the Gaussian process (GP) as a metamodel (or a surrogate) of complex computer code. That is, the ALS first fits the GP to the computer data, and then it treats the fitted GP as if it were true simulation code, which makes the ALS computationally feasible. The GP has been successfully used to analyze computer experiments [6][7][8]. In this study, we adopt the same model as a metamodel of computer simulation code.
In the literature, calibration is often performed within a framework where the code predictions suffer from a systematic discrepancy or bias for any value of parameters. This reflects the view that the mathematical equations underlying the code should not be considered as a perfect model of the real world [4,14]. Even if this framework is more realistic, it is outside the scope of this paper. Thus our presentation is centered on a statistical model which does not include the code bias [3]. However, it would be possible to generalize our framework if the shape of the discrepancy were provided.
The differences between the observations of experiments and the responses of computer simulations are often correlated or do not have equal variance. It is called heteroscedastic problem in a linear model theory [21,22]. In that case, the ordinary least squares approach is not appropriate, so a generalized least squares method is usually employed. In the calibration study, the existing methods including the ALS, however, do not consider the correlation between residuals. Moreover, one potential drawback of the ALS method is that it does not account for uncertainty in the approximation of the computer response by the fitted GP. To address the above two problems, we propose a generalized approximated nonlinear least squares estimation (GALS) method that takes into account the covariance matrix of residuals and the uncertainty due to the approximation.
Another potential defect of the ALS is that the surrogate GP is built only once based on the computer data and is not updated thereafter. To solve this difficulty, Seo et al. [2] introduced an iterative version of the ALS. They called it 'max-min algorithm'. Their method improves the accuracy of calibration and obtains more stable result. In this study, we propose a similar iterative version for GALS method, which is called as 'max-minG'. In addition, we considered the iteratively reweighted least square method in the ALS setting for calibration of computer code for a comparison purpose. Figure 1 shows the schematic flow chart of our proposed methods. These proposed methods are validated and compared with the existing methods using five test functions, where the true parameters are known a priori. This paper is organized as follows. Section 2 describes a GP for computer experiments. The ALS approach is introduced in Section 3. The GALS method and its iterative version (the max-minG) are proposed in Sections 4 and 5, respectively. In Section 6, a test function study is presented. An application to nuclear fusion simulator is given in Section 7. Discussion is given in Section 8, followed by a summary in Section 9. The formula for the covariance matrix of residuals is derived in the Appendix A.

Gaussian Process Model for Computer Experiments
In this section, we describe the statistical model for the computer simulation data based on the Gaussian process. In many cases, computer simulation code is deterministic. For this reason, Sacks et al. [6] proposed adopting the Gaussian process model (GPM) for computer experiments. The expression of the GP is as follows: where Y(t) is the response at site t, { f i (·)} is a set of known functions, β is a vector of unknown regression coefficients, and Z(·) is the Gaussian process with mean zero and covariance σ 2 Z R(t). Here, the first term represents a linear regression model, and the second term represents the departure from the assumed linear model, which allows us to interpolate between the observed sites. For t i = {t i1 , ...t id } and t j = {t j1 , ...t jd }, the covariance between Z(t i ) and Z(t j ) is given by where σ 2 Z is the process variance of Z(·) and R(t i , t j ) is a correlation function. Our choice is obtained from the Gaussian correlation family denoted by [23] where θ k s denote non-negative parameters. We define v (t 0 ), and Here, v (t 0 ) is a correlation vector between observed sites and a prediction site t 0 , and f (t 0 ) is a design vector of t 0 . If the correlation function R(·, ·) is known, the best linear unbiased predictor (BLUP) of Y(t 0 ), given observation y, iŝ where V = V(t, t), y is the observations at sites t, F = [ f j (t i )] 1≤i≤n,1≤j≤p is an n × p design matrix of observed sites, andβ = (F V −1 F) −1 F V −1 y is the generalized least squares estimator of β. V is usually unknown and, thus, we estimate the hyper-parameters via the maximum likelihood estimation (MLE) from the data collected at the deisgn sites. Then it are plugged into (6), which makes (6) become the so-called empirical best linear unbiased predictor (EBLUP) of Y(t 0 ) [23] or the Kriging in geostatistics. We used the package "DiceKriging" [24] of the R program. In the estimation of the GP parameters, starting values of the numerical optimization are important. In this study, we evaluated the likelihood at 200 random points, and selected the 20 values with the highest likelihood values for starting values. The prediction model can be determined differently according to a combination of β s and θ s in (1) and (3). In this study, we consider the following model: Of course, other models can be considered, and the models based on variable selection algorithm are also possible [25][26][27]. We refer the readers to References [23,28] for more information on GPM and its application to the design and analysis of computer experiments including calibration.

Approximated Nonlinear Least Squares
We briefly describe the approximated nonlinear least squares (ALS) method proposed by Reference [3]. The following notations for the computer data are used: d: dimension of input variables t = (c, x) of computer code q: dimension of unknown parameters c c: unknown parameters to be estimated (q dimensional) c S : input variables of computer model corresponding to unknown parameters c (q dimensional) n S : number of observations in computer simulations x S : independent variables in computer model (d-q dimensional) Y(c, x) or y S : computer response for input variables (c, x) In addition, we use the following notations for the real experimental data: Here, the subscripts S and E represent the computer simulation and real experiment, respectively. If the computer simulation code explains the real experiment data well, we can approximate y E using the following model: where e is assumed to be independent and identically distributed N(0, σ 2 E I), and σ 2 E is the variance of real experiments.
When the computer code is very time-consuming to execute, it is almost impossible in terms of time to optimize some quantity directly from the code. For this case, the ALS uses a GPM as a statistical metamodel or a surrogate of computer code. The ALS first fits the GPM for the computer data by estimating the GP parameters using the MLE. Then, it treats the fitted model as if it were true computer code. The ALS findsĉ that minimizes the following approximated residual sum of squares (ARSS); whereŶ(c, x E ) is the EBLUP of Y(x 0 ), as in Equation (6). The advantage of this method is that it does not require any additional execution of the computer code to evaluate ARSS(c) after the prediction model is built from a computer data set. Because no explicit solution is available to minimize ARSS(c), we use the quasi-Newton method in "optim" package of R program [29].
Sometimes, the residuals y E −Ŷ(c, x E ) seem to be correlated or do not have constant variance. The ALS, however, do not take into account the correlation between residuals for calibration. In that case, an approach dealing with the correlation should be more appropriate. Moreover, a potential drawback of the ALS is that it does not account for uncertainty in the approximation of Y C byŶ. We expect this drawback can be handled by the method presented in the next section in which the covariance matrix of the residuals y E −Ŷ(c, x E ) is considered. An improvement point of the ALS is that the surrogateŶ is built only once and is not updated thereafter. This can be addressed by the iterative version of the proposed method.

A Generalized ALS
Here, we propose a new method, the generalized approximated nonlinear least squares estimation (GALS) by considering the covariance matrix of residuals. The GALS method findsĉ which minimizes the following generalized ARSS(c): where Through some calculations presented in the Appendix A, K is written as follows: where where γ E = σ 2 E /σ 2 Z , F S is an n S × p design matrix of computer simulations, and F E is an n E × p design matrix for real experiments, respectively. See the Appendix A for derivation of the above form. Here, ..n S is the n E × n S correlation matrix between the observations of real experiments and the responses of computer simulations, and V S = [V(T S,i , T S,j )], i = 1, ..., n S , j = 1, ...n S is the n S × n S correlation matrix between the responses of the computer simulations, where T S = {t S,1 , t S,2 , ..., t S,n S } for t S,i = (c S , x S,i ), i = 1, ...n S . It is notable that the similar matrix was considered by Reference [30] in Bayesian framework for diagnostics of the GPM, not for the code tuning.
For the calculation of Equations (12) and (13), the quantities R(T E , T E ), U, V S , σ 2 Z , and γ E have to be specified. In this study, the parameters θ s andγ 2 S are inserted by the MLE calculated from the computer data using Equation (7), whileγ E is estimated from real data using the simple model, as follows:

Iterative Algorithm For GALS
In this section, we consider the iterative version of GALS. Seo et al. [2] proposed so-called the max-min algorithm which is an iterative version of the ALS. In the max-min algorithm, the tuning parameters and GP parameters are alternatively re-estimated and updated by maximum likelihood estimation and the ALS method. This algorithm uses both computer and experimental data repeatedly until convergence. This algorithm is motivated by the iteratively reweighted least squares (IRLS) method in regression analysis. Seo et al. [2] showed that the max-min performs better in calibrating as well as in surrogating the computer code than the ALS. We present a version of the max-min algorithm applied to the GALS, which is referred to the 'max-minG'. The steps are given in the following; Step 1: Estimate the GP parametersβ s,θ s, andσ 2 Z in Equation (2) and (7) using MLE for the computer simulation data (T S = (c S , x S ) and y S ) only.
Step 3: Estimate the GP parametersβ s,θ s, andσ 2 Z for the combined data (T B and y B ), where Here,ĉ is the estimates obtained from previous step.
Step 5: Repeat Steps 3 and 4 until ∑ In Step 1 and 3, the prediction model is built by the GP model as in Equation (7) with β's and d different θ's. Steps 2 and 4 are the same in minimizing GARSS(c), but in the predictionŶ(x E ), Step 2 uses only computer data while Step 4 uses the combined data. Steps 1 and 3 are the same in obtaining the MLE by maximizing the likelihood function, but Step 1 uses only computer data while Step 3 uses the combined data. The likelihood function in Step 3 is L(β, θ ;ĉ, y B , x B ). The subscript B stands for the combined (or both) data. For the optimizations in Steps 2, 3 and 4, quasi-Newton numerical algorithms were employed.
In each iteration of Steps 3 and 4, the combined data and the parameters in a metamodel are updated, so we expect it to positively affect the estimation of tuning parameter c. Our expectation comes from the fact that the iterative algorithm uses the combined (computer and real experimental) data in building a metamodel while the ALS and the GALS use the computer data only. Adding relevant data usually improves the prediction ability of the metamodel. The advantage of the max-minG is that both of computer data and real experiment data are used for building a metamodel. Thus, the uncertainty in the approximation Y using the metamodel gets smaller than those of the ALS and the GALS, because the ALS or the GALS builds a surrogate using computer data only [2]. We expect the max-minG method works well in calibrating as well as in surrogating the computer code.

Test Function Study
In this section, we apply the proposed methods to test functions. Five test functions in different conditions were studied for a comparative analysis of the methods. These test function are actually "toy models", that is, simple functions which are in fact easy to compute. We however treat these functions as if they are computationally expensive real simulators. Those functions are introduced only to evaluate and to compare the performances of the proposed methods. Test functions 1 and 2 are simple with one or two dimensional problems, while test functions 4 and 5 involve three or four dimensional problems. A real simulator which takes long time to compute is considered in the next section.
The experimental data with n E sample size and the computer experiment with n S sample size were generated by The five test functions along with the true values of parameters c are as follows: Table 1 shows the result of the estimation by the nonlinear least squares (NLS) method with the true test functions. The last column shows how many numbers of evaluations are needed to estimate based on the NLS using simulation code directly. As the simulation code becomes more complex, the number of evaluations required increases dramatically. So, if the simulation code very time consuming, the estimation based on the NLS is impossible in terms of computation time. With the five test functions, we examined how well the four methods estimate the parameters in 'fast' simulation codes using only a small number of simulated data. Table 1. Estimates and the distances to true values for the test functions 1-5 by using the nonlinear least squares with the true functions. The values within the parentheses are the true values for each parameter. The last column is the number of simulation code evaluation for the estimation of each function with a single initial value.
Distance to True Value # of Evaluation The optimal Latin-hypercube designs [31,32] were used to select an independent variable for real experiment (x E ) and an input variable for the computer simulations (c S , x S ). To address uncertainty in the design, 30 different computer experimental designs were used for each test function, while the real experimental design was fixed. If this design were not fixed, the variability of c estimation would increase.
For a comparison analysis, we provide the average of 30 different estimates and the standard deviation of the calibrated parameter values. The averaged distance to the true values from the estimates is calculated to evaluate the accuracy of the methods. To take account of both the variance and the accuracy of estimates, we also considered the root mean squared error (RMSE) of the estimates as a performance measure. The formula of RMSE is as follows: where Bias(ĉ) is the averaged distance to true values and std(ĉ i ) is the standard deviation of each estimates obtained from 30 repetitions. In addition, the computing time is also provided with the unit of seconds. The test function study was conducted using a PC on an Intel i5 CPU (3.6 GHz) with 16 GB of memory.

Iteratively Re-Weighted ALS
When the errors do not have equal error variance in the regression analysis, one can use the weighted least squares method. The weights can be determined by an iteratively, which leads to the iteratively re-weigthed least squares method (IRLS). It has been employed for robust regression for obtaining estimated regression coefficients that are relatively unaffected by extreme observations. In the IRLS, the weights are functions of the residuals from the previous iteration such that points with larger residuals receive relatively less weight than points with smaller residuals. Consequently unusual points tend to receive less weight than typical points [33].
We now apply IRLS method to the ALS setting for calibration of computer code, which is called as the iteratively re-weighted ALS (IRWALS). The algorithm of the IRWALS is as follows: Step 1: Getĉ (0) using the ALS, which minimizes the non-weighted residual sum of squares Step 2: Using the previously estimated parametersĉ (t−1) , calculate the t-th weight where i = 1, ..., n E . In this study, we consider the case p = 1.
Step 3: Calculate t-th estimatesĉ (t) using the weighted ARSS where W (t) is the diagonal matrix with entries w Step 4: Repeat Steps 2 and 3 until ∑ The IRWALS method is similar to the case in which the matrix K in the GALS is a diagonal. Thus, it is expected that the IRWALS may perform better than the ALS when errors do not follow equal variances. Note that, however, the IRWALS method still has the same drawback as in the ALS that it uses the surrogateŶ which was built only once from the computer data and is not updated thereafter.
Whereas the max-min or the max-minG algorithms update the surrogateŶ iteratively using both computer data and experimental data. The IRWALS method has this difference with the max-min or the max-minG algorithms. It thus may not share the advantage of the max-min type algorithm. Tables 2-6 show the results of the performance comparison for each test function. In terms of the RMSE, the proposed methods (the GALS and the max-minG) offer better results than the ALS and IRWALS, overall. Especially, the max-minG offers more accurate results (in terms of the averaged distance to true values) and less RMSE than the others including GALS in relatively complex cases such as test functions 4 and 5.
This is also confirmed in Figure 2 which shows the boxplot of distance to true value with the RMSE for the five test functions. The details are shown in Figures A1-A5, where the last boxplots show the distances between the estimates and true values, and the other boxplots show estimates with true values (horizontal solid line). In most cases, the medians of the estimates obtained from the GALS and max-minG are closer to the true values than those from the ALS and IRWALS. Furthermore, the box lengths of the estimates obtained from the GALS and max-minG are shorter than those from the ALS and the IRWALS. From the boxplot of the distances between the estimates and true values, it is also confirmed that the GALS and max-minG give more accurate and stable (shorter box lengths) results in overall.
The computing time of the max-minG, however, is much longer than the other methods (see Tables 2-6). Figure 3 shows how the performance of each model changes over the runtime (unit: second) on sample cases for test functions 4 and 5. The plots were truncated at run time of 100 and 1500 s to see the beginning part in detail.  Table 4. Same as Figure 2 but for test function 3 with c * = (2, 1).   Table 6. Same as Figure 2 but for test function 5 with c * = (1, 1, 3, 3).

Application to a Nuclear Fusion Simulator
In this section, we apply the proposed methods to a complex computer code called Baldur [34] which is a computationally expensive simulator of energy confinement time in a nuclear fusion device (called a tokamak from the Russian language). The model can be written as follows: where f is a known function calculated by the Baldur code, x 1 is the input heating power, x 2 is the toroidal plasma current, x 3 is the line average electron density, x 4 is the toroidal magnetic field, and c = (c 1 , c 2 , c 3 , c 4 ) are adjustable parameters determining energy transfer: drift waves, rippling, resistive ballooning, and the critical value of η i (which provokes increased ion energy losses for the drift), respectively. The experimental data consist of only x = {x 1 , x 2 , x 3 , x 4 }, whereas the computer data consist of (c, x). We use 42 observations from PDX (Poloidal Divertor Experiment) tokamak in Princeton and 64 responses from the Baldur simulator. A detailed description of the data is found in References [3,34].
We use a simple model as follows: where a common θ is that θ 1 = ... = θ d = θ. Table 7 shows the estimates of the tuning parameters and their 90% confidence intervals within the parentheses obtained by various tuning methods from the tokamak data. The confidence interval was calculated by the parametric bootstrap with B = 1000.  Figure 4 shows the residuals (y E −Ŷ(T E )), weighted residuals (K −1/2 (y E −Ŷ(T E )), andŴ −1/2 (y E −Ŷ(T E ))) according to the predited valuesŶ(T E ). The residuals from the ALS show a pattern of declining asŶ(T E ) increases. Whereas the abnormal pattern in the weighted residuals from the GALS and the max-minG is hardly observed. It can be confirmed via the Pearson correlation coefficient between the residuals (or the weighted residuals) and the predicted values from each method in Figure 4. The correlation coefficient, which was −0.57 on the ALS, barely changed to −0.54 on the IRWASL, but significantly changed to −0.41 and −0.37 on the GALS and max-minG, respectively.

Discussion
Unlike traditional estimations in statistics, we have two kinds of data (real data and computer data) for calibration. Thus, the following issues were considered for the calculation ofK. First, R(T E , T E ), and γ 2 E are related to a real experiment, and V S is related to the computer data. In addition, U and σ 2 Z are related to both data sets. In this sense, it seems reasonable to estimate R(T E , T E ), γ 2 E , and V S using the related data set, and to estimate U and σ 2 Z using the combined data. However, we have the tuning parameter c S for the computer simulations, but we do not have c E for the real data; hence, it is not easy to estimate U, σ 2 Z using the combined data. From a simple test, it was confirmed thatθ estimated using the real data can be vastly different from those using computer data; thus, it may not lead to good calibration results. Therefore, we usedθ estimated from computer data for R(T E , T E ), U, and V S , while we usedγ E estimated from real data. We do not claim that this is the final answer. One may try another version ofγ E in constructingK. For example, leave-one-out or k-fold cross validations would be useful to select an appropriate γ E in K.
There are basic limitations with tuning computer code to real-world data regarding the experimental design. We found that the performance of tuning methods is significantly dependent on the designs for both the computer experiments and the physical experiments. Some authors including [35,36] explored this topic. We agree that a sequential tuning approach is practically useful, as in References [13,15,20]. Further research on relevant designs under sequential tuning approach will be helpful.
The estimatesĉ may vary according to the selected metamodel; hence, the selection of the model is very important in calibration. In this study, we only use the simple model (Equation (7)) for the simulation study. If the best model selected by the variable selection method was used, the result might be better. In this study, we did not consider the identifiability of the tuning parameter, which is an important focus of recent developments [16,17].
In this study, we assume the computer code is deterministic so that the responses will not be changed for the same input. If the code is stochastic, our results may be changed. In the latter case, the uncertainty of c estimation would increase and need more sample sizes to have similar precision as we have in this study.
One issue in the GP modeling involves manupulations of an n × n covariance matrix that require O(n 3 ) computations, where n is the sample size. The calculation is computationally intensive and often intractable in big data. Various computationally efficient approximations have been proposed to address this problem. The sparse approximations employ m(m n) inducing points to summarize the whole training data. It reduces the complexity to O(nm 2 ). Another method is the aggregation of models, also known as consensus statistical method. This kind of scheme produces the final predictions by the aggregation of M submodels respectively trained on the random subsets or the disjoint partitions of data, thus distributing the computations to the experts [37] who still cover the whole space with moderate sample size. The third method is a local approximate GP which searches for the most useful training data points -a subdesign relative x such as nearest neighbor subset -for predicting at x, without considering/handling large matrices [28]. For big data, the calibration algorithms proposed in this study can be useful with modification to the above computational approximations. Parallelization in implementing a calibration algorithm for big data is essential to taking full advantage of contemporary computer architectures.

Summary
The approximated nonlinear least squares method (ALS) using a Gaussian surrogate is an ordinary calibration technique for complex computer code. It however has potential drawbacks of not dealing with the correlation in the residuals and of not taking into account the uncertainty due to using a surrogate to computer code. To address these difficulties, we proposed a generalized approximated nonlinear least squares method (GALS) for tuning complex computer codes. We also introduced the iterative algorithm of the GALS which is named as the max-minG. In addition, the iteratively re-weighted ALS method (IRWALS) was considered for a comparison purpose.
For the performance comparison, we examined the five test functions including the cases in which real experiment data has unequal variance error. From the results of the test function study, we confirmed that our proposed methods (the GALS and the max-minG) provide better results than the ALS and the IRWALS in terms of the root mean squared error (RMSE). In particular, in the test functions 4 and 5, which are relatively complex and unequal variance cases, the max-minG provided more accurate estimates and the variance of the estimates are smaller than those of the others including the GALS. In addition, we also identified that the abnormal pattern of residuals in the ALS can be resolved by using the proposed methods in an application to a nuclear fusion simulator.
The disadvantage of the proposed methods (the GALS and the max-minG) is that those are computationally more complex than the ALS, because the GALS and the max-minG need to optimize more complex functions and needs to estimate more parameters for the covariance matrix K. Nonetheless, the GALS and the max-minG may provide a better code tuing as well as a better surrogate of the computer code than the ALS or the IRWALS method can do.  Acknowledgments: The authors would like to thank the reviewers and the guest editors of the special issue for helpful comments and suggestions, which have greatly improved the presentation of this paper. We are grateful to Professor Clifford Singer (Department of Nuclear Engineering, University of Illinois at Urbana-Champaign) for providing the Tokamak data, and to Professor Dennis D. Cox (Department of Statistics, Rice University) for helpful discussion. The authors also thank Yire Shin for his help in computing with R program.

Conflicts of Interest:
The authors declare that they have no conflict of interest.

Appendix A. Calculation of the Covariance Matrix K
In this section, we calculate the n E × n E covariance matrix of residual y E −Ŷ(c, x E ), which is given by where , K is partitioned into four pieces: We assume that e = y E − Y(T E ) is independent of Y(T E ),Ŷ(T E ), and y E . Assumming that E(e) = 0, we have Therefore, K is given by where Σ(T E ) = E[(Y(T E ) −Ŷ(T E ))(Y(T E ) −Ŷ(T E )) ], a n E × n E matrix. For the calculation of Σ(T E ), observe that from Equation (6) using the same notations denoted in the Section 4. We decompose Σ(T E ) into the following; For the first term of Equation (A6), since Y(T E ) = F E β + Z(T E ) and E[Z(T E )] = 0, we obtain Using Equation (A5) and the inverse formula for a block matrix [38], the third and fourth terms of Equation (A6) are given by and where G = (F S V −1 S F S ) −1 . From Equations (A7)-(A9), we have and it is rearranged as the following by the inverse formula for a block matrix again; Finally, K is written as, for γ E = σ 2 E /σ 2

Distance
Distance to true value Figure A2. Same as Figure A1 but for test function 2.

Distance
Distance to true value Figure A5. Same as Figure A1 but for test function 5.