Developing a Novel Parameter Estimation Method for Agent-Based Model in Immune System Simulation under the Framework of History Matching: A Case Study on Influenza A Virus Infection

Since they can provide a natural and flexible description of nonlinear dynamic behavior of complex system, Agent-based models (ABM) have been commonly used for immune system simulation. However, it is crucial for ABM to obtain an appropriate estimation for the key parameters of the model by incorporating experimental data. In this paper, a systematic procedure for immune system simulation by integrating the ABM and regression method under the framework of history matching is developed. A novel parameter estimation method by incorporating the experiment data for the simulator ABM during the procedure is proposed. First, we employ ABM as simulator to simulate the immune system. Then, the dimension-reduced type generalized additive model (GAM) is employed to train a statistical regression model by using the input and output data of ABM and play a role as an emulator during history matching. Next, we reduce the input space of parameters by introducing an implausible measure to discard the implausible input values. At last, the estimation of model parameters is obtained using the particle swarm optimization algorithm (PSO) by fitting the experiment data among the non-implausible input values. The real Influeza A Virus (IAV) data set is employed to demonstrate the performance of our proposed method, and the results show that the proposed method not only has good fitting and predicting accuracy, but it also owns favorable computational efficiency.


Introduction
Because of the complexity and nonlinearity, precise computational modelling of natural immune systems is virtually impossible. Removing all of the possible details and retaining only the essential interactions provides a possibility to solve this problem. Differential equation (DE) and agent-based modeling (ABM) are two commonly used simulation techniques for immune systems. DE provides mathematical models to describe the dynamic interactions among the components in the immune system, and the immune response can be easily quantified after estimating the control parameters by fitting the experimental data [1][2][3][4][5][6][7]. But, when faced with the simulation of complex phenomena, DE may fall short in constructing a sufficient biological model. When compared to DE, ABM has its benefit from its great flexibility in tuning the complexity of the agents and can be employed to reflect the real sophisticated system [8][9][10][11][12][13][14][15][16][17][18][19]. However, because ABM describes the system at the level of its constituent units, but not at the top level, it is difficult for ABM to estimate the key parameters by incorporating the experimental data.
In view of this, Tong et al. [20] developed an innovative approach, IABMR, by integrating the advantages of both DE and ABM. Firstly, they denoted each cell as an agent with three phenotypes (i.e., quiescence, proliferation, and apoptosis) and employed ABM to describe the dynamic interactions among the components (i.e., epithelial cells, infected epithelial cells, and virus) and simulate the immune system. Then, they employed local regression (LOESS) to build a regression model that is based on the input and output of ABM. Lastly, the model's key parameters were optimized using the particle swarm optimization algorithm (PSO) [21][22][23][24][25][26] by fitting the experimental data. IABMR can not only has the potential to simulate the immune system, but it can also infer the model parameters, like DE. The case study of influenza A virus (IAV) infection [1,2] demonstrated its reliability and efficiency.
However, when the dimension of the input becomes large, the LOESS may severely suffer from "curse of dimensionality" [27]. The model estimation would incur great variabilities. The dimension-reduced type model, generalized additive model (GAM) [28][29][30], is usually a popular alternative. GAM is a nonparametric regression modelling technique that is not restricted by linear relationships, and it is flexible regarding the statistical distribution of the data. On the other hand, although the PSO algorithm is very efficient and convenient when solving the optimization problem, it could be hard to discover the acceptable region of the input space because the acceptable values may hide in a tiny proportion of the initially specified input space. In addition, it will take much time to find the optimal solution if the input region of each parameter is wide.
In this work, a systematic procedure for immune system simulation by integrating the advantages of ABM and GAM under the framework of history matching [31] is developed to address all of the issues mentioned above. A novel parameter estimation method by incorporating the experimental data for the simulator ABM during the procedure is proposed. IAV infection data [2,20] is employed to evaluate the efficiency and accuracy of the proposed method. First, we employed ABM as simulator [31] to simulate the data. Then, GAM is employed as the emulator [31] to train a statistical regression model of the simulator using the input and output data of ABM. Next, input space is reduced by discarding the implausible input values using an implausibility measure. At last, the model's key parameters are optimized using PSO by fitting the experimental data among the retained non-implausible input values.
The results demonstrate that our proposed method not only has good fitting accuracy and prediction precision, and thus possesses good potential in both simulating the immune system and fitting the real experimental data, like IABMR, but also needs less computing cost and owns more computational efficiency than IABMR. Figure 1 shows the procedure route of this research under the framework of history matching. The first step is to simulate the human immune system by using a stochastic ABM model as the simulator. Once obtaining the training data from the input and the output data of the simulator ABM, the next step is to convert the simulator into a statistical GAM model with a higher efficiency, known as emulator. Then, an implausibility measure is introduced. The parameter space is reduced by discarding the implausible input values. At last, the model's key parameters were optimized using PSO in the non-implausible domain by fitting the real experiment data. In this paper, the real Influeza A Virus (IAV) data set [2,20] is employed to demonstrate the performance of our proposed procedure ( Figure 1) and the statements of the detailed method are deferred in Section 3.  Table 1 lists the real experimental data [2,20] from infection of mice with the H3N2 influenza virus A/X31 strain from 0 to 5 days which is used to fit the model. In this study, we use the sample mean at each point as our observations z j in the process of history matching.

Sampling Data
As stated in Section 3.4, the maximin Latin Hypercube design method [32] is employed to generate sampling data. To make the range of sampled data adapt to our input space (0, 2θ 0 ), we use Equation (6) to map the original sampled data. Hereinafter, we use bold symbols to denote vectors distinguished from variables. A set of 40 sampling points is listed in Table 2, which is used as inputs x i , i = 1, 2, . . . , 40 into the simulator ABM described in Section 3.1 to generate the training data for the emulator GAM in Section 3.2. As discussed in Section 3.4, in our case of IAV infection study, the inputs x for the simulator ABM is denoted by a four-dimensional vector θ whose components θ k , k = 1, 2, 3, 4 represent proliferation rate, infection rate and death rate per hour for epithelial cells, infected epithelial cells and virus separately. In order to accommodate the randomness of the simulator ABM, we execute K = 30 times for each point to make the estimation ofĝ j (x i ) with sufficient accuracy, whereĝ j (x i ) represents the sample mean of K = 30 simulated samples for the jth output at input points x i . In our study, 10 outputs are selected from the outputs of the simulator ABM corresponding to the first 10 time points of the real data listed in Table 1. Table 2. Sampling data set as inputs into the simulator ABM, where θ k , k = 1, 2, 3, 4 represent proliferation rate, infection rate and death rate per hour for epithelial cells, infected epithelial cells, and virus, separately.

Non-Implausible Space
As discussed in Section 2.2, we can get the dataset x i ,ĝ j (x i ) , i = 1, 2, . . . , 40, j = 1, 2, . . . , 10 from the simulator ABM. Then, the emulator GAM model M 0 for each output is built based on these simulated data. After that, another set of 10 5 points, namely H 1 , are drawn from a four-dimensional uniform distribution within the region (0, 2θ 0 ), and the implausibility measure is evaluated for each point of H 1 . Since we have 10 outputs in our study, for simplicity, we use the following implausibility measure I(x) = max j=1,2,...,10 where I j (x) is defined by Equation (5) in Section 3.3. According to Pukelsheim's 3σ rule [33], all x ∈ H 1 with I(x) > 3 will be deemed implausible. Those non-implausible points that passed the implausibility test will be remained. Comparisons between the initial 10 5 sampling points and non-implausible sampling points for each parameter are shown in Figure 2. As shown in Figure 2, when compared to the initial 10 5 sampling points, the non-implausible sampling points shrunk by approximately 34.05%, which is retained to form the non-implausible interval for each parameter. Comparisons between the initial interval and non-implausible interval for each parameter are given in Table 3. It is observed that non-implausible intervals for four parameters have been narrowed.

Fitting Experimental Data
As stated in Section 3.4, once we obtain the reduced non-implausible region for the parameters, the optimization method PSO is employed within the non-implausible region to find the estimated parameters by fitting the experimental data Table 1. Considering that PSO is a stochastic optimization technique, we run the PSO algorithm 50 times and 50 parameter estimates are obtained. The initial parameters and the mean with standard error of the 50 parameter estimates are listed in Table 4. In order to evaluate the fitting efficiency of the proposed model, we draw the fitting at the first 10 timepoints, with the real experimental data in Table 1. The well-developed ordinary differential equation method (ODE) [2] is also employed for comparison. The results are illustrated in Figure 3. It is shown that the two methods have comparable fitting accuracy.

Average Relative Error
In order to evaluate the prediction accuracy of the whole procedure, we compute the average relative error (ARE) for each parameter θ k , k = 1, 2, 3, 4, as defined by Equation (8) in Section 3.4, with three different level of random noises 0.04, 0.05, and 0.06. The ARE of the IABMR method [20] is also computed for comparison. The results for each parameter are shown in Figure 4. It is observed that our method have relatively smaller AREs than IABMR, which suggests that our proposed method have favorable prediction precision.

Simulator: Using ABM (Agent-based Model) to Simulate the Immune System
A simulator is a computer model to describe the physical process. The simulator output is composed of a vector of r quantities, which is denoted with the vector f(x) = [ f 1 (x), . . . , f r (x)] ∈ R r . To accommodate the randomness, we run the simulator K times at each fixed input vector x and the observed values would satisfy. f jk (x) = g j (x) + ε jk , j = 1, 2, . . . , r, k = 1, 2, . . . , K where g j (x) is the mean value of the jth output at input value x and ε jk is a random variable with expectation 0. The training data point for input x is then x,ĝ j (x) , whereĝ j (x) is the sample mean of K simulated outputs f jk (x) as an estimator of the true mean value g j (x). For the case of IAV infection study, Tong et al. [20] developed an ABM to describe the dynamic interactions among the components (i.e., epithelial cells, infected epithelial cells, and virus) by denoting each cell as an agent with three phenotypes (i.e., quiescence, proliferation, and apoptosis). Then, the real experiment data from infection of mice with the H3N2 influenza virus A/X31 strain from 0 to 5 days [2,20] listed in Table 1 is used to evaluate the fitting accuracy. In this work, we simulate the immune system by using the ABM described by [20] as our simulator to obtain the training data for the following emulator.

Emulator: GAM Model
Suppose that y is a vector of observed responses and that x 1 , x 2 , · · · , x p are p independent variables. The GAM postulates that the response is additively related to the independent variables via the equation where µ = E y x 1 , x 2 , · · · , x p is the expectation of the responses that are conditioned on the predictors. The function h is some known function, called the link function. Each function m k , k = 1, 2, . . . , p is assumed to be an unknown nonparametric smooth function, which is needed to be estimated from the observed data. When compared with the LOESS model, even if the number of predictors p is large, the inputs in the functions m k in GAM (2) is still of only one dimension. Therefore, GAM (2) is a dimension-reduced type regression model. Because of the flexibility that is provided by allowing m k to be nonparametric functions, as well as the fact that it is easy to fit these models using standard statistical software, such as S-Plus and R, the GAM has become a powerful analytic tool with a wide range of applications [34][35][36].
In this paper, the simulated data x,ĝ j (x) obtained in Section 2.2 is applied to train the GAM emulators for the jth output.

Reducing the Input Space by Using Implausibility Measure
Traditional optimization method is to initialize the parameters' range first, and then process the optimization. However, it would take a long time to finish the task when the model has several key parameters. Since most of the optimization methods can only find the optima locally, it would probably not give the desirable result when the search ranges of the input parameters are wide.
History matching [31] is designed to identify the set of inputs that would give rise to acceptable matches between the model outputs and the observed data. It provides a tractable calculation involving expectations and variances by excluding implausible parts of the input space that are unlikely to make a good match with the observed data. History matching has several advantages. First, the calculations that are involved in history matching are far more efficient and straightforward to implement. Second, it is possible to exclude implausible space without taking the full sets of inputs and outputs simultaneously into consideration.
As demonstrated in [31], history matching assumes the existence of a physical process that is measured through observations z, which is linked to the best simulator input denoted by x * via where ϕ and δ are vectors of errors representing Observation Uncertainty (OU) and Model Discrepancy (MD), respectively. OU refers to the uncertainty involved in the data from observations and MD refers to the imperfect representation of reality for simulator [31]. ε is the random error in (1) whose variability would affect the search of implausible values which is referred to Ensemble Variability (EV). These errors OU, MD, and EV are considered to affect the search for the non-implausible input spaces. For the purpose of evaluating whether the simulator's output at input value x would result in an acceptable match with the observed data, Andrianakis et al. [31] introduced the implausibility measure given the current uncertainties as follows: Here, z j represents the jth output for the observation data. E * g j (x) represents the expectation that is obtained by the emulator-GAM model at input point x for the jth output. V 0 represents the variance that is associated with OU and it equals to the variance of the observation data. V c (x) denotes the code uncertainty as quantified by the emulator-GAM model (2). The variabilities incurred by EV and MD are represented by V s and V m , respectively. According to the suggestions provided by [31], V s is set equal to the sample variance of K simulated samples for each input. V m is considered to be equal to 10% of the variance of the simulator output data at all of the design points for simplicity.
A large value of I j (x) would indicate that despite the uncertainties that are present in the system, the prediction about the simulator's output for x is so far from the observed value z j . In this paper, we use 3 as a cut off value according to Pukelsheim's 3σ rule [33], i.e., all x with I j (x) > 3 will be deemed implausible. After reducing the implausible points, the input space would be narrowed.

Parameter Estimation
Once we obtain the non-implausible parameter space, the optimization method PSO is employed to locate the estimated parameters by fitting the observation data.
In our case of IAV infection study, the input parameter of ABM is denoted by a four-dimensional vector θ, whose components θ k , k = 1, 2, 3, 4 represent proliferation rate, infection rate, and death rate per hour for epithelial cells, infected epithelial cells, and virus separately. The definitions are given in Section 2.2 and Table 1 in [20]. According to [2], θ are estimated as (6.2 × 10 −9 , 2.47 × 10 −7 , 5.98 × 10 −2 , 4.23 × 10 −1 ). We refer to this as initial parameter θ 0 , and set the input parameter space of ABM within the region (0, 2θ 0 ). In this research, the inputs from the parameter vector space for the simulator ABM are obtained by the maximin Latin Hypercube design method [32], which can generate uniformly distributed points in the input space (0, 1). To adapt to our input space (0, 2θ 0 ), the sampling inputs that are obtained by the maximin Latin Hypercube design method are mapped to the region (0, 2θ 0 ) via the mapping function: where d ∈ (0, 1) and a = 0, b = 2θ 0 . Once the sampling input points are obtained, the simulator runs at the selected inputs to get the training dataset for emulator. Then, GAM model M 0 is built by using R function gam(). After that, we regenerate another sampling data set H 1 within the region (0, 2θ 0 ) and take H 1 as input into M 0 to obtain prediction output data G 1 . The implausibility measure (4) is employed to evaluate each point of H 1 and those that do not pass the implausibility test are deemed implausible, meaning that the simulator cannot match the observations given the current error specifications. The initial input space (0, 2θ 0 ) would be reduced to be a subset of non-implausible space. Next, PSO is employed to locate the optimal parameter θ * within the non-implausible space by fitting the real experimental data, i.e., the estimated parameter θ * is obtained by minimizing the following objective function with respect to x where m is the number of outputs andĝ j.GAM (x) is the fitted value from the GAM model M 0 at input x.
In order to evaluate the prediction accuracy of the whole procedure, we repeat the whole procedure, except for that the normal distributed error is used to add noise for each replicate of the data set obtained by the simulator ABM with key parameter given by θ * .
Finally, we compute the average relative error (ARE) [3] of our procedure via Here, θ * i denotes the optimal local parameter of each replicate and M represents the number of total replicates. ARE is widely used to evaluate the prediction of statistical models. The smaller the ARE value is, the better the model performs.

Conclusions
In this work, we proposed a systematic procedure for immune system simulation by integrating ABM and GAM under the framework of history matching. Because of its great flexibility, ABM has been widely used to simulate the biological immune system. However, it is crucial for ABM to obtain an appropriate key parameter by incorporating the real experimental data. One previous study [20] proposed an innovative method IABMR by employing the LOESS model and PSO optimization method. In this research, the dimension-reduced type regression model GAM (2) is employed to avoid "curse of dimensionality" in LOESS to help increase the precision of the model. Moreover, to further improve the computation efficiency during the optimization, the search region for the parameters is reduced by discarding those inputs that have large implausibility measure and would give rise to unacceptable matches between the model outputs and the observed data.
In our case of IAV infection data (Table 1), we historically match a four inputs and 10 outputs simulator by integrating ABM and GAM. By computing the implausibility measure for each point that was sampled from the initial parameter space, the non-implausible input space shrank by 34.05% (Figure 2). In our computation, during processing the optimization PSO by fitting the real data, IABMR [20] needs 3465 runs by sampling 385 inputs and running nine times at each input to make the PSO converge, but our method just needs 1200 runs, which shows that our method has more computation efficiency than IABMR when locating key parameters. Meanwhile, with regard to fitting accuracy, our proposed method is comparable to the previous well-developed ODE model [2] (Figure 3). The average relative error (ARE) is widely used to evaluate the prediction of statistical models. The smaller the ARE value is, the better the model performs. When compared with IABMR [20], our method also shows favorable prediction accuracy ( Figure 4).
However, [37] suggested that in the process of history matching during our procedure for immune system simulation, several computation waves might be needed to further reduce the parameter space in order to locate more precise parameters during optimization. Nevertheless, the simulator ABM at each wave during history matching does cost so much computing resource. It may be of use to bring in parallel computation, such as graphics processing unit technology [18,37,38] to accelerate the procedure. For example, each input of ABM needs to execute K times at each wave during history matching. If we could map the whole computing job to several nodes, it would save much time. The detailed work on how to make the simulator ABM during history matching in our procedure of immune system simulation run at parallel nodes still needs much to rewrite codes and it is out of the scope of this research. We leave this as our future work.
In conclusion, in this study a systematic procedure for immune system simulation by integrating the advantages of ABM, GAM under the framework of history matching is proposed. The estimation for the key parameters to incorporate the real experiment data for the simulator ABM in the procedure is developed. The real data analysis demonstrates that our study has good efficiency and accuracy, and thus could mimic the immune system on multiple levels.