An Algorithm for Nonparametric Estimation of a Multivariate Mixing Distribution with Applications to Population Pharmacokinetics

Population pharmacokinetic (PK) modeling has become a cornerstone of drug development and optimal patient dosing. This approach offers great benefits for datasets with sparse sampling, such as in pediatric patients, and can describe between-patient variability. While most current algorithms assume normal or log-normal distributions for PK parameters, we present a mathematically consistent nonparametric maximum likelihood (NPML) method for estimating multivariate mixing distributions without any assumption about the shape of the distribution. This approach can handle distributions with any shape for all PK parameters. It is shown in convexity theory that the NPML estimator is discrete, meaning that it has finite number of points with nonzero probability. In fact, there are at most N points where N is the number of observed subjects. The original infinite NPML problem then becomes the finite dimensional problem of finding the location and probability of the support points. In the simplest case, each point essentially represents the set of PK parameters for one patient. The probability of the points is found by a primal-dual interior-point method; the location of the support points is found by an adaptive grid method. Our method is able to handle high-dimensional and complex multivariate mixture models. An important application is discussed for the problem of population pharmacokinetics and a nontrivial example is treated. Our algorithm has been successfully applied in hundreds of published pharmacometric studies. In addition to population pharmacokinetics, this research also applies to empirical Bayes estimation and many other areas of applied mathematics. Thereby, this approach presents an important addition to the pharmacometric toolbox for drug development and optimal patient dosing.


Introduction
Pharmacokinetic studies in healthy volunteers commonly collect multiple observations in each subject. These datasets often arise from Phase I clinical trials and have been traditionally analyzed by noncompartmental methods or by modeling via the standard two-stage approach. Patient datasets often contain complex dosage regimens and only a few (or one) observation(s) per patient. To analyze such sparse datasets, e.g., in pediatric or critically ill patients, population modeling is required, since noncompartmental analysis and the standard two-stage approach are not applicable [1,2]. Population modeling has been shown to estimate PK parameters without bias and with good precision for such sparse datasets [3]. Parametric population modeling algorithms are commonly used and assume typically either normal or log-normal distributions for the between subject variability of PK parameters. While this assumption is made for virtually every parametric population PK model, it is difficult to prove, especially for datasets with a small number of subjects. Nonparametric population modeling can describe a multivariate distribution of PK parameters without assuming any shape of the PK parameter distribution. This is a key advantage of the nonparametric approach that is based on the exact log-likelihood. The present work comprehensively describes the foundation of this nonparametric estimation algorithm for the first time. This algorithm has been used in hundreds of peer-reviewed papers.
Pharmacometric observations can be described statistically by a mixture model. In this case, the probability of random variable arguments (the PK population model) of the pharmacokinetic compartmental model are described by a mixing distribution. The problem of estimating the mixing distribution from a set of pharmacometric observations can be stated as follows. Let Y 1 , ..., Y N be a sequence of independent but not necessarily identically distributed random vectors constructed from one or more observations from each of N subjects in the population. Let θ 1 , ..., θ N be a sequence of independent and identically distributed random vectors that represent unknown parameter values of N subjects. The θ belong to a compact subset Θ of Euclidean space with common but unknown distribution F. In other words, F is a distribution of the parameters in the population model. Θ represents the parameter space, and the dimension of this space is the number of parameters in the population model. The {θ i } are not observed. It is assumed that the conditional densities p(Y i |θ i ) are known, for i = 1, ..., N. p(Y i |θ i ) represents the model of observations y i given parameter values θ i including uncertainties of the measurement protocol. The mixing distribution of Y i with respect to F is given by p(Y i |F) = p(Y i |θ i )dF(θ i ). Because of independence of the {Y i }, the mixing distribution of the {Y i } with respect to F is given by The mixing distribution problem is to maximize the likelihood function L(F) with respect to all probability distributions F on Θ.
Remark 1. The distribution F ML that maximizes L(F) is a consistent estimator of the true mixing distribution; which means that F ML will converge to the true distribution if the number of subjects is large. This was proved originally by Kiefer and Wolfowitz in 1956 [4] . The consistency of F ML is especially important for our application to population pharmacokinetics where F ML is used as a prior distribution for Bayesian dosage regimen design [5].
The algorithm described in this paper differs from most other published methods in a number of ways. Our algorithm allows for high dimensional parameter space Θ. Most published methods require the dimension of Θ to be small and many require the dimension of Θ to be 1 , i.e., these methods required the number of parameters in the population model to be small. We have treated examples where the dimension of Θ is as high as 29, see Section 3.
Most published algorithms require the {Y i } to be identically distributed and assume that the population model {p(Y i |θ i )} is rather simple, such as p(Y i |θ i ) is a multivariate normal density with mean vector θ i and covariance matrix Σ. Even if Σ is unknown and has to be estimated, the structure of this model is straightforward. However, the estimation of Σ has to be done carefully to avoid singularities, see Wang and Wang [6]. As will be described in Section 3, we allow p(Y i |θ i ) to be calculated from a system of nonlinear ordinary differential-algebraic equations.
We now describe the details of our algorithm. It was proved by Lindsay [7] and Mallet [8] , under simple hypotheses on the population model {p(Y i |θ i )}, that the global maximizer F ML of L(F) could be represented by a discrete distribution with support on at most N points, i.e., a distribution with nonzero probability located on at most N points.

Remark 2.
One way to motivate this result by Lindsay and Mallet is as follows: Suppose by some lucky chance, we knew the exact parameters for each subject. How can we package this into a distribution? Answer: The "empirical distribution" of the exact parameters. That is the discrete distribution supported at the N exact parameters with equal weights. It turns out in this case that this empirical distribution is also the nonparametric maximum likelihood (NPML) distribution of the parameters. What is remarkable is that if we only have noisy measurements (Y 1 , ..., Y N ) of the N subjects only indirectly related to the subject parameters, the structure of the NPML is the same. A discrete distribution supported at N points. Of course, in this real case, the position and weights of the N support points are unknown. Finding these positions and weights is the subject of this paper.
Problem 1 is a convex programming problem. In our algorithm, we solve this problem by the primal-dual interior-point (PDIP) method. This type of method is standard in convex optimization theory (see Boyd and Vandenberghe [9]). However, the exact implementation for a specific problem varies from problem to problem. The exact details of our implementation are described in the Appendix. See also Bell [10], Baek [11] and Yamada et al. [12]. Our PDIP implementation is fast and can easily handle thousands of variables.
Finding a better set of support points in Problem 2 is a more difficult problem. This location problem is a nonconvex global optimization problem with many local extrema and whose dimension is potentially N × dim Θ. The details of our algorithm, called the adaptive grid (AG) method, will be described in Section 2.3 and in Algorithm 1.
Input: (Y, φ 0 , a, b, ∆ D , ∆ L , ∆ F , ∆ e , ∆ λ ), a and b are the lists of lower and upper bounds, respectively, of Θ; ∆ D is the minimum distance allowable between points in the estimated F ML . ∆ x see Section 2.7. Output: (φ, λ, l(λ, φ)) Estimate F ML given Y 2: 10: end if 11: n ←− n + 1 12: if (n > MAXCYCLES) then 16: end if 20: if |NewLogLike − LogLike| ≤ ∆ L and eps > ∆ e then 21: eps = eps/2 Adjust precision 22: end if 23: if eps ≤ ∆ e then check EXIT conditions 24: LogLike ← NewLogLike 35: end while 36: end procedure Roughly speaking, Problems 1 and 2 are solved as follows. An initial large grid of possible support points is defined in the hypercube Θ. Problem 1 is solved on this large grid. After PDIP, most of the original grid points are removed due to near-zero weights, leaving a smaller high-probability grid. Problem 1 is then solved on this smaller grid. Then, the adaptive grid method for Problem 2 takes place. For each remaining grid point, up to 2 × dim Θ new (daughter) support points are added. A daughter point outside the search space Θ or too close to a parent point is discarded. The new grid contains the current high-probability points plus the added daughter points. The algorithm is then ready for Problem 1 again. By construction, each iteration increases the value of l(λ, φ). This process continues until the function l(λ, φ) does not significantly change.

Comparable Methods
Because of space limitations, in this section, we only discuss NPML methods that optimize Equation (4), methods that treat multivariate distributions, and methods which allow general conditional probabilities {P(Y i , θ i )}. As explained in this paper, any such NPML algorithm has to address two problems: locations of support points and weights of support points. NPAG does locations by an adaptive grid method and weights by the primal-dual interior-point (PDIP) method. The algorithms discussed in this seection are summarized in Table 1.
The original methods of Lindsay [7] and Mallett [8] were based on algorithms of optimal design in the style of Fedorov [13]. In Schumitzky [14], an algorithm was proposed which did both locations and weights by the expectation maximization algorithm (EM). It was stable but also slow.
In Lesperance and Kalbfleisch [15], a new method was introduced which did weights by the dual method described in Section 5 of Lindsay [7] and locations by what they called the intra-simplex direction method (ISDM). Even though the Lesperance and Kalbfleisch paper was restricted to univariate distributions, the ISDM method has been generalized to the multivariate case. To briefly describe ISDM, let D(θ, F) be the directional derivative of log L(F) in the direction of the Dirac distribution δ θ supported at θ ∈ Θ. (This function is defined in Section 4 below). ISDM is an iterative algorithm. At stage k, let F k be the current estimate F ML . Then, find all the local maxima of D(θ, F k ). These local maxima are added to the current set of support points and a new F k+1 is calculated. If there are no new local maxima, then the algorithm is done.
In Pilla, Bartolucci, and Lindsay [16], another new method was developed where the locations were found by an initial fine grid. However, the weights were found by a dual version of the PDIP method.
In Savic, Kjellsson, and Karlsson [17], a nonparametric method (NONMEM-NP) was added to the popular NONMEM ® program. NONMEM-NP is a hybrid parametricnonparametric approach The locations of support points were found by a parametric maximum likelihood algorithm. Then, the weights were found by maximizing Equation (4) relative to the newly found support points. NONMEM-NP can handle high-dimensional and complex multivariate distributions. An extension to NONMEM-NP was developed in Savic and Karlsson [18] where additional support points are added to the original set. A comparison between NONMEM-NP and NPAG is discussed in Leary [19].
In Wang and Wang [6] , a new algorithm was developed for multivariate distributions. The locations were found by a combination of EM and a variant of ISDM. The weights were found by a family of quadratic programs. In [6], examples are performed for 8-and 13-dimensional mutivariate mixing distributions. Note: The quadratic programming algorithm (QP) of Wang and Wang [6] has an attractive feature. For a prescribed set of support points, QP finds the zero probabilities exactly. Thus, QP avoids the grid condensation step where support points from PDIP with sufficiently low probabilities are deleted. However, QP and PDIP are based on different numerical methods and a comparison of the efficiency of both algorithms has not been determined.
We finally mention that the NPML problem is a special case of a finite mixture model problem with unknown supports and weights. For a discussion of this approach, see Tatarinova and Schumitzky [21].
The algorithms which have shown by published examples to handle the highest dimensional multivariate problems are NONMEM-NP, Wang and Wang [6], and NPAG.

Benders Decomposition
For any set of grid points φ = (φ 1 , ..., φ m ) in Θ m , let λ =λ(φ) be the corresponding set of optimal weights given by the PDIP method. Then, the function F(φ) = l(λ(φ), φ) depends only on φ and can be maximized directly. For optimization methods, this technique is called Benders decomposition. The NPAG algorithm maximizes F(φ) by an adaptive search method. In a method proposed by James Burke, F(φ) is maximized by a Newton-type method. Since the function F(φ) is not necessarily differentiable, a relaxed Newton method must be used similar to what is described in the Appendix for the primal-dual algorithm. For details of Benders decomposition as applied to our problem, see Bell [10], Baek [11] and Jordan-Squire [22].
Founded on this prior work, the present study aimed to comprehensively describe, for the first time, the nonparametric adaptive grid algorithm (NPAG). This approach uses the exact log-likelihood to solve population modeling problems and does not make any assumptions about the shape of the PK parameter distributions. We illustrated the features and capabilities of this algorithm using a population PK modeling example. This algorithm presents the computational foundation of several hundred peer-reviewed papers, to date, and is ideally suited to optimize individual patient dosage regimens. The output of the NPAG algorithm becomes the input of the BestDose™ patient dosing software which is used at the bedside in real-time [23].

Pmetrics
The simulations and NPAG optimizations in this paper can be duplicated in R, using programs in the Pmetrics package [24]. R and Pmetrics are free software. R is available from many download sites. Pmetrics is available from lapk.org. NPAG is run using the NPrun() command in Pmetrics. Sample datasets and compartmental models are also available at lapk.org.

NPAG Subprograms
NPAG is a Fortran program consisting of a number of subroutines as described below. The main program performs the adaptive grid (AG) method (consisting of expansion and compression algorithms) and calls the primal-dual interior-point (PDIP) subprogram. The PDIP algorithm solves the maximization problem of Equation (4) for a fixed grid and is described precisely in the Appendix.
In NPAG, there are two types of grids: expanded and condensed. The expanded grids are the initial grid and the grids after grid expansion (Algorithm 2). The condensed grids are generated by grid condensation (Algorithm 3). Each cycle of NPAG begins with an expanded grid. The likelihood calculation is done on the condensed grids. Now for the adaptive grid method. Assume that Θ is a bounded Q-dimensional hyper-rectangle. Initially, we let φ 0 expanded = (φ 0 1 , ..., φ 0 M ) be the set of M Faure grid points in Θ (see [25][26][27]). Alternatively, we could initially let φ 0 expanded be generated by a uniform distribution on Θ or by a prior run of the program.
Remark. The Faure grid points for a hyper-rectangle Θ are a low-discrepancy set which in some sense optimally and uniformly covers Θ. In our implementation of NPAG, the Faure point sets come in discrete sizes which nest with each other. (Allowable number of points equals 2129, 5003, 10,007, 20,011, 40,009, 80,021, and multiples of 80,021.) This nesting property is useful for checking the optimality of F ML (see Section 4). We have found that replacing the initial Faure set with a set generated by a uniform distribution on Θ increases the time to convergence but results in the same optimal distribution. Now set G 0 expanded = (φ 0 ,λ(φ 0 )). Our approach is to generate a sequence of solutions G n to Equation (4) of increasingly greater likelihood, where unless otherwise specified, G n refers to the condensed grid at the n th cycle of the algorithm. If G n has log-likelihood negligibly different than G n−1 , then G n is considered the optimal solution to Equation (4) and is relabeled F ML . If not, then the process continues using the φ n as the new seed. This loop is repeated until F ML is found.
The stopping conditions for NPAG are defined precisely in Algorithm 1. If the stopping conditions are not met prior to a set maximum number of iterations, the program will exit after writing the last calculated G n into a file.
end if 10: for k in = 1 : length(newφ) do x ./y done component-wise 12: dist = min(dist, newdist) end for 14: if dist ≥ ∆ D then Check distance to new support point Check distance to new support point 26: 28: end for end for 30: The crux of the adaptive grid method is how to go from G 0 to G 1 or, more generally, from G n to G n+1 . The details of doing this are now explained roughly below and precisely in Algorithm 1.
Let Q be the dimension of Θ. Suppose at stage n we have a grid of high-probability support points φ n . We then add 2Q daughter points for each support point φ k ∈ φ n . The daughter points are the vertices of a small hyper-rectangle centered at each φ k with size proportional to the original size of the hyper-rectangle defining Θ. The size of this small hyper rectangle decreases as the accuracy of the estimates increases. (See Algorithm 2).

Grid Condensation (CONDENSE-Algorithm 3)
The above solution set G n+1 expanded may have many support points with low probability. We remove all support points which have corresponding probability less than (max λ)∆ λ , where λ is the vector of current probabilities and the default for ∆ λ is 10 −3 . (Note that the remaining probabilities are not normalized at this point.) The probabilities of the remaining support points are normalized by a second call to the PDIP subprogram. This second call to PDIP is fast. The likelihood associated with these remaining support points and normalized probabilities is then used to update the program control parameters and check for convergence (Algorithm 1 and Section 2.7). If convergence is attained, then the output of this second call to PDIP provides the support points and probabilities of the final solution. If convergence is not attained, then the remaining support points are sent to the grid expansion subprogram (Algorithm 2), initializing the next cycle.
At the end of the program, the output of this second call to PDIP provides the location and weights of the final solution.

PDIP Subprogram-See Appendix A
The PDIP subprogram finds the optimal solution to Equation (4) with respect to λ for fixed φ. PDIP employs a primal-dual interior-point method that uses a relaxed Newton method to solve the corresponding Karush-Kuhn-Tucker equations. (See Equations (14)- (17) of Appendix A).

NPAG Stopping Conditions
As explained above, a potential solution to F ML is not accepted as a global optimum until successive sequences of G n produce final distributions evaluating to sufficiently close log-likelihood. The various upper and lower bounds ∆ for NPAG control and stopping conditions are defined below and are used in Algorithms 1-3.
∆ L Primary upper bound on the allowable difference between two successive estimated log-likelihoods; the default initialization is 10 −4 . ∆ F Secondary upper bound on the allowable difference between two successive estimated log-likelihoods of potential F ML ; the default initialization is 10 −2 . ∆ e Sets an upper bound on the accuracy variable eps of Algorithm 1. The default initialization for ∆ e is 10 −4 . The default initialization for eps is 0.2 and is stepped down until eps ≤ ∆ e ∆ F and ∆ e define the two stopping conditions for Algorithm 1. ∆ D Sets a lower bound on how close two support points can get; the default initialization is 10 −4 . ∆ λ Sets a lower bound factor on the probabilities of the weights λ; the default initialization is 10 −3 .

Calculation of p(Y i | φ k )
Given observations Y i , i = 1, ..., N and grid points φ k , k = 1, ..., K, the PDIP subprogram only depends on the N × K matrix {p(Y i |φ k )}. NPAG can be used for any problem once this matrix is defined. However, the default setting of NPAG is for the problem of population pharmacokinetics. For a good background of population pharmacokinetics, see Davidian and Giltinan [28,29].
In population pharmacokinetics, generally Y i = (y i,1 , ..., y i,M ) is a matrix of vector observations for the ith subject. Since NPAG allows multiple outputs, each y i,m is itself a q-dimensional vector y i,m =(y i,m,1 , · · · , y i,m,q ). The observations y i,m,j , are then typically given by a regression equation of the form: ν i,m,j ∼ N(0, (σ i,m,j (θ i )) 2 ) θ i are unobserved parameters specific for Y i In the above Equation (5), f i,m,j is a known nonlinear function depending on the model structure, the dosage regimen, the sampling schedule, all covariates and of course the subject-specific parameter vector θ i . Except for simple models, f i,m,j requires the solution of (possibly nonlinear) ordinary differential equations.
In the current implementation of NPAG, it is assumed that the (y i,1 , ..., y i,M ) are independent. Then where f i,m = ( f i,m,1 , ..., f i,m,q ) and Σ i,m = diag(σ 2 i,m,1 , ..., σ 2 i,m,q ). For the purposes of matrix multiplication in Equation (6) ,we think of y i,m and f i,m as q-dimensional row vectors.
To complete the description of Equation (6) we need to model the standard deviation terms σ i,m,j of the assay noise. In our implementation of NPAG, four different models are allowed. Let and set The parameter γ in Equation (8) is a variance factor. Artificially increasing the variance during the first several cycles of NPAG increases the likelihood for each φ, allowing the algorithm to use these cycles to find a better initial state from which to begin optimization. NPAG also has an option to "optimize" γ. This changes NPAG from a nonparametric method to a "semiparametric" method and will not be discussed here. The interested reader can consult [12].
Next, if c 0 = 0 in Equation (7), then α i,m,j can become small for certain values of φ that in early iterations can be far from optimal. This, in turn, causes numerical problems as the likelihood is infinite if σ i,m,j = 0. One way to avoid this problem is to take σ i,m,j = constant. Another way would be to assume that α i,m,j is known and is given by That is, to approximate σ by using a polynomial of the observed values rather than model predicted values. In our experience with NPAG, the approximation of Equation (9) is useful for ensuring computational stability (especially during the early cycles of the algorithm). However, from a theoretical perspective, this change violates the conditions of maximum likelihood and will not be discussed here. Again, the interested reader can consult [12].

Convergence
For a given initial grid φ 0 , the NPAG algorithm is only guaranteed to find a local maximum of L(F) . More precisely, if φ * is the final grid of NPAG starting from φ 0 , then λ(φ * ) is a global maximum on φ * but the support points φ * may be only a local maximum.
Global convergence of a nonparameteric maximum likelihood method for estimation of a multivariate mixing distribution is difficult. For one-dimensional distributions the problem is straightforward. The idea of proof goes back to at least Fedorov [13] in 1972, which involves the use of directional derivatives.
Let F be any distribution on Θ. Then, the directional derivative of log L(F) in the direction of the Dirac distribution δ θ supported at θ is defined by . Let F k be the current NPML estimate at iteration k. The Fedorov method involves maximizing D(θ, F k ) for θ ∈ Θ, at every iteration. Then, the point at which the maximum occurs is added in an optimal way to F k to give F k+1 . Under the assumptions of regularity, Fedorov shows that L(F k ) converges to L(F ML ), see Fedorov [13], (Theorem 2.5.3). Many improvements to this method have been made. In Lesperance and Kalbfleisch [15] and Wang and Wang [6], instead of just adding the point at which D(θ, F k ) occurs, all the points where local maxima occur are added in an optimal way. Again, under the assumptions of regularity, convergence as above is proved. In one dimension, these methods are efficient. In higher dimensions, these methods are not computationally practical.
We now suggest a method to check whether the final distribution of NPAG is globally optimal and, if not optimal, how close it is to the optimal. It also involves the use of the directional derivative D(θ, F), but only at the last iteration of NPAG. Now define Note that the max in the above expression is only over Θ and not over Θ N . It is proved in Lindsay [7] that F * is a global maximum of L(F), i.e., F * =F ML , if and only if D(F * ) = 0. Even if D(F * ) = 0, it is useful to make this computation as it is also proved in Lindsay [7] that L(F ML ) − L(F * ) ≤ D(F * ), so this last expression gives an estimate of the accuracy of the final NPAG result. Now, even though we said above it is not practical to calculate D(F) at every iteration of an algorithm, we are just suggesting to make this calculation at the end of the algorithm. This calculation can be performed by a deterministic or stochastic optimization algorithm.

Examples
First of all, the NPAG program has been used successfully in high-dimensional and very complex pharmacokinetic-pharmacodynamic models. In Ramos-Martin et al. [30], the NPAG program was used for a population model of the pharmacodynamics of vancomycin for coagulase-negative staphylococci (CoNS) infection in neonates. Vancomycin is an antibiotic used to treat a number of serious bacterial infections. CoNS are the most commonly isolated pathogens in the neonatal intensive care unit. This model had 7 nonlinear differential equations and 11 random parameters. The population was a combination of 300 experimental and animal subjects. In Drusano et al. [31], the NPAG program was used for a population model of two drugs for the treatment of tuberculosis. This model had 5 nonlinear differential equations, 3 nonlinear algebraic equations, 1671 observations from 6 outputs and 29 random parameters. In the algebraic equations, the state variables were only defined implicitly and had to be solved for by an iterative method.
The above two examples are too complex to use for simulation purposes. Consequently, we present here a simpler model which has an analytic solution and which can be checked by other algorithms. Nevertheless, the estimation of parameters in this model is not trivial. We consider a three-compartment PK model with a continuous IV infusion into the central compartment and a bolus input into the absorption compartment. The individual subject model is described by the following differential equations: and output equation The inputs are a bolus b = 2000 mg at t = 5 Hr and a continuous infusion r(t) = 500 Hr −1 , for 0 < t < 16 Hr. This model has 5 random parameters (V, K a , K e , K cp , K pc ). A diagram of this model is given in Figure A1. It is known that this model is structurally identifiable, see Godfrey [32]. However, we have found that for a continuous IV infusion, the parameters K cp and K pc can be difficult to estimate in a noisy environment.
The details of the simulation are as follows. There were 300 simulated subjects. The random variables (V, K a , K cp , K pc ) were independently simulated from normal distributions with means respectively equal to (1.2, 0.8, 2.0, 0.2) and standard deviations equal to 25% coefficient of variation.
The random variable K e was independently simulated from a bimodal mixture of two normal distributions with means respectively equal to 0.5 and 1.5, with standard deviations equal to 10% coefficient of variation, and with weights equal to 0.2 and 0.8. This distribution would apply to an elimination rate constant with a bimodal distribution where 80% of the subjects have a mean of 1.5, and only 20% have a mean of 0.5. The power of the nonparametric method allows the detection of the 20% group. These sampling times were chosen in an ad hoc fashion and are not to be considered optimal. In Figure A2, we show the profiles of the 300 noisy model outputs y 1 . These profiles are plotted as piecewise linear functions with nodes at the observation times.
The initial Faure set had 80, 321 support points dispersed in the volume The assay error (Equation (8)) is not always known. An approximate assay error polynomial can be inferred from literature and Pmetrics includes a routine to estimate the assay error polynomial from the data. Another approach to analyzing data with unknown measurement error is to run successive NPAG optimizations using decreasing error magnitude for each new run. An advantage of this approach is that model development is faster. The first cycle of NPAG, which begins with a relatively large measurement error, can be initialized using a relatively small number of support points. Each NPAG solution is used as a prior to skip the first (and most computationally burdensome) step in the next NPAG run. We demonstrate this approach can converge to the correct solution on this simulated data.
Convergence for this problem was accomplished after applying NPAG four times. For the first application, σ = 0.025Y simulated . The output distribution of this first application was used as a prior to start NPAG again, this time with σ = 7.96 + 0.0125Y simulated . The output of this second application was used as a prior to start a third NPAG run, this time with σ = 7.96 + 0.0065Y simulated . Finally, the output density of this third run was used as a prior to run NPAG a fourth time, this time with σ = 5.5, the same as for the simulation. The step down in assumed observation error happened at convergence for each previous error levels: at cycles 4513, 5972, and 6791. Final convergence happened at cycle 8012. There are 284 support points in the final density.
The simulated and estimated marginal distributions are shown in Figures A3 and A4. It is seen that the estimated marginal distributions are similar to the simulated histograms. In particular, the bimodal shape of K e was uncovered. Similarity is tested using the R routine mtsknn.eq(..., k = 3), which returns a pval = 0.5809. mtsknn.eq applies a K−nearest neighbors approach to estimate the probability that two non-parametric distributions arise from the same distribution.
NPAG is designed to estimate the whole joint distribution of the parameters. As mentioned earlier, the estimate F ML is especially important for our application to population pharmacokinetics where F ML is used as a prior distribution for Bayesian dosage regimen design. However, F ML is a consistent estimator of the true mixing distribution and consequently, the moments of F ML should be consitent estimators of the true moments. Means and variances of parameter estimates for F ML can be easily obtained by integrating the corresponding marginal distributions. So as a check of this fact, in Table A1, the comparisons of estimated versus simulated means and variances are shown. Again, results are quite accurate (see Tables A1 and A2).
Finally, in Figure A5 we include a graph of Predicted versus Observed values which shows the all around good fit of the data. The predicted (right panel: Predicted Bayesian) values are gotten as follows: For each subject, the Bayesian mean estimate of the parameters are found using the final NPAG distribution as a prior and that subject's observations. Then, based on these parameter means, the subject's concentration profile is calculated. The predicted (left panel: Predicted Population) is the weighted average of the evaluation of all support points for subject model.

Conclusions
We have provided the first comprehensive description of the NPAG algorithm for estimating multivariate mixing distributions. This algorithm can describe between subject variability without assuming any shape for the distribution of PK parameters and is excellently suited for optimizing patient dosing [33][34][35]; NPAG is based on an iterative algorithm employing the Primal-Dual Interior-Point method and an Adaptive Grid method. This approach can handle pharmacokinetic, pharmacodynamic and other models with a high number of estimated model parameters and is based on the exact log-likelihood. A detailed description of NPAG is provided along with an application for a common two-compartment PK models. Finally, the NPAG algorithm is arguably the most efficiently parallelizable population modeling algorithm, since it can parallelize over both subjects and support points. This allows one to readily implement this algorithm on supercomputers as our laboratory has done on research projects. In addition to population pharmacokinetics, this research also applies to empirical Bayes estimation, see Koenker and Mizera [36] and to many other areas of applied mathematics, see Banks et al. [37]. Overall, the NPAG algorithm provides an important addition to the pharmacometric toolbox for drug development and optimal patient dosing at the interface of applied mathematics, biomedical scientists, and clinicians.

Recent References Using Npag
As a means for readers to more quickly assess the applicability of NPAG in a wide variety of pharmacometric studies, we refer here to the first 10 references on a recent PubMed search for papers that use NPAG . Note that all population PK studies using Pmetrics employ NPAG.

Recent Studies to Which NPAG Can Be Applied
The references below use excellent parametric methods. All of these methods have the same mathematical structure as our nonparametric NPAG. The main difference is that in the parametric methods, it is assumed that the population distribution is multivariate normal with unknown mean vector and unknown covariance matrix. In NPAG, we do not make any parametric assumptions about the population distribution, as discussed in our paper. Otherwise, the population analysis problem is the same. Bulitta et al. [38] and Shah et al. [39] use the method S-ADAPT. S-ADAPT is based on the ADAPT package developed by D'Argenio, Wang and Schumitzky. Ishihara et al. [40] and Soraluce et al. [41] use the industry standard parametric version of NONMEM. Allard et al. [42] uses the stochastic approximation expectation maximization method (SAEM). NPAG can run any problem that is run by any of these programs.

Comparison of NPAG to Classical Population Analysis Programs
Several studies have compared NPAG to a parametric algorithm. Bustad et al. [43] compared the statistical consistency and efficiency of ITS and two older NONMEM ® routines, FO and FOCE to that of NPEM and NPAG on simulated datasets; the nonparametric methods were more consistent and efficient. Prémaud et al. [44] also compared NPAG to NONMEM ® FOCE, but the two algorithms converged to distinctly different results. NPAG converged to an estimator with better predictive performance allowing for its use in therapeutic drug monitoring. More recently, de Velde et al. [45] compared NONMEM ® FOCE-I to NPAG, with both methods converging to similar parameter estimates. In each of the above studies, NPAG converged to a distribution with greater variance. subprogram. Pmetrics is an R package for nonparametric and parametric population modeling and simulation and is available at www.lapk.org, see Neely et al. [24].

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: AG Adaptive grid CoNS Coagulase-negative Staphylococci EM Expectation-maximization algorithm ISDM Intra-simplex direction method NPAG Nonparametric adaptive grid algorithm NPML Nonparametric maximum likelihood PDIP Primal-dual interior-point method QP Quadratic programming

Appendix A. A Primal-Dual Interior-Point Algorithm (PDIP)
To make this paper self-contained, we outline here the PDIP algorithm which was written by James Burke. This algorithm is a FORTRAN subroutine of NPAG. The description below is based on the Matlab and C++ codes found in Bradley Bell's website, see [10]. Definition of general terms and theorems can be found in Boyd and Vandenberghe [9].

Appendix A.1. Duality Theory and the Basic Problem
Given a set of support points {φ k }, the problem of finding the optimal weights {λ k } in Equation (4) can be posed as the following optimization problem P min Φ(Ψλ) s.t. 0 ≤ λ, e λ = 1, where Ψ ∈ R n×m is the matrix whose (i, j) entry is p(y i | φ j ) and where in general, the function Φ : R k → R ∪ {+∞} is given by The symbol e is always to be interpreted as the vector of all ones of the appropriate dimension.
The problem P is a convex programming problem since the objective function Φ is convex and the constraining region is a convex set. The Fenchel-Rockafellar dual of the convex program P is the problem From Boyd, we obtain the following Karush-Kuhn-Tucker (KKT) equations relating the solutions to the problem P and D.
where for any vector x, we define X to be the diagonal matrix having x along the diagonal.

Appendix A.2. An Interior-Point Path-Following Algorithm
The relaxed KKT is given by for µ > 0. (µ is the relaxation parameter.) A damped Newton's method is used to solve the above system.
Consider the function F : R 2m+n → R 2m+n given by and 0 ≤ λ, 0 ≤ ω, and 0 ≤ y. Path-following algorithms attempt to solve (A9) by applying Newton's method for progressively smaller values of the relaxation parameter µ. We first need the derivative of F. It follows At the kth iteration of the algorithm, the Newton step is given by the solution to the nonsingular linear system where y is constrained to satisfy the first KKT condition y k = e m − Ψ w k . The above set of equations can be reduced by standard techniques. It follows: where H = D 2 − ΨD 1 Ψ , D 2 = ZW −1 , D 1 = ΛY −1 , r 1 = µY −1 e, r 2 = W −1 e − Ψr 1 where the superscript k is suppressed for simplicity.

. Exit Conditions
Iterate Equations (A11)-(A13) until µ ≤ ε and ρ ≤ ε and γ ≤ ε. If these conditions are not satisfied after a set number of iterations, then write "PDIP did not converge in the given number of iterations".    Figure A3. Kernel densities of simulated PK parameters. All K are in Hr −1 , Volume is in dL. Kernel density (dashed line) of the distribution is over-plotted on the histogram (grey), which are normalized to the total observations. Simulated support points are generated in Pmetrics using the function SIMrun(). Details are in the text. Kernel densities are calculated in R using the (S3) generic function density().  Figure A4. Kernel densities of NPAG estimated PK parameters (black line) vs. Simulated r.v.s (dashed with grey fill). K are in Hr −1 , volume is in dL. NPAG estimation is calculated using the NPrun() function in Pmetrics. Similarity of these two distributions is verified in R using the function mtsknn.eq(). See text for further detail.