Approximate methods for maximum likelihood estimation of multivariate nonlinear mixed-effects models

Multivariate nonlinear mixed-effects models (MNLMM) have seen an increasing use due to its flexibility for analyzing multi-outcome longitudinal data following nonlinear profiles. In this work, I present and compare five different numerical algorithms for maximum likelihood estimation of the MNLMM. These algorithmic schemes include the penalized nonlinear least squares coupled with multivariate linear mixed-effects (PNLSMLME) approximation, Laplacian approximation, pseudo-data ECM algorithm, Monte Carlo EM algorithm, and importance sampling EM algorithm. When estimating the MNLMM, it is rather difficult to exactly evaluate the observed log-likelihood function in a closed-form expression because it involves evaluating a multiple integral. Therefore, the corresponding approximations of the observed log-likelihood function under the five algorithms are presented. A comparison of their computational performances is investigated through simulation and real data from an AIDS clinical study.


Introduction
Analysis of multi-outcome longitudinal data with various features has attracted considerable interest in clinical trials, biological psychology, environmental science and medical research, to name a few.The methodology of multivariate linear mixed-effects models (MLMM) [1] and multivariate nonlinear mixed-effects models (MNLMM) [2] has been developed for related work.A comprehensive study of the MLMM along with its applications can be found in [3][4][5][6][7], among others.Nonlinear models for repeated-measures data rest on more complicated mathematical derivations and heavier computational requirements than linear models, but they can offer flexibility in capturing a broader range of data patterns.Several approaches to carrying out maximum likelihood (ML) estimation of nonlinear mixed-effects models (NLMM) for single-outcome longitudinal data have been studied; see, for example, [8][9][10][11][12].Bayesian inference in NLMM via Markov chain Monte Carlo (MCMC) procedures can be found, for instance, in [13][14][15].Although the use of the NLMM, as well as its extensions in other families of distributions have been pretty well established in the literature, to the best of our knowledge, exploration of the inference on MNLMM is relatively rare so far.Analyzing each response variable of the data by fitting the NLMM separately might be inappropriate and fail to take account of the between-variable association, as well as its evolution.
For the general NLMM, the linearization method [8,16] that exploits a first-order Taylor expansion to approximate the nonlinear function in terms of a linear pseudo-data model is by far the most widely-used approach due to its numerical simplicity.Despite its popularity, [17] argued that the linearization method may produce substantial bias in parameter estimation, as the number of observations per subject is too small, and the variability of random effects tends to be large at the same time.Although computationally much simpler, the Laplace approximation method [10] can also lead to considerably-biased parameter estimates, depending on the quality of the mode.As an alternative to the pseudo-data and Laplace approximation approaches, the integral approximation methods that use Monte Carlo integration [18] or importance sampling [19] to approximate the observed likelihood may provide more accurate estimates than the linearization method.However, the numerical integration methods are generally inefficient to implement and become computationally prohibitive when the dimension of random effects increases [20].Over the past few decades, several estimation algorithms for NLMM have been developed and implemented in different software.For example, the linearization methods using the first-order Taylor expansion [21] or the first-order conditional estimation (FOCE) [8,16] are embedded in R function nlme, while the Laplace approximation method is implemented in NONMEM [22] and the SAS macro NLIMIX [23].A new SAS macro NLMIXED incorporating adaptive Gaussian quadrature has shown considerable improvement [24].The other improved procedure based on the stochastic approximation expectation maximization [25] was implemented in MONOLIX [26], NONMEM [27] and R package saemix [28].Multivariate nonlinear mixed-effects models can be fitted using ad hoc manipulation by expanding the design matrix with extra columns of dummy covariates flagging each element of the original multivariate responses.
Consider the multiple repeated measures {(Y i , X i ), i = 1, . . ., N }, where Y i is a s i × r response matrix composed of r response vectors y ij = (y ij,1 , . . ., y ij,s i ) T , j = 1, . . ., r, and X i is the covariate matrix for the i-th subject.Let E i = [e i1 : e i2 : • • • : e ir ] be the s i × r matrix of within-subject errors associated with Y i , where e ij = (e ij,1 , . . ., e ij,s i ).Let y i = vec(Y i ) and e i = vec(E i ) denote the stacked s i r × 1 vectors of all responses and within-subject errors, respectively.
In general, the MNLMM takes the form of: where µ i = µ i (η i , X i ) is a nonlinearly-differentiable function of a subject-specific parameter η i governing the within-profile behaviors and e i is a vector containing normally-distributed error components.Moreover, the fixed effects β and the random effects b i can be incorporated into the model by letting: where A i and B i are design matrices of size s × p and s × q, respectively.We assume that b i follows a multivariate normal distribution with mean vector 0 and q × q variance-covariance matrix D, denoted by b i ∼ N q (0, D), and independent of e i ∼ N s i r (0, R i ).The joint distributions of (b T i , e T i ) T for distinct subjects are independent.To reduce the number of parameters in R i , we assume that the k-th row of E i , say e i•k , follows N r (0, Σ), and the j-th column of E i , say This specification implies that within-subject errors for all responses measured at the same occasion have variance-covariance Σ.To capture the extra autocorrelation of a given response among irregularly-observed occasions, some parsimonious dependence structures can be made on C i , such as the compound symmetry, the p-order autoregressive model [29,30] and the damped exponential correlation [31].For simplicity, we write C i = C i (φ), which depends on subject i according to its dimension s i with each entry being a function of a small set of parameters φ describing within-subject autocorrelation.
Let θ = (β, D, Σ, φ) be the entire model parameters.According to Model Equation (1) with Assumption Equation (2), the marginal density of y i is: where φ d (•|µ, Ω) denotes the probability density function (pdf) of a d-variate normal distribution with mean vector µ and variance-covariance matrix Ω.Typically, this integral cannot yield a closed-form expression when the vector-valued function µ i = µ i (η i , X i ) is nonlinear in random effects b i .Thus, the log-likelihood function of θ for y = {y 1 , . . ., y N } is given by: The purpose of this article is to consider five different methods for carrying out ML estimation of the MNLMM described in Equation (1) along with Equation (2) and for approximating the observed log-likelihood Function Equation (4).The methods include the penalized nonlinear least squares coupled to multivariate linear mixed effects (PNLS-MLME) approximation [8], Laplacian approximation [32], a pseudo-data version of the expectation conditional maximization (ECM) algorithm [33], the Monte Carlo EM (MCEM) algorithm [34] and the importance sampling EM (ISEM) algorithm [35].The approximation to the observed log-likelihood is based on the standard Taylor expansion and is easy to calculate within the algorithms.A simple way of computing standard errors of parameters via the information-based method is provided.
The article is organized as follows.In Section 2, we describe the five computational procedures for ML estimation of the MNLMM together with the calculation of standard errors of parameters.In Section 3, the proposed methodology is illustrated with the analysis of HIV-AIDS data.Section 4 presents a comparison of the five approximation methods through simulation studies.We summarize and discuss implications in Section 5.The technical derivations are collected in the Appendix.

Five Approximate ML Procedures
From Model Equation (1), the j-th column (outcome) of Y i , say y ij = (y ij,1 , . . ., y ij,s i ) T , can be formulated as: where T and e ij = (e ij,1 , . . ., e ij,s i ) T .Analogously, the model for the k-th row (occasion) can be expressed as: where T and e i,k = (e i1,k , . . ., e ir,k ) T .We present five algorithms for employing ML estimation of Model Equation (1).The approximation to the observed log-likelihood Function Equation (4) and the calculation of standard errors of parameters are discussed, as well.

PNLS-MLME Procedure
Following the linear mixed-effects (LME) approximation method suggested by [8], the first procedure consists of two steps: a penalized nonlinear least squares (PNLS) step and a multivariate LME (MLME) step.The basic idea behind this procedure is that we estimate the unobservable random effects b i via the PNLS step and then update the ML estimates of parameters θ based on the formulation of MLMM for the pseudo-data.Specifically, the proposed PNLS-MLME procedure is sketched below.
In the PNLS step, first define: where , is a function of fixed effects β and random effects b i .Fixing the current estimates of parameters ), the conditional modes of random effects b i are obtained through minimizing a penalized nonlinear least-squares objective function: The joint distributions (b T i , e T i ) T for distinct subjects are independent, and thus, all y i are independent of each other.In practice, solving over b(h) i for each subject can be implemented by minimizing g(y i , b i , θ(h) ) with respect to q-dimensional random effects of one subject at a time, rather than finding the solutions with respect to those of all subjects simultaneously.
In the MLME step, which allows updating the parameter estimates, we utilize the first-order Taylor expansion of Model Equation (1) around the current estimates η(h where μj , j = 1, . . ., r, are the first partial derivatives of µ j with respect to η i and β and b i are replaced by β(h) and { b(h) i } N i=1 , respectively.Denote the pseudo-data by: where xijk = μj (η Consequently, Model Equation (1) can be rewritten as: The model for the super vector of the pseudo-data for the i-th subject is: where ỹi is a s i r × 1 vector composed of r pseudo-response vectors ỹij = (ỹ ij,1 , • • • , ỹij,s i ) T , Xi is a s i r × p matrix with rows made up of p × 1 vector xijk and Zi is a s i r × q matrix with rows made up of q ×1 vector zijk .Obviously, Model Equation ( 8) for the pseudo-data is shown in an LME representation, so the estimation procedure becomes much simpler.Therefore, the log-likelihood function of θ according to Model Equation ( 8) can be approximated by: In the MLME step, we update β(h) by a generalized least-squares approach, which yields: Denote the half-vectorization operator by vech(•), which represents a column vector obtained by vectorizing only the lower triangular entries of a symmetric matrix.Given the current estimate ) by the Newton-Raphson method: where ŝ(h+1/2) α and Ĥ(h+1/2) αα are the score vector s α and Hessian matrix H αα evaluated at β = β(h+1) and α = α(h) .Explicit expressions for elements in s α and H αα are given in Appendix.
Iterations of Equations ( 6), (10) and (11) continue until either the maximum number of iterations or the user-specified convergence tolerance has been achieved.

Laplacian Procedure
From Function Equation (3) and Definition Equation (5), we have the joint density of (y i , b i ), denoted by f (y i , b i |θ) = φ s i r y i |µ i , R i φ q (b i |0, D), and the marginal density of y i , given by: Laplacian approximation [32,36] is an alternative technique to estimate the marginal densities or posterior predictive densities, which involve integrating out all non-target variables.We next discuss how to adopt the Laplacian approximation to evaluate Equation (12) and develop the corresponding estimation algorithm.
Set an initial guess of random effects b i to be: Consider the second-order Taylor expansion of g(y i , b i , θ) around bi .It yields: because ġ(y i , bi , θ) = 0, where the first two partial derivatives of g(y i , b i , θ) with respect to b i are: and: respectively.Notice that the contribution of the term involving the second derivative of µ i in g(y i , b i , θ) is usually negligible compared to that involving the product of the first derivative of µ i at b i = bi [37].We hereby define: Consequently, the Laplacian approximation to log-likelihood Equation ( 4) is: with regard to ML estimation of θ, we can treat it as an optimization problem based on LA (θ|y).Subsequently, we estimate D by taking the first partial derivative of Equation ( 15) with respect to D −1 and setting it to zero, yielding: By maximizing Equation ( 15), the estimates of β, Σ and φ react with one another, and thus, we perform an iterative algorithm that proceeds as follows.Given D and the current estimates β(h) and φ(h) , we update the diagonal elements in Σ(h) by: and the off-diagonal elements by: Unfortunately, equating the first partial derivatives of Equation ( 15) with respect to β and φ, respectively, to zero cannot deduce the updated estimators in closed form.Therefore, we use the nlminb routine [38] to perform a numerical search for updating β(h) and φ(h) sequentially.Specifically, and:

Pseudo-ECM Algorithm
According to the pseudo-data model specified in Equation ( 8), treating the random effects {b i } N i=1 as latent data, we establish a complete-data framework of the model: Given the pseudo-complete data ỹ = {ỹ i } N i=1 and b = {b i } N i=1 , the log-likelihood function of θ is: To carry out ML estimation for the MNLMM, we develop an ECM algorithm [33], which is a variant of EM [39], replacing its M steps by several computationally-simpler conditional maximization (CM) steps.It has several appealing features, such as stability of monotone convergence and simplicity of implementation.Hereafter, the procedure is referred to as the pseudo-ECM algorithm, because it is developed under the pseudo-data defined in Equation (7).The proposed implementation approach is outlined below.

E step:
Evaluate the expected complete-data log-likelihood Function Equation ( 16) conditioning on the current estimates θ(h) and the pseudo-responses ỹ = ỹ( ), which linearize the regression function around the previous estimates of mixed effects ( ) and should be updated at each iteration.This gives rise to the so-called Q-function: where: ), and ẽ(h and φ(h) by maximizing the Q-function Equation (17).We obtain: where ω(h+1/2) ijl is an s i × s i matrix consisting of the ((j − 1)s i + 1)-th to the (js i )-th columns and rows of Ω(h) i in which β and D have been replaced by their updated estimates at the h+1 iteration.Besides, , we implement the pseudo-ECM algorithm until the user's specified convergence criterion satisfies.Analogous to the PNLS-MLME method, this algorithm is established under the pseudo-data scenario.Hence, the resulting approximate log-likelihood value can be obtained by using Equation (9).

Monte Carlo EM Algorithm
We offer a Monte Carlo (MC) version of the EM algorithm [40] for ML estimation of Model Equation (1) and evaluate the observed log-likelihood Equation (4) via the MC integration.The MCEM is a modification of the EM algorithm in which the E step is computed numerically through a large number of simulated samples.
Given the complete data (y, b), the log-likelihood function of θ for the MNLMM can be expressed as: In the E step, we compute the expectation of complete data log-likelihood Function Equation ( 18) to yield the Q-function: Obviously, Equation ( 19) cannot be written in closed form, since the conditional distribution of b i given y i : has no standard form.
To simulate random samples from Equation ( 20), we perform the Metropolis-Hastings (M-H) algorithm [41] with the proposal distribution: where ) is the inverse matrix of G(y i , θ) given in Equation ( 13) and evaluated at θ = θ(h) .Note that the idea of considering such a proposal distribution comes from the integration of Equation ( 14) over b i , which is, up to a multiplicative constant, approximately equal to a N ( bi , G −1 y i , θ ).We have the probability min 1, P (b m=1 , the random effects b i , as well as their function f (b i ) in Equation ( 19) can be estimated by b(h i )/M , respectively, at each iteration.In the M step, we find the limited value of the obtained Q-function Equation ( 19) by equating the following functions: and: to zeros, where α = {β, Σ, φ}.By allowing differentiation under the integral sign for Equation ( 22), we update the estimate of D by: Since solving Equation ( 23) is analytically intractable, we perform a profile approximate Q-function approach, which updates βh , Σ(h) and φ(h) by a sequential optimization procedure as the Laplacian method described in Section 2.2.It gives: and: Consequently, the marginal log-likelihood can be approximated as: i , θ).
According to an alternative hierarchy of the MNLMM, the MCEM algorithm that deals with Monte Carlo integration directly on the individual parameters η i rather than subject-specific random effects b i can yield an explicit estimator for the fixed effects β.
However, such an implementation may not be feasible in the framework of MNLMMs due to the possible singularity of B i DB T i .

Importance Sampling EM Algorithm
Importance sampling (IS) is an alternative way of performing MC integration.We provide an ISEM algorithm, which modifies MC approximation of Equation (19) in the E step of the MCEM algorithm by using the IS method.To implement the ISEM algorithm, we first choose an appropriate envelope distribution from which the samples are simulated and the importance weights calculated.Like that used in the M-H algorithm, Equation ( 21) is a natural consideration for the envelop distribution.As suggested by [35], an envelop distribution could be a mixture of two multivariate normal distributions with pdf: where the mixing proportion 0 ≤ P 0 ≤ 1 is a pre-specified value.Notably, ISEM can be performed to evaluate the expected values of any functions of unobservable Having obtained a sufficient number of random effects, denoted by {b . ., M , we adopt the ratio of two MC approximations using IS from Equation (27) to estimate Equation ( 28), given by: In the E step, given the current estimates of parameters θ(h) , we compute Equation (19) in which the required conditional moments of latent data b can be approximated based on Equation (29).In the M step, we update each entry of θ(h) by maximizing the Q-function.Indeed, the ISEM procedure works conceptually similarly to that of MCEM: only D(h+1) shows an explicit solution, while β(h+1) , Σ(h+1) and φ(h+1) are obtained through sequential optimization solutions via Equations ( 24)- (26).The IS approximation to the marginal log-likelihood is:

Initialization
When implementing iterative procedures, a common difficulty encountered in practice is that the algorithm is painfully slow or even non-convergent.Such a computational problem may occur in handling ML estimation of the MNLMM, especially when the data are too sparse or the dimension of random effects is over-specified.To overcome this potential problem, a default procedure of automatically creating a set of good initial values is summarized below.
(i) A direct way of obtaining the initial value for β is to fit the NLMMs to each outcome variable separately by using the nlme R package [12].(ii) Using the fitting results of NLMMs for each outcome, we take the initial value D(0) as a (block) diagonal form with the diagonal entry being the variances (covariances) of random effects under the fitted NLMMs.(iii) For the initial value for Σ, we use the sample variance-covariance matrix of the data.That is, take , where The initial values for φ, depending on the structure, are simply chosen to give a condition of nearly uncorrelated errors.

Application: ACTG 315 Data
We present a comparison of the five algorithms via a real data example from the AIDS Clinical Trial Group protocol 315 (ACTG 315) study developed by the Immunology Research Agenda Committee of the U.S. National Institute of Allergy and Infectious Disease, the ACTG sponsor.The study design and recruitment of participants (patients) were conducted by University Hospitals of Cleveland, Rush-Presbyterian-St. Luke's Medical Center and University of Colorado Health Science Center.In the study, 53 human immunodeficiency virus type 1 (HIV-1)-infected patients were recruited, and their plasma HIV-1 RNA (viral load) copies and CD4 + T cell counts were repeatedly measured at Days 0, 2, 7, 10, 14, 28, 56, 84, 168 and 196 after the start of treatment.A more detailed description of the study can be found in [42,43].
HIV-1 infection is associated with progressive and profound loss of immune function that places infected persons at enhanced risk for opportunistic infections, and even death.A reaction in HIV-1-related immune deficiency can be characterized by decreases in the numbers of circulating CD4 + T helper lymphocytes.CD4 + T cells in blood decline to a lower level after HIV-1 infection and may recover to a high level after antiviral therapies suppress viral load.Generally, there is a negative correlation between the virologic marker (measured by HIV-1 RNA) and the immunologic marker (measured by CD4 + T cells) during antiviral treatments.As a consequence, a joint analysis of HIV-1 RNA and CD4 + counts is helpful to take the evolution of the correlation among responses over time into account.The data have been analyzed by [44][45][46][47] using different modeling approaches.
As a part of the clinical trial on 53 patients, a total of 48 patients were recruited in our analysis after excluding four early drop-out patients and one due to a plasma HIV-1 RNA pattern that suggested intermittent adherence to study therapy.To stabilize the variances and to reduce the strong skewness among the two makers, a base-10 logarithmic transformation is made for HIV-1 RNA and a square-root transformation for CD4 + T cells.Both transformations are widely used in HIV-AIDS clinical trials.Let y i1,k and y i2,k be log 10 RNA and CD4 0.5 markers, respectively, at the k-th time point for patient i.We consider the following bivariate nonlinear mixed-effects model for y i1,k and y i2,k : where t ik = day ik /7 is the k-th visited time point (week) for patient i; rna i is the log 10 RNA levels for patient i at the start of the study; (b i1 , b i2 ) are the bivariate normally-distributed random effects; and (e T i1 , e T i2 ) = (e i1,1 , . . ., e i1,s i , e i2,1 , . . ., e i2,s i ) are the within-subject errors following a multivariate normal distribution with zero mean and variance-covariance matrix Σ ⊗ C i .Because the baseline RNA is a significant covariate in the ACTG 315 study [47], it should be incorporated into the analysis.To account for the extra autocorrelation caused by within-patient dependence among unequally-spaced occasions, we employ a continuous order-one autoregressive structure, i.e., C i = [φ |t ik −t ik | ], for the across-occasion covariance matrix of within-subject errors.
According to the standard formulation in Equation ( 2), we specify: and b i = (b i1 , b i2 ) T , where I d is a diagonal matrix of order d.Define: (10), and where , η 5 = β 5 and η 6 = β 6 .The first derivatives of µ 1 and µ 2 specified in Equation ( 31) with respect to η are: The first derivative of mean function µ i (β, b i ) = (µ 1 , µ 2 ) with respect to b i is: where ξ 1 and ξ 2 are s i × 1 vectors composed of ξ 1 and ξ 2 given by Equation ( 32) with t replaced by a s i × 1 occasion vector t i of the i-th patient.Table 1 presents the parameter estimates and their standard deviations (in parentheses) from the five computational methods, namely PNLS-MLMM, Laplacian, pseudo-ECM, MCEM with 500 Monte Carlo samples and ISEM with mixing proportion P 0 = 0.5.When employing the ISEM algorithm, several choices of the mixing proportion P 0 , ranging from 0-1 with and increment of 0.1, are considered.To save space, we reported only the result for P 0 = 0.5, as it yields the maximized log-likelihood value.The results indicate that the five methods can give very similar estimates and the significance of model parameters.According to the estimates of Σ = [σ jl ], the estimated correlation of log 10 RNA and CD4 0.5 ranges from −0.13-−0.18(around), confirming a negative relationship between the virologic and immunologic markers.The between-patient correlations of the two responses have no statistical significance based on the estimates of D. The estimate of autoregressive parameter φ is significantly different from zero, revealing an existence of autocorrelation among the within-patient variability.Figure 1 displays the observations and estimated mean curves in which the covariate is set to be the average of baseline RNA values of all patients for the five computational methods.Judging from the figure, the considered logarithmic and logistic curves in Equation ( 31) are reasonable functions to describe the evolutions of RNA in the log 10 scale and CD4 in the square-root scale over time.The trend of log 10 RNA decreases at the beginning due to the rapid growth of CD4 0.5 cells in the early days of antiviral therapies.After nearly four weeks, the decline pattern on log 10 RNA and the growth pattern on CD4 0.5 become slow and smooth.As an illustration, the fitted values obtained by the five methods together with the observations for seven randomly-selected patients are displayed in Figure 2. As anticipated, the fitted trajectories for each patient show the slight difference among the five estimating procedures.Generally, they adapt the trend along observed repeated measures, but some of configurations are not ideally captured.It is known that the viral load (RNA copies) and CD4 counts are highly variable immune system markers, making them difficult to fit.31) evaluated at the ML estimates θ obtained respectively by the five estimation procedures are reported in Table 2. To assess the accuracy of the approximations of the log-likelihood function, we also perform the double integral in log-likelihood Function Equation ( 4) by plugging the corresponding θ into Equation ( 4) and using the integrate routine in the R package to get the exact log-likelihoods.The exact log-likelihood values together with the absolute differences (AD) between the approximate and exact values are also listed in Table 2. Roughly, the log-likelihood values under the five approximation methods are similar and close to their corresponding exact values.In this example, the pseudo-ECM yields the most precise evaluation, followed by Laplacian, MCEM, ISEM and PNLS-MLME.Although the proposed five algorithms can provide quite similar estimates of model parameters, as well as the fitted mean profiles shown in Figures 1 and 2, we should give the following remarks.The PNLS-MLMM and Laplacian methods involve solving the fixed effects β and the modes of random effects {b i } N i=1 by implementing optimal iterative procedures.Thus, the two methods are very sensitive to initial values and may suffer from slow or even non-convergence due to singularity of variance-covariance matrices, especially when unnecessary random effects are included in the model.The MCEM and ISEM methods spend more time in generating an adequate number of samples of random effects to evaluate the required conditional expectations.Overall, the pseudo-ECM algorithm is the best method in terms of computational efficiency in this study.However, all of the proposed methods may get trapped in one of many local maxima of the log-likelihood function.To assess the stability of the resulting estimates, a variety of initial values should be employed when implementing the algorithms.The global optimal solution is obtained by choosing the one with the largest log-likelihood value.

Simulation Study
In this section, two simulation studies with data generated from two models with linear and nonlinear profiles, respectively, are undertaken to compare the performance of the five algorithmic procedures for fitting the MNLMM.The performance comparison includes the convergence efficiency in terms of the number of iterations and consumed CPU time, the accuracy of parameter estimates and the precision of log-likelihood approximation.All computations were carried out by R package 2.13.1 in a Win32 environment of a desktop PC machine with a 3.40-GHz/Intel Core(TM) i7-2600 CPU Processor and 4.0 GB RAM.

Bivariate Linear Case
To perform an evaluation of the exact log-likelihood values that is tractable, in this simulation, we restrict ourselves to generating datasets from the following bivariate LMM: When assessing the approximated log-likelihood functions, we find that all approximation methods produce relative biases in log-likelihoods within ±0.12 (the range is not quite large).Because the simulated datasets are generated from a linear scenario, i.e., bivariate LMM specified in Equation ( 33), the pseudo-data model given in Equation ( 8) certainly satisfies the MLMM [1] framework.Therefore, the ML estimates of model parameters, as well as the maximized log-likelihood value obtained by the pseudo-ECM algorithm are exactly the same as those obtained by fitting the MLMM using the EM-based algorithm.Besides, the PNLS-MLME method uses the same approximation of the log-likelihood function, say PD ( θ|y), with that of pseudo-ECM.Thus, the values of relative biases in log-likelihoods obtained by the PNLS-MLME and pseudo-ECM algorithms are quite similar, and they are very close to zero.Additionally, the Laplacian approximation gives near-zero, but slightly under-estimated relative biases in log-likelihoods, and the relative biases are negligible when the sample size and between-outcome correlation are large.The log-likelihood values could be slightly over-estimated by using the MCEM method and slightly under-estimated by using the ISEM method.As anticipated, the approximations of log-likelihood functions will get close to the exact log-likelihood value when the sample size increases.We now turn our attention to observing the estimation performance for model parameters under the five computational methods.From the RMSE rows of Table 3, typically, the five methods give comparable results for estimation accuracy due to negligible differences in RMSE scores.The RMSE decreases as the sample size increases, confirming the good asymptotic properties of ML estimators, at least for the setting of parameters used in this simulation.As mentioned above, the pseudo-ECM method implemented for linear models produces the same results as the EM-type algorithm for MLMM.Judging from Table 3, the pseudo-ECM method has the smallest RMSE among the five computational methods.Furthermore, we compare the estimates of each parameter obtained by PNLS-MLME, Laplacian, MCEM and ISEM against those obtained by pseudo-ECM one-by-one in detail.Figures 3 and 4 display the scatter plots of the estimates of fixed effects (β) and variance-covariance components (D and Σ) separately for the pseudo-ECM method (in the X-axes) versus the other four procedures (in the Y -axes).The dashed lines indicate the true values of parameters.To save space, we present only the case of N = 25 and ρ = 0.9, because the other five cases exhibit almost a similar pattern.It can be seen from the two figures that the estimates are all located in the neighborhood of the true values, indicating that all five computational procedures yield very precise estimates of model parameters.In general, there is a strong agreement in the estimates obtained through the five methods, because the point estimates fall close to the 45-degree line.However, for the estimate of β 4 , PNLS-MLME appears to have a slightly large variability.For the estimates of σ 11 , σ 12 and σ 22 , the other four methods tend to give estimates smaller than does the pseudo-ECM algorithm.

Bivariate Nonlinear Case
In the simulation, the data were generated from the MNLMM with nonlinear mean curves Equation (31).The presumed model parameters are: Each simulated dataset is fitted by the MNLMM using the five approximation methods described in Section 2. To investigate the effect of the size of MC samples for MCEM and mixing proportions of the envelope distribution for ISEM, we consider MC sample sizes M = 500, 1000, 2000 and the mixing proportions P 0 = 0.1, 0.5, 0.9.A total of 100 replications are run for each of sample sizes N = 25 and 50 across nine computational procedures.The convergence rule is the same as the previous simulation.Note that numerical double-integration is performed to calculate the exact log-likelihood, such that the evaluation of the accuracy of the approximate log-likelihood is tractable.
In this simulation study, there are 18 (10) and 12 (7) non-convergence cases out of 100 trials for the PNLS-MLME and Laplacian methods, respectively, under sample size N = 25 (50).To ensure that we are comparing estimates of different methods based on the same simulated data and initial values, an additional dataset will be regenerated in the procedure if one of the methods did not converge for a particular dataset.This can be done by using the R try() function to handle the error-recovery.Table 4 reports the computing results, including the averages of CPU time (Time), numbers of iterations (Iter), converged log-likelihood values ( max ), RB of log-likelihood functions and empirical sums of the RMSE of parameter estimates for each sample size and each algorithm.The results indicate that the pseudo-ECM spent the least CPU time, followed by the PNLS-MLME, Laplacian, ISEM with P 0 = 0.1, 0.5, MCEM with M = 500, 1000, 2000 and, then, ISEM with P 0 = 0.9.The PNLS-MLME demands the fewest numbers of iterations, followed by Laplacian, pseudo-ECM, ISEM with P 0 = 0.1, 0.5, MCEM with M = 2000, 1000, 500 and, then, ISEM with P 0 = 0.9.The performance of the five methods under the bivariate nonlinear model is conceptually similar to that under the bivariate linear model shown in Section 4.1.It makes sense that the consumed CPU time increases with the size of MC samples M for MCEM, but the required iteration number decreases with MC sample size M .Besides, for the ISEM method, when the proportion of importance samples of random effects drawing from the posterior of b i increases (say P 0 decreases), both the CPU time and iteration number decrease.
It can be seen from the RB column of Table 4 that all methods except for the three ISEM procedures provide comparable accuracy for approximate observed log-likelihood values, while the ISEM method tends to get a relatively large bias.Observing the empirical sums of RMSE, the PNLS-MLME and pseudo-ECM methods can yield more accurate estimates of model parameters as N = 25 and N = 50, respectively, while the others show minor difference in RMSE scores.The MCEM method generally offers better precision of the parameter estimates when the size of generated MC samples increases.Although the MCEM spent much CPU time and had larger iteration numbers to achieve convergence, it can produce relatively small bias for the approximation of observed log-likelihood and smaller RMSE for estimates of model parameters, especially for large sizes of sample N = 50 and MC samples M = 2000.Additionally, among the three settings of P 0 for ISEM, the case of equal weights (say P 0 = 0.5) gives smaller RB and RMSE scores.If we want to obtain more accurate results of approximate log-likelihood using the ISEM algorithm, probably a larger number of samples of random effects might be necessary, but it seems inefficient.As expected, when the sample size N increases, the required CPU time and iteration number increase, and the RB and RMSE decrease, confirming the large sample properties of ML estimation.In addition, the RMSE (×10 2 ) for the estimates of each parameter under the nine considered estimating procedures are listed in Table 5.It seems that the estimators for β 5 , β 6 , d 11 , d 21 , d 22 and σ 21 show somewhat less precise point estimates as opposed to the other parameters in the setting of this simulation.Observing the table, there are remarkable differences in the magnitude of RMSE values as the precision of parameter estimates depends heavily on the specification of nonlinear mean functions.Moreover, there are no consistent rankings of precision among the nine considered procedures for each parameter.Although this is a limited study, it demonstrates that all five approximation methods can give reasonable results for parameter estimation.

Discussion and Conclusions
In this article, we describe and compare five approximation methods to carry out ML estimation of the MNLMM, as well as the evaluation of the observed log-likelihood function.The methods, namely PNLS-MLME, Laplacian approximation, pseudo-ECM, MCEM and ISEM algorithms, depend on the result of the first two order Taylor expansions.The PNLS-MLME and pseudo-ECM methods use a linearization of nonlinear mean functions, while the other three methods rely on an approximation of the observed likelihood.Numerical results indicate that the five methods can give comparable accuracy of the estimation of model parameters, as well as approximation of the observed log-likelihood function of the MNLMM.
In summary, the five algorithmic schemes preserve flexibility and simplicity in carrying out ML estimation of the MNLMM.The pseudo-ECM method can offer relatively better efficiency compared to the other four methods.For the PNLS-MLME and Laplacian methods, a poor initial guess of θ can result in poor estimates of {b i } N i=1 , and thereby, the accuracy of parameter estimates and the performance of convergence become worse.To overcome this weakness, the consideration of different starting values for D(0) is recommended by specifying c D(0) , where c is a random draw from the standard normal distribution and the original D(0) is given in Section 2.7.The MCEM and ISEM methods appear to be less efficient, because both of them spend much time to generate MC samples for evaluating the required conditional expectations in each iteration.For the implementation of the ISEM algorithm, the specification of the mixing proportion P 0 depends on the data at hand.We suggest trying a variety of settings and choose the optimal P 0 corresponding to the maximized approximate observed log-likelihood.An R package for fitting MNLMM based on the proposed techniques will be released in the near future.
However, the multivariate normality assumption in the MNLMM might not provide robust inference if the data, even after being transformed, and exhibit fat tails and/or skewness [48][49][50].To alleviate such limitations, it is natural to replace the multivariate normally-distributed random effects and within-subject errors of the MNLMM by a broader family, such as the multivariate skew-normal distribution [51], the multivariate skew-t distribution [52], the multivariate skew-elliptical distribution [53], or the multivariate skew-normal independent distribution [54,55].The proposed methods are readily extendable to carry out ML estimation of the multivariate version of skew-family nonlinear mixed models.This leads to valuable further research on the issue of developing multivariate skew-family nonlinear mixed models together with their ML inference.

Figure 1 .
Figure 1.The log 10 (RNA) and CD4 0.5 observations (•) with the estimated mean curves against time (in days) from ML estimation using the five proposed procedures.

Figure 2 .
Figure 2. The fitted values obtained by the five proposed procedures together with the observations (•) of log 10 (RNA) and CD4 0.5 for seven randomly-selected patients.

Table 2 .
(31)oximate and exact log-likelihood functions for the fitted Model Equation(31)under the five estimation methods.AD, absolute difference.

Table 3 .
Simulation results for the computational performance of five approximation methods under each combination of correlations ρ and sample sizes N .Iter, iteration; RB, relative bias.

Table 4 .
Simulation results for nine estimating procedures under the bivariate nonlinear case.