Open Access
This article is
 freely available
 reusable
Stats 2019, 2(2), 174188; https://doi.org/10.3390/stats2020013
Article
Improved Small Sample Inference Methods for a MixedEffects Model for Repeated Measures Approach in Incomplete Longitudinal Data Analysis
^{1}
Department of Biostatistics, Janssen Pharmaceutical K. K., 52 Nishikanda 3chome, Chiyodaku, Tokyo 1010065, Japan
^{2}
Department of Statistical Science, School of Multidisciplinary Sciences, The Graduate University for Advanced Studies, 103 Midoricho, Tachikawa, Tokyo 1908562, Japan
^{3}
Department of Data Science, The Institute of Statistical Mathematics, 103 Midoricho, Tachikawa, Tokyo 1908562, Japan
^{4}
Department of Biostatistics, Faculty of Medicine, University of Tsukuba, 111 Tennodai, Tsukuba, Ibaraki 3058575, Japan
^{*}
Author to whom correspondence should be addressed.
Received: 16 January 2019 / Accepted: 23 March 2019 / Published: 28 March 2019
Abstract
:The mixedeffects model for repeated measures (MMRM) approach has been widely applied for longitudinal clinical trials. Many of the standard inference methods of MMRM could possibly lead to the inflation of type I error rates for the tests of treatment effect, when the longitudinal dataset is small and involves missing measurements. We propose two improved inference methods for the MMRM analyses, (1) the Bartlett correction with the adjustment term approximated by bootstrap, and (2) the Monte Carlo test using an estimated null distribution by bootstrap. These methods can be implemented regardless of model complexity and missing patterns via a unified computational framework. Through simulation studies, the proposed methods maintain the type I error rate properly, even for small and incomplete longitudinal clinical trial settings. Applications to a postnatal depression clinical trial are also presented.
Keywords:
Bartlett adjustment; MMRM; missing data; longitudinal data analysis; resampling1. Introduction
Clinical trials for new drug development are often longitudinal trials in which outcome variables are repeatedly measured. In these trials, the primary analyses usually compare the treatment efficacy with a comparator at the end of a followup period. However, during the followup period, dropouts or missing outcome variables usually occur, and may seriously influence the validity and precision of the statistical inference. In addition, regulatory guidelines for preventing and treating the missing data in clinical trials have been issued [1,2,3], and adequate practices have been strongly pursued in recent years. Following these discussions, the mixedeffects model for repeated measures (MMRM) [4,5,6] has been widely applied for primary analyses of clinical trials in drug development. This type of model allows for valid statistical inference under incomplete longitudinal repeated measurements based on the direct likelihood approach.
MMRM is a type of linear mixed model (LMM) [7,8,9] that directly models the variancecovariance matrix of the longitudinal multivariate outcome variables [5], in which random effects are included as part of the marginal covariance matrix. One of the advantages of MMRM is that it enables flexible modelling of the correlation structure between time points to ensure the validity of inference in treatment efficacy. Further, the ordinary inference methods for the LMM (e.g., restricted maximum likelihood (REML) method) [10] are based on large sample approximations. Their validities are violated under small or moderate sample settings [11,12]. For the MMRM, Gosho et al. [13] also showed the invalidity of the ordinary inference methods under small or moderate sample settings.
In order to resolve these problems, several related works have been conducted regarding the conventional LMM. One solution is to adopt a higherorder asymptotic theory. Zucker et al. [14] studied the Bartlett correction [15] and the CoxReid adjusted likelihood [16] as well as their combination of the likelihood ratio (LR) test. Lyons and Peters [17] and Guolo et al. [18] proposed a higherorder asymptotic approach by adapting Skovgaard’s improved modified signed loglikelihood ratio [19] introduced by BarndorffNielsen [20]. Stein et al. [21] investigated the modified profile likelihood approach of BarndorffNielsen [20] based on the approximation method of Severini [22]. Although the improved methods by Stein et al. [21] performed well in their simulation studies, they compared their methods only with the naïve LR test. In addition, their improved methods require complicated analytical calculations involved in higherorder differentiations of loglikelihood in casebycase analyses. In particular, when applying the MMRM in longitudinal clinical trials, the marginal covariance structure is usually assumed to be a complicated form to circumvent model misspecifications, and calculations by Stein et al. [21] would not be realistic in practical use. Stein et al. [21] also investigated bootstrapbased approximation inferences, but their discussions and numerical evaluations were also limited within the conventional LMM framework, and the applications to MMRM for incomplete longitudinal studies were not discussed.
In this study, we proposed and investigated two improved inference methods involved in MMRM under small sample size and with incomplete data for longitudinal clinical trials. To circumvent the practical difficulties in implementing the analytical calculations, we adopted numerical approximations using bootstrap inferences [23,24,25]. The first method was the Bartlett correction [15] with the adjustment term approximated by bootstraps [23]. This approach can effectively circumvent casebycase complicated analytical calculations and can be generally applied regardless of model complexity and missing patterns via a unified computational framework. In addition, the second involved the Monte Carlo test using empirical distribution constructed by the bootstrap; this was a straightforward approach, and is widely known to be an effective method under these situations. The resampling schemes of both methods allowed outcome variables to be incomplete, and we evaluated their validities and performances under practical situations of longitudinal clinical trials with missing data. In addition, we compared these methods with those of standard methods such as REML using KenwardRoger’s (KR) [11] method and unstructured covariance structure. We also assessed their practical effectiveness, illustrating their application to a postnatal depression clinical trial [26].
This paper is organized as follows: we first review the MMRM for longitudinal data analyses in Section 2. We then provide our approaches to improve the statistical inferences of MMRM in Section 3. We provide simulation evaluations in Section 4, and we apply our methods to the postnatal depression clinical trial data in Section 5. Lastly, we conclude with some discussion in Section 6.
2. MixedEffects Model for Repeated Measures (MMRM) for Longitudinal Data Analysis
Suppose subjects were randomized to two treatment groups (e.g., active drug vs. placebo). A continuous outcome was measured repeatedly over n time points. Also, we considered that the total number of subjects in the two groups was $N$ and the number of time points that the outcomes were observed for the $i$th individual is ${n}_{i}\text{}\left({n}_{i}\le n\right)$. The primary analysis of interest was to compare the mean difference of the primary endpoint at the final nth time point. In this study, we supposed only monotonic missing data for simplicity, but our discussions can be straightforwardly extended to nonmonotonic missingness.
where ${\mathit{\epsilon}}_{i}$ was the ${n}_{i}\times 1$ random error vector distributed independently as $MVN\left(0,\text{}{\Sigma}_{i}\right)$. ${\Sigma}_{i}$ was the ${n}_{i}\times {n}_{i}$ variance–covariance matrix. The random effect ${\mathit{b}}_{i}$ and the error ${\mathit{\epsilon}}_{i}$ were independent, and all data between different subjects were also assumed to be independent. ${\mathit{Y}}_{i}$ marginally follows $MVN\left({\mathit{X}}_{i}\mathit{\beta},\text{}{\mathit{V}}_{i}\right)$, where ${\mathit{V}}_{i}={\mathit{Z}}_{i}\mathit{D}{\mathit{Z}}_{i}^{\mathit{T}}+{\Sigma}_{i}$. In the MMRM method, the variations explained by random effects were included as part of the marginal covariance matrix ${\mathit{V}}_{i}$ rather than being explicitly modelled as the random effects [5]. An unstructured covariance matrix is often preferred as the structure of ${\mathit{V}}_{i}$ because no assumptions are made on the covariance structure [13]. For the statistical inferences of regression parameters, the restricted maximum likelihood (REML) method [10] has been routinely used in practice. In addition, although missing is a common problem in longitudinal clinical trials, validity of the inferences is assured under the missing at random (MAR) mechanism because MMRM adopts the likelihoodbased methods [27].
$${\mathit{Y}}_{i}={\mathit{X}}_{i}\mathit{\beta}+{\mathit{Z}}_{i}{\mathit{b}}_{i}+{\mathit{\epsilon}}_{i}$$
3. Improved Inference Methods
3.1. Likelihood Ratio (LR) test
We considered the testing problem for individual regression coefficients of MMRM, which corresponded to the primary analysis of longitudinal clinical trials. Without loss of generality, we considered a testing problem of the 1st component of the regression coefficients $\mathit{\beta}={\left({\beta}_{1},\text{}{\beta}_{2},\text{}\dots ,\text{}{\beta}_{p}\right)}^{T}$,
Let ${\mathit{\beta}}_{c}={\left({\beta}_{2},\text{}{\beta}_{3},\text{}\dots ,\text{}{\beta}_{p}\right)}^{T}$ and let $\mathit{\upsilon}$ a parameter vector composed of the components of marginal variance–covariance matrix ${\mathit{V}}_{i}$ (i = 1, 2, …, N).
$$\mathrm{Null}\text{}\mathrm{hypothesis}\text{}{H}_{0}:\text{}{\beta}_{1}={\beta}_{1}^{null}$$
$$\mathrm{Alternative}\text{}\mathrm{hypothesis}\text{}{H}_{1}:{\beta}_{1}\ne {\beta}_{1}^{null}$$
To develop the improved inference methods, we first introduced the LR test for MMRM. The LR test statistic for the hypothesis test above was
where $\left({\widehat{\beta}}_{1},\text{}{\widehat{\mathit{\beta}}}_{c},\text{}\widehat{\mathit{\upsilon}}\right)$ was the maximum likelihood (ML) estimates of $\left({\beta}_{1},\text{}{\mathit{\beta}}_{c},\text{}\mathit{\upsilon}\right)$ and $\left({\tilde{\mathit{\beta}}}_{c},\text{}\tilde{\mathit{\upsilon}}\right)$ was the constrained ML estimates under the null hypothesis. Also, $l\left({\beta}_{1},\text{}{\mathit{\beta}}_{c},\text{}\mathit{\upsilon}\right)$ was the loglikelihood function for MMRM,
$$T\left({\beta}_{1}^{null}\right)=2\left\{l\left({\beta}_{1}^{null},\text{}{\tilde{\mathit{\beta}}}_{c},\text{}\tilde{\mathit{\upsilon}}\right)l\left({\widehat{\beta}}_{1},{\widehat{\mathit{\beta}}}_{c},\text{}\widehat{\mathit{\upsilon}}\right)\right\},$$
$$l\left({\beta}_{1},\text{}{\mathit{\beta}}_{c},\text{}\mathit{\upsilon}\right)=\frac{1}{2}{\displaystyle \sum}_{i=1}^{N}\left\{\mathrm{log}\left{\mathit{V}}_{i}\right+{\left({\mathit{y}}_{i}{\mathit{X}}_{i}\mathit{\beta}\right)}^{T}{\mathit{V}}_{i}^{1}\left({\mathit{y}}_{i}{\mathit{X}}_{i}\mathit{\beta}\right)+{n}_{i}\mathrm{log}2\pi \right\}$$
The ML and the constrained ML estimates were computed by using this loglikelihood function. Asymptotically, the LR test statistic $T\left({\beta}_{1}^{null}\right)$ followed the chisquared distribution with 1 degree of freedom under the null hypothesis [28].
3.2. BartlettType Adjustment by Bootstrap Resampling
Conventionally, it is widely known that the large sample approximation of the LR test statistic $T\left({\beta}_{1}^{null}\right)$ to the chisquared distribution is not accurate under small sample settings [23]. To improve the approximations, several higher order approaches have been developed and the Bartlett correction [15] is one of the effective solutions. The Bartlett correction is a correction method for the LR test statistic that aims to improve the approximation to the reference chisquare distribution dividing by a correction term. The adjustment term is an estimate of the first moment of the null distribution of the LR test statistic $\xi =E\left[T\left({\beta}_{1}^{null}\right)\right]$, and the corrected LR test statistic is given by ${T}^{\ast}\left({\beta}_{1}^{null}\right)=T\left({\beta}_{1}^{null}\right)/\widehat{\xi}$. Intuitively, if the estimate $\widehat{\xi}$ is accurate, the null distribution of the corrected statistic approaches the chisquared distribution. Theoretically, BarndorffNielsen and Hall [29] showed that the Bartlett correction reduces the error of the chisquared approximation from $O\left({N}^{1}\right)$ to $O\left({N}^{2}\right)$.
In this study, we proposed a practical procedure to apply the Bartlett correction to MMRM effectively for incomplete longitudinal data under small sample size. Many previous studies attempted to obtain analytical forms of the Bartlett correction term by analytical methods [14,30,31]. However, analytical forms of the correction term were not necessarily obtainable when complicated models were assumed and missing data was involved. As an alternative effective approach, Rocke [32] proposed to use a resampling approach, which adopted the parametric bootstrap method to estimate the Bartlett correction term $\widehat{\xi}$. Here, we proposed to apply this resampling approach to improve the inferences of MMRM. The resampling approach possibly involved computational burdens, but it had an advantage in that it could be implemented using generic algorithms regardless of the complexity of regression model and covariance structure. The resampling based procedure was formulated as the following Algorithm 1.
Algorithm 1 Bartlett correction using bootstrap resampling technique. 

Also, the corresponding confidence interval of ${\beta}_{1}$ could be constructed by a set of ${\beta}_{1}^{null}$ that fulfill the following inequality,
where ${\chi}_{1,\text{}\alpha}^{2}$ is the upper $\alpha $th quantile of the chisquare distribution with 1 degree of freedom. Note that while it was technically possible to apply a nonparametric bootstrap method, the parametric bootstrap method would be preferred to estimate the Bartlett correction term $\xi $ under small sample settings, because the bootstrap distributions might have been too discrete [33,34].
$${T}_{BS}^{\ast}\left({\beta}_{1}^{null}\right)\le {\chi}_{1,\text{}\alpha}^{2},$$
3.3. Monte Carlo Test Using an Estimated Null Distribution by Bootstrap
Using the parametric bootstrap, we could also estimate the null distribution of the test statistic directly via the Monte Carlo technique [35]. We could construct an estimate of the null distribution of $T\left({\beta}_{1}^{null}\right)$ if we resampled a large number of LR test statistics under the null hypothesis using a parametric bootstrap technique. This method used the Monte Carlo estimate of the null distribution as the reference distribution of LR test, instead of the chisquared distribution. This approach would be an alternative to the former proposed method that had the same advantages for the inferences of small sample settings.
With processes 1–4 of Algorithm 1, we had the bootstrap LR test statistics ${T}^{\left(b\right)}\left({\beta}_{1}^{null}\right)\text{}\left(b=1,\text{}2,\text{}\dots ,\text{}B\right)$. The Monte Carlo estimate of the null distribution was obtained as the empirical distribution of $T\left({\beta}_{1}^{null}\right)$. Also, the bootstrapbased critical value of the nominal $\alpha $ level $\left(0<\alpha <1\right)$ corresponded to the upper $\alpha $th quantile of the empirical distribution function. The Monte Carlo test can be constructed by the following Algorithm 2.
Algorithm 2 Bootstrapbased adjustment of LR test. 

Also, $100\times \left(1\alpha \right)$% confidence intervals can be constructed with a set of ${\beta}_{1}^{null}$ that fulfill [36],
where ${\widehat{q}}_{bs,\left(1\alpha \right)}$ for the upper $\alpha \mathrm{th}\text{}\mathrm{quantile}$ of the estimated null distribution. According to Rocke [32], more than 1000 resamplings were recommended when estimating the tail of a distribution, such as the upper $\alpha $th quantile of the null distribution.
$$T\left({\beta}_{1}^{null}\right)\le {\widehat{q}}_{bs,\text{}\left(1\alpha \right)}.$$
4. Simulation Studies
4.1. Design and Setting
We conducted a series of simulation studies to assess the performances of the two methods, the Bartletttype correction for LR test statisticbased test (LR_{Bart}) and the bootstrap adjustment test for LR test statistic based test (LR_{Boot}) under practical situations of longitudinal small clinical trials. We compared the effectiveness of these methods with the conventional ordinary LR test and KenwardRoger (KR) method [11], which is the current standard inference method in MMRM analyses. We considered the same scenarios of the simulation studies of Gosho et al. [13], which conducted extensive simulations to evaluate the performances of MMRM for longitudinal clinical trials. We supposed two group comparative longitudinal clinical trials (e.g., active drug group vs. placebo group) and the number of post baseline visits (n) to be 7. The total number of subjects was determined as $N=20$ (i.e., 10 subjects per group, respectively). The outcome variables ${Y}_{it}$ (t = 1., 2., …7) were generated from the following model,
where $mea{n}_{it}$ was a fixed effect, $subjec{t}_{i}$ was a subject effect, and $erro{r}_{it}$ was a random error (i = 1, 2, …, N; t = 1, 2, …, 7). The mean values of ${Y}_{it}$ assumed the four scenarios illustrated in Figure 1. Here, we were interested in evaluating the mean difference between the two groups at the final (7th) time point. Scenarios 1 and 2 corresponded to the null hypothesis that the mean values of the outcome variables were the same between the two groups at the final time point. By contrast, in scenarios 3 and 4, the mean value of the treatment efficacy at the final time point differs between the two groups and corresponded to the alternative hypothesis.
$${Y}_{it}=mea{n}_{it}+subjec{t}_{i}+erro{r}_{it}$$
4.2. Correlation Structures
For the variance–covariance structure of the error terms, a first order heterogeneous autoregressive (ARH (1)) structure was adopted, of which $\left(t,\text{}{t}^{\prime}\right)$ element is defined as ${\sigma}_{t}{\sigma}_{{t}^{\prime}}{\rho}^{\left{t}^{\prime}t\right}$ where ${\sigma}_{t}$ and ${\sigma}_{{t}^{\prime}}$ are the standard deviances of $t$th and ${t}^{\prime}$th time points and $\rho $ is the correlation coefficient between the two points. Following Gosho et al. [13], the diagonal elements ${\sigma}_{t}^{2}$ was set to $9\left\{1+3\left(t1\right)/6\right\}$ and the correlation coefficient $\rho $ was set to 0.7. Also, the subject effect was generated by $N\left(0,\text{}{3}^{2}\right)$.
4.3. Missingdata Mechanism
In this simulation, we considered two missingdata mechanisms, missing completely at random (MCAR) and missing at random (MAR). Only the monotone missing was assumed, i.e., once missingness occurred, all outcome values after the time point were missing for the corresponding individual. We denoted the probability of missingness of ${Y}_{it}$ as ${\lambda}_{it}$. The missingness probability ${\lambda}_{it}$ was assumed to follow the logistic regression model,
$$\mathrm{logit}\left({\lambda}_{it}\right)={\gamma}_{0}+{\gamma}_{1}{Y}_{i,\text{}t1},\text{}t=2,\text{}3,\dots ,\text{}7.$$
The regression coefficients of the logistic regression model for the missingness probability were defined as ${\gamma}_{1}=0$ for MCAR and ${\gamma}_{1}=1$ for MAR. Table 1 summarizes the missingdata mechanisms and the coefficients of the logistic regression model that had a defined dropout rate for each treatment group at the final time point. The total dropout rates for the two groups were set to 0%, 20% or 40% for the four scenarios.
4.4. Analysis Methods
The simulated data were analysed using four methods as mentioned above. We adopted the standard MMRM that included a group variable and time variables as dummy variables and the groupbytime interactions in the regression function. An unstructured covariance structure was adopted for the covariance structure model for the outcome variables. Parametric bootstraps for the proposed two methods were performed via 3000 resamplings. The results concerning convergence of MMRM analyses are reported in the Appendix. The numbers of simulations were 1000 for all scenarios. All computations were performed by SAS ver. 9.4. Also, the significance levels were set to be 0.05.
4.5. Results
Figure 2 shows the type I error rates for scenarios 1 and 2 under $N=20$ (i.e., 10 subjects per group). The blue dashed lines correspond to the 95% intervals of the Monte Carlo errors. At first, the type I error rates of LR test increased greatly from 5% as the dropout rate increased. In scenario 1 with a 40% dropout rate, the type I error rates of LR was 11.3% under MCAR, and 9.8% under MAR, respectively. Besides, the type I error rates of LR_{Bart} and LR_{Boot} were maintained at 5% irrespective of the missingdata mechanism and dropout rate. For example, the type I error rates under scenario 1 with a 40% dropout rate under MAR were 5.5%, 5.6% for LR_{Bart} and LR_{Boot}, respectively. On the other hand, the type I error rates of the KR method were not maintained at around 5% under MAR and were too conservative. Under MAR with a 40% dropout rate, the type I error rates of the KR method were 3.6% and 3.5% for scenarios 1 and 2, respectively. Besides, under MCAR scenarios, the type I error rates of the KR method were maintained at around 5%. The convergence rates of these methods were not significantly different (see Appendix). Note that the type I errors for the LR, LR_{Bart} and LR_{Boot} were inflated under scenario 1 with dropout rate 40%, but they fell within the ranges of Monte Carlo errors. The results of the convergence for scenarios 1 and 2 appears in the appendix section as Figure A1.
Figure 3 shows the powers in scenarios 3 and 4 for $N=20$ (i.e., 10 subjects per group). At first, the powers of LR was higher than those of other methods, ranging from approximately 14% to 20% depending on the dropout rate. However, since the type I error rates of LR were not maintained at 5% under scenarios 1 and 2, it should be considered to have liberal properties in general. In all three methods other than LR test, the powers decreased as the dropout rate increased, due to the reduction of available statistical information. In scenario 3, with 10 subjects per group and a 40% dropout rate under MAR, the powers of LR_{Bart}, LR_{Boot} and KR were 7.1%, 7.4% and 6.5%, respectively. The overall trends concerning powers of the four methods agreed with the rejection rates under scenario 1 and 2, although they depended on the sample size and effect sizes.
The results of the convergence for scenarios 3 and 4 appears in the appendix section as Figure A2.
5. Application: Postnatal Depression Trial
Postnatal depression is commonly treated with antidepressants and counselling. Transdermal administration of estrogen has also been shown to be effective, and Gregoire et al [26] conducted a doubleblind, placebocontrolled study in 61 women within 3 months of giving birth [26,37]. Although the study planned to enroll 100 subjects, eventually 61 women were randomly assigned to the placebo group (27 subjects) or the estrogen group (34 subjects). The women were assessed twice prior to treatment and then monthly for 6 months after treatment using the Edinburgh postnatal depression scale (EPDS), with higher scores indicating more severe depression. Approximately 37.0% (10/27) of subjects in the placebo group and 17.6% (6/34) of subjects in the estrogen group had missing EPDS scores at the final time point. All data had monotone missing patterns.
The baseline EPDS score was defined as the average of the scores at the 1st and 2nd months in this study. The outcome variables were measured on the visits between the 3rd and 8th months. We considered analysing this longitudinal dataset using MMRM and the following regression function model,
where ${Y}_{it}$ denotes the EPDS score for the participant $i$ $\left(i=1,\text{}2,\text{}\dots ,\text{}61\right)$ on the $t$th occasion $\left(t=1,\text{}2,\text{}\dots ,\text{}8\right)$. ${G}_{i}$ was a dummy variable that equals 1 if the participant $i$ belongs to the estrogen group and equals 0, otherwise. For the covariance structure of the outcome variables, we assumed the unstructured structure. Here, our primary subject of interest was the evaluation of the mean difference of outcome variables on the final time point. In addition, we considered a subgroup analysis for a group of participants with clinically severe depressive symptoms that was defined as baseline EPDS $>21$ [38]. There were 30 participants in subgroup (15 participants in both placebo and estrogen groups). At the final month, the proportions of dropout were 40.0% (6/15) and 20.0% (3/15) for placebo and estrogen group, respectively.
$$E[{Y}_{it}{G}_{i},{t}_{it}]={\beta}_{0}+{\beta}_{1}{G}_{i}+{\beta}_{2t}{t}_{it}+{\beta}_{3t}{G}_{i}\times {t}_{it}$$
At baseline, the mean EPDS scores of the placebo and estrogen groups were 21.26 (3.11) and 21.59 (3.06), respectively. Table 2 summarizes the mean difference estimates of EPDS scores at the final month, as well as their 95% confidence intervals and $p$values by the conventional and proposed methods. We added the ttest on the single point analysis at the final month in these analyses as a reference. The numbers of resampling for the proposed methods were set to be 3000.
In the whole population analysis, all of the five methods showed significant differences and provided similar estimates. However, the pvalue of LR test was a bit smaller than the proposed methods, and that of KR was a bit larger. These trends might corresponds to the liberal and conservative properties of these methods. These trends became clearer for the subgroup analysis for the participants with severe symptoms. Only the LR test showed significant difference, and the other four methods provided nonsignificant results. The pvalues of LR_{Bart} and LR_{Boot} were 0.0586 and 0.0583, but that of KR was 0.1288. These results might reflect the conservative property of KR, and it was possibly improved by the proposed two methods. In addition, the $t$test for the subgroup analysis provided a larger pvalue (0.2349) with a considerably smaller estimate and larger standard error. Previous numerical evidence (e.g., Ashbeck and Bell [39]) showed possible bias and information reduction of the single time point analysis by ttest, and this result might correspond to this evidence. With LR_{Bart} and LR_{Boot}, the computational times were 55 and 38 minutes for whole population and subgroup, respectively (we used a general laptop computer with an Intel (R) Core (TM) i76500U and SAS 9.4). The computational times would be dramatically improved by applying parallel computation techniques.
Figure 4 shows the histogram of the empirical distribution of the LR test statistics resampled by the parametric bootstrap method under the null effect hypothesis. The mean values of the empirical distribution were designated by the vertical blue dashed line in each histogram and were 1.09 and 1.20 in the whole population and the subgroup, respectively. In addition, the 95th percentiles of the empirical distribution were 4.14 and 4.55 for the whole population and the subgroup, respectively. If the chisquared approximations are accurate, the means and 95th percentiles of the null distribution were expected to be 1.0 and 3.84. These results would show that the distribution of the LR test statistic in incomplete longitudinal data with a small sample size shifted and adequate corrections were needed. The proposed methods would improve the approximations and enable improvements of the inferences as shown in the simulations.
6. Discussion
MMRM with the KR method has been widely applied as a standard analysis method for longitudinal clinical trials. If a sufficient number of samples are available, there are no problems using statistical tests and confidence intervals based on large sample theory. However, the asymptotic approximations cannot be appropriate in cases with small sample sizes. In addition, most clinical trials involve missing data. As methods to improve validity of the statistical inferences, we proposed resamplingbased approaches. Throughout the simulations and real data applications, we demonstrated the effectiveness of the proposed methods compared with existing standard methods.
In the simulation experiments, the KR method and our proposed methods maintained almost the same type I error rate under MCAR, which was close to 5%. However, under MAR scenarios with large dropout rates, the KR method had conservative type I error rate compared with our proposed methods. Our proposed methods might have an advantage even if the missingdata mechanism is MAR compared with KR. In addition, it should be noted that Algorithm 1 uses bootstrap samples to estimate the mean of the null distribution, whereas Algorithm 2 uses them to estimate a quantile of the null distribution. In general, the latter is a more unstable quantity for Monte Carlo inferences and thus requires a larger number of resamplings in general [23]. In our simulation studies, 3000 resamplings were performed for both the LR_{Bart} and LR_{Boot} methods, and we obtained similar results. The number 3000 was determined considering Monte Carlo errors, and they would be sufficient. Although these might require large computation burdens, they would not be so problematic under a modern computational environment, in which parallel computations are available for standard statistical software.
In addition, another possible effective approach to be considered in future research might be the Bayesian approach. The Bayesian method might also accommodate small sample sizes, if the choices of the prior distributions are appropriate. The advantages and potential drawbacks are discussed in Van De Schoot et al. [40]. Also, another concern is extensions to multiparameter inferences. However, the proposed methods are quite general methods and could be straightforwardly extended to the multiparameter inferences.
The effectiveness of our proposed two resampling approaches for MMRM were clearly shown through simulation studies and real data applications. To assure scientific validity in developments of new drug and health technology, accurate statistical inference methods are essential tools. The proposed methods can be applied as effective options in statistical analyses for small and incomplete longitudinal clinical trials.
Author Contributions
Y.U. and H.N. conceived and designed this study. Y.U., H.N., K.M. and M.G. conducted developments of the methods. Y.U. conducted simulation and real data analyses. Y.U. and H.N. interpreted the results, and drafted the manuscript. All authors approved its final version.
Funding
This study was partly supported by GrantsinAid for Scientific Research from the Japan Society for the Promotion of Science (Grant numbers: JP15K15954, JP15H03390).
Conflicts of Interest
Y.U. is employee of Janssen Pharmaceutical K.K. H.N., K.M., and M.G. declare that they have no competing interests.
Appendix A
The results of convergence proportion with N = 20 are presented as figures below. Figure A1 displays the results in scenario 1 and 2 and Figure A2 displays the results in scenario 3 and 4, respectively.
Figure A1.
Convergence proportion (%) in scenarios 1 and 2 under MCAR and MAR assuming 10 subjects per group in the simulation study.
Figure A2.
Convergence proportion (%) in scenarios 3 and 4 under MCAR and MAR assuming 10 subjects per group in the simulation study.
References
 European Medicines Agency. Guideline on Missing Data in Confirmatory Clinical Trials. 2010. Available online: https://www.ema.europa.eu/documents/scientificguideline/guidelinemissingdataconfirmatoryclinicaltrials_en.pdf (accessed on 11 January 2019).
 National Research Council. The Prevention and Treatment of Missing Data in Clinical Trials; National Academies Press: Washington, DC, USA, 2010. [Google Scholar]
 Ratitch, B.; O’Kelly, M.; Tosiello, R. Missing data in clinical trials: From clinical assumptions to statistical analysis using pattern mixture models. Pharm. Stat. 2013, 12, 337–347. [Google Scholar] [CrossRef]
 Mallinckrodt, C.H.; Sanger, T.M.; Dubé, S.; DeBrota, D.J.; Molenberghs, G.; Carroll, R.J.; Potter, W.Z.; Tollefson, G.D. Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol. Psychiatry 2003, 53, 754–760. [Google Scholar] [CrossRef]
 Mallinckrodt, C.H.; Lane, P.W.; Schnell, D.; Peng, Y.; Mancuso, J.P. Recommendations for the primary analysis of continuous endpoints in longitudinal clinical trials. Drug Inf. J. 2008, 42, 303–319. [Google Scholar] [CrossRef]
 Mallinckrodt, C.H.; Watkin, J.G.; Molenberghs, G.; Carroll, R.J. Choice of the primary analysis in longitudinal clinical trials. Pharm. Stat. 2004, 3, 161–169. [Google Scholar] [CrossRef]
 Laird, N.M.; Ware, J.H. Randomeffects models for longitudinal data. Biometrics 1982, 38, 963–974. [Google Scholar] [CrossRef]
 Fitzmaurice, G.M.; Laird, N.M.; Ware, J.H. Applied Longitudinal Analysis; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
 Diggle, P.J.; Heagerty, P.J.; Liang, K.Y.; Zeger, S. Analysis of Longitudinal Data; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
 Patterson, H.D.; Thompson, R. Recovery of interblock information when block sizes are unequal. Biometrika 1971, 58, 545–554. [Google Scholar] [CrossRef]
 Kenward, M.G.; Roger, J.H. Small sample inference for fixed effects from restricted maximum likelihood. Biometrics 1997, 53, 983–997. [Google Scholar] [CrossRef]
 Schluchter, M.D.; Elashoff, J.T. Smallsample adjustments to tests with unbalanced repeated measures assuming several covariance structures. J. Stat. Comput. Simul. 1990, 37, 69–87. [Google Scholar] [CrossRef]
 Gosho, M.; Hirakawa, A.; Noma, H.; Maruo, K.; Sato, Y. Comparison of biascorrected covariance estimators for MMRM analysis in longitudinal data with dropouts. Stat. Methods Med. Res. 2017, 26, 2389–2406. [Google Scholar] [CrossRef] [PubMed]
 Zucker, D.M.; Lieberman, O.; Manor, O. Improved small sample inference in the mixed linear model: Bartlett correction and adjusted likelihood. J. R. Stat. Soc. B 2000, 62, 827–838. [Google Scholar] [CrossRef]
 Bartlett, M.S. Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A 1937, 160, 268–282. [Google Scholar][Green Version]
 Cox, D.R.; Reid, N. Parameter orthogonality and approximate conditional Inference. J. R. Stat. Soc. B 1987, 49, 1–39. [Google Scholar] [CrossRef]
 Lyons, B.; Peters, D. Applying skovgaard’s modified directed likelihood statistic to mixed linear models. J. Stat. Comput. Simul. 2000, 65, 225–242. [Google Scholar] [CrossRef]
 Guolo, A.; Brazzale, A.R.; Salvan, A. Improved inference on a scalar fixed effect of interest in nonlinear mixedeffects models. Comput. Stat. Data Anal. 2006, 51, 1602–1613. [Google Scholar] [CrossRef]
 Skovgaard, I.M. An explicit largedeviation approximation to oneparameter tests. Bernoulli 1996, 2, 145–165. [Google Scholar] [CrossRef]
 BarndorffNielsen, O. On a formula for the distribution of the maximum likelihood estimator. Biometrika 1983, 70, 343–365. [Google Scholar] [CrossRef]
 Stein, M.C.; da Silva, M.F.; Duczmal, L.H. Alternatives to the usual likelihood ratio test in mixed linear models. Comput. Stat. Data Anal. 2014, 69, 184–197. [Google Scholar] [CrossRef]
 Severini, T.A. An approximation to the modified profile likelihood function. Biometrika 1998, 85, 403–411. [Google Scholar] [CrossRef]
 Cordeiro, G.M.; CribariNeto, F. An Introduction to Bartlett Correction and Bias Reduction; Springer: New York, NY, USA, 2014. [Google Scholar]
 Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
 Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: New York, NY, USA, 1994. [Google Scholar]
 Gregoire, A.J.; Kumar, R.; Everitt, B.; Henderson, A.F.; Studd, J.W. Transdermal oestrogen for treatment of severe postnatal depression. Lancet 1996, 347, 930–933. [Google Scholar] [CrossRef]
 Verbeke, G.; Molenberghs, G. Linear Mixed Models for Longitudinal Data; Springer: New York, NY, USA, 2000. [Google Scholar]
 Cox, D.R.; Hinkley, D.V. Theoretical Statistics; Chapman & Hall: London, UK, 1974. [Google Scholar]
 BarndorffNielsen, O.E.; Hall, P. On the levelerror after Bartlett adjustment of the likelihood ratio statistic. Biometrika 1988, 75, 374–378. [Google Scholar] [CrossRef]
 Lawley, D.N. A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 1956, 43, 295–303. [Google Scholar] [CrossRef]
 Melo, T.F.N.; Ferrari, S.L.P.; CribariNeto, F. Improved testing inference in mixed linear models. Comput. Stat. Data Anal. 2009, 53, 2573–2582. [Google Scholar] [CrossRef][Green Version]
 Rocke, D.M. Bootstrap Bartlett adjustment in seemingly unrelated regression. J. Am. Stat. Assoc. 1989, 84, 598–601. [Google Scholar] [CrossRef]
 Noma, H.; Nagashima, K.; Maruo, K.; Gosho, M.; Furukawa, T.A. Bartletttype corrections and bootstrap adjustments of likelihoodbased inference methods for network metaanalysis. Stat. Med. 2018, 37, 1178–1190. [Google Scholar] [CrossRef] [PubMed]
 Wehrens, R.; Putter, H.; Buydens, L.M.C. The bootstrap: A tutorial. Chemometr. Intell. Lab. 2000, 54, 35–52. [Google Scholar] [CrossRef]
 Efron, B. Bootstrap Methods: Another Look at the Jackknife; Springer: New York, NY, USA, 1992. [Google Scholar]
 Zeng, Q.; Davidian, M. Bootstrapadjusted calibration confidence intervals for immunoassay. J. Am. Stat. Assoc. 1997, 92, 278–290. [Google Scholar] [CrossRef]
 Everitt, B.S. A Handbook of Statistical Analyses Using SPlus; CRC Press: New York, NY, USA, 2001. [Google Scholar]
 McCabeBeane, J.E.; Segre, L.S.; Perkhounkova, Y.; Stuart, S.; O’Hara, M.W. The identification of severity ranges for the Edinburgh Postnatal Depression Scale. J. Reprod. Infant Psychol. 2016, 34, 293–303. [Google Scholar] [CrossRef]
 Ashbeck, E.L.; Bell, M.L. Single time point comparisons in longitudinal randomized controlled trials: Power and bias in the presence of missing data. BMC Med. Res. Methodol. 2016, 16, 43. [Google Scholar] [CrossRef] [PubMed]
 Van De Schoot, R.; Broere, J.J.; Perryck, K.H.; ZondervanZwijnenburg, M.; Van Loey, N.E. Analyzing small data sets using Bayesian estimation: The case of posttraumatic stress symptoms following mechanical ventilation in burn survivors. Eur. J. Psychotraumatol. 2015, 6, 25216. [Google Scholar] [CrossRef] [PubMed]
Figure 2.
Type I error rates under scenarios 1 and 2 in the simulation study. Red dashed line, 5%; Blue dashed line, 95% intervals of Monte Carlo error with 1000 iterations.
Figure 4.
Empirical distribution by parametric bootstrap in postnatal depression trial. Blue dashed line, mean of empirical distribution (1.09 and 1.20 for whole population and subgroup, respectively); Red dashed line, 95th percentile of empirical distribution (4.14 and 4.55 for whole population and subgroup, respectively).
Scenario  Missing Mechanism  Overall Dropout (%)  Dropout (%)  ${\mathit{\gamma}}_{0}$  

Placebo  Active  Placebo  Active  
1  MCAR  20  20  20  3.2  3.2 
40  40  40  2.4  2.4  
MAR  20  20  20  7.1  7.1  
40  40  40  4.2  4.2  
2  MCAR  20  22  18  3.1  3.4 
40  44  36  2.3  2.6  
MAR  20  22  18  6.6  6.6  
40  44  36  3.7  3.7  
3  MCAR  20  24  16  3.0  3.5 
40  46  34  2.2  2.6  
MAR  20  24  16  7.8  7.8  
40  46  34  4.8  4.8  
4  MCAR  20  24  16  3.0  3.5 
40  46  34  2.2  2.6  
MAR  20  24  16  7.2  7.2  
40  46  34  4.2  4.2 
Table 2.
Results of the analyses of the postnatal depression trial: Inferences of the mean difference at the final month.
Whole Population (N = 61)  Subgroup (N = 30)  

Estimate [95% CI]  pValue  Estimate [95% CI]  pValue  
LR_{Bart}  4.34 [1.67, 7.66]  0.0031  3.93 [−0.19, 10.24]  0.0586 
LR_{Boot}  4.34 [1.70, 7.64]  0.0050  3.93 [−0.16, 10.21]  0.0583 
LR  4.34 [1.81, 7.52]  0.0019  3.93 [0.29, 9.74]  0.0383 
KR  4.34 [1.45, 7.23]  0.0040  3.93 [−1.23, 9.09]  0.1288 
$t$test  4.36 [1.43, 7.29]  0.0045  3.17 [−2.24, 8.57]  0.2349 
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).